Handbook of Ambient Intelligence and Smart Environments
Hideyuki Nakashima · Hamid Aghajan · Juan Carlos Augusto Editors
Handbook of Ambient Intelligence and Smart Environments
123
Editors Hideyuki Nakashima Future University Hakodate Kameda-Nakano 116-2 Hakodate, Hokkaido 041-8655 Japan
[email protected]
Hamid Aghajan Department of Electrical Engineering Stanford University 350 Serra Mall Stanford, CA 94305-9515 USA
[email protected]
Juan Carlos Augusto School of Computing & Mathematics University of Ulster at Jordanstown Shore Road, Newtownabbey, Co. Antrim UK BT37 0QB
[email protected]
ISBN 978-0-387-93807-3 e-ISBN 978-0-387-93808-0 DOI 10.1007/978-0-387-93808-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009935679 c Springer Science+Business Media, LLC 2010 c Chapter 11 2009 Frank Stajano. Used with permission All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Contents
Part I Introduction Ambient Intelligence and Smart Environments: A State of the Art . . . . . Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Sensors, Vision, and Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mobile and Pervasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Human-centered Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Artificial Intelligence and Robotics . . . . . . . . . . . . . . . . . . . . . . . . . 6 Multi-Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Societal Implications and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Selected Research Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Perspectives of the Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 7 10 14 17 20 22 24 26 29 30 31
Part II Sensor, Vision and Networks A Survey of Distributed Computer Vision Algorithms . . . . . . . . . . . . . . . . Richard J. Radke 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Topology Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Non-overlapping Topology Estimation . . . . . . . . . . . . . . . 3.2 Overlapping Topology Estimation . . . . . . . . . . . . . . . . . . . 4 Camera Network Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Non-overlapping Camera Calibration . . . . . . . . . . . . . . . .
35 35 37 38 40 41 42 43
v
vi
Contents
4.2 Overlapping Camera Calibration . . . . . . . . . . . . . . . . . . . . 4.3 Improving Calibration Consistency . . . . . . . . . . . . . . . . . . 5 Tracking and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44 45 47 49 50
Video-Based People Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcus A. Brubaker, Leonid Sigal and David J. Fleet 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Tracking as Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Generative Model for Human Pose . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Kinematic Parameterization . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Body Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Image Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 2D Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Edges and Gradient Based Features . . . . . . . . . . . . . . . . . 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Motion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Joint Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Smoothness and Linear Dynamical Models . . . . . . . . . . . 4.3 Activity Specific Models . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Physics-based Motion Models . . . . . . . . . . . . . . . . . . . . . . 5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Annealed Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Markov Chain Monte Carlo Filtering . . . . . . . . . . . . . . . . 6 Initialization and Failure Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction to Discriminative Methods for Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Discriminative Methods as Proposals for Inference . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Locomotion Activities in Smart Environments . . . . . . . . . . . . . . . . . . . . . . Björn Gottfried 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Why Locomotion Matters . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work on Motion Analysis and Smart Environments . . . . . 2.1 Motion Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Wayfinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57 58 60 60 61 62 62 63 64 65 66 68 68 69 69 70 72 73 74 75 78 79 80 82 82 83 89 89 89 90 90 91 91 92 92
Contents
2.5 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Locomotion Activities in Smart Environments . . . . . . . . . . . . . . . . 3.1 Smart Hospitals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Scenarios in a Smart Hospital . . . . . . . . . . . . . . . . . . . . . . 3.3 Characterising Locomotion Activities . . . . . . . . . . . . . . . . 4 The Representation of Locomotion Activities . . . . . . . . . . . . . . . . . 4.1 Spatiotemporal Information . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Functional Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Allocentric View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Locomotion Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Planning Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Wayfinding Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Searching Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Monitoring Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manish Kushwaha, Songhwai Oh, Isaac Amundson, Xenofon Koutsoukos, Akos Ledeczi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Challenges and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Audio Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Video Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Synchronization Methodology . . . . . . . . . . . . . . . . . . . . . . 6.2 Evaluation of HSN Time Synchronization . . . . . . . . . . . . 6.3 Synchronization Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Multimodal Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Sequential Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . 7.2 Sensor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Multiple-Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Sequential Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . 8.2 MCMCDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
93 94 94 94 96 98 101 101 102 104 104 105 106 106 108 109 110 111 112 117
117 119 121 123 125 128 128 129 130 131 132 134 136 137 137 141 144 145
viii
Contents
Multi-Camera Vision for Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noriko Takemura and Hiroshi Ishiguro 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Camera Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Fixed Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Active Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Mixed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Multi-modal System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Surveillance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Location and Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Real-time cooperative multi-target tracking by dense communication among Active Vision Agents . . . . . . . . . 3.2 Tracking Multiple Occluding People by Localizing on Multiple Scene Planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Applying a Form of Memory-based Attention Control to Activity Recognition at a Subway Station . . . . . . . . . . 3.4 DMCtrac : Distributed Multi Camera Tracking . . . . . . . . 3.5 Abnormal behavior-detection using sequential syntactical classification in a network of clustered cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Industry Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149 149 150 150 152 152 154 154 156 156 157 157 158 160 161
163 163 164 165 165
Part III Mobile and Pervasive Computing Collaboration Support for Mobile Users in Ubiquitous Environments . . . Babak A. Farshchian and Monica Divitini 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Characteristics of Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Collaboration and Shared Context . . . . . . . . . . . . . . . . . . . 2.2 Embodied Interactions and Artifacts as Resources . . . . . 2.3 Mobility of People and Resources . . . . . . . . . . . . . . . . . . . 2.4 Physical Distribution of People . . . . . . . . . . . . . . . . . . . . . 2.5 Flexibility and the Need for Tailoring . . . . . . . . . . . . . . . . 3 Existing Technology in Support of Ubiquitous Collaboration . . . . 4 Human Grid as a Unifying Concept for AmI and CSCW . . . . . . . . 4.1 The Example of UbiBuddy . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Shared Context in Human Grid . . . . . . . . . . . . . . . . . . . . . 4.3 Embodiment in Human Grid . . . . . . . . . . . . . . . . . . . . . . . 4.4 Mobility in Human Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Support for Physical Distribution in Human Grid . . . . . .
173 173 174 174 177 178 178 179 180 181 183 186 186 187 187
Contents
ix
4.6 Flexibility and Tailoring in a Human Grid . . . . . . . . . . . . Implementation of Human Grid: The UbiCollab Platform . . . . . . . 5.1 UbiNode Overall Architecture . . . . . . . . . . . . . . . . . . . . . . 5.2 Collaboration Instance Manager . . . . . . . . . . . . . . . . . . . . 5.3 Session Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Collaboration Space Manager . . . . . . . . . . . . . . . . . . . . . . 5.5 Service Domain Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Resource Discovery Manager . . . . . . . . . . . . . . . . . . . . . . 5.7 Identity Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187 188 189 190 192 192 193 193 194 195 196 196
Pervasive Computing Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregor Schiele, Marcus Handte and Christian Becker 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Organizational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Provided Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . 2.3 Supported Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Spontaneous Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Ubiquitous Communication and Interaction . . . . . . . . . . . 3.2 Integration of Heterogeneous Devices . . . . . . . . . . . . . . . 3.3 Dynamic Mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Context Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Acquisition and Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Modeling and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Provisioning and Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Application Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Inter-Application Adaptation . . . . . . . . . . . . . . . . . . . . . . . 5.2 Intra-Application Adaptation . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
201
5
Case Study of Middleware Infrastructure for Ambient Intelligence Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuo Nakajima 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 High Level Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Middleware Infrastructure for Distributed Audio and Video Appliances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Middleware Infrastructure to Hide Complex Details in Underlying Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Middleware Infrastructure to Support Spontaneous Smart Object Integration . . . . . . . . . . . . . . . . . . . . . . . . . .
201 202 202 204 205 206 207 210 211 214 216 217 218 218 219 222 224 225 229 229 232 233 233 237 241
x
Contents
3.4 Middleware Infrastructure for Context Analysis . . . . . . . Middleware Design Issues for Ambient Intelligence . . . . . . . . . . . 4.1 High Level Abstraction and Middleware Design . . . . . . . 4.2 Interface v.s. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Non Functional Properties . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Human Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245 249 249 249 250 251 251 253 253
Collaborative Context Recognition for Mobile Devices . . . . . . . . . . . . . . . Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Humans: Context-aware Animals . . . . . . . . . . . . . . . . . . . 2 Context From Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mobile Context Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Context Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Context Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Distributed Context Awareness . . . . . . . . . . . . . . . . . . . . . 4 Making Rational Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Distributed Decision Making . . . . . . . . . . . . . . . . . . . . . . . 4.2 Voting Protocols as Distributed Decision Making Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Voting as CCR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Case Study: Physical Activity Recognition . . . . . . . . . . . . . . . . . . . 5.1 Data Set and Test Settings . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Context Recognition Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Way Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Knowledge-based CCR? . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257
Security Issues in Ubiquitous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Stajano 1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Ubiquitous (Pervasive, Sentient, Ambient. . . ) Computing 1.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Security Issues in Ubiquitous Computing . . . . . . . . . . . . . 2 Application Scenarios and Technical Security Contributions . . . . 2.1 Wearable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Location Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 RFID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Authentication and Device Pairing . . . . . . . . . . . . . . . . . .
281
4
257 258 259 260 260 262 263 264 264 265 265 266 270 270 272 273 274 274 275 276 277
281 282 284 286 287 288 289 292 298
Contents
xi
2.5 Beyond Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Going Up a Level: the Rest of the Story . . . . . . . . . . . . . . . . . . . . . . 3.1 Security and Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Understanding People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Motivations and Incentives . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
301 304 304 305 307 309
Pervasive Systems in Health Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Achilles Kameas, Ioannis Calemis 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview of Pervasive Healthcare Systems . . . . . . . . . . . . . . . . . . . 2.1 @HOME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 HEARTFAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 ALARM-NET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 CAALYX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 TeleCARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 CHRONIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 MyHeart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 OLDES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 SAPHIRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Input-Output System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Local System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Remote Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 The HEARTS System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Local Devices/Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 HEARTS-OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 HEARTS Local Subsystem Manager (LSM) . . . . . . . . . . 4.4 Remote Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
315
3
315 316 316 317 318 319 319 319 320 321 322 322 323 326 330 333 334 335 340 341 343 344 344
Part IV Human-centered Interfaces Human-centered Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicu Sebe 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 What is Human-centered Computing? . . . . . . . . . . . . . . . . . . . . . . . 2.1 Gateways and Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Scope of HCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 HCC, HCI, CSCW, User-centered Design, and Human Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Areas of Human-centered Computing . . . . . . . . . . . . . . . . . . . . . . . 3.1 Multimedia Production . . . . . . . . . . . . . . . . . . . . . . . . . . . .
349 349 350 350 351 353 355 356 357
xii
Contents
3.2 Multimedia Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Multimedia Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Human Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Ubiquitous Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Users with Disabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Public and Private Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Integrating HCC into a Real (Human) World . . . . . . . . . . . . . . . . . 5.1 Integrating Modalities and Media . . . . . . . . . . . . . . . . . . . 5.2 Integrating Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Integrating Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Research Agenda for HCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
358 359 361 361 362 362 362 363 363 363 363 364 365 365 366 367 368
End-user Customisation of Intelligent Environments . . . . . . . . . . . . . . . . . Jeannette Chin, Victor Callaghan and Graham Clarke 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Home of the Future – A Vision . . . . . . . . . . . . . . . . . . . . . . . . . 3 Contemporary Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Pre-programmed Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Agent-programmed Rules . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 User-programmed Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Supporting Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Pervasive interactive Programming (PiP) – An Example of End-User Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 PiP Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 PiP System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 How the System Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The dComp Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 dComp Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The dComp Ontology – An Overview . . . . . . . . . . . . . . . 6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 PiP Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 dComp Performance Evaluation . . . . . . . . . . . . . . . . . . . . 6.3 The PiP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Concluding Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
371 371 372 373 374 375 376 377 377 378 379 381 383 387 387 388 394 394 394 396 398 401 403
Contents
Intelligent Interfaces to Empower People with Disabilities . . . . . . . . . . . . Margrit Betke 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Camera Mouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Intelligent Devices for People with Motion Impairments . . . . . . . . 4 Camera-Based Music Therapy for People with Motion Impairments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Enabling Users to Create Their Own Smart Environments . . . . . . 6 Assistive Software for Users with Severe Motion Impairments . . . 7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Successful Design of Camera-based Interfaces for People with Disabilities Requires a User-oriented Approach to Computer Vision Research . . . . . . . . . . . . . . 7.2 Designing Interfaces for People with Disabilities – Research Efforts in Human-Computer Interaction . . . . . . 7.3 Progress in Assistive Software Development . . . . . . . . . . 7.4 The Importance of Computational Social Science to Intelligent Interface Design . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visual Attention, Speaking Activity, and Group Conversational Analysis in Multi-Sensor Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Gatica-Perez and Jean-Marc Odobez 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 On-line Regulation of Group Interaction . . . . . . . . . . . . . 2.2 Assistance of Remote Group Interaction . . . . . . . . . . . . . 3 Perceptual Components: Speaking Turns and Visual Attention . . . 3.1 Estimating Speaking Turns . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating Visual Focus of Attention . . . . . . . . . . . . . . . . 4 From Speaking Activity and Visual Attention to Social Inference: Dominance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Dominance in Social Psychology . . . . . . . . . . . . . . . . . . . 4.2 Dominance in Computing . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Dominance Estimation from Joint Visual Attention and Speaking Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Multi-modal Sensing for Human Activity Modeling in the Real World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Technologies for Tracking Human Actvities . . . . . . . . . . . . . . . . . . 2.1 Methods for Logging Physical Activity . . . . . . . . . . . . . . 2.2 The Mobile Sensing Platform (MSP) and UbiFit Garden Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
409 409 411 411 414 415 415 420
420 421 422 423 424 433 433 434 434 436 438 438 440 450 451 452 452 455 457 463 463 464 464 465
xiv
Contents
2.3 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usability, Adaptability, and Credibility . . . . . . . . . . . . . . . . . . . . . . 3.1 Usability of Mobile Inference Technology . . . . . . . . . . . . 3.2 Adaptability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Credibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
468 469 469 472 475 476 477
Recognizing Facial Expressions Automatically from Video . . . . . . . . . . . . Caifeng Shan and Ralph Braspenning 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem Space and State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Level of Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Static Versus Dynamic Expression . . . . . . . . . . . . . . . . . . 2.3 Facial Feature Extraction and Representation . . . . . . . . . 2.4 Expression Subspace Learning . . . . . . . . . . . . . . . . . . . . . 2.5 Facial Expression Classification . . . . . . . . . . . . . . . . . . . . 2.6 Spontaneous Versus Posed Expression . . . . . . . . . . . . . . . 2.7 Expression Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Controlled Versus Uncontrolled Data Acquisition . . . . . . 2.9 Correlation with Bodily Expression . . . . . . . . . . . . . . . . . 2.10 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Facial Expression Recognition with Discriminative Local Features 3.1 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Learning Discriminative LBP-Histogram Bins . . . . . . . . 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
479
3
Sharing Content and Experiences in Smart Environments . . . . . . . . . . . . Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 What is Experience, Can You Share It? . . . . . . . . . . . . . . . . . . . . . . 2.1 Defining Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Experience Design Modeling . . . . . . . . . . . . . . . . . . . . . . . 2.3 On Sharing Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Changing Role of the User in Content Provision (Prosumers) 4 Smart Environments and Experience Sharing Features . . . . . . . . . 4.1 From Ubiquitous Computing to Digital Ecologies . . . . . 4.2 Features for Sharing Content and Experiences . . . . . . . . . 5 Examples of Research in Sharing Content and Experiences . . . . . 5.1 Video Content Management in Candela . . . . . . . . . . . . . . 5.2 Context Aware Delivery and Interaction in Nomadic Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Kontti - Context Aware Services for Mobile Users . . . . .
479 480 480 482 483 487 487 489 490 490 492 494 495 495 496 497 502 503 509
509 510 510 510 512 513 514 514 515 518 518 520 521
Contents
5.4 m:Ciudad - A Metropolis of Ubiquitous Services . . . . . . 5.5 Sharing with My Peer - ExpeShare . . . . . . . . . . . . . . . . . . 5.6 Ubimedia Based on Memory Tags . . . . . . . . . . . . . . . . . . 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User Interfaces and HCI for Ambient Intelligence and Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Butz 1 Input/Output Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Interactive Display Surfaces . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Tangible/Physical Interfaces . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Adapting Traditional Input Devices . . . . . . . . . . . . . . . . . 1.5 Multi-Device User Interfaces . . . . . . . . . . . . . . . . . . . . . . . 2 Humans Interacting with the Smart Environment . . . . . . . . . . . . . . 2.1 Metaphors and Conceptual Models . . . . . . . . . . . . . . . . . . 2.2 Physicality as a Particularly Strong Conceptual Model . . 2.3 Another Example: the Peephole Metaphor . . . . . . . . . . . . 2.4 Human-centric UIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Designing User Interfaces for Ambient Intelligence and Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The User-centered Process . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Novel Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluating User Interfaces for Ambient Intelligence and Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Adaptation of Existing Evaluation Techniques . . . . . . . . 4.2 Novel Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Applications in the ParcTab project . . . . . . . . . . . . . . . . . . 5.2 The MIT Intelligent Room . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Implicit interaction: the MediaCup . . . . . . . . . . . . . . . . . . 5.4 Tangible Interaction: The MediaBlocks . . . . . . . . . . . . . . 5.5 Multi Device interaction: Sony CSL Interactive Workspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Hybrid Interfaces: The PhotoHelix . . . . . . . . . . . . . . . . . . 5.7 Summary and Perspective . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multimodal Dialogue for Ambient Intelligence and Smart Environments Ramón López-Cózar and Zoraida Callejas 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Context Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Context Detection and Processing . . . . . . . . . . . . . . . . . . . 2.2 User Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
524 524 526 529 529 529 533 533 533 535 537 538 540 540 540 541 543 544 544 545 545 546 546 546 547 547 548 549 550 550 552 553 553 557 557 558 559 561
xvi
Contents
3
Handling of Input Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Abstracting the Environment . . . . . . . . . . . . . . . . . . . . . . . 3.2 Processing Multimodal Input . . . . . . . . . . . . . . . . . . . . . . . 4 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Interaction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Confirmation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Contextual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Evaluation from the Perspective of Human-computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
562 562 563 564 566 567 567 569 569 570 571 572
Part V Artificial Intelligence and Robotics Smart Monitoring for Physical Infrastructures . . . . . . . . . . . . . . . . . . . . . . Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Infrastructure Monitoring in the Context of Ambient Intelligence 2.1 State-of-the-Art in Infrastructure Monitoring . . . . . . . . . . 2.2 Context-Awareness in Ambient Intelligence . . . . . . . . . . 2.3 Knowledge Representation and Reasoning in the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Pattern-based Modeling of Context Information and Infrastructure Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Key Terms and Concepts of Condition Monitoring . . . . . 3.2 Ontology Design Pattern for Infrastructure Conditions . . 4 Smart Monitoring Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Reasoning over Distributed Context Information . . . . . . . . . . . . . . 5.1 Answering Conjunctive Queries . . . . . . . . . . . . . . . . . . . . 5.2 Implementation as a Framework . . . . . . . . . . . . . . . . . . . . 6 European Railway Monitoring as a Case Study . . . . . . . . . . . . . . . . 7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spatio-Temporal Reasoning and Context Awareness . . . . . . . . . . . . . . . . . Hans W. Guesgen and Stephen Marsland 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Sensory Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Spatio-Temporal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Symbolic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Frequent Pattern Data Mining . . . . . . . . . . . . . . . . . . . . . . 6.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
581 581 582 584 584 586 586 587 588 591 594 595 599 600 602 603 605 605 606 609 609 614 617 617 620
Contents
xvii
6.3 Novelty Detection and Habituation . . . . . . . . . . . . . . . . . . 6.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
622 625 626 627
From Autonomous Robots to Artificial Ecosystems . . . . . . . . . . . . . . . . . . Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 State-of-the-art: Integrated Frameworks for Robots in Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Ambience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 PEIS Ecology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Ubibot: Sobots, Embots and Mobots . . . . . . . . . . . . . . . . . 2.4 The Artificial Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Application Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Cooperation and Competition for Transportation and Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Ambience, Ubibot and PEIS Ecology at Work . . . . . . . . . 4 Case Study: the Artificial Ecosystem Approach . . . . . . . . . . . . . . . 4.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Cooperation and Competition in the Reference Scenario 4.3 Distributed knowledge representation and reasoning . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
631
Behavior Modeling for Detection, Identification, Prediction, and Reaction (DIPR) in AI Systems Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . Rachel E. Goshorn, Deborah E. Goshorn, Joshua L. Goshorn, and Lawrence A. Goshorn 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 AI and Behavior Modeling for DIPR Subsystems . . . . . . . . . . . . . . 2.1 DIPR “Detection” Subsystem . . . . . . . . . . . . . . . . . . . . . . 2.2 DIPR “Identification” Subsystem . . . . . . . . . . . . . . . . . . . 2.3 DIPR “Predict” Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 DIPR “Reaction” Subsystem . . . . . . . . . . . . . . . . . . . . . . . 3 Infrastructure for AI Systems Solutions . . . . . . . . . . . . . . . . . . . . . . 3.1 Physical Infrastructure for AI Systems Solutions . . . . . . 3.2 Overlay Architecture for AI Systems Solutions . . . . . . . . 4 Application Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Application Model (A): AI Decision Making . . . . . . . . . 4.2 Application Model (B): Abnormal Behavior Detection . 4.3 Application Model (C): Enabling “Smart Environments” 4.4 Application Model (D): Environment or “Ambient Intelligence” (Learning) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Application Model (E): Turbocharging Detection Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
631 633 633 634 635 636 638 640 642 645 645 651 657 661 663 665
665 668 668 670 672 676 676 677 678 680 681 682 686 689 690
xviii
Contents
5
Advanced DIPR Systems Engineering . . . . . . . . . . . . . . . . . . . . . . . 5.1 Advanced Fusion Methods Using the DIPR System . . . . 5.2 Adaptive Learning in the DIPR System . . . . . . . . . . . . . . 6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
692 692 692 694 695
Part VI Multi-Agents Multi-Agent Social Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Issues on Multiagent Social Simulation . . . . . . . . . . . . . . . . . . . . . . 3 Autonomous Vehicles at Intersections . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Intersection Management Protocol . . . . . . . . . . . . . . 3.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluation of Dial-a-Ride System by Simulation . . . . . . . . . . . . . . 4.1 Problem Domain and Formalization . . . . . . . . . . . . . . . . . 4.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 User Model Constructed Based on Sensor Network . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Integrated Exhibition Support System . . . . . . . . . . . . . . . 5.3 Proofing Experiments in Expo 2005 Aichi . . . . . . . . . . . . 5.4 Tour Trend Analysis and Its Application . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Concluding Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multi-Agent Strategic Modeling in a Specific Environment . . . . . . . . . . . . Matjaz Gams and Andraz Bezek 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview of the Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Multi-Agent Strategy Discovering Algorithm . . . . . . . . . . . . . 3.1 Domain-Dependent Knowledge . . . . . . . . . . . . . . . . . . . . 3.2 Graphical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Symbolic Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Human Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 RoboCup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 3vs2 Keepaway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Advantages of Strategic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
699 699 700 701 703 706 708 710 711 711 712 713 715 715 715 716 720 721 724 724 724 727 727 728 731 732 733 736 738 740 740 742 744 745
Contents
xix
Learning Activity Models for Multiple Agents in a Smart Space . . . . . . . Aaron Crandall and Diane J. Cook 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Testbeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Time Delta Enhanced Classification . . . . . . . . . . . . . . . . . 5.2 Markov Model Classification Accuracy . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
747
Mobile Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ichiro Satoh 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Mobility and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Mobile Agent Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Mobile Agent Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Agent Execution Management . . . . . . . . . . . . . . . . . . . . . . 2.4 Inter-agent Communication . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Locating Mobile Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mobile Agent Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Remote Information Retrieval . . . . . . . . . . . . . . . . . . . . . . 3.2 Network Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Mobile Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Active Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Active Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Ambient Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Mobile Agent-based Middleware for Ambient Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Context-aware Mobile Agent . . . . . . . . . . . . . . . . . . . . . . . 4.3 Personal-Assistant Agents . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
767
747 749 751 753 754 757 760 762 763
767 769 770 771 773 774 775 775 775 777 777 777 778 779 779 779 780 781 781 782 782 783 785 785
Part VII Applications Ambient Human-to-Human Communication . . . . . . . . . . . . . . . . . . . . . . . Aki Härmä 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Long Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Spatial Attributes Of Presence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
789 789 790 792
xx
Contents
4
Ambient Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Ambient Telephone Architecture . . . . . . . . . . . . . . . . . . . . 4.2 Speech Capture And Enhancement . . . . . . . . . . . . . . . . . . 4.3 Speech Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Spatial Speech Reproduction . . . . . . . . . . . . . . . . . . . . . . . 5 Tracking And Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Calibration And Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Ambient Visual Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
795 796 797 802 802 807 807 809 810 811 811
Smart Environments for Occupancy Sensing and Services . . . . . . . . . . . . Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Detecting Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Infrared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Ultrasonic Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 FM Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Ultra-Wideband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Estimation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Location Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 WiFi Locationing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 GSM cells and GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bayes Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 From Measurements to Meaning and Services . . . . . . . . . . . . . . . . 4.1 Location-Aware Reminders . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Scenario on Routine Learning . . . . . . . . . . . . . . . . . . . . 5 Platforms of smart environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 EasyLiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Aware Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 PlaceLab House_n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Robotic Room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
819
Smart Offices and Intelligent Decision Rooms . . . . . . . . . . . . . . . . . . . . . . . Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Smart Offices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Active Badge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Monica SmartOffice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
845
819 819 820 820 823 823 824 824 825 826 826 827 827 829 829 830 830 830 834 834 835 836 837 838 839
845 846 849 849
Contents
2.3 Stanford’s Interactive Worksapaces . . . . . . . . . . . . . . . . . . 2.4 Intelligent Environment Laboratory of IGD Rostock . . . 2.5 Sens-R-Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Smart Doorplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Ambient Agoras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Intelligent Meeting Rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The IDIAP Smart Meeting Room . . . . . . . . . . . . . . . . . . . 3.2 SMaRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 M4 and AMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 NIST Smart Space Project . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 MIT Intelligent Room Project . . . . . . . . . . . . . . . . . . . . . . 4 LAID - Laboratory of Ambient Intelligence for Decision Support 4.1 Ubiquitous Group Decision Making . . . . . . . . . . . . . . . . . 4.2 Ubiquitous System Architecture . . . . . . . . . . . . . . . . . . . . 4.3 ABS4GD and WebMeeting Plus . . . . . . . . . . . . . . . . . . . . 4.4 Idea Generation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smart Classroom: Bringing Pervasive Computing into Distance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Smart Classroom: Sensor-Rich Smart Environment . . . . . . . . . . . . 2.1 The Layout of Smart Classroom . . . . . . . . . . . . . . . . . . . . 2.2 A Typical User Experience Scenario in Smart Classroom 3 Key Multimodal Interface and Context-Awareness Technologies . 3.1 Location Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Direct Manipulation based on Laser Pointer Tracking . . 3.3 Speaker Tracking based on Microphone Array and Location Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Context-awareness in Smart Classroom . . . . . . . . . . . . . . 4 Large-Scale Real-Time Collaborative Distance Learning Supporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Totally Ordered Reliable Multicast . . . . . . . . . . . . . . . . . . 4.2 Adaptive Multimedia Transport Model . . . . . . . . . . . . . . . 4.3 SameView: Real-Time Interactive Distance Learning Application in Smart Classroom . . . . . . . . . . . . . . . . . . . . 5 Smart Platform: Multiagent-Based Infrastructure . . . . . . . . . . . . . . 5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Features and Design Objectives . . . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxi
850 850 851 851 851 852 853 853 853 854 855 855 857 858 860 865 865 869 870 870 875 875 877 877 878 879 880 883 887 890 896 896 897 898 900 901 902 903 903
xxii
Contents
Ambient Intelligence in the City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marc Böhlen and Hans Frei 1 Overview and Motivation and Limitations . . . . . . . . . . . . . . . . . . . . 2 Ambient intelligence and Urbanism – a Reality Check . . . . . . . . . 3 Ambient Intelligence and Philosophies of Technology . . . . . . . . . . 4 Out of Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hydra’s New Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Information Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Ambient Agoras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Combat Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Amateur Ambient Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Slow Space Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Unusual Encounters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Mobile Devices Everywhere . . . . . . . . . . . . . . . . . . . . . . . 7 Ambient Intelligence for The People . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Knowledge from Many Minds . . . . . . . . . . . . . . . . . . . . . . 8 Remaking Public Space, with and in Spite of Ambient Intelligence 8.1 Paths of Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Staging City Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Paths of Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Personal Pollution Control . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Full Body Ambient Intelligence . . . . . . . . . . . . . . . . . . . . 9 AmI in the City, Revitalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Advancement of World Digital Cities . . . . . . . . . . . . . . . . . . . . . . . . . . Mika Yasuoka, Toru Ishida and Alessandro Aurigi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 American Community Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 European Digital Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Asian City Informatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Advances in Information and Communication Technologies . . . . . 6 Technologies in Digital City Kyoto . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Information Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Interface Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The Interaction Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
905 905 906 908 909 910 911 912 914 915 916 917 917 918 919 919 920 921 921 922 922 925 926 927 933 933 935 937 939 941 942 943 945 948 949 950 951
Contents
xxiii
Part VIII Societal Implications and Impact Human Factors Consideration for the Design of Collaborative Machine Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sung Park, Arthur D. Fisk, and Wendy A. Rogers 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Collaborative Machine Assistants (CMAs) . . . . . . . . . . . . . . . . . . . 3 User Acceptance of CMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 What is Acceptance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 User Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Technology Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Relational Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
955 955 956 957 957 958 961 963 969 971 972
Privacy Sensitive Surveillance for Assisted Living – A Smart Camera Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979 Sven Fleck and Wolfgang Straßer 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979 2 Classification of AAL & Video Surveillance Systems . . . . . . . . . . 981 3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982 3.1 Related Work – Surveillance . . . . . . . . . . . . . . . . . . . . . . . 982 3.2 Related Work – Smart Cameras . . . . . . . . . . . . . . . . . . . . . 984 4 Identified Problems of Current Approaches and Resulting Impact 985 4.1 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985 5 The Proposed SmartSurv Approach for AAL . . . . . . . . . . . . . . . . . 987 5.1 The Smart Camera Paradigm . . . . . . . . . . . . . . . . . . . . . . . 987 5.2 SmartSurv 3D Surveillance Architecture . . . . . . . . . . . . . 987 5.3 Privacy Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988 5.4 Smart Camera Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 989 5.5 SafeSenior - The Installation at An Assisted Living Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989 6 Smart Camera Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 991 6.1 Background Modeling and AutoInit . . . . . . . . . . . . . . . . . 991 6.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992 6.3 2D → 3D Conversion Unit . . . . . . . . . . . . . . . . . . . . . . . . . 993 6.4 Handover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993 6.5 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993 6.6 The Classification & Activity Recognition Flow . . . . . . . 995 7 Visualization Fully Decoupled from Sensor Layer . . . . . . . . . . . . . 997 7.1 3D Visualization Node - XGRT . . . . . . . . . . . . . . . . . . . . . 997 7.2 Google Earth as Visualization Node . . . . . . . . . . . . . . . . . 998 7.3 Web Application for Visualization . . . . . . . . . . . . . . . . . . 1000 7.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1001
xxiv
Contents
8
Conclusion and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003 8.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005 Data Mining for User Modeling and Personalization in Ubiquitous Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009 Alejandro Jaimes 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009 2 User Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 3 Data Mining & Knowledge Discovery (KDD) . . . . . . . . . . . . . . . . 1013 4 Context & Mobile User Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 5 Data Mining in User Modeling & Applications . . . . . . . . . . . . . . . 1020 6 Human Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021 7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024 8 Conclusions & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027 Experience Research: a Methodology for Developing Human-centered Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033 Boris de Ruyter and Emile Aarts 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033 1.1 From Entertaining to Caring . . . . . . . . . . . . . . . . . . . . . . . 1034 1.2 From System Intelligence to Social Intelligence . . . . . . . 1035 2 Experience Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037 2.1 Context Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038 2.2 Lab Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1040 2.3 Field Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 3 ExperienceLab Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 3.1 HomeLab: the Home Environment . . . . . . . . . . . . . . . . . . 1042 3.2 ShopLab: the Retail Environment . . . . . . . . . . . . . . . . . . . 1043 3.3 CareLab: the Assisted Living Environment . . . . . . . . . . . 1044 4 The ExperienceLab as an Assessment Instrument . . . . . . . . . . . . . . 1045 4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046 4.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046 4.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048 4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052 5 Case Study of the Experience Research Approach . . . . . . . . . . . . . 1053 5.1 Context-Mapping Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053 5.2 Laboratory Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055 5.3 Field Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
Contents
xxv
Part IX Projects Computers in the Human Interaction Loop . . . . . . . . . . . . . . . . . . . . . . . . . 1065 A. Waibel, R. Stiefelhagen, R. Carlson, J. Casas, J. Kleindienst, L. Lamel, O. Lanz, D. Mostefa, M. Omologo, F. Pianesi, L. Polymenakos, G. Potamianos, J. Soldatos, G. Sutschet, J. Terken 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065 2 Audio-Visual Perceptual Technologies . . . . . . . . . . . . . . . . . . . . . . . 1067 2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068 2.2 Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070 2.3 Person Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073 2.4 Interaction Cues: Gestures, Body Pose, Head Pose, Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074 2.5 Activity Analysis, Situation Modeling . . . . . . . . . . . . . . . 1076 2.6 Audio-Visual Output Technologies . . . . . . . . . . . . . . . . . . 1078 2.7 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078 2.8 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079 3 Technology Evaluations and Data Collection . . . . . . . . . . . . . . . . . 1080 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 3.2 Corpus Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084 4 Software Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085 4.1 SitCom - The Situation Composer . . . . . . . . . . . . . . . . . . . 1087 4.2 Agent Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088 5 CHIL Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090 5.1 The Memory Jog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091 5.2 The Collaborative Workspace . . . . . . . . . . . . . . . . . . . . . . 1093 5.3 Managing the Social Interaction: The Relational Cockpit and the Relational Report . . . . . . . . . . . . . . . . . . . 1096 5.4 The Connector: A Virtual Secretary in Smart Offices . . . 1100 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1102 Eye-based Direct Interaction for Environmental Control in Heterogeneous Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111 Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111 2 The COGAIN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113 3 Domotic Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114 3.1 Overview of Domotic Technologies . . . . . . . . . . . . . . . . . 1115 3.2 Architecture for Intelligent Domotic Environment . . . . . 1119 3.3 The DOG Domotic OSGi Gateway . . . . . . . . . . . . . . . . . . 1122 4 Eye-based Environmental Control . . . . . . . . . . . . . . . . . . . . . . . . . . 1125 4.1 Overview of Eye Based Environmental Control . . . . . . . 1125 4.2 The ART system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127 4.3 Eye Based Environmental Control and Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1129
xxvi
Contents
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1130 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1130 Middleware Architecture for Ambient Intelligence in the Networked Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133 Nikolaos Georgantas, Valerie Issarny, Sonia Ben Mokhtar, Yerom-David Bromberg, Sebastien Bianco, Graham Thomson, Pierre-Guillaume Raverdy, Aitor Urbieta and Roberto Speicys Cardoso 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133 2 Achieving Interoperability in AmI Environments . . . . . . . . . . . . . . 1135 2.1 The Amigo Service Architecture for Interoperability . . . 1136 2.2 Service Discovery Interoperability . . . . . . . . . . . . . . . . . . 1137 2.3 Service Communication Interoperability . . . . . . . . . . . . . 1139 3 AmIi Interoperable Service Discovery . . . . . . . . . . . . . . . . . . . . . . . 1141 3.1 ASDM: A Model for Semantic and Syntactic Service Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142 3.2 ASDL: A Language for Semantic and Syntactic Service Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144 3.3 Interoperable Matching of Service Capabilities . . . . . . . . 1147 3.4 Ranking Heterogeneous Matching Results . . . . . . . . . . . . 1149 4 AmIi Interoperable Service Communication . . . . . . . . . . . . . . . . . . 1150 4.1 Interoperable Service Discovery and Communication . . 1150 4.2 RPC Communication Stack . . . . . . . . . . . . . . . . . . . . . . . . 1153 4.3 Event-based Interoperability . . . . . . . . . . . . . . . . . . . . . . . 1155 4.4 AmIi-COM Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160 The PERSONA Service Platform for AAL Spaces . . . . . . . . . . . . . . . . . . . 1165 Mohammad-Reza Tazari and Francesco Furfari and Juan-Pablo Lázaro Ramos and Erina Ferro 1 An Introduction to the PERSONA Project . . . . . . . . . . . . . . . . . . . . 1165 2 Requirements on a Service Platform for AAL Spaces . . . . . . . . . . 1167 2.1 User Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167 2.2 Technical Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169 3 Evaluation of Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1170 4 The PERSONA Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . 1173 4.1 The Abstract Physical Architecture . . . . . . . . . . . . . . . . . . 1173 4.2 The Interoperability Framework . . . . . . . . . . . . . . . . . . . . 1174 4.3 The PERSONA Platform Services . . . . . . . . . . . . . . . . . . 1177 4.4 The Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1181 4.5 The Ontological Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 1184 5 Sensing the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185 5.1 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . 1186 5.2 ZigBee Networks Integration . . . . . . . . . . . . . . . . . . . . . . . 1188 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
Contents
xxvii
ALADIN - a Magic Lamp for the Elderly? . . . . . . . . . . . . . . . . . . . . . . . . . 1195 Edith Maier and Guido Kempter 1 ALADIN - Magic Lighting for the Elderly? . . . . . . . . . . . . . . . . . . 1195 1.1 The Impact of Lighting on People’s Health and Wellbeing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196 1.2 Wellbeing - a Multifaceted Concept . . . . . . . . . . . . . . . . . 1197 2 ALADIN’s Journey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198 2.1 Theoretical and Methodological Underpinnings . . . . . . . 1198 2.2 Analyzing User Requirements . . . . . . . . . . . . . . . . . . . . . . 1199 2.3 Tests in a Laboratory Setting . . . . . . . . . . . . . . . . . . . . . . . 1200 2.4 User Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202 3 The Modern Version of the Enchanted Lamp . . . . . . . . . . . . . . . . . 1204 3.1 Technical Architecture - Overview . . . . . . . . . . . . . . . . . . 1205 3.2 Hardware Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206 3.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209 4 Discussion of Field Test Results and Outlook . . . . . . . . . . . . . . . . . 1216 4.1 Findings from Field Trials . . . . . . . . . . . . . . . . . . . . . . . . . 1216 4.2 Outlook On ALADIN’s Future . . . . . . . . . . . . . . . . . . . . . 1217 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220 Japanese Ubiquotous Network Project: Ubila . . . . . . . . . . . . . . . . . . . . . . . 1223 Masayoshi Ohashi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223 2 Ubila’s Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225 3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225 4 Various Ambient Service Designs and Prototypes . . . . . . . . . . . . . 1226 4.1 Basic Component Technology . . . . . . . . . . . . . . . . . . . . . . 1226 4.2 Pilot Applications and Services . . . . . . . . . . . . . . . . . . . . . 1227 4.3 Collection and Dissemination of User Profiles . . . . . . . . . 1230 5 Field Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234 5.1 Live Commerce Akiba . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234 5.2 Temperature Monitoring with Airy Notes . . . . . . . . . . . . 1235 5.3 Lifelog Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235 6 Creation of Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237 6.2 First Video: Small Stories in 2008 . . . . . . . . . . . . . . . . . . . 1237 6.3 Second Video: Aura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238 7 Building and Operating Smart Spaces . . . . . . . . . . . . . . . . . . . . . . . 1239 7.1 Akihabara Smart Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239 7.2 Yurakucho Smart Space (uPlatea) . . . . . . . . . . . . . . . . . . . 1239 7.3 Kitakyushu Smart Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240 8 Network or Protocol Oriented R&D . . . . . . . . . . . . . . . . . . . . . . . . . 1240 8.1 Protocol Enhancement: OSNAP and CASTANET . . . . . 1241 8.2 Adaptive Network Control Based on Network Context . 1242 8.3 Automatic Network Configuration . . . . . . . . . . . . . . . . . . 1243
xxviii
Contents
9
Defining Context Aware System Architecture . . . . . . . . . . . . . . . . . 1243 9.1 Ubiquitous Network Architecture and Context . . . . . . . . 1243 9.2 Contributions to ITU-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244 10 Symposium and Contributions to Academia and Fora . . . . . . . . . . 1244 11 Related Activities and Future Trends in Japan . . . . . . . . . . . . . . . . . 1246 11.1 National R&D Projects in Parallel to Ubila . . . . . . . . . . . 1246 11.2 Other Ubiquitous Computing Related R&D Projects in Japan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246 11.3 Next Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247 12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248
Ubiquitous Korea Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251 Minkoo Kim, We Duke Cho, Jaeho Lee, Rae Woong Park, Hamid Mukhtar, and Ki-Hyung Kim 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251 2 Ubiquitous Computing Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1252 2.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1252 2.2 Vision: Ubiquitous Smart Space(USS) . . . . . . . . . . . . . . . 1252 2.3 Reference System Architecture and Deployment of USS 1252 2.4 R&D Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1252 2.5 Actual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1253 2.6 Sub-projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254 2.7 Test Bed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256 3 Intelligent Service Robots in Ubiquitous Environment . . . . . . . . . 1256 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256 3.2 Task and Knowledge Management Framework . . . . . . . . 1257 3.3 Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259 3.4 Knowledge Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259 3.5 Service Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1261 3.6 Results and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1262 4 U-Health Projects in Korea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1263 4.1 Concepts of u-Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1263 4.2 Why u-Health? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264 4.3 U-Health Projects in Korea . . . . . . . . . . . . . . . . . . . . . . . . 1266 4.4 Today’s Reality to Overcome . . . . . . . . . . . . . . . . . . . . . . . 1269 5 USN Technology for Nation-wide Monitoring . . . . . . . . . . . . . . . . 1269 5.1 IP-USN as an Integrating Technology . . . . . . . . . . . . . . . . 1270 5.2 Design Goals of WSN Management . . . . . . . . . . . . . . . . . 1273 5.3 LNMP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274 5.4 Nationwide Deployment in Korea . . . . . . . . . . . . . . . . . . . 1275 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279
Part I
Introduction
Ambient Intelligence and Smart Environments: A State of the Art Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
1 Introduction Advances in the miniaturization of electronics is allowing computing devices with various capabilities and interfaces to become part of our daily life. Sensors, actuators, and processing units can now be purchased at very affordable prices. This technology can be networked and used with the coordination of highly intelligent software to understand the events and relevant context of a specific environment and to take sensible decisions in real-time or a posteriori. AmI is aligned with the concept of the “disappearing computer” [23, 24, 22]: “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”
It is particularly interesting to contrast this time in history when accomplishing Weiser’s aims sounds plausible with the situation five decades ago when computers had the size of a room and the idea of a computer being camouflaged with any environment was an unimaginable notion. Today different appliances have successfully become integrated to our daily life surroundings to such an extent that we use them without consciously thinking about them. Computing devices have transitioned in this past half a century from big mainframes to small chips that can be embedded in a variety of places. This has allowed various industries to silently distribute computing devices all around us, often without us even noticing, both in public spaces and in our more private surroundings. Now washing machines, heating systems and even toys come equipped with some capability for autonomous decision making. AutoJuan Carlos Augusto University of Ulster, UK Hideyuki Nakashima Future University Hakodate, Japan Hamid Aghajan Stanford University, USA H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_1, © Springer Science+Business Media, LLC 2010
3
4
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
motive industry has become increasingly interested in embedding sensing mechanisms that allow the car to make decisions for a safer and less expensive journey, and as a result cars now have dozens of sensor and actuator systems that make decisions on behalf of or to assist the driver. Public spaces have become increasingly occupied with tracking devices which enable a range of applications such as sensing shoplifted object and crowd behavior monitoring in a shopping mall. Whether it is our home anticipating when the lights should be turned on and off, when to wake us up or order our favorite food from the supermarket, a transport station facilitating commuting, or a hospital room helping to care for a patient, there are strong reasons to believe that our lives are going to be transformed in the next decades by the introduction of a wide range of devices which will equip many diverse environments with computing power. These computing devices will have to be coordinated by intelligent systems that integrate the resources available to provide an “intelligent environment”. This confluence of topics has led to the introduction of the area of “Ambient Intelligence” (AmI): “a digital environment that proactively, but sensibly, supports people in their daily lives.” [7]
Ambient Intelligence started to be used as a term to describe this type of developments about a decade ago (see for example [25, 1, 16, 3]) and it has now been adopted as a term to refer to a multidisciplinary area which embraces a variety of pre-existing fields of computer science as well as engineering, see Figure 1. Given the diversity of potential applications this relationship naturally extends to other areas of science like education, health and social care, entertainment, sports, transportation, etc. Whilst AmI nourishes from all those areas, it should not be confused with any of those in particular. Networks, sensors, human-computer interfaces, pervasive computing and Artificial Intelligence (here we also include robotics and multi-agent systems) are all relevant but none of them conceptually fully covers AmI. It is AmI which brings together all these resources as well as many other areas to provide flexible and intelligent services to users acting in their environments. By Ambient Intelligence we refer to the mechanisms that rule the behavior of the environment: The definition of Ambient Intelligence given above includes the need for a “sensible” system, and this means a system with intelligence. The definition reflects an analogy with how a trained assistant, e.g. a nurse, typically behaves. The assistant will help when needed but will restrain to intervene except when necessary. Being sensible demands recognizing the user, learning or knowing her/his preferences, and the capability to exhibit empathy with or react to the user’s mood and the prevailing situation, i.e., it implicitly requires for the system to be sensitive. We reserve here the term “Smart Environments” (see for example [12]) to emphasize the physical infrastructure (sensors, actuators and networks) that supports the system. An important aspect of AmI has to do with interactivity. On the one hand there is a motivation to reduce explicit human-computer interaction (HCI) as the sys-
Ambient Intelligence and Smart Environments: A State of the Art
5
Fig. 1 Relationship between AmI and several scientific areas.
tem is supposed to use its intelligence to infer the situations and user needs from the observed activities, as if a passive human assistant were observing the activities unfold with the expectation to help when (and only if) required. Systems are expected to provide situation-aware information (I-want-here-now) through natural interfaces[18]. On the other hand, a diversity of users may need or voluntarily seek direct interaction with the system to indicate preferences, needs, etc. HCI has been an important area of computer science since the inception of computing as an area of study and development. Today, with so many gadgets incorporating computing power of some sort, HCI continues to thrive as an important AmI topic. It is also worthwhile to note that the concept of AmI is closely related to the “service science”[21] in the sense that the objective is to offer proper services to users. It may not be of interest to a user what kind of sensors are embedded in the environment or what type of middleware architecture is deployed to connect them. Only the services given to the user matter to them. Therefore, the main thrust of research in AmI should be integration of existing technologies rather than development of each elemental device. Within the notion of the service science, it is important to consider the implications of user evaluation in a “service-evaluation-research” loop (Figure 2). As part of their investigation, AmI researchers should assess efforts to setting up working intelligent environments in which users conduct their normal activities. As the system is distributed in the environment, users only notice services they receive. The
6
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
researchers then observe the interaction and (re-)evaluate the system. The result of the evaluation may lead to identifying new services, new viewpoints or even new evaluation criteria. Then all of the findings can be fed back to new research and development. This Handbook is organized in several thematic parts. Each of these parts provides a state of the art in research and development related to the specific sub-areas contributing to the development of the broader subject of this volume. Section 2 focuses on the infrastructure and how sensors (including vision as a high-level way of capturing information from an environment) are networked and utilized in a few application settings. Section 3 considers the technology that can be built over a networked sensing infrastructure to make resources widely available in an unobtrusive way. Section 4 gathers contributions on the interaction between humans and artificial systems. Section 5 focuses on developments that aim to make artificial system more rational. This is complemented with section 6 focused on Multi-Agent Systems (MAS). While an area with its own development architecture, these system contribute to the resources an artificial system can use to understand different situations and to decide intelligently. The volume provides in section 7 a wide range of applications as well as consideration of the impact this technology can have in our lives (safety, privacy, etc.). Section 9 provides insight into some of the recent major projects developed around the world.
Fig. 2 The loop of R&D, service, and evaluation.
Ambient Intelligence and Smart Environments: A State of the Art
7
2 Sensors, Vision, and Networks The focus of this thematic part is on the role vision and sensor networks play in acquiring information from the environment and events for responsive and reactive applications. As illustrated in Figure 3, these networks can offer a variety of inferences from the event and human activities of interest [5]. The information is then transferred to high-level reasoning modules for knowledge accumulation in applications involving behavior monitoring, or for reacting to the situation in applications based on ambient intelligence and smart environments. Another type of interaction between vision and higher-level application context may involve visualization of events or human actions which can take the form of video distribution or avatarbased visual communication [4]. In both kinds of interfaces, two-way communication between the vision and data processing modules on the one hand, and high-level reasoning and visualization modules on the other hand can enable effective information acquisition by guiding vision to the features of interest, and validating the vision output based on prior knowledge, context, and accumulated behavior models.
high-level reasoning
visualization
behavior
context
video / select view
voxel reconstruction
events
adaptation
avatar mapping
meta data
interface
knowledge accumulation
detection
observation validation
pose / activity
user preferences and availability
user interface
3D model
processing
vision and sensor network
Fig. 3 The role of vision and sensor networks in interaction with high-level reasoning and visualization.
In this section the contributing authors offer an overview of the latest trends in algorithm and application development based on visual and sensory inputs.
8
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
1. A Survey of Distributed Computer Vision Algorithms Recent years have seen great advances in computer vision research dealing with large numbers of cameras. However, many multi-camera computer vision algorithms assume that the information from all cameras is communicated in a lossless fashion to a central processor that solves the problem. This assumption is unrealistic for emerging wireless camera networks, which may contain processors with limited capability, antennas with limited power, and batteries with limited life. In this chapter, an overview of algorithms that solve computer vision problems in a distributed manner is presented, and their implementation on a visual sensor network is discussed. Methods for topology estimation, camera calibration, and visual tracking are introduced as case studies. 2. Vision-based People Tracking Human pose tracking is a key enabling technology for myriad applications, such as the analysis of human activities for perceptive environments and novel manmachine interfaces. Nevertheless, despite years of research, 3D human pose tracking remains a challenging problem. Pose estimation and tracking is currently possible in constrained environments with multiple cameras and under limited occlusion and scene clutter. Monocular 3D pose tracking in unconstrained natural scenes is much more difficult, but recent results are promising. This chapter provides an introduction to the principal elements of an approach to people tracking in terms of Bayesian filtering. It discusses likelihood functions which characterize the consistency between a hypothetical 3D pose and image measurements, including image appearance and image gradients. To help constrain pose estimation solutions, common prior models over plausible 3D poses and motions are considered. Toward this end, due to the high dimensionality of human pose, dimensionality reduction techniques are often employed as means of constraining the inferred poses to lie on a manifold spanned by training examples. Since the underlying problems are non-linear and result in non-Gaussian posterior distributions, online inference is achieved with particle filters or direct optimization. The chapter also discusses discriminative methods for static pose estimation, which are found helpful for initialization and transient failure recovery. 3. Locomotion Activities in Smart Environments One sub-area in the context of ambient intelligence concerns the support of moving objects, i.e. to monitor the course of events while an object crosses a smart environment and to intervene if the environment should provide assistance. For this purpose, the smart environment has to employ methods of knowledge representation and spatio-temporal reasoning. This enables the support of such diverse tasks as way-finding, spatial search, and collaborative spatial work. A number of case studies are introduced and examined in this chapter to demonstrate how smart environments can effectively make use of the information about the movement of people. A specific focus lies in the research
Ambient Intelligence and Smart Environments: A State of the Art
9
question as to which minimal information suffices the need of the smart environment to deal with approximate motion information in a robust and efficient way, and under realistic conditions. The study indicates that a thorough analysis of the task at hand can enable the smart environment to achieve results using a minimum of sensory information. 4. Tracking in Urban Environments using Sensor Networks based on Audio-Video Fusion Recent advances in sensor networks, micro-scale, and embedded vision technologies are enabling the vision of ambient intelligence and smart environments. Environments will react in an attentive, adaptive and active way to the presence and activities of humans and other objects in order to provide intelligent and smart services. Heterogeneous sensor networks with embedded vision, acoustic, and other sensor modalities offer a pivotal technology solution for realizing such systems. Networks with multiple sensing modalities are gaining popularity in different fields because they can support multiple applications that may require diverse resources. This chapter discusses a heterogeneous network consisting of audio and video sensors for multimodal tracking in an urban environment. If a moving target emits sound, then both audio and video sensors can be utilized for tracking. Audio and video can complement each other, and improve the performance of the other modality through data fusion. Multimodal tracking can also provide cues for the other available modalities for actuation. An approach for multimodal target tracking in urban environments is demonstrated, utilizing a network of mote class devices equipped with microphone arrays as audio sensors and embedded PCs equipped with web cameras as video sensors. 5. Multi-Camera Vision for Surveillance In this chapter, a variety of visual surveillance systems are surveyed. Visual surveillance methods are classified according to the criteria which distinguish between systems intended to acquire detailed information on a small number of subjects, systems designed to track as many subjects as possible, and systems designed to acquire trajectories of subjects and to recognize scenes in specific environments. These systems are also classified based on camera types and situations.
3 Mobile and Pervasive Computing This thematic part focuses on physical aspects of mobility and pervasiveness of devices, as well as systems that support mobility of them. Logical mobility such as mobile agents is treated in section 6 and contextual issues are treated in section 4.
10
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
Middleware architecture plays an important role here since the connection between devices are not fixed. We need a flexible and light-weight operating system not only for connecting sensors and actuators, but for providing seamless connection from devices to users (Figure 4). For context-aware computation, some of the context information such as location, time, temperature, etc. must be handed over to application software modules in some predefined format. To accomplish this, data conversion and even data fusion from many different devices must be handled by the middleware. Handling security also turns into a different issue due to mobility, since traditional methods that assume fixed configuration may not be applicable.
Fig. 4 AmI architecture with the middleware.
This thematic part consists of the following chapters: 1. Collaboration Support for Mobile Users in Ubiquitous Environments Collaboration among people happens ubiquitously. User mobility and seamless access to artifacts and devices are natural parts of collaboration. This chapter will first discuss some core characteristics of collaboration, addressing issues connected with the physical embodiment of resources, informal collaboration, and changes in context, considering that the context of any collaboration includes physical as well as social elements. Discussing these characteristics it focuses on the intersection between the research fields of Computer Supported Cooperative Work (CSCW) and Ambient Intelligence (AmI) and ubiquitous computing. A brief overview of technology supporting ubiquitous collaboration is provided. The last part of the chapter focuses on current research activities at
Ambient Intelligence and Smart Environments: A State of the Art
11
Ubicollab, including development of a platform for design of ubiquitous applications to support cooperation among distributed users. 2. Pervasive Computing Middleware Pervasive computing envisions seamless application support for user tasks. In most scenarios, these applications are executed on sets of networked computers that are integrated invisibly into everyday objects. Due to user mobility the resulting systems are dynamic. As a consequence, applications need to cope with ever-changing execution environments and vastly different user context. Without further support, the development of such applications is a complex and error prone task. Middleware systems can help to reduce the resulting complexity by providing a set of supportive services. This chapter discusses such middleware systems and their services. To do that, the chapter first presents a number of design considerations that have a significant impact on the middleware and service design. Thereafter, it discusses the three main functionalities of pervasive computing middleware. These functionalities provide support for (1) spontaneous remote interaction, (2) context management, and (3) application adaptation. To enable the analysis of a broad range of existing systems, the chapter introduces subcategories, compares their advantages and disadvantages, and classifies approaches accordingly. 3. Case Study of Middleware Infrastructure for Ambient Intelligence Environments A variety of application services in ambient intelligence environments can be realized at reduced cost by encapsulating complex issues in middleware infrastructures that are shared by the applications. Since these attractive services are typically quite complicated, their development is not easy for average programmers. High level abstractions offered by middleware infrastructures make it possible to hide the complexities in ambient intelligence environments. This chapter first discusses several potential future services in ambient intelligence environments, and how high level abstractions offered by middleware infrastructures facilitate their development. Then, it illustrates the effectiveness of the high level abstractions through several case studies of middleware designs. Finally, the chapter describes a number of important design issues for the future development of middleware infrastructures. 4. Collaborative Context Recognition for Mobile Devices This chapter investigates collaborative methods for context awareness with mobile devices. Interpretation of context often has to rely on unreliable sensor information that is noisy, conflicting, or simply missing. To limit the search space of possible interpretations it is useful to have reasonable default conditions. The chapter proposes to obtain the defaults from other people (or their devices). In a way, mobile devices would then mimic humans: observe other people and see how they behave in unfamiliar situations. For instance, if everyone else keeps
12
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
quiet in a concert, perhaps we should keep quiet too. If everyone has switched their phone to “silent” mode, maybe ours should be set to the same – automatically. The chapter discusses experiments suggesting that such collaboration not only provides useful first guesses for context analysis, but also improves the recognition accuracy over single-node algorithms. Through short-range links, mobile devices can query each other’s context information and then jointly (and individually) arrive at a shared interpretation of the situation. The chapter describes possible methods for using the shared context data and outlines some potential application areas as well as the potential risks of the approach. 5. Security Issues in Ubiquitous Computing With ubiquitous computing, everyday objects become networked computing devices. What new security risks does this revolution introduce? Small devices such as RFID tags severely constrain the gate count and the energy budget: how can we build secure protocols on such a foundation? At the next level up, devices with greater resources such as cell phones and A/V components still need to be paired up securely so as to be able to give orders to each other: how can this be done, especially over radio links where you can never be sure of the origin of a message? Going up another level, how can you defend your privacy if wearing networked devices makes you trackable everywhere you go? And finally, how can we reconcile security with usability? This last point is probably one of the most pressing challenges facing designers of ubiquitous systems today. 6. Pervasive Systems in Health Care As the computer disappears in the environments surrounding our activities, the objects therein are transformed to artifacts, that is, they are augmented with Information and Communication Technology (ICT) components (i.e. sensors, actuators, processor, memory, wireless communication modules) and can receive, store, process and transmit information. Artifacts may be totally new objects or improved versions of existing objects, which by using the ambient technology, allow people to carry out novel or traditional tasks in unobtrusive and effective ways. This chapter surveys the application of pervasive systems in supporting activities with health significance. Furthermore an abstract architecture of pervasive health care systems is presented. Then, as an example of a model architecture, the chapter describes and analyzes the HEARTS system, a pervasive system architecture that integrates a number of technologies to enable and support patients with congestive heart failure.
Ambient Intelligence and Smart Environments: A State of the Art
13
4 Human-centered Interfaces The area of smart environment design is the confluence of a multitude of disciplines involving sensor fusion, human-computer interfaces, networking and pervasive computing, and actuation and response services. Developed upon the premise of a user-centric data extraction and decision making paradigm, many smart environment applications support interactivity based on interpretation of the gesture, region of interest, and interactions of the user with the surrounding environment components [6]. Thanks to the proliferation of inexpensive sensors such as cameras, as well as embedded processors, unprecedented potentials exist for realizing novel real-time applications such as immersive human-computer interfaces for virtual reality, gaming, teleconferencing, smart presentations, and gesture-based control, as well as other human-centric applications such as patient monitoring and assistive technologies. Integration of multiple sensors in a distributed human-centric interface embodies new research and development opportunities in algorithm design methods based on collaborative sensing and processing, data fusion, event interpretation, context extraction, and generation of behavior models. while the notion of human centricity finds many interpretations, its true meaning goes beyond any of the individual aspects that have been discussed in the literature. It truly refers to a new paradigm in developing and utilizing technology to serve its user in whatever form and flavor that best offers the intended experience of the application of interest by the user. In the new paradigm, privacy management, ease of use, unobtrusive design, and customized system can each be part of the definition of a human-centric interface. However, these factors are not regarded as rigid requirements when the essence of human centricity is considered, meaning different levels of each factor and relative priorities between them need to be derived from the context and objective of the application. In other words, for example the same vision sensing mechanism that may be perceived as intrusive when used in surveillance applications can be employed in a human-centric paradigm to improve the quality of life of patients and the elderly by timely detection of abnormal events or accidents. Chapters in this thematic part offer an overview of the studies and methodologies of human-centric interface design for applications in smart environments. 1. Human-centered Computing Human-centered Computing (HCC) is a set of methodologies that applies to any field that uses computers, in any form, in applications in which humans directly interact with devices or systems that use computer technologies. This chapter offers an overview of HCC from different perspectives. It describes the three main areas of HCC: content production, analysis, and interaction. In addition, the chapter identifies the core characteristics of HCC, provides example applications and investigates the deployment of the existing systems into real world conditions. Finally, a research agenda for HCC is proposed for future development.
14
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
2. End-User Customization of Intelligent Environments This chapter describes a methodology that enables non-technical occupants of intelligent environments to translate conceptual ideas for the functionality of networked environments and appliances into concrete designs. It builds on a technique called “programming by example” which was first proposed by David Canfield Smith (Stanford University) in the mid-seventies and, later, continued by Henry Lieberman (MIT) in the nineties. The chapter shows how this approach, which has hitherto been applied only to single computing platforms, was extended to allow non-technical people to program distributed computing platforms of the type that make up intelligent environments. In order to achieve this vision the chapter introduces a number of novel concepts and methodologies, in particular a new networked service aggregation model (the deconstructed appliance model), a representation for virtual appliances (metaappliances/applications), an ontology to support decomposition (dComp) and a tool for programming coordinated behavior in rule-based network artifacts (Pervasive Interactive Programming). Finally, the chapter reports on a user evaluation of this methodology, which demonstrates that users find the methods both simple and enjoyable to use. 3. Intelligent Interfaces to Empower People with Disabilities Smart environments are essential for the well-being of people with severe paralysis. Intelligent interfaces can immensely improve their daily lives by allowing them to communicate and participate in the information society, for example, by browsing the web, posting messages, or emailing friends. People with motion impairments often prefer camera-based interfaces, because these are customizable, comfortable, and do not require user-borne accessories that could draw attention to their disability. This chapter presents smart environments that enable users to communicate with the computer via inexpensive video cameras. The video input is processed in real time so that users can work effectively with assistive software for text entry, web browsing, image editing, and animation. Because motion abilities differ from person to person and may change due to a progressive disease, the solution offers a number of environments from which the user may choose. Each environment addresses a different special need. Users may also create their own smart environments that understand their gestures. 4. Visual Attention, Speaking Activity, and Group Conversational Analysis in Multi-Sensor Environments One of the most challenging domains in multi-sensor environments is the automatic analysis of multi-party conversations from sensor data. While spoken language constitutes a strong communicative channel, nonverbal communication is also fundamental, has been substantially documented in social psychol-
Ambient Intelligence and Smart Environments: A State of the Art
15
ogy and cognition, and is opening new computational problems in smart spaces equipped with microphones and cameras. This chapter discusses the role of two nonverbal cues for analysis of group conversations, namely visual attention and speaking activity. The chapter reviews techniques for the extraction of such cues from real conversations - summarizing novel approaches based on contextual and audio-visual integration - and discusses their use in emerging applications on face-to-face and remote conversation enhancement and in the recognition of social behavior. 5. Using Multimodal Sensing for Human Activity Modeling in the Real World This chapter describes experiences from a five-year investigation on the design and deployment of a wearable system for automatically sensing, inferring, and logging a variety of human physical activity. The chapter highlights some of the key findings resulting from the deployment of a working system in three-week and three-month real world field trials, which are framed in terms of system usability, adaptability, and credibility. 6. Recognizing Facial Expressions Automatically from Video Facial expressions are the face changes in response to a person’s internal emotional states, intentions, or social communications. Computer recognition of facial expressions has many important applications such as intelligent humancomputer interaction, computer animation, and awareness systems. This chapter introduces recent advances in computer recognition of facial expressions. It first describes the problem space, which includes multiple dimensions: level of description, static versus dynamic expression, facial feature extraction and representation, expression subspace learning, facial expression recognition, posed versus spontaneous expression, expression intensity, controlled versus uncontrolled data acquisition, correlation with bodily expression, and databases. Meanwhile, the state of the art of facial expression recognition is also surveyed from different aspects. In the second part, the chapter proposes an approach to recognizing facial expressions using discriminative local statistical features. Particularly, Local Binary Patterns are investigated for facial representation. Finally, the current status, open problems, and research directions are discussed. 7. Sharing Content and Experiences in Smart Environments This chapter discusses the possibilities and challenges arising from the growing amount of content that can be accessed on-line anywhere, including being carried around in mobile devices or even being embedded in our environment. The chapter focuses on consumer contents such as music, pictures, video clips, streaming content and games, which people may create, carry around and share with others. The content may be enjoyed alone or together with others via mobile devices or appliances in the environment. Mobility allows novel content delivery mechanisms based on the identified context. Current and future application potentials, the issues relating to user expectations, the concept of expe-
16
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
rience as opposed to simple content, and the options for experience sharing are discussed. The chapter will review interaction solutions that exploit the environment to provide a synergetic interface for content management and rendering. User-driven design is emphasized: The user acts not only as content consumer but also as content producer and co-designer when setting up shared content environments. Technological and other challenges arising from these issues will be presented. 8. User Interfaces and HCI for Ambient Intelligence and Smart Environments This chapter aims to present a systematic structure of the field of User Interfaces, as it applies to Ambient Intelligence and Smart Environments. It starts with the machine side (i.e., the hardware in use) and the human side (i.e., conceptual models, human issues in interaction), then proceeds to sections about designing and evaluating UIs, and ends with some case studies. 9. Multimodal Dialogue for Ambient Intelligence and Smart Environments One goal of Ambient Intelligence and Smart Environments is to make it possible for users to interact naturally with their environment, for example, using speech and/or gestures to switch on the TV or turn off a lamp. This goal can be addressed by setting up a specialized dialogue system that interacts with the user and his environment. On the one hand, this system must process the data provided by a diversity of sensors placed in the environment (e.g., microphones, cameras, and presence detectors), which capture information from the user. The dialogue system may need to employ several dialogue management strategies, user models and contextual information to interact with the user and the environment. The interaction may be initiated by the user or by the system if the latter is designed to be proactive. The dialogue system must then decide on the proper actions to be performed in response to the data captured from the user and the status of the environment. This chapter discusses several issues regarding the implementation of this kind of systems, focusing on context awareness, handling of multimodal input, dialogue management, response generation and evaluation.
5 Artificial Intelligence and Robotics Artificial Intelligence and Robotics are well established areas of research which have witnessed good progress in the past few decades. Yet it is recognized they have not reached their full potential and much work needs to be done in these areas to develop methods for practical AmI and SmE applications. From expert systems to multi-agents there has been significant work and lessons learned. The arrival of the topic of AmI which calls for new computing methods to deliver services in the expected way (see [9] and [20]) and the deployment of AmI
Ambient Intelligence and Smart Environments: A State of the Art
17
concepts over Smart Environments which are characterized by a network of interconnected devices each with limited capability presents both a new opportunity and a new challenge. The opportunity lies in that now AI has another opportunity to get closer to people, to improve the way they live. Now the focus has shifted from justifying the need of AI/robots to support humankind in missions where it is not possible/desirable to involve humans (e.g., going to an inhospitable planet like Mars, entering a building in flames or in danger of collapse, or deactivating a bomb) to more humble and close to people services. Now systems can measure their intelligence in simpler (still enormously challenging) tasks like helping drivers to find their way round a city, helping tourists to make the most of a visit to the museum, making vulnerable people feel safer by notifying carers when their health or safety seems compromised. All this is very healthy for AI because it allows to demonstrate its usefulness for practical applications and to enable it to reach out to outside the lab and deal with the uncertainties of everyday life in the real world. The challenge lies in the complexity that is part of any element of real applications. Systems developed under this concept should be intelligent to detect the situations that matter, and to react to them appropriately and reliably. Relevant to the development of AI systems that can make an environment truly intelligent is the need for such an environment to be able to: (a) learn habits, preferences and needs of the occupants, (b) correctly diagnose situations, (c) be aware of where and when the events of interest occur, (d) integrate mobile elements like robots, and (e) provide a structured way to analyze, decide and react over that environment. The chapters in this thematic part provide a step forward in addressing these issues. 1. Learning Situation Models in Smart Spaces Smart Spaces enhance user capabilities and comfort by providing new services and automated service execution based on sensed user activity. Though, when becoming really “smart”, such spaces should not only provide some service automation, but further learn and adapt their behavior during use. This chapter motivates and investigates learning of situation models in a smart space, covering knowledge acquisition from observation as well as evolving situation models during use. An integral framework for acquiring and evolving different layers of a situation model is detailed. Different learning methods are presented as part of this framework: role detection per entity, unsupervised extraction of situations from multimodal data, supervised learning of situation representations, and the evolution of a predetermined situation model with feedback. The situation model serves as frame and support for the different methods, permitting to stay in an intuitive declarative framework. An implementation of the whole framework for a smart home environment is described, and the results of several evaluations are depicted. 2. Smart Monitoring for Physical Infrastructures Condition monitoring of physical infrastructures such as power and railway networks is an important prerequisite for optimizing operations and maintenance. Today, various sensor systems are in place but generally lack integration for as-
18
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
sessing conditions more precisely. This is a typical challenge faced by contextaware ambient intelligence systems, which tackle it using formal knowledge representation and reasoning. This chapter adopts the knowledge-based approach for infrastructure monitoring: Description Logic-based ontologies are used for developing declarative, extensible and machine-workable models of symptoms, faults, and conditions. A novel approach for reasoning over distributed knowledge bases of context information is introduced. Finally, a case study on European railway monitoring demonstrates the feasibility and potential of the proposed approach. 3. Spatio-Temporal Reasoning and Context Awareness An important aspect of ambient intelligence in smart environments is the ability to classify complex human behavior and to react to it in a context-specific way. For example, a heating system can detect if a human is present in the room, but for it to react appropriately and adjust the temperature to a comfortable level, it needs to consider a wider context: someone watching TV, for example, might require a warmer room than somebody engaged in exercise, whereas someone entering to simply look for a book may be leaving the room shortly. The classification of complex human behavior requires an advanced form of spatio-temporal reasoning, since such classification is often dependent on where or when the behavior occurs. For example, someone walking around a triangle repeatedly, stopping at various points along the route, would look odd unless it occurred in a kitchen, with the triangle being the space between the fridge, sink, and stove. Even in this space, a recurring pattern of behavior has to be interpreted in different ways, depending on the time interval it occurs in – cooking a large meal in the middle of the night would still be unusual. This chapter looks at some of the major conceptual and methodological challenges around spatio-temporal reasoning and context awareness and methods of approaching them. 4. From Autonomous Robots to Artificial Ecosystems In recent years, the new paradigm of ubiquitous robotics attracted many researchers worldwide for its tremendous implications: mobile robots are moving entities within smart environments, where they coexist and cooperate with a plethora of distributed and smart nodes that can be assigned with a certain amount of computational resources and a certain degree of autonomy. This chapter is organized in three main parts: the first part surveys the state-of-theart, pointing out analogies and correspondences between different approaches; the second part describes a real-world scenario: the setup of a robot team for autonomous transportation and surveillance within an intelligent building; and the third part introduces the artificial ecosystem framework as a case study to investigate how ubiquitous robotics can be exploited for the deployment of robotic applications in complex real world scenarios.
Ambient Intelligence and Smart Environments: A State of the Art
19
5. Behavior Modeling for Detection, Identification, Prediction, and Reaction in AI Systems Solutions Behavior modeling automation is a requirement in today’s application development in AI, and the need to fuse data from a large number of sensors demands such automation in the operations to observe, fuse and interpret the data. Applications in abnormal behavior detection, ambient intelligence, and intelligent decision making are a few of the unlimited applications of human intelligence automation. A generalized software approach to implement these solutions is presented in this chapter to break the artificial intelligence behavior modeling problem into four distinct intelligence functions to describe and implement using the distinct subsystems of: Detection, Identification, Prediction, and Reaction (DIPR). These subsystems also allow complete fusion of all sensor features into the intelligent use as a result. Behaviors can be modeled as sequences of classified features, fused with intelligence rules, and events into time and space. Multiple features are extracted with low-level classifiers in time and space in the “Detection” subsystem. The temporal (feature,space) matrix is used by the “Identification” subsystem to combine with low-level simple rules to form intelligent states (symbols). The “Prediction” subsystem concatenates symbols over time/space to form spatial-temporal sequences. These sequences are classified into behaviors and then passed through an inferred predicted behavior outcome module. The final stage is the “Reaction” subsystem for an action producing intelligent solution. The DIPR system is shown to be overlaid onto a number of different distributed networks and a number of pre-engineered application models are presented to demonstrate the applications of the system.
6 Multi-Agents Studies of intelligent agents are roughly divided into three categories: (1) those trying to design a human-like intelligent agents, (2) those trying to implement (semi-) intelligent autonomous software modules, and (3) those emphasizing on the coordination or collaboration among many agents (multi-agents). Agents in the latter two types are extremely useful in AmI. To provide useful functionality in the environments, we have to implement some autonomously functioning software into devices embedded in the environment. Agent technologies are useful in creating such functions. Moreover, to make those devices collaborate with each other and to make them function in unity, we need multi-agent technologies. 1. Multi-Agent Social Simulation The difficulty in the introduction of a new IT system is in the uncertainty of its effect in the society. The benefit of any IT systems appears when it is implemented deeply into the society and impacts social activities in a large scope. Social simulation, especially, multi-agent simulation will provide an important
20
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
tool to evaluate how the IT system may change the behavior of the society. This chapter discusses the capabilities of multi-agent systems in social simulations, and provides some examples of multi-agent simulations that evaluate new IT methodologies. 2. Multi-Agent Strategic Modeling in a Specific Environment One of the tasks in ambient intelligence is to deduce the behavior of agents in the environment. Here we present an algorithm for multi-agent strategic modeling applied in a robotic soccer domain. The algorithm transforms a multi-agent action sequence into a set of strategic action descriptions in a graphical and symbolic form. Moreover, graphic and symbolic strategic action descriptions are enriched with corresponding symbolic rules at different levels of abstraction. The domain independent method was evaluated on the RoboCup Soccer Server Internet League data, using hierarchically ordered domain knowledge. 3. Learning Activity Models for Multiple Agents in a Smart Space Intelligent environment research has resulted in many useful tools such as activity recognition, prediction, and automation. However, most of these techniques have been applied in the context of a single resident. A current looming issue for intelligent environment systems is performing these same techniques when multiple residents are present in the environment. This chapter investigates the problem of attributing sensor events to individuals in a multi-resident intelligent environment. Specifically, a naive Bayesian classifier and a hidden Markov model are used to identify the resident responsible for a sensor event. Results of experimental validation in a real intelligent workplace testbed are presented and the unique issues that arise in addressing this challenging problem are discussed. 4. Mobile Agents Mobile agent technology has been promoted as an emerging technology that makes it much easier to design, implement, and maintain distributed systems, including cloud computing and sensor networks. It provides an infrastructure not only for executing autonomous agents but also for migrating them between computers. This chapter discusses the potential uses of mobile agents in distributed systems, lists their potential advantages and disadvantages, and describes technologies for executing, migrating, and implementing mobile agents. The chapter presents several practical and potential applications of mobile agents in distributed systems and smart environments.
Ambient Intelligence and Smart Environments: A State of the Art
21
7 Applications This thematic part overviews the application areas of Ambient Intelligence and Smart Environments (AISE). One of the important aspects of many concepts developed in AISE is their closeness to the society. While other areas of computer science impact everyday life in our society, AISE provides a framework in which information technologies are applied to the real world situations in a systemic and often multi-disciplinary fashion to enhance our lives. It is worth noting that even subjects in computer science not covered in this handbook, for example, the Internet, language processing and search engines are nevertheless related to AISE because they are part of the intelligence or smartness of the environments we live or work in. We do not cover them here because if we did, this handbook would have been a handbook of the whole computer science! The list of topics covered in this part is not an exhaustive enumeration of applications based on AISE, but a selected set of representative topics from various application fields. The subjects are roughly ordered from applications for supporting an individual to providing services to a large number of people. 1. Ambient Human-to-Human Communication The traditional telephone and video conference systems are examples of a terminal-centered interaction scenario where the user is either holding the terminal in the hand or seated in front of it. The terminal centricity guides the interaction to take the form of an on/off session which starts by making a call and ends by terminating the call. A terminal-centered communication session has the same social characteristics as a formal face-to-face meeting. When people are physically together there is a larger variety of interaction modes available which may be characterized by different ranges of social or interpersonal distances beyond the face-to-face meeting mode. The model for the ambient communication technology is the natural interaction between people who are physically present. In this chapter we give a general overview of various topics related to ambient communication. A detailed description of the architecture and essential components of a voice-based ambient telephone system is provided as an example. 2. Smart Environments for Occupancy Sensing and Services This chapter discusses recent developments in occupancy sensing and applications related to sensing the location of users and physical objects and occupation in smart environments, mainly houses, offices, and urban spaces. The sensory equipment widely used for sensing are presented, as well as different tools and algorithms used to process, present, and apply the information. The focus is on different detecting devices, data mining, and statistical techniques that have been utilized in pervasive computing applications. Examples that utilize occupancy information to enable services for ordinary or professional users are studied.
22
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
3. Smart Offices and Smart Decision Rooms Decision making is one of the noblest activities of the human being. Most of times, decisions are made involving several persons (group decision making) in specific spaces (e.g. meeting rooms). On the other hand many scenarios (e.g. increased business opportunities, intense competition in the business community) demand fast decision making to achieve success. Smart offices and decision rooms contribute to reducing the decision cycle, and offer connectivity wherever the users are, aggregating the knowledge and information sources. This chapter provides a survey of smart offices and intelligent meeting rooms, considering the hardware and software perspectives. It presents the Smart Decision Room, a project platform created to provide support to meeting room participants, assisting them in the argumentation and decision making processes, while combining the rational and emotional aspects. 4. Smart Classroom: Bringing Pervasive Computing into Distance Learning The Smart Classroom project aims to build a real-time interactive classroom with tele-education experience by bringing pervasive computing technologies into traditional distance learning. This chapter presents the current state of the project which includes four key technologies: 1) a Smart Classroom, which embeds multiple sensors and actuators (e.g. location sensor, RFID, microphone array) to enhance the conventional classroom-based learning environment; 2) key multimodal interface technologies and context-aware applications, which merge speech, handwriting, gestures, location tracking, direct manipulation, large projected touch-sensitive displays, and laser pointer tracking, providing enhanced interaction experience for teachers and students; 3) a dedicated software called “SameView”, which uses a hybrid application-layer multicast protocol and adaptive content delivery scheme, providing a rich set of functions for teachers, local students and large-scale remote students to carry out realtime, collaborative distance learning experience; 4) a software infrastructure based on multi-agent architecture called Smart Platform, which integrates, coordinates and manages all the distributed collaborative modules in the smart environment, supporting various data transmission modes in terms of different quality of service (QoS) levels. The smart classroom system has been demonstrated as a prototype system in Tsinghua University. 5. Ambient Intelligence in the City This chapters aims to address deficits in the way ambient intelligence conceives of and designs for urban environments. It begins with an overview of issues surrounding ambient intelligence in the city, followed by critical voices and examples of alternative ambient intelligence systems. The text argues for the inclusion of pervasive sensing issues into urban planning, in particular to negotiate between private and public interests.
Ambient Intelligence and Smart Environments: A State of the Art
23
6. The Advancement of World Digital Cities This chapter reviews the advances in several worldwide activities on regional information spaces empowered by new technologies in sensors, information terminals, broadband networks and wireless networks. In the US and Canada, a large number of community networks supported by grass-root activities appeared in the early 1990s. In Europe, more than one hundred digital cities have been tried, often supported by local or central governments and the EU in the name of local digitalization. Asian countries have actively adopted the latest information technologies as a part of national initiatives. In the past 15 years since the first stage of digital cities, the development of the original digital cities has leveled off or stabilized. In spite of that, by looking back at the trajectory of the development of digital cities, it is clear that digital environments in cities have often benefited from the previous activities on various regional information technology spaces.
8 Societal Implications and Impact It is well known within the community that the areas covered in this volume can potentially have profound impacts in our society. On the one hand they can provide unprecedented support for the society to enter a true modern digital era where computers everywhere silently assist us. On the other hand, this can also prove to be a technical minefield where many problems arise as a result of attempting to offer the intended services [13, 15, 2, 10]. User-oriented design, privacy preservation, personalization and social intelligence are some of the fundamental concepts the area has to approach efficiently in order to encourage acceptance from the masses. Failing to do well in those aspects will most certainly cause the users to feel overwhelmed and distrustful of the technology that is entering their daily lives in a disruptive way. The chapters in this thematic part address some of these important user-related issues, namely how to achieve truly unobtrusive systems, give privacy issues a well deserved relevance, personalize systems to a user, and consider social interaction as an essential element of design. 1. Human Factors Consideration for the Design of Collaborative Machine Assistants The basic idea of ambient intelligence is to connect and integrate technological devices into our environment such that the technology itself disappears from sight and only the user interface remains. Representative technologies used in ambient intelligent environments are collaborative machine assistants (CMAs), for example, in the form of a virtual agent or robot. Designing effective CMAs requires attention to the user interaction making it critical to consider user characteristics (e.g., age, knowledge), technology characteristics (e.g., perceived usefulness, ease of use), and relational characteristics (e.g., voluntariness, attraction) during the design. We propose a framework for understanding the user
24
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
variables that influence the acceptance of CMAs. Our goal is to provide designers with heuristic tools to be used during the design process to increase user acceptance of such technology. 2. Privacy Sensitive Surveillance for Assisted Living: A Smart Camera Approach Western societies are aging rapidly. An automated 24/7 surveillance to ensure safety of the elderly while respecting privacy becomes a major challenge. At the same time this is a representative of novel and emerging video surveillance applications discovered lately besides the classic surveillance protection applications in airports, government buildings and industrial plants. Three problems of current surveillance systems are identified. A distributed and automated smart camera based approach is proposed that addresses these problems. The proposed system’s goal set is to analyze the real world and reflect all relevant – and only relevant – information live in an integrated virtual counterpart for visualization. It covers geo referenced person tracking and activity recognition (falling person detection). A prototype system is installed in a home for assisted living running 24/7 for several months now and shows quite promising performance. 3. Data Mining for User Modeling and Personalization in Ubiquitous Spaces User modeling has traditionally been concerned with analyzing a user’s interaction with a system and developing cognitive models that aid in the design of user interfaces and interaction mechanisms. In recent years, the large amounts of data that can be collected from users of technology in multiple scenarios (web, mobile, etc.) has created additional opportunities for research, not only for traditional data mining applications, but also for creating new user models that lead to personalized interaction and services. The explosion of available user data presents many opportunities and challenges, requiring the development of new algorithms and approaches for data mining, modeling and personalization, particularly considering the importance of the socio-cultural context in which technologies are used. This chapter will present an overview of data mining of ubiquitous data, review user modeling and personalization methods, and discuss how these can be used in ubiquitous spaces not just for designing new interfaces, but also in determining how user models can inform the decisions concerning what services and functionalities are adequate for particular users. 4. Experience Research: a Methodology for Developing Human-centered Interfaces Ten years of Ambient Intelligence (AmI) research have led to new insights and the understanding that Ambient Intelligence should not be so much about system intelligence and computing but more about social intelligence and innovation. We discuss these novel insights and their resulting impact on the AmI research landscape. We argue that new ways of working are required applying the concept of Experience Research resulting in a true user-centered approach
Ambient Intelligence and Smart Environments: A State of the Art
25
to ambient intelligence. The Experience Research approach is illustrated with a small case study on applying this methodology in the area of Ambient Assisted Living.
9 Selected Research Projects Another sign of the interest these areas are attracting is the number of projects that have been started in the last decade around the world. The number of such projects have grown exponentially thanks to the realization of governments (predominantly in the north hemisphere) of the strategic importance this problem has for the future development of the most industrialized nations. Both governments and companies have started to invest heavily in addressing different bottlenecks for the future development of the area and in the development of the first generation of system which can increase safety, entertainment, healthy lifestyle habits, and many other services that people can enjoy during work, life at home, and leisure time. This thematic part provides a glance on some of those projects, given space restrictions we can only offer a small number of them but they illustrate a variety of topics (human-centric models of computing, non-traditional ways of interaction, robust middleware, ambient assisted living, efficient and automated light control, and large ubiquitous experiments in Japan and Korea) which have been effectively addressed by large consortia of academic and industrial organizations in a collaborative way. 1. Computers in the Human Interaction Loop Computers have become an essential part of modern life, supporting human activities in a multiplicity of ways. Access to the services that computers provide, however, comes at a price: Human attention is bound and directed toward a technical artifact rather than at the human activity or interaction that computing is to serve. In CHIL (Computers in the Human Interaction Loop) Computing we explore an alternative paradigm for computing in a human centered way. Rather than humans attending to computers, we are developing computing services and technologies that attend to humans while they go about their normal activities and attempt to provide implicit support. By analogy to a human butler hovering in the background, computers that attend to human needs implicitly would require a much richer understanding and quite a bit of careful observation of human interaction and activities. This presumes a wide set of perceptual capabilities, allowing to determine who was doing what to whom, where, why and how, based on a multimodal integration of many sensors. It also presumes the ability to derive rich descriptions of human activities and state, so that suitable inferences for the determination of implied need can be made. In this chapter, we present several prototypical CHIL services and their supporting perceptual technologies. We also discuss the evaluation framework under which technological advances are made, and the software architecture by which perceptual components are integrated. Finally, we examine the social impact of CHIL com-
26
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
puters; how increasingly social machines can better cooperate with humans and how CHIL computers enhance their effectiveness. 2. Eye-based Direct Interaction for Environmental Control in Heterogeneous Smart Environments Eye-based environmental control requires innovative solutions for supporting effective user interaction, for allowing home automation and control, and for making homes more “attentive” to user needs. This chapter describes the research efforts within the COGAIN project, where the gaze-based home automation problem is tackled as a whole, exploiting stateof-the-art technologies and trying to integrate interaction modalities that are currently supported and that may be supported in the near future. User-home interaction is sought through an innovative interaction pattern: direct interaction. Integration between the eye tracking interfaces, or other interaction modalities, and the wide variety of equipment that may be present in an intelligent house (appliances and devices) is achieved by the implementation of a Domotic House Gateway that adopts technologies derived from the semantic web to abstract devices and to enable interoperability of different automation subsystems. Innovative points can be identified in the wide flexibility of the approach which allows on one hand integration of virtually all home devices having a communication interface, and, on the other hand, direct user interaction paradigms, exploiting the immediacy of command and feedback. 3. Middleware Architecture for Ambient Intelligence in the Networked Home The European IST FP6 Amigo project aimed at developing a networked home system enabling the ambient intelligence / pervasive computing vision. Key feature of the Amigo system is effectively integrating and composing heterogeneous devices and services from the four application domains (i.e., personal computing, mobile computing, consumer electronics and home automation) that are met in todayÃÔ home. The design of the supporting interoperability middleware architecture is the focus of this chapter. The Amigo middleware architecture is based on the service-oriented architectural style, hence enabling a networked home system structured around autonomous, loosely coupled services that are able to communicate, compose and evolve in the open, dynamic and heterogeneous home environment. The distinguishing feature of the proposed middleware architecture is that it poses limited technology-specific restrictions: interoperability among heterogeneous services is supported through semanticbased interoperability mechanisms that are key elements of the Amigo architecture. 4. The PERSONA Service Platform for Ambient Assisted Living Spaces An European project started in January 2007 in the field of Ambient Assisted Living, PERSONA strives for the development of sustainable and affordable solutions for the independent living of senior citizens. AAL Services with focus on social integration, assistance in daily activities, safety and security, and
Ambient Intelligence and Smart Environments: A State of the Art
27
mobility should be developed based on an open scalable technological platform. This chapter summarizes the conceptual design of this service platform for AAL spaces that is being finalized in the end of the second year after project start. It introduces a physical and a logical architecture for AAL spaces at an abstract level along with a set of more concrete functional components that are identified as part of the platform. The abstract physical architecture treats an AAL space as a dynamic ensemble of networked nodes and the interoperability framework abstracts all functionality not leading to context, input, or output events (or the direct processing of such events) as service. Hence, the proposed architecture is service oriented. A last topic treated in this chapter concerns the PERSONA solution for binding ultra-thin devices through a Sensor Abstraction and Integration Layer. 5. ALADIN - a Magic Lamp for the Elderly? The overall aim of ALADIN, an EU-funded research project in the realm of Ambient Assisted Living (AAL), is to extend our knowledge about the impact of lighting on the well-being and comfort of older people and translate this into a cost-effective open solution. The development approach is underpinned by an ambient intelligence framework which focuses on user experience and emphasizes activity support. The adaptive lighting system developed in ALADIN consists of an intelligent open-loop control, which can adapt various light parameters in response to the psycho-physiological data it receives. For adaptive control, simulated annealing and genetic algorithms have been developed. In the lab tests that were carried out to determine the psycho-physiological target values for activation and relaxation, skin conductance response emerged as the most appropriate target value with heart rate as the second best candidate. Tests with different light parameters (color temperature, intensity, distribution) showed a clear impact on older adults both in terms of activation and relaxation. The technical architecture and hardware prerequisites (sensor device, lighting installation) as well as the different software applications are described. Finally, the preliminary results of the extensive field trials conducted in twelve private households are discussed. The user feedback provides the basis for future application scenarios and exploitation plans of the Consortium partners. 6. Japanese Ubiquitous Network Project: Ubila From 2003 to March 2008, Ubila , a part of the “u-Japan” project, was a pioneering national ubiquitous network R&D project in Japan. Its objective was to develop core technologies to realize ambient environments with the cooperation of networks, middleware and information processing technologies. The Ubila project has demonstrated various activities. Every year, its research outputs were demonstrated at exhibitions and also periodically demonstrations were also conducted at smart spaces for guests. the Ubila project has released two service prototype videos that reflect its R&D directions. the Ubila project marked a unique and important step toward the advent of the ambient world.
28
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
7. Ubiquitous Korea Project In 2006 Korean government planed to construct u-Korea until 2015. Although the plan has not been executed exactly, the government has established several huge national projects for constructing u-Korea. In this chapter we introduce four major projects related with ubiquitous computing technology which has been conducted by Korean government.
10 Perspectives of the Area When we look back to the research in intelligent systems, we can see the general direction moving from the individual to the group, and from closed systems to open systems. When the research area of Artificial Intelligence was first established, the dominating concept was “the physical symbol system hypothesis” which postulates that the essence of intelligence is captured by and only by symbol processing[19]. As a natural consequence, an intelligent system was modeled as an input-computeoutput, or recognize-compute-act system (Figure 5).
Fig. 5 The old view of intelligence.
As experience was gained in the area, the community began to recognize the importance of the environment. There even exists a viewpoint declaring that only the environment is the key to intelligent behavior: An agent (or any actor in the environment) only demonstrates behavior by picking up an affordance provided by the environment [14]. This view, although regarded by some as extreme, has some alignment with the concept of Ambient Intelligence in emphasizing the various influences and multi-faceted services the environment has to offer to its occupants.
Ambient Intelligence and Smart Environments: A State of the Art
29
Around the same period, the concept of Autopoiesis which denies a predefined distinction between the system and the environment was proposed [17]. These concepts also match research directions taken on in robotics where sensors and actuators play more important roles than symbolic planning. Brooks [11] even denied the necessity of internal symbolic representation. These new view of intelligence can be formalized as Figure 6 where recognition, computation and action take place in parallel. The essence of general research direction in Ambient Intelligence, placing more emphasis on the environment than the agents acting in the environment, can be captured along the same direction but in a wider scope. The area is not limited to a single agent, but rather aims to support a team of agents in one environment, or even the society as a whole. This reflects the trend from focusing on the individual towards serving a group.
Fig. 6 A new view of intelligence, which includes the environment within the system.
11 Conclusions This handbook offers a snapshot of the state-of-the-art in Ambient Intelligence and Smart Environments. The various core disciplines participating in the creation of the concepts and applications of AmI and SmE are represented through examples of research and development efforts by leading experts in each field. The domain of potentials is vast with applications in all aspects of life for an individual or a community. The extent of the area overlaps with such diverse groups of disciplines as sensor networks and sociology, and artificial intelligence and communication networks. While this offers ample opportunities for many areas of science and technology to impact the future of AmI and SmE by providing tools from within the
30
Juan Carlos Augusto, Hideyuki Nakashima, Hamid Aghajan
focus area of each discipline, truly ground breaking impact is possible through collaboration among researchers from several disciplines. This handbook aims to offer and encourage such opportunities by introducing a representative set of fundamental concepts from the different areas. It is hoped that the presented material will inspire researchers to steer their work towards making novel contributions to this promising area. The rest of the handbook is divided into 8 thematic parts, each with a collection of contributions from recent studies. We hope you will find the material as exciting and inspiring as we have.
Acknowledgments We wish to thank Jennifer Evans and Melissa Fearon of Springer for their encouragement and support of this project. We are also grateful to Valerie Schofield and Jennifer Maurer of Springer for their assistance with the production of this handbook. Many thanks go to the contributors of the chapters for the book. Their active and timely cooperation is highly appreciated.
References [1] Aarts E, Appelo L (1999) Ambient intelligence: thuisomgevingen van de toekomst. IT Monitor (9):7–11 [2] Aarts E, de Ruyter B (2009) New research perspectives on ambient intelligence. Journal on Ambient Intelligence and Smart Environments (JAISE) 1(1):5–14 [3] Aarts E, Harwig R, Schuurmans M (2002) Ambient intelligence. In: The invisible future: the seamless integration of technology into everyday life, McGrawHill, pp 235–250 [4] Aghajan Y, Lacroix j, Cui J, van Halteren A, Aghajan H (2009) Home Exercise in a Social Context: Real-Time Experience Sharing using Avatars, Intetain’09 [5] Aghajan H, Cavallaro A (2009) Multi-Camera Networks: Principles and Applications, Academic Press 2009 [6] Aghajan H, Augusto J, Delgado R (2009) Human-centric Interfaces for Ambient Intelligence, Academic Press 2009 [7] Augusto JC (2007) Ambient Intelligence: the Confluence of Ubiquitous/Pervasive Computing and Artificial Intelligence, Springer London, pp 213–234. Intelligent Computing Everywhere [8] Augusto JC, Nugent CD (eds) (2006) Designing Smart Homes. The Role of Artificial Intelligence. Springer-Verlag [9] Augusto JC, Nugent CD (2006) Smart homes can be smarter. In: [8], SpringerVerlag, pp 1–15
Ambient Intelligence and Smart Environments: A State of the Art
31
[10] Böhlen M (2009) Second order ambient intelligence. Journal on Ambient Intelligence and Smart Environments (JAISE) 1(1):63–67 [11] Brooks R (1990) Elephants Don’t Play Chess. In: Robotics and Autonomous Systems 6, pp. 3–15 [12] Cook D, Das S (2004) Smart Environments: Technology, Protocols and Applications (Wiley Series on Parallel and Distributed Computing). WileyInterscience [13] Dertouzos M (2001) Human-centered systems. In: The Invisible Future, McGraw-Hill, pp 181–192 [14] Gibson J (1979) The Ecological Approach to Visual Perception. Houghton Mifflin, Boston [15] Huuskonen P (2007) Run to the hills! ubiquitous computing meltdown. In: Augusto JC, Shapiro D (eds) Advances in Ambient Intelligence, no. 164 in Frontiers in Artificial Intelligence and Applications (FAIA), IOS Press [16] IST Advisory Group (2001) The European union report, scenarios for ambient intelligence in 2010. URL: ftp://ftp.cordis.lu/pub/ist/docs/ istagscenarios2010.pdf [17] Maturana H and Varela F (1980) Autopoiesis and Cognition: The Realization of the Living. D. Reidel Publishing Co., Dordrecht, Holland [18] Nakashima H (2007) Cyber assist project for ambient intelligence. In: Augusto JC, Shapiro D (eds) Advances in Ambient Intelligence, IOS Press Inc. [19] Newell A, Simon H (1963), GPS: A Program that Simulates Human Thought. In: Feigenbaum E, Feldman J (eds) Computers and Thought, McGraw-Hill [20] Ramos C, Dey A, Kameas A (2009) Thematic Issue on Contribution of Artificial Intelligence to AmI. Journal on Ambient Intelligence and Smart Environments (JAISE) 1(3) [21] Spohrer J, Maglio P, Bailey J, Gruhl D (2007) Steps Toward a Science of Service Systems, IEEE Computer 40(1):71–77 [22] Streitz NA, Kameas A, Mavrommati I (eds) (2007) The Disappearing Computer, Lecture Notes in Computer Science, vol 4500, Springer [23] Weiser M (1991) The computer for the twenty-first century. Scientific American 165:94–104 [24] Weiser M (1993) Hot topics: Ubiquitous computing. IEEE Computer 26(10):71–72 [25] Zelkha E (1998) The future of information appliances and consumer devices, Palo Alto Ventures, Palo Alto, California
Part II
Sensor, Vision and Networks
A Survey of Distributed Computer Vision Algorithms Richard J. Radke
1 Introduction Over the past twenty years, the computer vision community has made great strides in the automatic solution to such problems as camera localization and visual tracking. Many algorithms have been made tractable by the rapid increases in computational speed and memory size now available to a single computer. However, the world of visual sensor networks poses several challenges to the direct application of traditional computer vision algorithms. First, visual sensor networks are assumed to contain tens to hundreds of cameras- many more than are considered in many vision applications. Second, these cameras are likely to be spread over a wide geographical area- much larger than the typical computer lab. Third, the cameras are likely to have modest local processors with no ability to communicate beyond a short range. Fourth, and perhaps most importantly, no powerful central processor typically exists to collect the images from all cameras and solve computer vision problems relating them all. Even if a central processor is available, the sheer number of cameras now feasible in a visual sensor network may preclude the use of a centralized algorithm. For example, more than half a million CCTV cameras observe the streets of London, with thousands in the airports and subways alone. Near real-time processing of this amount of data at a central processor (either with automatic vision algorithms or human observation) is simply infeasible, arguing for so-called “smart cameras” that can perform image processing and computer vision tasks locally before transmission to the broader network. These trends all point to the need for distributed algorithms for computer vision problems in visual sensor networks. In addition to being well-suited to ad-hoc wireRichard J. Radke Department of Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute Troy, NY, USA e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_2, © Springer Science+Business Media, LLC 2010
35
36
Richard J. Radke
less networks, distributed algorithms have no single point of failure, generally make fairer use of underyling communication links and computational resources, and are more robust and scalable than centralized algorithms. In this chapter, we survey the recent literature on distributed computer vision algorithms designed for implementation in a visual sensor network. We note that many algorithms described as “distributed” either require data to be transmitted to a central processor after some on-board processing at the sensor, or require each node to collect data from all other nodes in the network, effectively making every node perform the same central computation. While such research is valuable, in this chapter we emphasize techniques that are truly distributed: that is, nodes in the network make local decisions based only on information from their immediate neighbors. The design goal for such algorithms should be that the distributed result approaches what could be obtained had all the information been available at a central processor. We also do not consider algorithms that are “distributed” in the sense of being efficiently implemented in parallel on a central multi-core processor. The algorithms we survey fall into a few main categories. Section 2 reviews general work on non-computer-vision related distributed algorithms, many of which have recently found new applications in wireless sensor networks. Section 3 discusses the problem of topology estimation in a visual sensor network- that is, the determination of which nodes share overlapping fields of view, or have corresponding entry/exit points. Section 4 discusses the distributed calibration of a camera network- that is, the estimation of the three-dimensional location and orientation of each camera with respect to a common coordinate system. Section 5 discusses a common application of visual sensor networks: the distributed tracking of single or multiple moving objects, which may also include identification or classification of the objects. Section 6 concludes the chapter with a discussion of future trends and opportunities. We note that our interest here is at the application layer of a visual sensor network- that is, the computer vision algorithms running at each node as opposed to the physical layer of the sensor nodes. The algorithms described here are generally applicable to the many “smart camera” nodes under development in academia and industry that consist of a visual sensor, a set of embedded network and digital signal processors, and an antenna. An excellent taxonomy and historical overview of smart cameras was recently presented by Rinner et al. [83, 81]. Several chapters in the rest of this book go into more detail on specific instances of the algorithms generally surveyed here. We also note the increased academic and industrial interest in the area of distributed computer vision. Much of the most innovative work has appeared in the 2006 Distributed Smart Camera Workshop 1 , and the 2007 and 2008 International Conference on Distributed Smart Cameras 2 . Recent special issues of the Proceedings of the IEEE [82], the IEEE Journal on Selected Topics in Signal Processing [1], and the EURASIP Journal on Advances
1 2
http://www.iti.tugraz.at/dsc06/ http://www.icdsc.org/
A Survey of Distributed Computer Vision Algorithms
37
in Signal Processing [44] dedicated to multi-camera networks further address the topics.
2 Distributed Algorithms The basic theory of distributed algorithms is well-described in the textbook by Lynch [52]. Distributed algorithms are categorized by whether they are synchronous, asynchronous, or partially synchronous, and whether communication is established using shared memory, broadcast messages, or point-to-point communication. Typical problems include electing a leader, conducting a breadth-first search, finding shortest paths, constructing a minimum spanning tree, and reaching consensus. In general, provably convergent algorithms are more difficult to derive in asynchronous networks than in synchronous ones. However, most of the work surveyed in this chapter implicitly assumes at least partially synchronous networks, which may not be realistic for real-world visual sensor networks. The book by Bertsekas and Tsitsiklis [9] discusses somewhat higher-level parallel and distributed algorithms in the context of numerical analysis and optimization. They consider linear and nonlinear asynchronous iterative methods for algebraic and differential equations, dynamic programming, unconstrained and constrained optimization, and variational inequalities. The book by Varshney [92] treats classical detection theory and hypothesis testing in a distributed framework, including Bayesian, Neyman-Pearson, minimax, sequential, and locally optimum detection. We note that many distributed estimation problems arising from visual sensor networks can be viewed as special types of consensus problems. That is, all nodes in the network should come to an agreement about global parameters describing the network, or objects moving in it, without communicating all sensor data to a centralized location. For example, all the cameras in the network might need to agree on the estimated position of a central landmark. The past several years have seen an explosion of interest in distributed algorithms specifically designed for wireless ad-hoc sensor networks, often taking into account networking, communication, and power considerations. A substantial body of work exists on the distributed localization of nodes in wireless sensor networks based on considerations such as the time-of-flight between nodes. For example, Iyengar and Sikdar [38] proposed a distributed, scalable algorithm for sensor node localization in the plane that used time-of-arrival measurements instead of GPS. Marinakis and Dudek [59, 60] presented several innovative probabilistic algorithms for sensor network localization based on noisy inter-sensor range data. Langendoen and Reijers [45] overviewed and compared several algorithms for the localization problem. We can think of the algorithms discussed in Sections 3 and 4 as extensions of the localization problem in 3D that use visual information as the basis for estimation. Similarly, these algorithms, as well as the tracking algorithms considered in Section 5, can be considered as instances of distributed parameter estimation or opti-
38
Richard J. Radke
mization problems, another topic of interest in the sensor networking community. For example, Boyd et al. [11] overviewed gossip algorithms for information exchange in sensor networks. They showed how such algorithms could be applied to the distributed estimation of an average value in a wireless sensor network. Nowak [69] showed how the parameters of a mixture of Gaussians for which each node of a sensor network had different mixing coefficients could be estimated using a distributed version of the well-known expectation-maximization (EM) algorithm. This message-passing algorithm involves the transmission of sufficient statistics between neighboring nodes in a specific order, and was experimentally shown to converge to the same results as centralized EM. Kowalczyk and Vlassis [42] proposed a related gossip-based distributed algorithm called Newscast EM for estimating the parameters of a Gaussian mixture. Random pairs of nodes repeatedly exchange their parameter estimates and combine them by weighted averaging. Rabbat and Nowak [76] investigated a general class of distributed optimization algorithms for sensor networks. They showed that many natural optimizations could be solved incrementally, by each node updating the estimated parameters it receives to reduce its local cost and passing the updated parameters on. Using the theory of incremental subgradient optimization, they showed that the distributed algorithm converges to within a small amount of the globally optimal value after only a few cycles through the network. They applied the approach to robust least-squares estimation, energy-based source localization, and decision boundary problems. Finally, we note research on activating or orienting directional sensors to maximize spatial coverage. Ai and Abouzeid [2] described a distributed algorithm for maximizing spatial coverage in an ad-hoc network of directional sensors. They proposed a distributed greedy algorithm that compared favorably to a central greedy algorithm, as well as a dynamic sensor scheduling algorithm that favorably distributed its energy requirements. Hoffmann and Hähner [35] proposed a simple local algorithm to maximize visual coverage in a network of smart cameras. They simulated their results on networks of up to 80 cameras, and analyzed the underlying networking issues using the NS2 network simulator. Bash and Desnoyers [8] proposed an algorithm for the distributed exact computation of Voronoi cells in sensor networks, which could be useful for addressing coverage and tracking problems.
3 Topology Estimation We consider two main types of topology estimation problems in visual sensor networks. The first is what we refer to as the overlapping problem, where it is assumed that the cameras observe parts of the same environment from different perspectives. The relationships between the cameras can be modeled as an undirected graph in which an edge appears between two cameras if they observe some of the same scene points from different perspectives. In our research, we called this the vision graph [17]. We might naturally expect cameras’ fields of view to overlap in wireless ad-
A Survey of Distributed Computer Vision Algorithms
39
hoc camera networks, in which a large number of cameras are densely distributed around a large, open environment. The second topology estimation problem is what we refer to as the nonoverlapping problem, where it is assumed that no two cameras observe the same part of the environment. Instead, relationships between the cameras are induced by the likelihood that an object in one camera appears in another after some amount of time. The relationships between the cameras can be again be modeled as a graph, but in this case a directed graph with expected transition probabilities is a more natural model (for example, it may be much more likely for objects to travel from Camera A to B than vice versa). This situation naturally arises in indoor surveillance scenarios, such as cameras mounted in corridors. For this reason, most research on non-overlapping networks assumes a planar environment. In either case, we should keep in mind that the topology graph estimation problem in a wireless visual sensor network is accomplished by sending messages along underlying one-hop communication links. These links can be abstracted as a communication graph [29], which is mostly determined by the locations of the nodes, the powers of their antennas, and the topography of the environment. The camera topology estimation problem can be viewed as an overlay graph on this network. We note that the presence of an edge in the communication graph does not imply the presence of the same edge in the vision graph, since nearby cameras may be pointed in different directions, and conversely, physically distant cameras can image part of the same scene. Figure 1 illustrates the communication and vision graphs for a simulated example in which 30 cameras are scattered around several buildings. We can see that the communication and vision graphs are not highly correlated.
8
7
6
5
9
21
10
8
7
6
5
23
9
21
20
20
22 12
4
11
12
13
13 26
3
11
14
19
28
26
3 28
27
2 15
1
27
2 24
29
(a)
23
22
4
14
19
10
25
25
15
30
16
(b)
17
18
1
24
29
30
16
17
18
(c)
Fig. 1 (a) A simulated camera network (the focal lengths of the cameras have been exaggerated). (b) The corresponding communication graph, assuming each camera has the same fixed antenna range. (c) The corresponding vision graph.
As Figure 1 illustrates, there is generally no way to rule out the possibility of overlap between any two visual sensors without comparing their images (or extracted features from them). Consequently, correctly estimating the camera network topology inherently involves collecting the information from all sensors in one place, either at a central node, or at all nodes in parallel. We review several
40
Richard J. Radke
such methods here, emphasizing those designed with the constraints of a visual sensor network in mind. Even though such algorithms may be better characterized as decentralized than distributed, they are an important first step for the higher-level vision problems we discuss subsequently. We review approaches to the topology estimation problem for both cases below. Virtually all research in the area seems to cast itself in one domain or the other- that is, either the cameras are viewed as entirely non-overlapping, or non-overlapping cameras are not considered in the topology. The paper by Wang et al. [94] is an exception that allows for no, little, or large overlaps between fields of view, but the focus is on activity analysis of multi-camera trajectories rather than topology estimation.
3.1 Non-overlapping Topology Estimation The non-overlapping topology estimation problem is well-studied. Most researchers have approached the problem using the spatio-temporal paths of tracked objects that enter and exit the field of view of each camera. Marinakis and Dudek [58] described how to estimate a Markov model characterizing transitions between cameras, using only non-discriminative detected motions as input. They reconstructed object trajectories that best explain the cameras’ observations, and used these to estimate the transition graph. The number of objects moving through the camera network was assumed known, and their probabilistic behavior was assumed to be homogeneous. Makris et al. [55] proposed a similar approach that temporally correlates incoming and outgoing targets in detected entry and exit zones for each camera. Tieu et al. [89] used mutual information as a measure of statistical dependence to estimate object correspondences for topology estimation. Special cases of objects used for non-overlapping topology estimation include identifiable people [96] and vehicles [68]. Mandel et al. [56] described a simple algorithm for estimating camera network topology based on broadcast messages regarding time-stamped detected motions in different spatial regions. They applied a sequential probability ratio test (SPRT) to accept or reject the possibility that two cameras observe the same scene based on the correspondence (or lack thereof) of the aggregated detections. The use of the SPRT is a noteworthy idea for decision problems in sensor networks. Van den Hengel et al. [34] took a slightly different approach they called exclusion, in which the vision graph was initially assumed to be fully connected, and edges were removed that are contradicted by observed evidence over time. They tested their algorithm on a network of 100 real cameras. Javed et al. [39] incorporated the appearance of objects as well as their spatiotemporal motion into the topology estimation problem. While their emphasis was on tracking, they learned the topology of a set of cameras during a training phase that used Parzen windows to model the joint density of entry/exit times, locations, and velocities, and a Gaussian model for the Bhattacharyya distances between his-
A Survey of Distributed Computer Vision Algorithms
41
tograms of corresponding objects. They later extended their approach [40] to more accurately model the brightness transfer function relating pairs of cameras. It would be interesting to see how this approach scales to large networks of cameras. Farrell and Davis [26] proposed a different decentralized algorithm for topology discovery based on an information-theoretic appearance matching criterion on tracked objects moving through the network. More distinctive objects contribute proportionally more to the estimated transition model between cameras. This idea seems interesting, but it was not tested with real images. Medeiros et al. [61] took a slightly different approach to estimating the relationships between cameras based on what they termed event-driven clustering. Instead of explicitly constructing a graph representing the camera network topology, they used estimated detections of the same object to construct clusters of the vision graph. The proposed protocol addresses cluster formation, merging, splitting, and interaction in a distributed camera network. An alternate method for distributed feature matching (which implicitly defines a vision graph) was described by Avidan et al. [4], who used a probabilistic argument based on random graphs to analyze the propagation of wide-baseline stereo matching results obtained for a small number of image pairs to the remaining cameras. Farrell et al. [27] described a Bayesian framework for learning higher-order transition models in camera networks. That is, in addition to estimating the adjacency of cameras, the method can help answer questions such as “what fraction of objects passing through cameras X and Y will at some later time reach camera Z?”. They claimed that the higher-order models can improve tracking and inference performance. It would be interesting to see the implementation of this idea in a real network of cameras.
3.2 Overlapping Topology Estimation When the images from the cameras overlap, the topology estimation problem is often combined with the calibration problem discussed in the next section. Here we mention several techniques that “stop” at the point of topology estimation. Kulkarni et al. [43] proposed a method using the time-stamped appearances of a moving reference object to provide an approximate initialization for the cameras’ regions of overlap, which is indirectly related to their topology. Similar approaches applied to the camera calibration problem are discussed in the next section. However, for large-scale visual sensor networks, we believe it is important to develop techniques in which neither the cameras nor the scene are altered from their natural state, with no assumptions about camera synchronization or interactions between the user and the environment. Cheng et al. [17] described an algorithm for estimating the vision graph based on a fixed-length “feature digest” of automatically detected distinctive keypoints that each camera broadcasts to the rest of the network. These keypoints arise from the popular and successful Scale-Invariant Feature Transform (SIFT) detec-
42
Richard J. Radke
tor/descriptor proposed by Lowe [51]. Each sending camera makes a tradeoff between the number of keypoints it can send and how much their descriptors are compressed, assuming a fixed bit budget for their message. Each receiver camera decompresses the feature digest to recover the approximate feature descriptors, which are matched with its own features to generate putative matches. If enough matches are found, a vision graph edge is established. They analyzed the tradeoffs between the size of the feature digest, the number of transmitted features, the level of compression, and the overall performance of the edge generation. This method is appealing since it requires no user interaction with the environment, using static scene features as the basis for edge generation.
4 Camera Network Calibration Formally, camera calibration is the estimation of the parameters that describe how a perspective camera projects 3-D points in a fixed world coordinate system to 2-D points on an image plane. These calibration parameters can be divided into four or more internal parameters related to a camera’s optics, such as its principal point, aspect ratio, lens distortion, and focal length, and six external parameters, namely the rotation angles and translation vector that relate the camera coordinate frame to the world coordinate frame. Since the fixed internal parameters of a given camera can usually be estimated individually prior to deployment [19, 30, 85], most research is mainly concerned with estimating the external parameters. The classical problem of externally calibrating a pair of cameras is well-understood [91]; the parameter estimation usually requires a set of feature point correspondences in both images. When no points with known 3-D locations in the world coordinate frame are available, the cameras can be calibrated up to a similarity transformation [31]. That is, the cameras’ positions and orientations can be accurately estimated relative to each other, but not in an absolute sense. Without metric information about the scene, an unknown scale parameter also remains; for example, the same set of images would be produced by cameras twice as far away from a scene that is twice as large. Multi-camera calibration can be accomplished by minimizing a nonlinear cost function of the calibration parameters and a collection of unknown 3-D scene points projecting to matched image correspondences; this problem is also known as Structure from Motion (SFM). The optimization process for estimating the parameters is called bundle adjustment [90]. Good results are achievable when the images and correspondences are all accessible to a powerful, central processor. It is beyond the scope of this chapter to survey the centralized SFM problem here, but the book by Hartley and Zisserman [31] is an excellent reference. Also, Ma et al. [53] outlined a complete step-by-step recipe for the SFM problem, assuming the information from all cameras is collected in one place. We note that SFM is closely related to a problem in the robotics community called Simultaneous Localization and Mapping
A Survey of Distributed Computer Vision Algorithms
43
(SLAM) [49, 88], in which mobile robots must estimate their locations from sensor data as they move through a scene.
4.1 Non-overlapping Camera Calibration The algorithms discussed in this section generally assume a camera network with overlapping fields of view, as discussed in Section 3.2. In the non-overlapping camera case, estimating the actual 3D positions of the cameras based only on their visual information is generally a poorly-posed problem. That is, multiple cameras must actually observe many of the same 3D points as the basis for accurate estimation of localization parameters. However, several researchers have proposed solutions that make assumptions about the dynamics of objects transiting the environment to get rough estimates of camera positions and orientations. In this area, the work by Rahimi et al. [77] is the most notable. They encoded prior knowledge about the state (location and velocity) of targets as a linear Gaussian Markov model, and then combined this prior with measurements from the camera network to produce maximum a posteriori estimates of global trajectories and hence camera calibration parameters. The approach was illustrated with networks of four to ten downward-facing cameras observing a single target moving on a ground plane. It would be interesting to see how this method generalizes to large numbers of cameras in a fully 3-D environment. However, it is not reasonable to expect that highly accurate calibration parameters could be obtained with this type of approach using naturally-moving objects in a real-world, well-separated camera network due to difficult-to-model dynamics and unmodeled sources and sinks. Rekleitis et al. [80, 79] proposed the use of a mobile robot holding a calibration pattern (a checkboard or an array of unique fiducial markers) as the basis for nonoverlapping camera calibration. The robot’s pose on the ground plane, obtained from odometry, is used to provide each camera with the 3D positions of points on the calibration target.
4.2 Overlapping Camera Calibration Several methods, while still centralized, have used visual sensor networks as the motivation for the design of calibration algorithms. These algorithms are usually based on different cameras’ observations of the same object or objects in their environment. For this purpose, several researchers have suggested modulated light sources moved throughout a darkened room [5, 16, 7] or placed on the cameras themselves [86]. Maas [54] used a moving reference bar of known length, while Baker and Aloimonos [6] used grids of parallel lines and square boxes. Liu et al. [50] proposed an algorithm in which each camera independently calibrates itself based on images
44
Richard J. Radke
of a placed reference device with known world coordinates. As mentioned previously, algorithms that require no user interaction with the environment are generally preferable for large-scale ad-hoc deployments. Similar to the non-overlapping topology estimation problem, several researchers have proposed to obtain corresponding points for calibration by matching trajectories of objects that naturally move through the cameras’ environment. Lee et al. [47] described one such approach in which pedestrians and vehicles moving on a ground plane were used to calibrate two or three overlapping surveillance cameras. Other approaches along this line were proposed by Jaynes [41] and Meingast et al. [63]. Several researchers have addressed distributed versions of this type of camera network calibration. For example, Lee and Aghajan [46] used the joint observations of a moving target to solve a set of nonlinear equations for its position (and hence, the camera positions) using a distributed version of the Gauss-Newton method. Another excellent algorithm of this type called SLAT (Simultaneous Location and Tracking) was proposed by Funiak et al. [28]. The key innovations were a relative overparameterization of the cameras that allows Gaussian densities to be used, and a linearization procedure to address the uncertainty in camera angle for proper use of a Kalman filter. It would be very interesting to see how such an approach would scale to full calibration of large outdoor camera networks with many moving objects. Many distributed algorithms for full 3D camera network calibration rely on using image correspondences to estimate the epipolar geometry between image pairs. When the intrinsic camera parameters are known, this information can be used to extract estimates of the rotation and translation between each camera pair. For example, Barton-Sweeney et al. [7] used an Extended Kalman Filtering framework on the estimated epipolar geometry to estimate the rotation and translations between a camera pair. Mantzel et al. [57] proposed an algorithm called DALT (Distributed Alternating Localization and Triangulation) for camera network calibration. As the the title suggests, each camera calibrates itself by alternating between 1) triangulating 2D image projections into 3D space, assuming the camera matrices are known, and 2) estimating the camera matrices based on putative image-to-world correspondences. While their optimization approach is simpler than bundle adjustment, this work includes some interesting analysis of the computational and power requirements that would be incurred by the algorithm on a realistic platform. These algorithms used a user-inserted calibration object as the basis for establishing point correspondences. Devarajan et al. [21, 23] proposed a calibration algorithm in which each camera only communicates with the cameras connected to it by an edge in the vision graph, exchanging image projections of common 3D points. After performing a local projective-to-metric factorization and bundle adjustment, each camera has an estimate of 1) its own location, orientation, and focal length, 2) the corresponding parameters for each of its neighbors in the vision graph, and 3) the 3D positions of the image feature points it has in common with its neighbors. They observed that the distributed calibration algorithm not only resulted in good calibration accuracy compared to centralized bundle adjustment, but also that the messages per
A Survey of Distributed Computer Vision Algorithms
45
communication link required to calibrate the network were more fairly distributed, implying a longer node lifetime. Medeiros et al. [62] proposed a distributed cluster-based calibration algorithm for which they observed a similar result: the distributed algorithm required significantly fewer transmit and receive bits than for a centralized algorithm, while having comparable accuracy. However, their approach makes a restrictive assumption that distinctive target objects with fixed features at a known scale are used for calibration. We note that Brand et al. [13] presented a novel approach to large-scale camera network calibration using a graph embedding formalism. They first estimated nodeto-node direction vectors in 3D using corresponding features and vanishing points (the signs and lengths of the vectors were unknown). They then showed how the solution to a simple eigenvalue problem resulted in the embedding of 3D camera locations that was maximally consistent with the measured directions. The algorithm is remarkably fast and accurate, albeit centralized. It would be very interesting to investigate whether the eigenvalue problems can be solved in a distributed manner.
4.3 Improving Calibration Consistency Many researchers have proposed methods for dealing with inconsistent parameter estimates in wireless sensor networks. These are typically based on verifying whether estimates of the same parameter are sufficiently close [84, 14, 95] and/or consistent with prior knowledge [25]. Most such approaches are well-suited for sensors that measure a scalar quantity such as temperature or pressure. In contrast, camera networks contain inconsistencies in a continuous, high-dimensional parameter space, and require principled statistical methods for resolving them. Devarajan and Radke [22] proposed a calibration refinement method based on belief propagation [67], a method for probabilistic inference in networks that several researchers have recently applied to discrete or low-dimensional sensor networking and robotics problems (e.g., [93, 18, 87, 36]). The belief propagation algorithm involved passing messages about common camera parameters along the edges of the vision graph that iteratively brought inconsistent estimates of camera parameters closer together. They discussed solutions to several challenges unique to the camera calibration problem. They showed that the inconsistency in camera localization was reduced by factors of 2 to 6 after applying their belief propagation algorithm, while still maintaining high accuracy in the parameter estimates. Several researchers have recently applied more sophisticated and robust distributed inference algorithms than belief propagation to sensor fusion problems; the work of Paskin, Guestrin, and McFadden [71, 72] and Dellaert et al. [20] is notable. Future distributed camera calibration techniques could benefit from these new algorithms. Some researchers have suggested improvements in calibration by augmenting the input from visual sensors with non-visual sensor measurements. For example, Mein-
46
Richard J. Radke
gast et al. [64] fused image data with radio interferometry to estimate visual sensors’ position and orientation. However, this method relied on centralized processing.
5 Tracking and Classification A visual sensor network is usually calibrated for an application-level purpose. Commonly, the camera network is tasked to track and/or classify objects as they move through the environment. A massive literature on single-camera or centralized multicamera tracking exists; here we focus on distributed algorithms for the tracking problem designed with sensor networking considerations in mind. Rao and Durrant-Whyte [78] presented early work that recognized the importance of a decentralized algorithm for target tracking. They proposed a Bayesian approach, but required a fully connected network topology that is unrealistic for modern sensor networks. Hoffmann et al. [35] described a distributed algorithm for multi-camera tracking, modeled as a finite state machine. The proposed algorithm was validated in a simulated network in which it was assumed that object detection results were entirely accurate. Dockstader and Tekalp [24] presented an algorithm for distributed real-time multi-object tracking, based on a Bayesian belief network formalism for fusing observations with various confidence estimates. The basic strategy is for each camera to track each object independently in 2D, apply a distributed fusion/triangulation algorithm to obtain a 3D location, and then independently apply a 3D Kalman filter at each camera. Results were demonstrated for a three-camera network. Qu et al. [73] proposed a Bayesian algorithm for distributed multi-target tracking using multiple collaborative cameras. Their approach was based on a dynamic graphical model that updates estimates of objects’ 2D positions instead of tracking in 3D. The algorithm is well-suited for challenging environments such as crowds of similar pedestrians. Heath and Guibas [33] described an algorithm for multi-person tracking in a network of stereo (i.e., two closely-separated) cameras. The use of stereo cameras enables each node to estimate the trajectories of 3D points on each moving object. Each node communicates these feature positions to its neighbors and performs independent 3D particle filtering based on its received messages. They demonstrated results on small networks of real cameras as well as a larger simulated camera network observing a crowded scene. They analyzed the tradeoffs relating the communication budget for each camera, the number of tracked people in the environment, and the tracking precision. Heath and Guibas [32] also described an algorithm for tracking people and acquiring canonical (i.e., frontal-facing) face images in a camera network. They proposed a tracking algorithm based on 2D visual hulls synthesized at clusterhead nodes, and used the direction of pedestrian motion as a simple estimate for head orientation. The idea seems promising and a more fully distributed system would be interesting.
A Survey of Distributed Computer Vision Algorithms
47
Mensink et al. [65] described a multi-camera tracking algorithm in which the observation at each camera was modeled as a mixture of Gaussians, with each Gaussian representing the appearance features of a single person. The authors extended the gossip-based Newscast EM algorithm [42] discussed in Section 2 to incorporate multiple observations into the iterative, distributed computation of means, covariances, and mixing weights in the Gaussian mixture model. The preliminary results showed that the distributed estimation algorithm converged as quickly and accurately as a centralized EM algorithm. This is an excellent example of a distributed computer vision algorithm. Leistner et al. [48] described an interesting technique for automatic person detection that applied an online learning method called co-training to improve the local detection results. Cameras viewing the same scene exchange image patches that they are confident to contain positive and negative examples of an object, comprising a few hundred bytes per frame. They implemented the algorithm on smart cameras and demonstrated that the approach resulted in increasingly better detection results. The problem of matching objects between images from different cameras in a visual sensor network is sometimes called reacquisition or reidentification, and often results in a handoff of the responsibility of object tracking from one camera to another. Quaritsch et al. [74] described a distributed object tracking algorithm that illustrates the main ideas of decentralized tracking handoff. The basic idea is that the responsibility for tracking each object lies with a particular master camera. When the object nears the periphery of this camera’s field of view, neighboring cameras are put into slave mode, searching for the object until it appears. When a slave discovers the object in its field of view, it becomes the new master. Arth et al. [3] proposed an algorithm for reacquisition based on signatures, or compact representations of local object features. The signatures are much less expensive to transmit through the network than the original images. The idea has some resemblance to the feature digest of Cheng et al. [17] discussed in Section 3. The algorithm was demonstrated on a simulated traffic camera network and its performance and communication properties with respect to noise analyzed in some detail. Park et al. [70] proposed an algorithm for the distributed construction of look-up tables stored locally at each node that are used to select the cameras that are most likely to observe a specific location. These lookup tables could be used to improve camera handoff for multiple moving targets. Qureshi and Terzopoulos [75] described distributed algorithms for segmentation, visual tracking, camera assignment, and handoff in the context of a large-scale virtual camera network in a train station environment. Objects are detected and matched between cameras that are not assumed to have uniform imaging characteristics using a color indexing method based on histogram backprojection. Iwaki et al. [37] pointed out the importance of developing tracking algorithms for large-scale camera networks that did not assume synchronized image capture. They presented an algorithm based on evidence accumulation that operates in a hierarchical manner. Cameras individually detect human heads in the scene and send the results to clusterheads for outlier detection and localization refinement.
48
Richard J. Radke
Some researchers have suggested improvements in tracking by augmenting the input from visual sensors with non-visual sensor measurements. For example, Miyaki et al. [66] fused image tracking information with the signal strength from pedestrians’ Wi-Fi mobile devices to better estimate their locations.
6 Conclusions The past five years have seen an encouraging upturn in the development of distributed solutions to classical computer vision problems. We believe that there are many viable distributed alternatives for accurately estimating transition models and visual overlap, 3D calibration parameters, and 2D and 3D object tracks. The highest-level algorithms surveyed here emphasize simple tracking rather than the estimation of more subtle object/environment characteristics. We view this as a natural next step in the development of distributed computer vision algorithms. As an early example, Chang and Aghajan [15] described fusion algorithms based on linear quadratic regression and Kalman filtering to estimate the orientation of faces imaged by a camera network. Most algorithms also assume that the camera network is entirely engaged in performing the same vision task at every node, and that all tasks have the same priority. However, for distributed wireless camera networks deployed in the real world, it will be important to consider calibration and other computer vision tasks in the context of the many other tasks the node processors and network need to undertake. This will entail a tight coupling between the vision algorithms and the MAC, network, and link-layer protocols, organization, and channel conditions of the network, as well as the power supply, transmitter/receiver and kernel scheduler of each node. This tight integration would be critical for making the algorithms described here viable for embedded platforms, such as networks of cell-phone cameras [10]. Bramberger et al. [12] recently described an algorithm for allocating general tasks in a camera network, posing a distributed constraint satisfaction problem. When a smart camera poses a new task, the request is cycled through the cameras in the network, incrementally merging partial task allocations until the request arrives back at the originating camera. Finally, we note that prototyping vision algorithms in a real network of cameras is a time-, labor-, and money-consuming proposition. Visually accurate, large-scale camera network simulation systems such as those developed by Qureshi and Terzopoulos [75] or Heath and Guibas [33] will enable rapid prototyping of distributed computer vision algorithms without requiring the resources for an actual deployment.
A Survey of Distributed Computer Vision Algorithms
49
References [1] Aghajan H, Kleihorst R, Rinner B, Wolf W (eds) (2008) IEEE Journal of Selected Topics in Signal Processing Special Issue on Distributed Processing in Vision Networks, vol 2. IEEE [2] Ai J, Abouzeid A (2006) Coverage by directional sensors in randomly deployed wireless sensor networks. Journal of Combinatorial Optimization 11(1):21 – 41 [3] Arth C, Leistner C, Bischof H (2007) Object reacquisition and tracking in large-scale smart camera networks. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 156–163 [4] Avidan S, Moses Y, Moses Y (2007) Centralized and distributed multi-view correspondence. International Journal of Computer Vision 71(1):49–69 [5] Baker P, Aloimonos Y (2000) Complete calibration of a multi-camera network. In: Proceedings of IEEE Workshop on Omnidirectional Vision 2000 [6] Baker P, Aloimonos Y (2003) Calibration of a multicamera network. In: Proceedings of the IEEE Workshop on Omnidirectional Vision [7] Barton-Sweeney A, Lymberopoulos D, Savvides A (2006) Sensor localization and camera calibration in distributed camera sensor networks. In: Proceedings of the 3rd International Conference on Broadband Communications, Networks and Systems (BROADNETS), San Jose, CA, USA, pp 1–10 [8] Bash BA, Desnoyers PJ (2007) Exact distributed Voronoi cell computation in sensor networks. In: International Conference on Information Processing in Sensor Networks [9] Bertsekas DP, Tsitsiklis JN (1997) Parallel and Distributed Computation: Numerical Methods. Athena Scientific [10] Bolliger P, Köhler M, Römer K (2007) Facet: Towards a smart camera network of mobile phones. In: Proceedings of Autonomics 2007 (ACM First International Conference on Autonomic Computing and Communication Systems), Rome, Italy [11] Boyd S, Ghosh A, Prabhakar B, Shah D (2005) Gossip algorithms: design, analysis and applications. In: IEEE INFOCOM, vol 3, pp 1653–1664 [12] Bramberger M, Doblander A, Maier A, Rinner B, Schwabach H (2006) Distributed embedded smart cameras for surveillance applications. Computer 39(2):68–75 [13] Brand M, Antone M, Teller S (2004) Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In: Proceedings of the European Conference on Computer Vision [14] Capkun S, Hubaux JP (2005) Secure positioning of wireless devices with application to sensor networks. In: Proceedings of IEEE INFOCOM [15] Chang CC, Aghajan H (2006) Collaborative face orientation detection in wireless image sensor networks. In: Workshop on Distributed Smart Cameras [16] Chen X, Davis J, Slusallek P (2000) Wide area camera calibration using virtual calibration objects. In: IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition
50
Richard J. Radke
[17] Cheng Z, Devarajan D, Radke R (2007) Determining vision graphs for distributed camera networks using feature digests. EURASIP Journal of Applied Signal Processing, Special Issue on Visual Sensor Networks Article ID 57034 [18] Christopher C, Avi P (2003) Loopy belief propagation as a basis for communication in sensor networks. In: Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI-03), Morgan Kaufmann Publishers, San Francisco, CA, pp 159–166 [19] Collins RT, Tsin Y (1999) Calibration of an outdoor active camera system. In: IEEE Computer Vision and Pattern Recognition 1999 [20] Dellaert F, Kipp A, Krauthausen P (2005) A multifrontal QR factorization approach to distributed inference applied to multirobot localization and mapping. In: Proceedings of National Conference on Artificial Intelligence (AAAI 05), pp 1261–1266 [21] Devarajan D, Radke R (2004) Distributed metric calibration for large-scale camera networks. In: Proceedings of the First Workshop on Broadband Advanced Sensor Networks (BASENETS) 2004 (in conjunction with BroadNets 2004), San Jose, CA [22] Devarajan D, Radke R (2007) Calibrating distributed camera networks using belief propagation. EURASIP Journal of Applied Signal Processing, Special Issue on Visual Sensor Networks Article ID 60696 [23] Devarajan D, Radke R, Chung H (2006) Distributed metric calibration of adhoc camera networks. ACM Transactions on Sensor Networks 2(3):380–403 [24] Dockstader S, Tekalp A (2001) Multiple camera tracking of interacting and occluded human motion. Proceedings of the IEEE 89(10):1441–1455 [25] Du W, Fang L, Peng N (2006) LAD: localization anomaly detection for wireless sensor networks. J Parallel Distrib Comput 66(7):874–886 [26] Farrell R, Davis L (2008) Decentralized discovery of camera network topology. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [27] Farrell R, Doermann D, Davis L (2007) Learning higher-order transition models in medium-scale camera networks. In: IEEE 11th International Conference on Computer Vision [28] Funiak S, Guestrin C, Paskin M, Sukthankar R (2006) Distributed localization of networked cameras. In: Proceedings of the Fifth International Conference on Information Processing in Sensor Networks [29] Haas Z, et al (2002) Wireless ad hoc networks. In: Proakis J (ed) Encyclopedia of Telecommunications, John Wiley [30] Hartley R (1994) Self-calibration from multiple views with a rotating camera. In: Proc. ECCV ’94, vol 1, pp 471–478 [31] Hartley R, Zisserman A (2000) Multiple View Geometry in Computer Vision. Cambridge University Press [32] Heath K, Guibas L (2007) Facenet: Tracking people and acquiring canonical face images in a wireless camera sensor network. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 117–124
A Survey of Distributed Computer Vision Algorithms
51
[33] Heath K, Guibas L (2008) Multi-person tracking from sparse 3D trajectories in a camera sensor network. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [34] van den Hengel A, Dick A, Detmold H, Cichowski A, Hill R (2007) Finding camera overlap in large surveillance networks. In: Proceedings of the 8th Asian Conference on Computer Vision [35] Hoffmann M, Hähner J (2007) Rocas: A robust online algorithm for spatial partitioning in distributed smart camera systems. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 267–274 [36] Ihler A, Fisher J, Moses R, Willsky A (2004) Nonparametric belief propagation for self-calibration in sensor networks. In: Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks, Berkeley, CA [37] Iwaki H, Srivastava G, Kosaka A, Park J, Kak A (2008) A novel evidence accumulation framework for robust multi-camera person detection. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [38] Iyengar R, Sikdar B (2003) Scalable and distributed GPS free positioning for sensor networks. In: IEEE International Conference on Communications, vol 1, pp 338–342 [39] Javed O, Rasheed Z, Shafique K, Shah M (2003) Tracking across multiple cameras with disjoint views. In: Proceedings of the 9th International Conference on Computer Vision, Nice, France [40] Javed O, Shafique K, Shah M (2005) Appearance modeling for tracking in multiple non-overlapping cameras. In: The IEEE Computer Society Conference on Computer Vision and Pattern Recognition [41] Jaynes C (1999) Multi-view calibration from planar motion for video surveillance. In: Second IEEE Workshop on Visual Surveillance (VS’99), pp 59–66 [42] Kowalczyk W, Vlassis N (2004) Newscast EM. In: Neural Information Processing Systems Conference [43] Kulkarni P, Shenoy P, Ganesan D (2007) Approximate initialization of camera sensor networks. In: Proceedings of the 4th European Conference on Wireless Sensor Networks [44] Kundur D, Lin CY, Lu CS (eds) (2007) EURASIP Journal on Advances in Signal Processing Special Issue on Visual Sensor Networks. Hindawi Publishing Corporation [45] Langendoen K, Reijers N (2003) Distributed localization in wireless sensor networks: A quantitative comparison. Computer Networks 43(4):499–518 [46] Lee H, Aghajan H (2006) Collaborative node localization in surveillance networks using opportunistic target observations. In: Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, pp 9–18 [47] Lee L, Romano R, Stein G (2000) Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Transactions on Pattern Recognition and Machine Intelligence 22(8)
52
Richard J. Radke
[48] Leistner C, Roth P, Grabner H, Bischof H, Starzacher A, Rinner B (2008) Visual on-line learning in distributed camera networks. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [49] Leonard J, Durrant-Whyte H (1991) Simultaneous map building and localization for an autonomous mobile robot. In: Proceedings of the IEEE/RSJ International Workshop on Intelligent Robots and Systems, pp 1442–1447 [50] Liu X, Kulkarni P, Shenoy P, Ganesan D (2006) Snapshot: A self-calibration protocol for camera sensor networks. In: Proceedings of IEEE/CreateNet BASENETS 2006 [51] Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2):91–110 [52] Lynch N (1996) Distributed Algorithms. Morgan Kaufmann [53] Ma Y, Soatto S, Košecká J, Sastry S (2004) An Invitation to 3-D Vision: From Images to Geometric Models, Springer, chap 11 [54] Maas HG (1999) Image sequence based automatic multi-camera system calibration techniques. ISPRS Journal of Photogrammetry and Remote Sensing 54(5-6):352–359 [55] Makris D, Ellis T, Black J (2004) Bridging the gaps between cameras. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition [56] Mandel Z, Shimshoni I, Keren D (2007) Multi-camera topology recovery from coherent motion. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 243–250 [57] Mantzel W, Choi H, Baraniuk R (2004) Distributed camera network localization. In: Proceedings of the 38th Asilomar Conference on Signals, Systems and Computers [58] Marinakis D, Dudek G (2005) Topology inference for a vision-based sensor network. In: In Proc. of Canadian Conference on Computer and Robot Vision [59] Marinakis D, Dudek G (2006) Probabilistic self-localization for sensor networks. In: AAAI National Conference on Artificial Intelligence [60] Marinakis D, Dudek G (2008) Occam’s razor applied to network topology inference. IEEE Transactions on Robotics 24(2):293 – 306 [61] Medeiros H, Park J, Kak A (2007) A light-weight event-driven protocol for sensor clustering in wireless camera networks. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 203–210 [62] Medeiros H, Iwaki H, Park J (2008) Online distributed calibration of a large network of wireless cameras using dynamic clustering. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [63] Meingast M, Oh S, Sastry S (2007) Automatic camera network localization using object image tracks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshop on Visual Representations and Modeling of Large-scale Environments
A Survey of Distributed Computer Vision Algorithms
53
[64] Meingast M, Kushwaha M, Oh S, Koutsoukos X, Ledeczi A, Sastry S (2008) Heterogeneous camera network localization using data fusion. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [65] Mensink T, Zajdel W, Krose B (2007) Distributed EM learning for appearance based multi-camera tracking. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 178–185 [66] Miyaki T, Yamasaki T, Aizawa K (2007) Multi-sensor fusion tracking using visual information and wi-fl location estimation. In: First ACM/IEEE International Conference on Distributed Smart Cameras, pp 275–282 [67] Murphy KP, Weiss Y, Jordan MI (1999) Loopy belief propagation for approximate inference: An empirical study. In: Proceedings of Uncertainty in Artificial Intelligence (UAI ’99), pp 467–475 [68] Niu C, Grimson E (2006) Recovering non-overlapping network topology using far-field vehicle tracking data. In: Proceedings of the 18th International Conference on Pattern Recognition [69] Nowak R (2003) Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Transactions on Signal Processing 51(8):2245– 2253 [70] Park J, Bhat PC, Kak AC (2006) A look-up table based approach for solving the camera selection problem in large camera networks. In: Workshop on Distributed Smart Cameras [71] Paskin MA, Guestrin CE (2004) Robust probabilistic inference in distributed systems. In: Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI ’04), pp 436–445 [72] Paskin MA, Guestrin CE, McFadden J (2005) A robust architecture for inference in sensor networks. In: 4th International Symposium on Information Processing in Sensor Networks (IPSN ’05) [73] Qu W, Schonfeld D, Mohamed M (2007) Distributed Bayesian multipletarget tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing 2007(38373) [74] Quaritsch M, Kreuzthaler M, Rinner B, Bischof H, Strobl B (2007) Autonomous multicamera tracking on embedded smart cameras. EURASIP Journal on Advances in Signal Processing 2007(92827) [75] Qureshi F, Terzopoulos D (2008) Smart camera networks in virtual reality. Proceedings of the IEEE 96(10):1640–1656 [76] Rabbat M, Nowak R (2004) Distributed optimization in sensor networks. In: International Conference on Information Processing in Sensor Networks [77] Rahimi A, Dunagan B, Darrell T (2004) Simultaneous calibration and tracking with a network of non-overlapping sensors. In: The IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 1, pp 187–194 [78] Rao B, Durrant-Whyte H (1993) A decentralized Bayesian algorithm for identification of tracked targets. IEEE Transactions on Systems, Man and Cybernetics 23(6):1683–1698
54
Richard J. Radke
[79] Rekleitis I, Meger D, Dudek G (2006) Simultaneous planning, localization, and mapping in a camera sensor network. Robotics and Autonomous Systems 54(11):921–932 [80] Rekleitis IM, Dudek G (2005) Automated calibration of a camera sensor network. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp 401–406 [81] Rinner B, Wolf W (2008) An introduction to distributed smart cameras. Proceedings of the IEEE 96(10):1565–1575 [82] Rinner B, Wolf W (eds) (2008) Proceedings of the IEEE Special Issue on Distributed Smart Cameras, vol 96. IEEE [83] Rinner B, Winkler T, Schriebl W, Quaritsch M, Wolf W (2008) The evolution from single to pervasive smart cameras. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, Stanford University, Palo Alto, CA, USA [84] Sastry N, Shankar U, Wagner D (2003) Secure verification of location claims. In: Proceedings of WiSe ’03, the 2nd ACM workshop on Wireless Security, pp 1–10 [85] Stein GP (1995) Accurate internal camera calibration using rotation, with analysis of sources of error. In: ICCV, pp 230–236 [86] Taylor CJ, Shirmohammadi B (2006) Self localizing smart camera networks and their applications to 3D modeling. In: Proceedings of the International Workshop on Distributed Smart Cameras [87] Thorpe J, McEliece R (2002) Data Fusion Algorithms for Collaborative Robotic Exploration. Interplanetary Network Progress Report 149:1–14 [88] Thrun S, Burgard W, Fox D (2005) Probabilistic Robotics: Intelligent Robotics and Autonomous Agents. MIT Press [89] Tieu K, Dalley G, Grimson W (2005) Inference of non-overlapping camera network topology by measuring statistical dependence. In: Tenth IEEE International Conference on Computer Vision, vol 2, pp 1842–1849 [90] Triggs B, McLauchlan P, Hartley R, Fitzgibbon A (2000) Bundle adjustment – A modern synthesis. In: Triggs W, Zisserman A, Szeliski R (eds) Vision Algorithms: Theory and Practice, LNCS, Springer Verlag, pp 298–375 [91] Tsai R (1992) A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. In: Wolff L, Shafer S, Healey G (eds) Radiometry – (Physics-Based Vision), Jones and Bartlett [92] Varshney PK (1997) Distributed Detection and Data Fusion. Springer [93] Venkatesh S, Alanyali M, Savas O (2006) Distributed detection in sensor networks with packet losses and finite capacity links. IEEE Transactions on Signal Processing 54(11):4118–4132 [94] Wang B, Sung K, Ng T (2002) The localized consistency principle for image matching under non-uniform illumination variation and affine distortion. In: European Conference on Computer Vision, vol LNCS 2350, pp 205–219, copenhagen, Denmark
A Survey of Distributed Computer Vision Algorithms
55
[95] Wei Y, Yu Z, Guan Y (2007) Location verification algorithms for wireless sensor networks. In: Proceedings of the 27th International Conference on Distributed Computing Systems, p 70 [96] Zou X, Bhanu B, Song B, Roy-Chowdhury A (2007) Determining topology in a distributed camera network. In: Proceedings of the IEEE International Conference on Image Processing
Video-Based People Tracking Marcus A. Brubaker, Leonid Sigal and David J. Fleet
1 Introduction Vision-based human pose tracking promises to be a key enabling technology for myriad applications, including the analysis of human activities for perceptive environments and novel man-machine interfaces. While progress toward that goal has been exciting, and limited applications have been demonstrated, the recovery of human pose from video in unconstrained settings remains challenging. One of the key challenges stems from the complexity of the human kinematic structure itself. The sheer number and variety of joints in the human body (the nature of which is an active area of biomechanics research) entails the estimation of many parameters. The estimation problem is also challenging because muscles and other body tissues obscure the skeletal structure, making it impossible to directly observe the pose of the skeleton. Clothing further obscures the skeleton, and greatly increases the variability of individual appearance, which further exacerbates the problem. Finally, the imaging process itself produces a number of ambiguities, either because of occlusion, limited image resolution, or the inability to easily discriminate the parts of a person from one another or from the background. Some of these issues are inherent, yielding ambiguities that can only be resolved with prior knowledge; others lead to computational burdens that require clever engineering solutions. The estimation of 3D human pose is currently possible in constrained situations, for example with multiple cameras, with little occlusion or confounding background clutter, or with restricted types of movement. Nevertheless, despite a decade of acMarcus A. Brubaker Department of Computer Science, University of Toronto, e-mail:
[email protected]. edu Leonid Sigal Department of Computer Science, University of Toronto, e-mail:
[email protected] David J. Fleet Department of Computer Science, University of Toronto, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_3, © Springer Science+Business Media, LLC 2010
57
58
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
(a)
(b)
(c)
(d)
(e)
Fig. 1 Challenges in human pose estimation. Variation in body size and shape (a), occlusions of body parts (b), inability to observe the skeletal motion due to clothing (c), difficulty segmenting the person from the background (d), and complex interactions between people in the environment (e), are challenges that plague the recovery of human pose in unconstrained scenes.
tive research, monocular 3D pose tracking remains largely unsolved. From a single view it is hard to escape ambiguities in depth and scale, reflection ambiguities where different 3D poses produce similar images, and missing observations of certain parts of the body because of self-occlusions. This chapter introduces the basic elements of modern approaches to pose tracking. We focus primarily on monocular pose tracking with a probabilistic formulation. While multiview tracking in constrained settings, e.g., with minimal occlusion, may be relatively straightforward [29, 10] the problems faced in monocular tracking often arise in the general multiview case as well. This chapter is not intended to be a thorough review of human tracking but rather a tutorial introduction for practitioners interested in applying vision-based human tracking systems. For a more exhaustive review of the literature we refer readers to [17, 38].
1.1 Tracking as Inference Because of the inescapable uncertainty that arises due to ambiguity, and the prevalence of noisy or missing observations of body parts, it has become common to formulate human pose tracking in probabilistic terms. As such, the goal is to determine the posterior probability distribution over human poses or motions, conditioned on the image measurements (or observations). Formally, let st denote the state of the body at time t. It represents the unknown parameters of the model we wish to estimate. In our case it typically comprises the joint angles of the body along with the position and orientation of the body in world coordinates. We also have observations at each time, denoted zt . This might simply be the image at time t or it might be a set of image measurements (e.g., edge locations or optical flow). Tracking can then be formulated as the problem of inferring the probability distribution over state sequences, s1:t = (s1 , . . . , st ), conditioned on the observation history, z1:t = (z1 , . . . , zt ); that is, p(s1:t |z1:t ) . Using Bayes’ rule, it is common to express the posterior distribution as
Video-Based People Tracking
59
p(s1:t |z1:t ) =
p(z1:t |s1:t )p(s1:t ) . p(z1:t )
(1)
Here, p(z1:t |s1:t ) is called the likelihood. It is the probability of observing the image measurements given a state sequence. In effect the likelihood provides a measure of the consistency between a hypothetical motion and the given image measurements. The other major factor in (1) is the prior probability of the state sequence, p(s1:t ). This prior distribution captures whether a given motion is plausible or not. During pose tracking we aim to find motions that are both plausible and consistent with the image measurements. Finally, the denominator in (1), p(z1:t ), often called the partition function, does not depend on the state sequence, and is therefore considered to be constant for the purposes of this chapter. To simplify the task of approximating the posterior distribution over human motion (1), or of finding the most probable motion (i.e., the MAP estimate), it is common to assume that the likelihood and prior models can be factored further. For example, it is common to assume that the observations at each time are independent given the states. This allows the likelihood to be rewritten as a product of simpler likelihoods, one at each time: p(z1:t |s1:t ) =
t
∏ p(zi |si ) .
(2)
i=1
This assumption and resulting factorization allows for more efficient inference and easier specification of the likelihood. Common measurement models and likelihood functions are in Section 3. The prior distribution over human motion also plays a key role. In particular, ambiguities and noisy measurements often necessitate a prior model to resolve uncertainty. The prior model typically involves a specification of which poses are plausible or implausible, and which sequences of poses are plausible. Often this involves learning dynamical models from training data. This is discussed in Section 4. The last two elements in a probabilistic approach to pose tracking are inference and initialization. Inference refers to the process of finding good computational approximations to the posterior distribution, or to motions that are most probable. This is discussed in Section 5. Furthermore, tracking most often requires a good initial guess for the pose at the first frame, to initialize the inference. Section 6 discusses methods for automatic initialization of tracking and for recovery from tracking failures.
2 Generative Model for Human Pose To begin to formulate pose tracking in more detail, we require a parameterization of human pose. While special parameterizations might be required for certain tasks, most approaches to pose tracking assume an articulated skeleton, comprising connected, rigid parts. We also need to specify the relation between this skeleton and
60
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
pr
οi
skeleton
skin & soft tissue
clothing
image
Fig. 2 Simple image formation model for a human pose. The skeleton of the human body is overlaid with soft tissue and clothing. Furthermore, the formation of an image of the person (on the right) also depends on lighting and camera parameters.
the image observations. This is complex since we do not observe the skeleton directly. Rather, as illustrated in Figure 2, the skeleton is overlaid with soft tissue, which in turn is often covered by clothes. The image of the resulting surface also then depends on the viewpoint of the camera, the perspective projection onto the image plane, the scene illumination and several other factors.
2.1 Kinematic Parameterization An articulated skeleton, comprising rigid parts connected by joints, can be represented as a tree. One part, such as the upper torso, is defined to be the root node and all remaining parts are either a child of the root or of another part. In this way, the entire pose can be described by the position and orientation of the root node in a global coordinate frame, and the position and orientation of each part in the coordinate frame of its parent. The state s then comprises these positions and orientations. If parts are rigidly attached at joints then the number of degrees of freedom (DOFs) required will be less than the full 6 DOFs necessary to represent pose in a 3D space. The precise number of degrees of freedom varies based on the type of joint. For instance, a hinge joint is commonly used to represent the knee and has one rotational DOF while a ball-and-socket joint, often used to represent the hip, has three rotational DOFs. While real joints in the body are significantly more complex, such simple models greatly reduce the number of parameters to estimate. One critical issue when designing the state space is the parameterization of rotations. Formally, rotations in R3 are 3×3 matrices with determinant 1, the set of which is denoted SO(3). Unfortunately, 3×3 matrices have significantly more parameters than necessary to specify the rotation, and it is extremely difficult to keep a matrix in SO(3) as it changes over time. Lower dimensional parameterizations of rotations are therefore preferred. Most common are Euler angles, which represent a rotation as a sequence of 3 elementary rotations about fixed axes. Unfortunately, Euler angles suffer from several problems including ambiguities, and singularities
Video-Based People Tracking
61
known as Gimbal lock. The most commonly used alternatives include exponential maps [21] and quaternions [34].
2.2 Body Geometry The skeleton is overlaid with soft tissue and clothing. Indeed we do not observe the skeleton but rather the surface properties of the resulting 3D volume. Both the geometry and the appearance of the body (and clothing) are therefore critical factors in the estimation of human pose and motion. Body geometry has been modeled in many ways and remains a largely unexplored issue in tracking and pose estimation. A commonly used model treats the segments of the body as rigid parts whose shapes can be approximated using simple primitives such as cylinders or ellipsoids. These geometric primitives have the advantage of being simple to design and efficient to work with under perspective projection [64, 73]. Other, more complex shape models have been used such as deformable super-quadrics [37], and implicit functions comprising mixtures of Gaussian densities to model 3D occupancy [47]. The greater expressiveness allows one to more accurately model the body, which can improve pose estimation, but it increases the number of parameters to estimate, and the projection of the body onto the image plane becomes more computationally expensive. Recent efforts have been made to build detailed models of shape in terms of deformable triangulated meshes that are anchored to a skeleton. A well-known example of which is the SCAPE model [2]. By using dimensionality reduction, the triangulated mesh is parameterized using a small number of variables, avoiding the potential explosion in the number of parameters. Using multiple cameras one can accurately recover both the shape and pose [4]. However, the computational cost of such models is high, and may only be practical with offline processing. Good results on 3D monocular hand tracking have also been reported, based on a mesh-based surface model with approximately 1000 triangular facets [11]. However, even deformable mesh body models cannot account for loose fitting clothing. Dresses and robes are extreme examples, but even loose fitting shirts and pants can be difficult to handle, since the relationship between the surface geometry observed in the image and the underlying skeleton is very complex. In most current tracking algorithms, clothing is assumed to be tight fitting so that the observed geometry is similar to the underlying body. To handle the resulting errors due to these assumptions, the observation models (and the likelihood functions) must be robust to the kinds of appearance variations caused by clothing. Some have attempted to explicitly model the effects of clothing and its interaction with the body to account for this, but these models are complex and computationally costly [3, 54]. This remains a challenging research direction.
62
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
2.3 Image Formation Given the pose and geometry of the body, the formation of an image of the person depends on several other factors. These include properties of the camera (e.g., the lens, aperture and shutter speed), the rest of the scene geometry (and perhaps other people), surface reflectance properties of clothing and background objects, the illumination of the scene, etc. In practice much of this information is unavailable or tedious to acquire. The exception to this is the geometric calibration of the camera. Standard methods exist (e.g., Forsyth and Ponce [16]) which can estimate camera parameters based on images of calibration targets.1 With fixed cameras this need only be done once. If the camera moves then certain camera parameters can be included in the state, and estimated during tracking. In either case, the camera parameters define a perspective projection, P(X), which maps a 3D point X ∈ R3 to a point on the 2D image plane.
3 Image Measurements Given the skeleton, body geometry and image formation model, it remains to formulate the likelihood distribution p(z|s) in (1).2 Conceptually, the observations are the image pixels, and the likelihood function is derived from ones generative model that maps the human pose to the observed image. As suggested in Figure 2, this involves modeling the surface shape and reflectance properties, the sources of illumination in the scene, and a photo-realistic rendering process for each pixel. While this can be done for some complex objects such as the human hand [11], this is extremely difficult for clothed people and natural scenes in general. Many of the necessary parameters about scene structure, clothing, reflectance and lighting are unknown, difficult to measure and not of direct interest. Instead, approximations are used that explain the available data while being (to varying degrees) independent of many of these unknown parameters. Toward that end it is common to extract a collection of image measurements, such as edge locations, which are then treated as the observations. This section briefly introduces the most common measurements and likelihood functions that are often used in practice.
3.1 2D Points One of the simplest ways to constrain 3D pose is with a set image locations that are projections of known points on the body. These 3D points might be joint centers or 1
Standard calibration code and tools are available as part of OpenCV (The Open Computer Vision Library), available from http://sourceforge.net/projects/opencvlibrary/. 2 In this section we drop the time subscript for clarity.
Video-Based People Tracking
63
points on the surface of the body geometry. For instance, it is easy to show that one can recover 3D pose up to reflection ambiguities from the 2D image positions to which the joint centers project [66]. If one can identify such points (e.g., by manual initialization or ensuring that subjects wear textured clothing that produce distinct features), then the observation z comprises a set of 2D image locations, {mi }M i=1 , where measurement mi corresponds to location i on part j(i). If we assume that the 2D image observations are corrupted by additive noise then the likelihood function can be written as M p({mi }M i=1 | s) = ∏ pi mi − P(K j(i) (i | s))
(3)
i=1
where P(X) is the 2D camera projection of the 3D point X, and K j (|s) is the 3D position in the global coordinate frame of the point on part j given the current state s. The function pi (d) is the probability density function of mean-zero additive noise on point i. This is often chosen to be Gaussian with a standard deviation of σi , i.e., 1 d2 exp − pi (d) = √ . (4) 2σi2 2π σi However, if it is believed that some of the points may be unreliable, for instance if they are not tracked reliably from the image sequence, then it is necessary to use a likelihood density with heavy tails, such as a Student’s t-distribution. The greater probability density in the tails reflects the belief that measurement outliers exist, and reduces the influence of such outliers in the likelihood function. One way to find the image locations to which the joint centers project is to detect and track a 2D articulated model [15, 51, 58]; unfortunately this problem is almost as challenging as the 3D pose estimation problem itself. Another approach is to find image patches that are projections of points on the body (possibly joint centers), and can be reliably tracked over time, e.g., by the KLT tracker [67] or the WSL tracker [28]. Such a likelihood is easy to implement and has been used effectively [69, 70]. Nevertheless, acquiring 2D point tracks frequently requires hand initialization and tuning of the tracking algorithm. Further, patch trackers often fail when parts are occluded or move quickly, requiring reinitialization or other modifications to maintain a reliable set of tracks.
3.2 Background Subtraction If the camera is in a fixed location and the scene is relatively static, then it is reasonable to assume that a background image B(x, y) of the scene can be acquired (see Figure 3 (b)). This can then be subtracted from an observed image I(x, y) and thresholded to determine a mask that indicates which pixels correspond to the foreground person [23, 49]. That is, M(x, y) = 1 if I(x, y)− B(x, y) > ε and M(x, y) = 0
64
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
(a) Image I(x, y)
(b) Background (c) Log Probability of BG B(x, y) log pB (I(x, y))
(d) Mask M(x, y)
(e) Clean Mask
Fig. 3 Background subtraction. Original image is illustrated in (a); the corresponding background image of the scene, B(x, y), in (b); (c) shows the log probability of each pixel I(x, y) belonging to the background (with light color corresponding to high probability); (d) illustrates the foreground mask (silhouette image) obtained by thresholding the probabilities in (c); in (e) a cleaned up version of the foreground mask in (d) obtained by simple morphological operations.
otherwise (e.g. Figure 3 (d)). The mask can be used to formulate a likelihood by peˆ y|s) nalizing discrepancies between the observed mask M(x, y) and a mask M(x, predicted from the image projection of the body geometry. For instance Deutscher and Reid [13] used ˆ y|s)| 1 |M(x, y) − M(x, exp − (5) p(M | s) = ∏ √ 2σ 2 (x,y) 2π σ where σ controls how strongly disagreements are penalized. Such a likelihood is attractive for its simplicity but there will be significant difficulty in setting the threshold ε to an appropriate value; there may be no universally satisfactory value. One can also consider a probabilistic version of background subtraction which avoids the need for a threshold (see Figure 3 (c)). Instead, it is assumed that background pixels are corrupted with mean-zero, additive Gaussian noise. This yields the likelihood function p(I | s) =
∏ pB(I(x, y))1−M(x,y|s) ˆ
(6)
(x,y)
where pB (I(x, y)) is the probability that pixel I(x, y) is consistent with the background. For instance, a hand specified Gaussian model can be used or more complex models such as mixtures of Gaussians can be learned in advance or during tracking [63]. Such a likelihood will be more effective than one based on thresholding. Nevertheless, background models will have difficulty coping with body parts that appear similar to the background; in such regions, like the lower part of the torso in Figure 3, the model will be penalized incorrectly. Problems also arise when limbs occlude the torso or other parts of the body, since then one cannot resolve them from the silhouette. Finally, background models often fail when the illumination changes (unless an adaptive model is used), when cameras move, or when scenes contain moving objects in the background.
Video-Based People Tracking
65
0.045
0.02
0.04
0.018 0.016
0.035
0.014
Probability
Probability
0.03 0.025 0.02 0.015
0.01 0.008 0.006
0.01
0.004
0.005 0 500
0.012
0.002 600
700
800
900
1000
1100
0 −300
1200
−200
−100
0
100
200
300
Optical Axis Translation
Z−Axis Translation
(a)
(b)
(c)
(d)
Fig. 4 Background likelihood. The behavior of the background subtraction likelihood described by Equation (5) is illustrated. A true pose, consistent with the pose of the subject illustrated in Figure 3, is taken and the probability of that pose as a function of varying a single degree of freedom in the state are illustrated in (a) and (c); in (a) the entire body is shifted up and down (along the Z-axis), in (d) along the optical axis of the camera. In (b) and (d) poses corresponding to the strongest peak in the likelihood of (a) and (c) respectively are illustrated. While ideally one would prefer the likelihood to have a single global maxima at the true value (designated by the vertical line in (a) and (c)), in practice, the likelihoods tend to be noisy, multi-modal and may not have a peak in the desired location. In particular in (c), due to the insensitivity of monocular likelihoods to depth, noise in the obtained foreground mask and inaccuracies in the geometric model of the body lead to severe problems. Also note that, in both figures, the noise in the likelihood indicates that simple search methods are likely to get stuck in local optima.
3.3 Appearance Models In order to properly handle uncertainty, e.g., when some region of the foreground appears similar to the background, it is useful to explicitly model the foreground appearance. Accordingly, the likelihood becomes p(I | s) =
∏ pB(I(x, y))1−M(x,y | s) pF (I(x, y) | s)M(x,y | s) ˆ
ˆ
(7)
(x,y)
where pF (I(x, y) | s) is the probability of pixel I(x, y) belonging to the foreground. Notice that if a uniform foreground model is assumed, i.e., pF (·) ∝ 1, then (7) simply becomes the probabilistic background subtraction model of (6). An accurate foreground model pF (I(x, y)|s) is often much harder to develop than a background model, because appearance varies depending on surface orientation with respect to the light sources and the camera, and due to complex non-rigid deformation of the body and clothing over time. It therefore requires offline learning based on a reasonable training ensemble of images [27, 50] or it can be updated online [76]. Simple foreground models are often learned from the image pixels to which the body projects to in one or more frames. For example one could learn the mean RGB color and the its covariance for the body, or for each part of the body if they differ in appearance. One can also model the statistics of simple filter outputs (e.g., gradient filters). One important consideration about likelihoods is computational expense, as evaluating every pixel in the image can be burdensome. Fortunately, this can usually be avoided as a likelihood function typically needs only be specified up to a multiplicative constant. By dividing the likelihood by the background model for each
66
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
··· Head
··· Torso
Head
Calf
Thigh
··· Thigh
Torso
···
(a) Image
(b) Log Probability of BG
(c) Log Probability of FG
I(x, y)
log pB (I(x, y))
log pF (I(x, y)|s)
Calf
(d) Log FG/BG Ratio log
pF (I(x,y)|s) pB (I(x,y))
Fig. 5 Modeling appearance. The appearance likelihood described by Equation (8) is illustrated. The observed image is shown in (a); the log probability of a pixel belonging to the background in (b); the probability of the pixel belonging to a foreground model (modeled by a mixture of Gaussians) for a given body part in (c); the final log ratio of the foreground to background probability is illustrated in (d). Notice that unlike the background likelihood, the appearance likelihood is able to attribute parts of the image to individual segments of the body. 0.012
0.04 0.035
0.01
0.03
Probability
Probability
0.008 0.025 0.02 0.015
0.006
0.004 0.01
0.002 0.005 0 500
600
700
800
900
1000
1100
0 −300
1200
(a)
−200
−100
0
100
200
300
Optical Axis Translation
Z−Axis Translation
(b)
(c)
(d)
Fig. 6 Appearance likelihood. The behavior of the appearance likelihood described by Equation (8) is illustrated. Similarly to Figure 4 a true pose, consistent with the pose of the subject illustrated in Figure 3, is taken and the probability of that pose as a function of varying a single degree of freedom in the state are illustrated in (a) and (c); as before in (a) the entire body is shifted up and down (along the Z-axis), in (d) along the optical axis of the camera. In (b) and (d) poses corresponding to the strongest peak in the likelihood of (a) and (c) respectively are illustrated. Notice that due to the strong separation between foreground and background in this image sequence, appearance likelihood performs similarly to the background likelihood model (illustrated in Figure 4); in sequences where foreground and background contain similar colors appearance likelihoods tend to produce superior performance.
pixel terms cancel out leaving p(I | s) ∝
pF (I(x, y) | s) pB (I(x, y)) ˆ (x,y) s.t. M(x,y | s)=1
∏
(8)
where the product is only over the foreground pixels, allowing a significant savings in computation. This technique can be more generally used to speed up other types of likelihood functions.
Video-Based People Tracking
67
3.4 Edges and Gradient Based Features Unfortunately foreground and background appearance models have several problems. In general they have difficulty handling large changes in appearance such as those caused by varying illumination and clothing. Additionally, near boundaries they can become inaccurate since most foreground models do not capture the shading variations that occur near edges, and the pixels near the boundary are a mixture of foreground and background colors due to limited camera resolution. For this reason, and to be relatively invariant to lighting and small errors in surface geometry, it has been common to use edge-based likelihoods [73]. These models assume that the projected edges of the person should correspond to some local structure in image intensity. Perhaps the simplest approach to the use of edge information is the Chamfer distance [5]. or the Hausdorff distance [25]. Edges are first extracted from the observed image using standard edge detection methods [16] and a distance map is computed where d(x) is the squared Euclidean distance from pixel x to the nearest edge pixel. The outline of the subject in the image is computed and the boundary is sampled at a set of points {bi }M i=1 . In the case of Chamfer matching the likelihood function is 1 M p(d | s) = exp − ∑ d(bi ) . (9) M i=1 Chamfer matching is fast, as the distance map need only be computed once and is evaluated only at edge points. Additionally, it is robust to changes in illumination and other appearance changes of the subject. However it can be difficult to obtain a clean set of edges as texture and clutter in the scene can produce spurious edges. Gavrila and Davis [19] successfully used a variant of Chamfer matching for pose tracking. To minimize the impact of spurious edges they performed an outlier rejection step on the points bi . Chamfer matching is also robust to inaccuracies in the geometry of the subject. If the edges of the subject can be predicted with a high degree of accuracy, then predictive models of edge structure can be used. Kollnig and Nagel [32] built hand specified models which predicted large gradient magnitudes near outer edges of the target. Later, this work was extended to predict gradient orientations and applied to human pose tracking by Wachter and Nagel [73]. Similarly, Nestares and Fleet [43] learned a probabilistic model of local edge structure which was used by Poon and Fleet [48] to track people. Such models can be effective however sufficiently accurate shape models can be difficult to build.
68
Marcus A. Brubaker, Leonid Sigal and David J. Fleet 0.02
0.08
0.14
0.018
0.07
0.12
0.016
0.06
0.1
0.012 0.01 0.008
0.05
Probability
Probability
Probability
0.014
0.04 0.03
0.006
0.08
0.06
0.04
0.02 0.004
0 −300
0.02
0.01
0.002 −200
−100
0
100
Optical Axis Translation
(1 view)
200
300
0 −300
−200
−100
0
100
200
300
0 −300
Optical Axis Translation
(2 views)
−200
−100
0
100
200
300
Optical Axis Translation
(3 views)
Fig. 7 Number of views. Effect of combining measurements from a number of image views on the (background) likelihood. With a single view the likelihood exhibits a wide noisy mode relatively far from the true value of the translation considered (denoted by the vertical line); with more views contributing image measurements the ambiguity can be resolved, producing a stronger peak closer to the desired value.
3.5 Discussion There is no consensus as to which form of likelihood is best. However, some cues are clearly more powerful than others. For instance, if 2D points are practical in a given application then they should certainly be used as they are an extremely strong cue. Similarly, some form of background model is invaluable and should be used whenever it is available. Another effective technique is to use multiple measurements. To correctly combine measurements, the joint probability of the two observations p(z(1) , z(2) | s) needs to be specified. This is often done by assuming the conditional independence of the observations p(z(1) , z(2) | s) = p(z(1) | s)p(z(2) | s) .
(10)
This assumption, often referred to as naïve Bayes, is unlikely to hold as errors in one observation source are often correlated with errors in others. However, it is reasonable when, for instance, the two observations are from different cameras or when one set of observations is explaining edges and the other is explaining pixels not at the boundary. The behavior of the background likelihood (previously illustrated in Figure 4) as a function of image measurements combined from multiple views is illustrated in Figure 7.
4 Motion Models Prior information about human pose and motion is essential for resolving ambiguity, for combining noisy measurements, and for coping with missing observations. A prior model biases pose estimation toward plausible poses, when pose might otherwise be under-constrained. In principle one would like to have priors that are weak enough to admit all (or most) allowable motions of the human body,
Video-Based People Tracking
69
but strong enough to constrain ambiguities and alleviate challenges imposed by the high-dimensional inference. The balance between these two competing goals is often elusive. This section discusses common forms of motion models and introduces some emerging research directions.
4.1 Joint Limits The kinematic structure of the human body permits a limited range of motion in each joint. For example, knees cannot hyperextend and the torso cannot tilt or twist arbitrarily. A central role of prior models is to ensure that recovered poses satisfy such biomechanical limits. While joint limits can be encoded by thresholds imposed on each rotational DOF, the true nature of joint limits in the human body is more complex. In particular, the joint limits are dynamic and dependant on other joints [22]. Unfortunately, joint limits by themselves do not encode enough prior knowledge to facilitate tractable and robust inference.
4.2 Smoothness and Linear Dynamical Models Perhaps the simplest commonly used prior model is a low-order Markov model, based on an assumption that human motion is smooth [73, 56, 48]. A typical firstorder model specifies that the pose at one time is equal to the previous pose up to additive noise: (11) st+1 = st + η where the process noise η is usually taken to be Gaussian η ∼ N (0, Σ ). The resulting prior is then easily shown to be p(st+1 | st ) = G(st+1 ; st , Σ )
(12)
where G(x; m,C) is the Gaussian density function with mean m and covariance C, evaluated at x. Second-order models express st+1 in terms of of st and st−1 , allowing one to use velocity in the motion model. For example, a common, damped secondorder model is (13) st+1 = st + κ (st − st−1 ) + η where κ is a damping constant which is typically between zero and one. Equations (11) and (13) are instances of linear models, the general form of which is st+1 = ∑Nn=1 An st−n+1 + η , i.e., an N-th order linear dynamical model. In many cases, as in (11) and (13), it is common to set the parameters of the transition model by hand, e.g., setting An , assuming a fixed diagonal covariance matrix Σ , or letting the diagonal elements of the covariance matrix in (12) be proportional to ||st −st−1 ||2 [13]. One can also learn dynamical models from motion capture data [45]. This
70
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
would allow one, for example, to capture the coupling between different joints. Nevertheless, learning good parameters is challenging due to the high-dimensionality of the state space, for which the transition matrices, An ∈ RN×N , can easily suffer from over-fitting. Smoothness priors are relatively weak, and as such allow a diversity of motions. While useful, this is detrimental when the model is too weak to adequately constrain tracking in monocular videos. In constrained settings, where observations from 3 or more cameras are available and occlusions are few, such models have been shown to achieve satisfactory performance [13]. It is also clear that human motion is not always smooth, thereby violating smoothness assumptions. Motion at ground contact, for example, is usually discontinuous. One way to accommodate this is to assume a heavy-tailed model of process noise that allows occasional, large deviations from the smooth model. One might also consider the use of switching linear dynamical models, which produce piece-wise linear motions [46].
4.3 Activity Specific Models Assuming that one knows or can infer the type of motion being tracked, or the identity of the person performing the motion, one can apply stronger prior models that are specific to the activity or subject [35]. The most common approach is to learn models off-line (prior to tracking) from motion capture data. Typically one is looking for some low-dimensional parameterization of the pose and motions. To introduce the idea, consider a dataset Ψ = {ψ (i) } consisting of K kinematic poses ψ (i) , i ∈ (1, ..., K) obtained, for example, using a motion capture system. Since humans often exhibit characteristic patterns of motion, these poses will often lie on or near a low-dimensional manifold in the original high-dimensional pose space. Using such data for training, methods like Principle Component Analysis (PCA) can be used to approximate poses by the linear combination of a mean pose μΨ = K1 ∑Ki=1 ψ (i) and a set of learned principal directions of variation. These principle directions are computed using the singular value decomposition (SVD) of a matrix S whose i-th row is ψ (i) − μΨ . Using SVD, matrix S is decomposed into two orthonormal matrices U and V (U = [u1 , u2 , ..., um ] consisting of the eigenvectors, (a.k.a., eigen-poses) and a diagonal matrix Λ containing ordered eigenvalues such that S = U Λ V T . Given this learned model, a pose can be approximated by q
ψ ≈ μΨ + ∑ ui ci
(14)
i=1
where ci is the set of scalar coefficients and q m controls the amount of variance accounted for by the model. As such, the inference over the pose can be replaced by the inference over the coefficients s = [c1 , c2 , ..., cq ]. Since q is typically
Video-Based People Tracking
71
Fig. 8 Illustration of the latent space motion prior model. Results of learning a Gaussian Process Dynamical Model that encodes both the non-linear low-dimensional latent pose space and the dynamics in that space. On the left a few walking motions are shown embedded in the 3D latent space. Each point on a trajectory is an individual pose. For six of the points the corresponding mean pose in the full pose space is shown. On the right the distribution over plausible poses in the latent space is shown. This figure is re-printed from [74].
small (e.g. 2 − 5) with respect to the dimensionality of the pose space this transformation facilitates faster pose estimation. However, the new low-dimensional state space representation also requires a new model of dynamics that has to operate on the coefficients. The models of dynamics in the linear latent-space such as the one obtained using the eigen-decomposition are typically more complex then those in the original pose space and are often nonlinear. One alternative to simplifying the motion models is to learn the eigen-decomposition for entire trajectories of motion rather then the individual poses [56, 71]. Regardless, linear models such us the one described here are typically insufficient to capture intricacies of real human poses or motion. More recent methods have shown that non-linear embeddings are more effective [60]. Gaussian Processes Latent Variable Models (GPLVMs) have became a popular choice since they have been shown to generalize from small amounts of training data [69]. Furthermore, one can learn a low-dimensional embedding that not only models the manifold for a given class of motions, but also captures the dynamics in that learned manifold [36, 70]. This allows the inference to proceed entirely in the low-dimensional space alleviating complexities imposed by the high-dimensional pose space all together. An example of a 3D latent space for walking motions is illustrated in Figure 8 and results of tracking with that model is shown in Figure 9. Alternatively, methods that use motion capture directly to implicitly specify stronger priors have also been proposed. These types of priors make the assumption that the observed motion should be akin to the motion exhibited in the database of exemplar motions. Simply said, given a pose at time t such approaches find an exemplar motion from the database that contains a closely resembling pose, and uses that motion to look up the next pose in the sequence. Priors of this form can also be formulated probabilistically [57].
72
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
Fig. 9 Tracking with the GPDM. 56 frames of a walking motion that ends with almost total occlusion (just the head is visible) in a cluttered and moving background. Note how the prior encourages realistic motion as occlusion becomes a problem. This figure is re-printed from [70].
All of these methods have proven effective for monocular pose inference in specific scenarios for relatively simple motions. However, due to their action specific nature, learning models that successfully generalize and represent multiple motions and transitions between those motions has been limited.
4.4 Physics-based Motion Models Recently, there has been preliminary success in using physics-based motion models as priors. Physics-based models have the potential to be as generic as simple smoothness priors but more informative. Further, they may be able to recover subtleties of more realistic motion which would be difficult, if not impossible, with existing motion models. The use of physics-based prior models in human tracking dates back to the early 1990’s with pioneering work by Metaxas and Terzopoulos [37] and Wren and Pentland [75]. However, it is only recently that success with monocular imagery has been shown. The fundamental motivation for physics-based motion models is the possibility that motions are best described by the forces which generated them, rather than a sequence of kinematic poses. These forces include not only the internal (e.g., muscle generated) forces used to propel limbs, but also external forces such as gravity, ground reaction forces, friction and so on. Many of these forces can be derived from first principles and provide important constraints on motion. Modeling the remaining forces, either deterministically or stochastically is then the central difficulty of physics-based motion models. This class of models remains a promising but relatively unexplored area for future research. The primary difficulty with physics-based models is the instability of complex dynamical systems. Sensitivity to initial conditions, discontinuities of motion and other non-linearities have made robust, realistic control of humanoid robots an elusive goal of robotics research. To address this in the context of tracking Brubaker, Fleet, and Hertzmann [8] used a simplified, physical model that is stable and easy to control. While this model was restricted to simple walking motions, the work was extended by Brubaker and Fleet [7] to a more complex physical model, capable
Video-Based People Tracking
73
of a wider range of motions. An alternative strategy employed by Vondrak, Sigal, and Jenkins [72] used a motion capture database to guide the dynamics. Using inverse dynamics, they solved for the forces necessary to mimic motions found in the database.
5 Inference In a probablistic framework our goal is to compute some approximation to the distribution p(s1:t | z1:t ). Often this is formulated as online inference, where the distribution is computed one frame at a time as the observations arrive, exploiting the well-known recursive form of the posterior (assuming conditional independence of the observations): p(s1:t | z1:t ) ∝ p(zt | st ) p(st | s1:t−1 ) p(s1:t−1 | z1:t−1 ) .
(15)
The motion model, p(st |s1:t−1 ), is often a first order Markov model which simplifies to p(st |st−1 ). While this is not strictly necessary for the inference methods presented here, it is important because the motion model then depends only on the last state as opposed to the entire trajectory. The classic, and perhaps simplest, approach to this problem is the Kalman filter [73]. However, the Kalman Filter is not suitable for human pose tracking where the dynamics are non-linear and the likelihood functions are non-Gaussian. As a consequence, Sequential Monte Carlo techniques are amongst the most commonly used to perform this inference. Sequential Monte Carlo methods were first applied to visual tracking with the CONDENSATION algorithm of Isard and Blake [26] but were applied earlier for time series analysis by Gordon, Salmond, and Smith [20] and Kong, Liu, and Wong [33]. For a more detailed discussion of Sequential Monte Carlo methods, we refer the reader to the review article by Doucet, Godsill, and Andrieu [14]. In this section, we present a very simple algorithm, particle filtering, in which stochastic simulation of the motion model is combined with weighting by the likelihood to produce weighted samples which approximate the posterior. We also present two variants which attempt to work around the deficiencies of the basic particle filter.
5.1 Particle Filter A particle filter represents a distribution with a weighted set of sample states, de(i) (i) noted {(s1:t , w1:t )|i = 1, . . . , N}. When the samples are fairly weighted, then sample statistics approximate expectation under the target distribution, i.e.,
74
Marcus A. Brubaker, Leonid Sigal and David J. Fleet N
∑ wˆ 1:t f (s1:t ) (i)
(i)
≈ E[ f (s1:t )]
(16)
i=1
(i) ( j) −1 (i) w1:t is the normalized weight. The sample statistics where wˆ 1:t = ∑Nj=1 w1:t approach that of the target distribution as the number of samples, N, increases. In a simple particle filter, given a fairly weighted sample set from time t, the samples at time t + 1 are obtained with importance sampling. First samples are (i) (i) drawn from a proposal distribution q(st+1 | s1:t , z1:t+1 ). Then the weights are updated (i)
(i)
w1:t+1 = w1:t
(i)
(i)
(i)
p(zt+1 | st+1 )p(st+1 | s1:t ) (i)
(i)
q(st+1 | s1:t , z1:t+1 )
(17)
to be fairly weighted samples at time t + 1. The proposal distribution must be nonzero everywhere the posterior is non-zero, but it is otherwise largely unconstrained. The simplest and most common proposal distribution is motion model, p(st+1 | s1:t ), (i) (i) (i) which simplifies the weight update to be w1:t+1 = w1:t p(zt+1 |st+1 ). This simple procedure, while theoretically correct, is known to be degenerate. As t increases, the normalized weight of one particle approaches 1 while the others approach 0. Weights near zero require a significant amount of computation but contribute very little to the posterior approximation, effectively reducing the posterior approximation to a point estimate. To mitigate this problem, a resampling step is introduced where particles with small weights are discarded. Before the propagation stage a new set of samples (i) ( j) {˜s1:t |i = 1, . . . , N} is created by drawing an index j such that p( j) = wˆ 1:t , and then (i) ( j) (i) setting s˜1:t = s1:t . The weights for this new set of particles are then w˜ 1:t = 1/N for all i. This resampling procedure can be done at every frame, at a fixed frequency, or only when heuristically necessary. While it may seem good to do this at every frame, as done by Isard and Blake [26], it can cause problems. Specifically, resampling introduces bias in finite sample sets, as the samples are no longer independent and can even exacerbate particle depletion over time. Resampling when necessary balances the need to avoid degeneracy without introducing undue bias. One of the most commonly used heuristics is an estimate of the effective sample size Ne f f =
N
∑
(i) (wˆ 1:t )2
−1 (18)
i=1
which takes values from 1 to N. Intuitively, this can be thought of as the average number of independent samples that would survive a resampling step. Notice that after resampling, Ne f f is equal to N. With this heuristic, resampling is then performed when Ne f f < Nthresh , otherwise it is skipped. The particle filtering algorithm, with this heuristic resampling strategy, is outlined in Algorithm 1.
Video-Based People Tracking
75 (i)
(i)
Initialize the particle set {(s1 , w1 )|i = 1, . . . , N} for t = 1, 2, . . . do (i) Compute the normalized weights wˆ 1:t and the effective number of samples Ne f f as in (18) if Ne f f < Nthresh then for i = 1, . . . , N do ( j) Randomly choose an index j ∈ (1, ..., N) with probability p( j) = wˆ 1:t (i) ( j) (i) Set s˜1:t = s1:t and w˜ 1:t = 1 end for (i) (i) Replace the old particle set {(s1:t , w1:t )|i = 1, . . . , N} with (i) (i) {(˜s1:t , w˜ 1:t )|i = 1, . . . , N} end if for i = 1, . . . , N do (i) (i) Sample st+1 from the proposal distribution q(st+1 |s1:t , z1:t+1 ) (i)
(i)
(i)
Construct the new state trajectory s1:t+1 = (s1:t , st+1 ) (i)
Update the weights w1:t+1 according to equation (17) end for end for Algorithm 1: Particle Filtering.
5.2 Annealed Particle Filter Unfortunately, resampling does not solve all the problems of the basic particle filter described above. Specifically, entire modes of the posterior can still be missed, particularly if they are far from the modes of the proposal distribution or if modes are extremely peaked. One solution to this problem is to increase the number of particles, N. While this will solve the problem in theory, the number of samples theoretically needed is generally computationaly untenable. Further, many samples will end up representing uninteresting parts of the space. While these issues remain challenges, several techniques have been proposed in an attempt to improve the efficiency of particles filters. One approach, inspired by simulated annealling and continuation methods, is Annealed Particle Filtering (APF) [13]. The APF algorithm is outlined in Algorithm 2. At each time t, the APF goes through L levels of annealing. For each particle i, annealing level , begins by ( j) choosing a sample from the previous annealing level, s1:t+1,−1 , with probability (i)
p( j) = wˆ 1:t+1,−1 . The state at time t + 1 of sample j is then diffused to create a new (i)
hypothesis st+1, according to T (st+1, |s1:t+1,−1 ) = N (st+1, |st+1,−1 , α Σ ) , and weights the new hypothesis by
(19)
76
Marcus A. Brubaker, Leonid Sigal and David J. Fleet (i)
(i)
Initialize the weighted particle set {(s1,L , w1,L )|i = 1, . . . , N}. for t = 1, 2, . . . do for i = 1, . . . , N do (i) (i) Sample st+1 from the proposal distribution q(st+1 |s1:t,L , z1:t+1 ) (i)
(i)
(i)
Construct the new state trajectory s1:t+1,0 = (s1:t,L , st+1 ) (i)
Assign the weights w1:t+1,0 =
(i)
W0 (s1:t+1,0 |z1:t+1 ) (i)
(i)
q(st+1,0 |s1:t,L ,z1:t+1 )
end for for = 1, . . . , L do for i = 1, . . . , N do Compute the normalized weights −1 (i) ( j) (i) w1:t+1,−1 wˆ 1:t+1,−1 = ∑Nj=1 w1:t+1,−1 Randomly choose an index j ∈ (1, ..., N) with probability ( j) p( j) = w1:t+1,−1 (i)
( j)
Sample st+1, from the diffusion distribution T (st+1, |s1:t+1,−1 ) (i)
(i)
(i)
Construct the new state trajectory s1:t+1, = (s1:t , st+1, ) (i)
(i)
Compute the annealed weights w1:t+1, = W (s1:t+1, |z1:t+1 ) end for end for end for Algorithm 2: Annealed Particle Filtering. β W (s1:t+1, |z1:t+1 ) = p(zt+1 | st+1, )p(st+1, |st, ) .
(20)
In the above α ∈ (0, 1) is called the annealing rate and is used to control the scale of covariance, Σ , in the diffusion process. The β is the temperature parameter, derived based on the annealing rate, α , and the survival diagnostics of the particle set (for details see Deutscher and Reid [13]) to ensure that a fixed fraction of samples survive from one stage of annealing to the next. The sequence β0 , . . . , βL is a gradually increasing sequence between zero and 1, ending with βL = 1. When β is small, the difference in height between peaks and troughs of the posterior are attenuated. As a result it is less likely that one mode, by chance, will dominate and attract all the particles, thereby neglecting other, potentially important modes. In this way the APF allows particles to broadly explore the posterior in the early stages. This means that the particles are better able to find different potential peaks, which then attract the particles more strongly as the likelihood becomes more strongly peaked (as β increases). It is worth noting that, with L = 1, the APF reduces to the standard particle filter discussed in the previous section. While often effective in finding significant modes of the posterior, the APF does not produce fairly weighted samples from the posterior. As such it does not
Video-Based People Tracking
77
Fig. 10 Annealed Particle Filter (APF) tracking from multi-view observations. This figure is re-printed from [13]. (i)
(i)
Initialize {(s1 , w1 )|i = 1, . . . , N} for t = 1, 2, . . . do for i = 1, . . . , N do (i) (i) Sample s˜t+1 from the proposal distribution q(˜st+1 |s1:t , z1:t+1 ) (i)
(i)
Construct the new state trajectory s˜1:t+1 = (s1:t , s˜t+1 ) (i)
(i)
(i)
Update the weights w1:t+1 according to equation (17) with st+1 = s˜t+1 end for −1 (i) ( j) (i) w1:t+1 Compute the normalized weights wˆ 1:t+1 = ∑Nj=1 w1:t+1 for i = 1, . . . , N do ( j) Randomly choose an index j ∈ (1, ..., N) with probability p( j) = wˆ 1:t+1 . ( j)
Set the target distribution to be P(q) ∝ p(zt+1 |q)p(q|˜s1:t ) ( j) Set the initial state of the Markov Chain to q0 = s˜t+1 for r = 1, . . . , R do Sample qr from the MCMC transition density T (qr |qr−1 ), e.g., using Hybrid Monte Carlo as described in Algorithm 4 end for (i) (i) (i) Set s1:t+1 = (˜s1:t , qR ) and w1:t+1 = 1 end for end for Algorithm 3: Markov Chain Monte Carlo Filtering.
accurately represent the posterior and the sample statistics of (16) are not representative of expectations under the posterior. Recent research has shown that by restricting the form of the diffusion and properly weighting the samples, one can obtain fairly weighted samples [18, 42].
78
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
5.3 Markov Chain Monte Carlo Filtering Another way to improve the efficiency of particle filters is with the help of Markov Chain Monte Carlo (MCMC) methods to explore the posterior. A Markov Chain3 is a sequence of random variables q0 , q1 , q2 , . . . with the property that for all i, p(qi |q0 , . . . , qi−1 ) = p(qi |qi−1 ). In MCMC, the goal is to construct a Markov Chain chain such that, as i increases, p(qi ) approaches the desired target distribution P(q). In the context of particle filtering at time t, the random variables, q, are hypothetical states st and the target distribution, P(q), is the posterior p(s1:t |z1:t ). The key to MCMC is the definition of a suitable transition density p(qi |qi−1 ) = T (qi |qi−1 ). To this end there are several properties that must be satisfied, one of which is P(q) =
ˆ P(q)d ˆ qˆ . T (q|q)
(21)
This means that the chain has the target distribution P(q) as its stationary distribution. For a good review of the various types of Markov transition densities used, and a more thorough introduction to MCMC in general, see [41]. A general MCMC-filtering algorithm is given in Algorithm 3. It begins by propogating samples through time and updating their weights according to a conventional particle filter. These particles are then are chosen with probability proportional to their weights, as the initial states in N independant Markov chains. The target distribution for each chain is the posterior p(st |, z1:t ). The final states of each chain are then taken to be fair samples from the posterior. Choo and Fleet [9] used an MCMC method known as Hybrid Monte Carlo. The Hybrid Monte Carlo algorithm (Algorithm 4) is an MCMC technique, based on ideas developed for molecular dynamics, which uses the gradient of the posterior to efficiently find high probability states. A single step begins by sampling a vector p0 ∈ Rm from a Gaussian distribution with mean zero and unit variance. Here m is the dimension of the state vector q0 . The randomly drawn vector, known as the momentem, is then used to perform a simulation of the system of differen∂q tial equations ∂∂ pτ = − ∂ E(q) ∂ q and ∂ τ = p where τ is an artificial time variable and E(q) = − log P(q). The simulation begins at (q0 , p0 ) and proceeds using a leapfrog step which is explicitly given in Algorithm 4. There, L is the number of steps to simulate for and Δ is a diagonal matrix whos entries specify the size of step to take in each dimension of the state vector q. At the end of the simulation, the ending state of the physical simulation qL is accepted with probability a = min(1, e−c ) where 1 1 c = (E(qL ) + p0 2 ) − (E(q0 ) + pL 2 ) . 2 2
(22)
The specific form of the simulation procedure and the acceptance test at the end are designed such that P(q) is the stationary distribution of the transition distribu3
A full review of MCMC methods is well beyond the scope of this chapter and only a brief introduction will be presented here.
Video-Based People Tracking
79
Given a starting state q0 ∈ Rm and a target distribution P(q), define E(q) = − log P(q). Draw a momentum vector p0 ∈ Rm from a Gaussian distribution with mean 0 and unit variance. for = 1, . . . , L do ∂ E(q ) p−0.5 = p−1 − 12 Δ ∂ q−1 q = q−1 + Δ p−0.5 ) p = p−0.5 − 12 Δ ∂ E(q ∂q end for Compute the acceptance probability a = min(1, e−c ) where c is computed according to (22) Set u to be a uniformly sampled random number between zero and one if u < a then return qL else return q0 end if Algorithm 4: Hybrid Monte Carlo Sampling.
tion. The parameters of the algorithm, L and the diagonals of Δ are set by hand. As a rule of thumb Δ and L should be set so that roughly 75% of transitions are accepted. An important caveat is that the values of L and Δ cannot be set based on q and must remain constant throughout a simulation. For more information on Hybrid Monte Carlo see [41].
6 Initialization and Failure Recovery The final issue we must address concerns initialization and the recovery from tracking failures. Because of the large number of unknown state variables one cannot assume that the filter can effectively search the entire state space without a good prior model (or initial guess). Fortunately, over the last few years, progress on the development of discriminative methods for detecting people and pose inference has been encouraging.
6.1 Introduction to Discriminative Methods for Pose Estimation Discriminative approaches aim to recover pose directly from a set of measurements, usually through some form of regression applied to a set of measurements from a single frame. Discriminative techniques are typically learned from a set of
80
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
Fig. 11 Illustration of the simple discriminative model. The model introduced by [1] is illustrated. From left to right the figure shows (1) silhouette image, (2) contour of the silhouette image, (3) shape context feature descriptor for a point on a contour, (4) a set of shape context descriptors in the high (60-dimensional) space, (5) a 100-dimensional vector quantized histogram of shape descriptors that is used to obtain (6) the pose of the person through linear regression. This figure is re-printed from [1].
training exemplars, D = {(s(i) , z(i) ) ∼ p(s, z)|i = 1...N}), which are assumed to be fair samples from the joint distribution over states and measurements. The goal is to learn to predict an output for a given input. The inputs, z ∈ RM , are generic image measurements, 4 and outputs s ∈ RN , as above, represent the 3D poses of the body. The simplest discriminative method is Nearest-Neighbor (NN) lookup [24, 39], where, given a set of features observed in an image, the exemplar from the training database with the closest features is found, i.e., k∗ = arg mink d(˜z, z(k) ). The pose ∗ s(k ) for that exemplar is returned. The main challenge is to define a useful similarity measure d(·, ·), and a fast indexing scheme. One such approach was proposed by Shakhnarovich, Viola, and Darrell [55]. Unfortunately, this simple approach has three drawbacks: (1) large training sets are required, (2) all the training data must be stored and used for inference, and (3) it produces unimodal predications and hence ambiguities (or multi-modality) in image-to-pose mappings cannot be accounted for (e.g., see Sminchisescu, Kanaujia, Li, and Metaxas [61]). To address (1) and (2) a variety of global [1] and local [53] parametric models have been proposed. These models learn a functional mapping from image features to 3D pose. While these methods have been demonstrated successfully on restricted domains, and with moderately large training sets, they do not provide one-to-many mappings, and therefore do not cope with multi-modality. Multi-modal mappings have been formulated in a probabilistic setting, where one explicitly models multi-modal conditional distributions, p(s|z, Θ ), where Θ are parameters of the mapping, learned by maximizing the likelihood of the training data D. One example is the conditional Mixture of Experts (cMoE) model introduced by Sminchisescu, Kanaujia, Li, and Metaxas [61], which takes the form p(s | z, Θ ) =
K
∑ gk (z | Θ )ek (s | z, Θ ),
(23)
k=0
4 For instance, histograms-of-oriented-gradients, vector quantized shape contexts, HMAX, spatial pyramids, vocabulary trees and so on. See Kanaujia, Sminchisescu, and Metaxas [30] for details.
Video-Based People Tracking
81
Fig. 12 Discriminative Output. Pose estimation results obtained using the discriminative method introduced by Kanaujia, Sminchisescu, and Metaxas. This figure is re-printed from [30].
where K is the number of experts, gk are positive gating functions which depend on the input features, and ek are experts that predict the pose (e.g., kernel regressors). This model under various incarnations has been shown to work effectively with large datasets [6] and with partially labeled data5 [30]. The MoE model, however, still requires moderate to large amounts of training data to learn parameters of the gates and experts. Recently, methods that utilize an intermediate low dimensional embedding have been shown to be particularly effective in dealing with little training data in this setting [40, 31]. Alternatively, nonparametric approaches for handling large amounts of training data efficiently that can deal with multi-modal probabilistic predictions have also been recently proposed by Urtasun and Darrell [68]. Similar in spirit to the simple NN method above, their model uses the local neighborhood of the query to approximate a mixture of Gaussian Process (GP) regressors.
6.2 Discriminative Methods as Proposals for Inference While discriminative methods are promising alternatives to generative inference, it is not clear that they will be capable of solving the pose estimation problem in a general sense. The inability to generalize to novel motions, deal with significant occlusions and a variety of other realistic phenomena seem to suggest that some generative component is required. Fortunately, discriminative models can be incorporated within the generative setting in an elegant way. For example, multimodal conditional distributions that are the basis of most recent discriminative methods [6, 40, 61, 68] can serve directly as proposal distributions (i.e., q(st+1 |s1:t , z1:t+1 )) to improve the sampling efficiency of the Sequential Monte Carlo methods discussed above. Some preliminary work on Since joint samples span a very high dimensional space, RN+M , obtaining a dense sampling of the joint space for the purposes of training is impractical. Hence, incorporating samples from marginal distributions p(s) and p(z) is of great practical benefit.
5
82
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
combining discriminative and generative methods in this and other ways has shown promise. It has been shown that discriminative components provide for effective initialization and the recovery from transient failures, and that generative components provide effective means to generalize and better fit image observations [59, 61, 62].
7 Conclusions This chapter introduced the basic elements of modern approaches to pose tracking. Using the probabilistic formulation introduced in this chapter one should be able to build a state-of-the-art framework for tracking relatively simple motions of single isolated subjects in a compliant (possibly instrumented) environment. The more general problem of tracking arbitrary motion in monocular image sequences of unconstrained environments remains a challenging and active area of research. While many advances have been made, and the progress is promising, no system to date can robustly deal with all the complexities of recovering the human pose and motion in an entirely general setting. While the need to track human motion from images is motivated by a variety of applications, currently there have been relatively few systems that utilize the imagebased recovery of the articulated body pose for higher-level tasks or consumer applications. This to a large extent can be attributed to the complexity of obtaining an articulated pose in the first place. Nevertheless, a few very promising applications in biomechanics [10] and human computer interfaces [12, 52, 65] have been developed. The articulated pose has also proved useful as a front end for action recognition applications [44]. We believe that as the technologies for image-based recovery of articulated pose grows over the next years, so will the applications that utilize that technology.
References [1] Agarwal A, Triggs B (2006) Recovering 3d human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(1):44–58 [2] Anguelov D, Srinivasan P, Koller D, Thrun S, Rodgers J, Davis J (2005) SCAPE: Shape Completion and Animation of People. ACM Transactions on Graphics 24(3):408–416 [3] Balan A, Black MJ (2008) The naked truth: Estimating body shape under clothing. In: IEEE European Conference on Computer Vision [4] Balan AO, Sigal L, Black MJ, Davis JE, Haussecker HW (2007) Detailed human shape and pose from images. In: IEEE Conference on Computer Vision and Pattern Recognition
Video-Based People Tracking
83
[5] Barrow HG, Tenenbaum JM, Bolles RC, Wolf HC (1977) Parametric correspondence and chamfer matching: Two new techniques for image matching. In: International Joint Conference on Artificial Intelligence, pp 659–663 [6] Bo L, Sminchisescu C, Kanaujia A, Metaxas D (2008) Fast algorithms for large scale conditional 3d prediction. In: IEEE Conference on Computer Vision and Pattern Recognition [7] Brubaker M, Fleet DJ (2008) The kneed walker for human pose tracking. In: IEEE Conference on Computer Vision and Pattern Recognition [8] Brubaker M, Fleet DJ, Hertzmann A (2007) Physics-based person tracking using simplified lower-body dynamics. In: IEEE Conference on Computer Vision and Pattern Recognition [9] Choo K, Fleet DJ (2001) People tracking using hybrid Monte Carlo filtering. In: IEEE International Conference on Computer Vision, vol II, pp 321–328 [10] Corazza S, Muendermann L, Chaudhari A, Demattio T, Cobelli C, Andriacchi T (2006) A markerless motion capture system to study musculoskeletal biomechanics: visual hull and simulated annealing approach. Annals of Biomedical Engineering 34(6):1019–1029 [11] de la Gorce M, Paragos N, Fleet DJ (2008) Model-based hand tracking with texture, shading and self-occlusions. In: IEEE Conference on Computer Vision and Pattern Recognition [12] Demirdjian D, Ko T, Darrell T (2005) Untethered gesture acquisition and recognition for virtual world manipulation. Virtual Reality 8(4):222–230 [13] Deutscher J, Reid I (2005) Articulated body motion capture by stochastic search. International Journal of Computer Vision 61(2):185–205 [14] Doucet A, Godsill S, Andrieu C (2000) On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10(3):197–208 [15] Felzenszwalb P, Huttenlocher DP (2005) Pictorial structures for object recognition. International Journal of Computer Vision 61(1):55–79 [16] Forsyth DA, Ponce J (2003) Computer Vision: A Modern Approach. Prentice Hall [17] Forsyth DA, Arikan O, Ikemoto L, O’Brien J, Ramanan D (2006) Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision 1(2&3):1–255 [18] Gall J, Potthoff J, Schnorr C, Rosenhahn B, Seidel HP (2007) Interacting and annealing particle filters: Mathematics and a recipe for applications. Journal of Mathematical Imaging and Vision 28:1–18 [19] Gavrila DM, Davis LS (1996) 3-D model-based tracking of humans in action: a multi-view approach. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 73–80 [20] Gordon N, Salmond DJ, Smith AFM (1993) Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE Proceedings Part F Radar and signal processing 140:107–113 [21] Grassia FS (1998) Practical parameterization of rotations using the exponential map. Journal of Graphics Tools 3(3):29–48
84
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
[22] Herda L, Urtasun R, Fua P (2005) Hierarchical implicit surface joint limits for human body tracking. Computer Vision and Image Understanding 99(2):189– 209 [23] Horprasert T, Harwood D, Davis L (1999) A statistical approach for realtime robust background subtraction and shadow detection. In: FRAME-RATE: Frame-rate applications, methods and experiences with regularly available technology and equipment [24] Howe N (2007) Silhouette lookup for monocular 3d pose tracking. Image and Vision Computing 25:331–341 [25] Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the Hausdorf distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9):850–863 [26] Isard M, Blake A (1998) CONDENSATION - conditional density propagation for visual tracking. International Journal of Computer Vision 29(1):5–28 [27] Isard M, MacCormick J (2001) BraMBLe: a bayesian multiple-blob tracker. In: IEEE International Conference on Computer Vision, vol 2, pp 34–41 [28] Jepson AD, Fleet DJ, El-Maraghi TF (2003) Robust Online Appearance Models for Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(25):1296–1311 [29] Kakadiaris L, Metaxas D (2000) Model-based estimation of 3D human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12):1453–1459 [30] Kanaujia A, Sminchisescu C, Metaxas D (2007) Semi-supervised hierarchical models for 3d human pose reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition [31] Kanaujia A, Sminchisescu C, Metaxas D (2007) Spectral latent variable models for perceptual inference. In: IEEE International Conference on Computer Vision [32] Kollnig H, Nagel HH (1997) 3d pose estimation by directly matching polyhedral models to gray value gradients. International Journal of Computer Vision 23(3):283–302 [33] Kong A, Liu JS, Wong WH (1994) Sequential imputations and bayesian missing data problems. Journal of the American Statistical Association 89(425):278–288 [34] Kuipers JB (2002) Quaternions and Rotation Sequences: A Primer with Applications to Orbits, Aerospace, and Virtual Reality. Princeton University Press [35] Lee CS, Elgammal A (2007) Modeling view and posture manifolds for tracking. In: IEEE International Conference on Computer Vision [36] Li R, Tian TP, Sclaroff S (2007) Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In: IEEE International Conference on Computer Vision [37] Metaxas D, Terzopoulos D (1993) Shape and nonrigid motion estimation through physics-based synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(6):580–591
Video-Based People Tracking
85
[38] Moeslund TB, Hilton A, Krüger V (2006) A survey of advances in visionbased human motion capture and analysis. Computer Vision and Image Understanding 104(2-3):90–126 [39] Mori G, Malik J (2002) Estimating human body configurations using shape context matching. In: IEEE European Conference on Computer Vision, pp 666–680 [40] Navaratnam R, Fitzgibbon A, Cipolla R (2007) The joint manifold model for semi-supervised multi-valued regression. In: IEEE International Conference on Computer Vision [41] Neal RM (1993) Probabilistic inference using markov chain monte carlo methods. Tech. Rep. CRG-TR-93-1, Department of Computer Science, University of Toronto [42] Neal RM (2001) Annealed importance sampling. Statistics and Computing 11:125–139 [43] Nestares O, Fleet DJ (2001) Probabilistic tracking of motion boundaries with spatiotemporal predictions. In: IEEE Conference on Computer Vision and Pattern Recognition, vol II, pp 358–365 [44] Ning H, Xu W, Gong Y, Huang TS (2008) Latent pose estimator for continuous action recognition. In: IEEE European Conference on Computer Vision [45] North B, Blake A (1997) Using expectation-maximisation to learn dynamical models from visual data. In: British Machine Vision Conference [46] Pavolvic V, Rehg J, Cham TJ, Murphy K (1999) A dynamic bayesian network approach to figure tracking using learned dynamic models. In: IEEE International Conference on Computer Vision, pp 94–101 [47] Plankers R, Fua P (2001) Articulated soft objects for video-based body modeling. In: IEEE International Conference on Computer Vision, vol 1, pp 394–401 [48] Poon E, Fleet DJ (2002) Hybrid Monte Carlo filtering: edge-based people tracking. In: Workshop on Motion and Video Computing, pp 151–158 [49] Prati A, Mikic I, Trivedi MM, Cucchiara R (2003) Detecting moving shadows: Algorithms and evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(7):918–923 [50] Ramanan D, Forsyth DA, Zisserman A (2007) Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:65–81 [51] Rehg J, Kanade T (1995) Model-based tracking of self-occluding articulated objects. In: IEEE International Conference on Computer Vision, pp 612–617 [52] Ren L, Shakhnarovich G, Hodgins J, Pfister H, Viola P (2005) Learning silhouette features for control of human motion. ACM Transactions on Graphics 24(4):1303–1331 [53] Rosales R, Sclaroff S (2002) Learning body pose via specialized maps. In: Advances in Neural Information Processing Systems [54] Rosenhahn B, Kersting U, Powel K, Seidel HP (2006) Cloth X-Ray: MoCap of people wearing textiles. In: Pattern Recognition, DAGM
86
Marcus A. Brubaker, Leonid Sigal and David J. Fleet
[55] Shakhnarovich G, Viola P, Darrell TJ (2003) Fast pose estimation with parameter-sensitive hashing. In: IEEE International Conference on Computer Vision, pp 750–757 [56] Sidenbladh H, Black M, Fleet D (2000) Stochastic tracking of 3d human figures using 2d image motion. In: IEEE European Conference on Computer Vision, vol 2, pp 702–718 [57] Sidenbladh H, Black MJ, Sigal L (2002) Implicit probabilistic models of human motion for synthesis and tracking. In: IEEE European Conference on Computer Vision, vol 1, pp 784–800 [58] Sigal L, Black MJ (2006) Measure locally, reason globally: Occlusionsensitive articulated pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 2041–2048 [59] Sigal L, Balan A, Black MJ (2007) Combined discriminative and generative articulated pose and non-rigid shape estimation. In: Advances in Neural Information Processing Systems [60] Sminchisescu C, Jepson A (2004) Generative modeling for continuous nonlinearly embedded visual inference. In: International Conference on Machine Learning, pp 759–766 [61] Sminchisescu C, Kanaujia A, Li Z, Metaxas D (2005) Discriminative density propagation for 3d human motion estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 1, pp 390–397 [62] Sminchisescu C, Kanajujia A, Metaxas D (2006) Learning joint top-down and bottom-up processes for 3d visual inference. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 1743–1752 [63] Stauffer C, Grimson W (1999) Adaptive background mixture models for realtime tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 246–252 [64] Stenger BDR (2004) Model-based hand tracking using a hierarchical bayesian filter. PhD thesis, University of Cambridge [65] Sukel K, Catrambone R, Essa I, Brostow G (2003) Presenting movement in a computer-based dance tutor. International Journal of Human-Computer Interaction 15(3):433–452 [66] Taylor CJ (2000) Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding 80(10):349–363 [67] Tomasi C, Kanade T (1991) Detection and tracking of point features. Tech. Rep. CMU-CS-91-132, Carnegie Mellon University [68] Urtasun R, Darrell T (2008) Local probabilistic regression for activityindependent human pose inference. In: IEEE Conference on Computer Vision and Pattern Recognition [69] Urtasun R, Fleet DJ, Hertzmann A, Fua P (2005) Priors for people tracking from small training sets. In: IEEE International Conference on Computer Vision, vol 1, pp 403–410
Video-Based People Tracking
87
[70] Urtasun R, Fleet DJ, Fua P (2006) 3D people tracking with gaussian process dynamical models. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 1, pp 238–245 [71] Urtasun R, Fleet DJ, Fua P (2006) Motion models for 3D people tracking. Computer Vision and Image Understanding 104(2-3):157–177 [72] Vondrak M, Sigal L, Jenkins OC (2008) Physical simulation for probabilistic motion tracking. In: IEEE Conference on Computer Vision and Pattern Recognition [73] Wachter S, Nagel HH (1999) Tracking persons in monocular image sequences. Computer Vision and Image Understanding 74(3):174–192 [74] Wang JM, Fleet DJ, Hertzmann A (2006) Gaussian process dynamical models. In: Advances in Neural Information Processing Systems 18, pp 1441–1448 [75] Wren CR, Pentland A (1998) Dynamic models of human motion. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp 22– 27 [76] Wren CR, Azarbayejani A, Darrell T, Pentland AP (1997) Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):780–785
Locomotion Activities in Smart Environments Björn Gottfried
1 Introduction One subarea in the context of ambient intelligence concerns the support of moving objects, i. e. to monitor the course of events while an object crosses a smart environment and to intervene if the environment could provide assistance. For this purpose, the smart environment has to employ methods of knowledge representation and spatiotemporal reasoning. This enables the support of such diverse tasks as wayfinding, spatial search, and collaborative spatial work. While this chapter provides methods for dealing with these issues, a specific focus lies on the research question as to which minimal information already suffices a smart environment to deal with approximate motion information in a robust and efficient way, that is under realistic conditions. It shows that the thorough analysis of the task at hand enables the smart environment to get by on a minimum of sensory information: the choice of an adequate representation and the use of knowledge about the smart environment forces a simple technical implementation. Additionally, this requires to process just a minimum of sensory information. A number of case studies will be introduced and carefully examined in order to demonstrate how smart environments can effectively make use of locomotion patterns.
1.1 Why Locomotion Matters The objective of smart environments is to support human beings who live and act in such environments. This requires the analysis of their behaviours in order to enable a smart environment to determine how it can help. A fundamental category of behaviours are locomotion behaviours, addressing the question as to how someone Björn Gottfried Centre for Computing Technologies, University of Bremen, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_4, © Springer Science+Business Media, LLC 2010
89
90
Björn Gottfried
changes his location over time. This chapter investigates locomotion behaviours. It looks at how the motion of someone who crosses a smart environment can be described. For this purpose a top down approach is proposed. It asks for the distinctions which are relevant regarding specific tasks. The goal of this chapter is to clarify in logical terms what the essential ingredients of smart environments are insofar such environments make use of how, and that at all, people move around. This makes it necessary to analyse locomotion behaviours in relation to specific tasks and problems. In doing so, only a specific aspect of the concept of motion, or rather locomotion, is of interest here: motion always results in a change of location; and what we shall consider in the context of a smart environment is where objects are within this environment: which locations an object has visited or should visit in order to cope with a specific task. The result will provide a specification which aids the implementation of smart environments that model object motion in terms of locations and changes of locations.
1.2 Overview The body of this chapter is structured as follows. The next section describes how the presented approach fits into the related work (Section 2). After that it is analysed which concepts are relevant in the context of a smart environment that monitors how people move around. For this purpose, a number of scenarios describe in which way a smart environment could support those people (Section 3). From these scenarios a formalisation can be derived that defines functional relationships among locations (Section 4). These will be used to define more specific algorithms that show how a smart environment will deal with the challenges described in the previous scenarios (Section 5). The chapter concludes with a discussion, outlook, and summary (Sections 6, 7 and 8).
2 Related Work on Motion Analysis and Smart Environments Any analysis of object motion has to make several commitments. We have identified in particular five objectives we should look at when dealing with motion information in smart environments. (1) Different motion parameters exist. Which are relevant to realise smart environments? (2) The precision to describe object motion can be arbitrarily fine. How fine should the precision be in order to solve the problems at hand? (3) Object motion can be described from different viewpoints. Which is the viewpoint a smart environment should adopt? (4) Human navigation and wayfinding plays an important role in smart environments. Which aspects should be taken into account? (5) Eventually, bottom-up and top-down approaches to motion analysis can be distinguished. Does one of these strategies take precedence over the other in the context of smart environments? Besides the consideration of issues concerning
Locomotion Activities in Smart Environments
91
motion analysis, a look at past investigations on smart environments shows that the topics of this chapter complement those investigations.
2.1 Motion Parameters The motion of a body is frequently described in terms of the relation between the distance it covers and the time it needs to cover this distance. The notions of velocity and acceleration can then be used in order to describe how fast an object moves and how it changes its speed. But the velocity of a body is not the only quantity of interest in motion analysis. In the context of smart environments we are primarily interested in where an object moves and where it intends to move to. This is why we shall analyse how to locate objects in smart environments. In recent years many techniques have been invented that enable indoor localisation of objects, for instance by means of RFID [52, 26], smart floors [49, 44], wireless LAN [32, 28], or infrared positioning techniques [29, 27]; especially, different techniques enable a different resolution of location information; overviews of techniques at the sensory level for locating objects and identifying them are provided by [36, 24]. A location model for smart environments that includes in particular the consideration of locations of computing devices and services and which focuses on location-aware and user-aware services is proposed by [48]. Regardless of the positioning technique used, algorithms are necessary which meaningfully make use of positional information in smart environments. Such algorithms should be independent of the sensors employed because sensor technology rapidly improves. That is, motion analysis algorithms should abstract from the sensory level. Assumption 1 (Sensor independence) Methods for motion analysis should be sensor independent.
2.2 Precision At first glance, it seems advantageous to deploy sensors in a way to obtain locations of objects as precise as possible. However, more important than precision is accuracy which concerns the correctness of information. We argue that coarse but correct information about the location of someone in a smart environment is better than precise information with a high probability of inaccuracy. Spatial and temporal information which fall into this class of coarse concepts are investigated in the field of qualitative representations [6]; and with regard to built environments see in particular [5]. Even more important is that coarse information is frequently adequate. It is this latter statement that we shall prove to be valid for a couple of meaningful tasks in smart environments.
92
Björn Gottfried
Assumption 2 (Information coarseness) Coarse sensory information is sufficient for a number of important problems arising in the context of motion analysis.
2.3 Viewpoint For a robust and efficient solution to the problem of motion analysis qualitative spatial reasoning methods are employed [43]. These methods originally derive from combinatorial, computational, and discrete geometry [9, 15] and have been brought into the field of qualitative navigation by [35]. While such techniques are employed for simultaneous localisation and mapping problems, for short SLAM [34], the design of a smart environment differs significantly from the design of a mobile agent. A smart environment is distinguished by knowing its infrastructure that supports agents moving in it, whereas the task of mobile agents frequently consists in finding their way in an unknown environment. In both cases qualitative spatial reasoning techniques are useful. What these cases distinguish are their viewpoints: a smart environment has an allocentric view on the entire environment, while a mobile agent has just an egocentric, and hence partial view on a larger area it is contained in. As opposed to the egocentric perspective, an allocentric view is considered to be independent of the actual position of the perceiver since it is anchored to the external space and defines spatial relations with respect to elements of the environment [50]. It is the aim of the present chapter to analyse how to solve motion analysis problems by means of the allocentric view of a smart environment. Assumption 3 (Allocentric view) A smart environment should adhere to an allocentric view.
2.4 Wayfinding Investigating how a smart environment could support locomotion activities of humans relates to a large body of work on human navigation and wayfinding. Problems are taken into account from the point of view of architectural design, spatial orientation and wayfinding [12]; means to generate adaptive route directions are elaborated on [47]; and, from the psychological perspective effects of familiarity, visual access, and orientation aids are considered [13]. As opposed to such external cues that help in navigation and wayfinding, [25] investigate how cognitive maps determine wayfinding tasks from the point of view of Cognitive Sciences. The choice of salient features which are used in wayfinding tasks is treated by [53] and means for analysing behaviour patterns of pedestrians are investigated in [41]. Additionally, in the Artificial Intelligence community much work exists on navigation in physical environments [11, 33, 38, 30].
Locomotion Activities in Smart Environments has
Smart Environment
93 Allocentric View
needs
Location Maintenance
is based on
Information Coarseness
allows
Sensor Independence
defines
Sensory Equipment
Fig. 1 Assumptions to be satisfied by smart environments.
We shall consider human navigation and wayfinding problems. For this purpose a smart environment will be assumed that has a map of its infrastructure (in accordance with the allocentric view Assumption) and that maintains the whereabouts of each mobile object using this infrastructure. On this basis algorithms can be defined which solve wayfinding problems. Assumption 4 (Location maintenance) A smart environment should maintain the whereabouts of each mobile object.
2.5 Modelling One line of research employs data mining and learning techniques in order to search for regularities in the data. In [8] for example anomalies in event data of a health care smart home are sought by data mining techniques. Such approaches argue that smart environments are richly equipped with sensors that accumulate huge amounts of sensory data. This makes it necessary to investigate how to make sense of these data, similar to sensor networks for which challenges arise from the volumes of data collected by the huge number of mobile and stationary sensors [4]. Methods in the field of visual analytics are yet another way to search for regularities in large data sets [3]. We shall follow another line of research that instead analyses a given problem domain in order to determine which sensory information is relevant for specific tasks. Such a top-down analysis enables one to determine the sensors which are actually necessary, and hence, leads to data reduction and avoids the need for data mining. Assumption 5 (Sensory equipment) An application driven top-down analysis determines the necessary sensors.
94
Björn Gottfried
2.6 Summary A number of assumptions have been identified which should be realised by smart environments that make use of locomotion behaviours. Later on methods will be introduced that satisfy these assumptions. They are summarised in Fig. 1.
3 Locomotion Activities in Smart Environments This section analyses locomotion behaviours in smart environments. For this purpose a specific scenario is chosen – a smart hospital (Section 3.1). The complexity of such an environment should enable the thorough investigation of most of the relevant issues concerned with the analysis of motion activities in smart environments. Four scenarios are discussed that differ with regard to significant aspects (Section 3.2). Their analysis aids in identifying fundamental characteristics of smart environments (Section 3.3).
3.1 Smart Hospitals The idea of a smart hospital is not new. Until now however smart hospitals concentrate on increasing the efficiency of clinical processes and knowledge management [23, 39]. Moreover, they support medication and primarily aim at paperless documentation. For example the Mountain View’s El Camino Hospital in the USA is using technologies to improve communication among staff members and to make processes more efficient: nurses and other staff members use wireless, hands-free devices to communicate with each other; devices are employed that allow authorised staff members to open cabinet doors to medications and other supplies with thumbprint readings; tablet PCs and handhelds are used by doctors who will eventually do away with the clipboard; supply-replenishing systems are used that automatically alert the suppliers of the hospital when anything from needles to dressing kits runs low; lab samples move through various processing stations on an automated production line [7]. Apart from the lab samples, will locomotion based technologies also have a future in a smart hospital? Motion behaviours come into existence in several different ways. Here we shall focus on those motion activities in which a human being changes her location over time, that is we are concerned with locomotion behaviours. The consideration of what is going on at a hospital site allows a broad spectrum of different locomotion behaviours to be taken into account that might arise in a smart environment. The environment of a hospital has a complex structure. It consists of a building or even of a set of buildings on a larger area that define a complex spatial infrastructure. This infrastructure serves several purposes which require people to move around. In this context, a smart hospital should support people finding their way or coping with
Locomotion Activities in Smart Environments
95
Fig. 2 A taxonomy of motion processes at a hospital site.
other tasks within the hospital site. This support strongly depends on several factors including the urgency of a task, the roles of the people involved, as well as temporal and spatial constraints. Fig. 2 shows a taxonomy of some typical locomotion processes at a hospital site. It distinguishes single moving objects as opposed to groups of moving objects. Then, it looks at different spatial scales in order to find specific scenarios that involve specific locomotion behaviours. We shall investigate some of them in Section 3.2. Their analysis aids in identifying fundamental characteristics which are relevant for realising these scenarios within a smart environment (Section 3.3). The results of this analysis will be used for a functional specification which relates the different characteristics (Section 4). Eventually, algorithms can be defined on this specification (Section 5). Fig. 3 summarises this design process: the arrows indicate the information flow. It shows that the concepts can be derived from the introduced scenarios, that the specification is based on the found concepts, and that the algorithms are defined by the scenarios and by the specification. While scenarios and algorithms are specific in that they describe individual cases, the concepts and specification are defined in a general way.
96
Björn Gottfried
Specific
General
Scenarios (3.2)
Concepts (3.3)
Algorithms (5)
Specification (4)
Fig. 3 Design process used: numbers indicate corresponding sections, arrows information flow. Scenarios and algorithms are specific, while concepts and specification are generally defined.
3.2 Scenarios in a Smart Hospital For each of the following scenarios a problem or challenge is identified. Afterwards it is described how the smart environment should cope with this challenge. This concerns the first design stage (see Fig. 4). Scenario 1 (Planning) A nurse has to take care of several patients. To decide whom to visit next, urgency, space constraints, and other tasks are to be taken into account. In order to realise this scenario the smart environment should provide means for tracking individuals. It is however sufficient to determine positions at the level of single rooms. The smart environment can then tell the nurse where she has to go next. More formally, the whereabouts of a nurse as well as the rooms of the patients she has to visit are represented by regions. A nurse is contained in a region, and as far as regions represent disjoint places (different rooms and hallways), she is contained in exactly one region. The smart environment further represents the arrangement of regions at the hospital site. In this way, it can provide the nurse an efficient plan about where to go next when she is finished with a patient. Describing the path the nurse has to take amounts to consider an ordered list of regions she has to cross. Scenario 2 (Wayfinding) A visitor is going to visit a patient and he wants to talk to the responsible doctor. The visitor is in this hospital for the first time and does not know where to go. An additional challenge is that the responsible doctor will himself move around instead of being in his office all the time. The smart environment has to track an individual. In general the visitor will not be familiar with the environment, meaning for the smart environment to provide him information about where to go. For this purpose a map could be displayed at the wall which the visitor will most likely perceive depending on from where he enters a specific room or hallway. In particular, a personalised map will be confined to what is the most relevant content for the given situation; for customised maps see [51].
Locomotion Activities in Smart Environments
97
Specific
General
Scenarios (3.2)
Concepts (3.3)
Algorithms (5)
Specification (4)
Fig. 4 Design process used: numbers indicate corresponding sections, arrows information flow. Scenarios and algorithms are specific, while concepts and specification are generally defined.
This is in fact a challenge for the smart environment as soon as a number of visitors cross the same hallway. In short, the smart environment will lead the visitor to his destination after he has told the system at the entrance who he wants to see. More formally, the visitor is contained in a specific region which is the region of the entrance hall. There he lets the system know who he wants to visit, patient or doctor. Since the system knows everything about the whereabouts of everybody it can compute the order of regions the visitor has to pass. Even if the patient or doctor changes his location in the meantime the smart environment can react promptly by adapting the order of regions the visitor has to cross to reach his goal. Scenario 3 (Searching) A patient needs help, requiring as fast as possible a doctor to be called and also a nurse to provide assistance. The smart environment has to look for which doctor and which nurse is closest to the patient; it might also consider whether doctor and nurse are available, and if not, to look for the next available doctor and nurse. As soon as the smart hospital has identified doctor and nurse it has to show them effectively where to go. Furthermore, it can provide them with as much meaningful information about the state of the patient as available. More formally, both doctor and nurse are contained in either the same region or in different regions. As in the wayfinding scenario the smart environment can tell doctor and nurse the order of regions they have to cross in order to get as fast as possible to the patient. Although they are familiar with the environment, the smart environment will help in getting to the target region as fast as possible. Additionally, further information about the patient and his state can be shown on smart displays at appropriate places on their way or by a smart audio system. Scenario 4 (Monitoring) It is not only of interest for the doctor to monitor the vital signs of a patient but also how they correlate with his spatial activities. This is for the purpose of analysing his progress. The system has to track an individual who is a patient and who will therefore be within the smart environment for a longer period of time, for example a week. The
98
Björn Gottfried
Table 1 Locomotion scenarios summarised in terms of participating continuants and activities
Scenario
Continuant
Activities
planning wayfinding searching monitoring
nurse visitor doctor patient
perform tasks at different locations search for locations follow path to locations show modes of movements at locations
system will monitor his recovery by observing how often he leaves his bed, room, or even the building and by correlating his locomotion with his vital signs. The smart environment will tell the responsible doctor whether the patient makes advances. This might take place during the ward round, that is a display will show the doctor how the vital signs of the patient correlate with his activities. More formally, as opposed to the previous scenarios finer locomotion behaviours might be of interest. For instance, how fast the patient moves through his room as between bed and window, or between bed and washing room. Those parts of the room can also be represented by regions, but instead of measuring only whether the patient is contained in a region the speed with which he reaches another region is also of interest.
3.3 Characterising Locomotion Activities In the following a set of four different categories are identified that derive from the previous scenarios which are summarised in Table 1. These categories will aid in designing smart environments by showing which aspects are to be taken into account (see Fig. 5).
3.3.1 Continuants The first category, continuants, is about the people who are to be supported by the smart environment, and hence, which are to be monitored: a patient, a visitor, a nurse, or a doctor. In most cases the smart environment has to provide support for a single person. But as the searching scenario shows sometimes the collaboration of a number of people is required and needs to be supported by the smart environment. When a group of people, such as a team of doctors and nurses who are preparing an operation, have to work together the smart environment could suggest a room, could even reserve it, as well as remind and lead the continuants to it when it is time to get together — the smart environment is the common memory of the continuants who have to work together. Many possibilities exist on how a collection of people need to act together. A classification scheme for collectives can be found in [54]
Locomotion Activities in Smart Environments
99
Specific
General
Scenarios (3.2)
Concepts (3.3)
Algorithms (5)
Specification (4)
Fig. 5 Design process used: numbers indicate corresponding sections, arrows information flow. Scenarios and algorithms are specific, while concepts and specification are generally defined.
who analyse typical characteristics of different kinds of groups and their movement patterns.
3.3.2 Events Each scenario can be conceived of as some more or less specific event that corresponds to a specific problem or task the continuants are confronted with: an event might be defined by a couple of tasks which are to be performed at different locations (planning scenario); an event might consist in finding the way between locations in an unknown environment (wayfinding scenario); an event might be defined as a search problem, for example specific people with a specific competence are to be sought as fast as possible (searching scenario); not so much an event but rather an ongoing process is the observation of the spatial activities of the patient; but then, an event is defined by a specific time interval for which measurements are taken (monitoring scenario). This summary makes clear how the scenarios generalise to other domains in which also planning, wayfinding, searching, and monitoring problems in space arise. In general, such events can be separated in domain independent events: {wayfinding, searching, hiding, provide security} and domain dependent events: {operation, round, serving meals, visiting, cleaning}
3.3.3 Time The time is a fundamental dimension in that the speed of supporting someone might be crucial, as in the searching scenario. The speed that is needed that someone gets from one location to another location is relevant in this scenario, as is the speed with which someone is crossing a region like in the monitoring scenario. The speed is normally captured by the concept of velocity that relates a temporal interval to
100
Björn Gottfried
a distance someone has travelled during this time interval. Instead of measuring a precise velocity value it might be sufficient to relate a temporal interval to the number of regions someone has passed or has to pass while walking between specific locations; additionally the smart environment knows the distances covered for every pair of regions. For a qualitative concept of velocity see [42]. Also, the temporal order of how something is performed might be relevant (planning scenario), or rather sufficient, as well as how events temporally relate [2].
3.3.4 Space Each event takes place in space and the smart environment itself has a specific spatial structure that determines how it has to provide support. Among others, spatial aspects are about considering space at different scales. The largest scale concerns the whole hospital site (of interest for all scenarios and in particular for the wayfinding scenario); intermediate scales concern connected portions of it (primarily of interest for the planning scenario); the smallest scale is that one of a single human (monitoring scenario). It is the idea to approximate locations at specific scales which are relevant for the given tasks. We will do this later at the level of rooms. In other words a function called position will be introduced that returns regions corresponding to rooms and hallways. In addition, the distance between locations might be of interest and should be defined in accordance to the spatial scale used. Moreover, the neighbourhood of a region refers to all other regions which are directly connected to that region. This concept is of interest when the smart environment provides the user with information about which region he should enter next in order to get to a specific destination, as in the wayfinding scenario. Other spatial aspects such as a closer relation might be of interest. Then, both the neighbourhood and the closer relation can be employed for realising the planning scenario.
3.3.5 Distinguishing Events Each of these four categories determines locomotion behaviours and can therefore be used for linking observable criteria, primarily concerning temporal and spatial constraints, to knowledge about specific events, thus provide the smart environment with what it needs to know in order to support people. Then, the nurse who starts to work notifies the smart environment that it knows that it has to provide her assistance during her round; this relieves her from planning this round. By contrast, the visitor registers himself at the entrance and triggers a wayfinding process, while the urgency case of a patient triggers a search process. Finally, a monitoring process is launched when a new patient checks in. There might be further temporal and spatial aspects of interest in other domains. An overview on spatial and temporal reasoning techniques for smart homes can be found in [19] and further issues about spatiotemporal reasoning and context awareness can be found in [21].
Locomotion Activities in Smart Environments
101
4 The Representation of Locomotion Activities The allocentric viewpoint of a smart environment enables it to determine where a moving agent is. In this way it can track any mobile object that moves through it. A functional specification that realises this is presented in this section. At first, some general issues concerning the treatment of spatiotemporal information are discussed in 4.1. Then a representation describing the functional relationships for realising the previous scenarios are presented in Section 4.2, that is a representation is proposed which is entirely based on regions and allows to distinguish both the locations where objects are and relations among those locations. Accordingly, motion patterns are described by means of locations of objects and the change of locations. Section 4.3 describes the allocentric component in this context and how the allocentric viewpoint is part of every process that solves a problem within the smart environment. Section 5 will afterwards show what the algorithmic solution of the previously described scenarios look like on this basis.
4.1 Spatiotemporal Information Spatiotemporal information has been thoroughly dealt with in the past. Many approaches are related to the Triad framework of [45] who distinguishes the three dimensions referred to as What, Where and When, similar to [40] who distinguish location, time, and theme perspectives. Here the What or theme component divides into continuants and occurrences (events), continuants enduring through time and occurrences being extended in time [20]. As the examples of the previous section indicate, a formalisation based on regions is sufficient. A region based formalism is the 9-intersection calculus [10] and the region connection calculus [46] which set into relation pairs of regions in a topological way. Here we are not interested in comparing how regions relate topologically but in describing how a person crosses a smart environment that is divided into regions. This is similar to [16] who proposes to consider the tessellation of an environment as a basis relative to which locomotion patterns are described. Another related approach is that one of [31] who describe movement patterns of single agents represented by directed lines in relation to regions.
4.2 Functional Specification We shall now turn our attention to specifying relations on the concepts previously introduced (see Fig. 6). According to what has been derived in Section 3.3, we assume a primitive ontology of continuants c ∈ C, events e ∈ E, times t ∈ T , and regions r ∈ R. In the case of a hospital the regions represent places such as entrance hall, hallways, and rooms. The four domains are related by means of a couple of
102
Björn Gottfried
Specific
General
Scenarios (3.2)
Concepts (3.3)
Algorithms (5)
Specification (4)
Fig. 6 Design process used: numbers indicate corresponding sections, arrows information flow. Scenarios and algorithms are specific, while concepts and specification are generally defined.
functions that define how the smart environment makes sense of these domains. The first one determines the position of a continuant at a specific time: position : C × T → R
(1)
It is this function that solely defines the sensory equipment necessary for the smart environment. This is because for each other function nothing more is required than to determine in which region a specific continuant is. The sophistication of these functions is due to the smart environment knowing its spatial structure, so that for instance it can derive what the shortest path between two regions is. The smart environment defines a kind of metric that returns for two regions their minimal distance: distance : R × R → R (2) Based on this a function can be defined that returns which of two regions is closer to a third one: closer : R × R × R → R (3) Then this function can itself be used to determine the region one should enter when being in a specific region ri and when aiming at reaching another region: visitOnRoute : R × R → R
(4)
A somewhat more general function is needed when one is interested in determining for a set of regions which to visit next, e. g. the nearest one, given that one is contained in a specific region ri : visitNearest : R × 2R → R
(5)
Another distinction between (4) and (5) is that the former is restricted to search in the neighbourhood of the current region ri , while (5) searches in a set of regions that are not necessarily in the neighbourhood of ri . A number of continuants might be contained in the same region. The following function is needed in order to determine for a region who is contained in that region:
Locomotion Activities in Smart Environments
containedIn : R → 2C
103
(6)
and another function is needed that returns the role of a continuant: getRole : C → {nurse, doctor, patient, visitor, other}
(7)
All rooms which are in the direct neighbourhood of a given room, more precisely which are connected by a door, are returned by the following function: neighbourhood : R → 2R
(8)
The function position tells the smart environment the position of the person under consideration and a couple of functions can be defined on the basis of the distance function. Since the smart environment has a static spatial structure, much can be computed already when defining the smart environment. This includes distances between regions as well as all those functions that build upon the distance between regions: closer, visitOnRoute, and visitNearest. In principle, for a given smart environment these functions can be defined in advance by means of tables that enable efficient access. The defined functions can be conceived of as an abstract data type with which a number of more specific functions can be defined for the smart environment. Besides the position the role has to be determined automatically, for instance by RFID and a table that maps identifiers to roles. For determining the speed with which someone is crossing a specific region, in the simplest implementation it is just the time which is to be measured during which this person is contained in that region; this time and the size of the region are to be set into relation.
4.3 Allocentric View Each process solving a given problem of the smart environment has an allocentric view on this environment. This means that each process has access to a data structure that represents the spatial structure of the whole hospital site, enabling it to process the given functions of the previous paragraph. The distinct places of the environment can be represented by a graph that captures the topological structure of the environment. That is, in particular the neighbourhood of a place which is represented by a node can be directly read off this graph by following the edges which emanate from it. Each node or region or room has itself the complete structure of the smart environment annotated with distances, shortest path information and everything else which can be computed in advance. Also, the whereabouts of each continuant is maintained in such a graph structure. The latter is the only part which changes over time and a change is triggered whenever a mobile crosses the boundary between two distinct places. This is the only challenge the sensory equipment of the smart environment has to manage. Independent of the sensory equipment, this structure is used by the reasoning mechanisms
104
Björn Gottfried
Specific
General
Scenarios (3.2)
Concepts (3.3)
Algorithms (5)
Specification (4)
Fig. 7 Design process used: numbers indicate corresponding sections, arrows information flow. Scenarios and algorithms are specific, while concepts and specification are generally defined.
of the smart environment. Conversely, each sensory equipment which will maintain this graph structure can be deployed. In this way this structure forms the level of abstraction between the sensors at the basis and the reasoning mechanisms at the top end. Since each process has access to the current whereabouts of all persons within the smart environment, algorithms are straightforward that rely on the current position of somebody, that is the position function can always be used instead of documenting in a data structure where someone currently is or should be – which is in fact quite susceptible to errors. For short, the allocentric paradigm enables each process to determine at any time the whereabouts of everybody and this is the fundamental basis on which the following algorithms will build on. A role and an event type is assigned to each continuant. This enables each part of the smart environment to determine how it has to support the concerned individual. Additional parameters might also be necessary, such as the identifier of another continuant who the former continuant is looking for or a destination location.
5 Locomotion Based Algorithms In Section 4 a representation has been defined that includes both a number of domains and a number of functions that show the interrelationships of these domains. Thereby, regions and continuants are the two most important domains which manifest how people move around in space. We shall now investigate how we can make use of these domains and functions in order to define algorithms (see Fig. 7); these algorithms show that our assumptions are valid: the scenarios described earlier can be solved by the smart environment on the basis of a simple sensory equipment. This sensory equipment determines the whereabouts of continuants solely at the level of rooms and hallways, as well as their roles.
Locomotion Activities in Smart Environments
105
5.1 Planning Scenario There is a nurse who is to be supported by the smart environment that tells her which patient out of a set, called patients, she has to visit next. Here, the patient who is nearest to the nurse is selected. Other constraints are conceivable such as the degree of urgency. rooms = 0/ for i = 1 to |patients| do rooms = rooms ∪ {position(patients(i), now)} end for if rooms = 0/ then return visitNearest(position(nurse, now), rooms) else return report that no one is left end if Algorithm 1: Visit Next (nurse ∈ C, patients ⊂ C) This algorithm returns the room the nurse has to visit next, if there is any patient left. For this purpose the rooms are determined that contain the patients. Afterwards the room which is nearest to the position of the nurse is computed. The latter computation is to be replaced by another one as soon as other constraints are to be taken into account which identify the patient who is to be visited next.
5.2 Wayfinding Scenario A visitor is going to visit a patient. The visitor identifies himself at the entrance and the smart environment will lead him to the room of the patient. source = position(visitor, now) destination = position(patient, now) while source = destination do source = visitOnRoute(source, destination) print Now enter source source = position(visitor, now) destination = position(patient, now) end while Algorithm 2: Wayfinding (visitor ∈ C, patient ∈ C)
106
Björn Gottfried
Assuming that the patient will never leave his room, the last row in the algorithm could be omitted (destination = position(patient, now)). Similarly, the last but one row could also be omitted when the visitor would always take the correct path (source = position(visitor, now)); but this algorithm takes into account that the visitor might take the wrong path sometimes and therefore determines his position now and then. While this algorithm just outlines the main parts for supporting a wayfinding task, several extensions are conceivable, such as that the smart environment orders a lift in advance when the visitor has to change the floor or a lift could even be reserved for a doctor who has to get as fast as possible to a specific place; visitor or doctor even do not have to chose the correct button inside the lift since the lift already knows the destination. Similar to the case of the lift, doors could be automatically opened that lead to other parts of the hallway or into other rooms, so that the user is afforded to take the according door or the according lift – this being an example of how the concept of affordance by Gibson [14] might be an integral part of a smart environment. Or, the smart environment can determine how crowded a hallway is which is part of a route, and consequently can suggest alternative routes, especially if the doctor is in a hurry. Eventually, one might choose between the fastest route, the most comfortable route for those who cannot walk very well, or routes which support wheelchairs.
5.3 Searching Scenario The next algorithm searches for a doctor. This search starts at the room of the patient and successively takes into account rooms in the neighbourhood of the room of the patient, and then in the neighbourhood of the rooms in the neighbourhood, etc., until it has found a doctor. In this way, the doctor who is closest to the patient can be identified. A simple extension would check whether the doctor has currently time to visit the patient. Note how the variable checkedRooms takes care of terminating the search. Without subtracting the rooms already checked from the new neighbourhood, those rooms would always be included in the next iteration phase and the algorithm would endlessly check where the next doctor is. One might argue that this would not pose a problem. Either there is a doctor in the hospital, then the algorithm would find him. Or there is currently no doctor in the hospital. In the latter case it would make sense to loop through this algorithm again and again until there is a doctor available. One can easily imagine how this function generalises to finding a specific person having a particular role, and hence how the searching scenario is realised by the smart environment: This algorithm can easily be adapted for the purpose of calling someone with a particular competence to a specific place.
Locomotion Activities in Smart Environments
107
d := null N := neighbourhood(position(patient, now)) checkedRooms := {position(patient, now)} while d = null do people := containedIn(N) for i := 1 to |people| do if getRole(people(i)) = doctor then d := people(i) end if end for checkedRooms := checkedRooms ∪ N N = neighbourhood(N) − checkedRooms end while Algorithm 3: Find Doctor for Patient (patient ∈ C) d := Find Person for Patient (doctor, patient) n := Find Person for Patient (nurse, patient) Wayfinding (d, patient) Wayfinding (n, patient) Algorithm 4: Lead Doctor and Nurse to Patient (patient ∈ C)
5.4 Monitoring Scenario The monitoring scenario is an example of how the smart environment can make use of the available motion data which is acquired over a longer period of time. It memorises the whereabouts of every patient. This enables it to determine in particular how long one needs to reach the washing room or to pace along a hallway or to get to the canteen. The speed information with which a patient crosses different regions can be set into relation with his vital signs and this provides the doctor additional information concerning the activities of the patient and how well he deals with these activities. This example illustrates how monitoring scenarios split up into two different parts, the monitoring and the interpretation part. The monitoring part captures the positional data, and therewith the locomotion behaviour of a patient; it repeatedly saves positional information by taking into account a defined time-lag: The function put is a database function that saves the position of a given continuant at a given time and the function delay stops the tracking of a continuant for an amount of time which is given by break. Several interpretation methods for positional data are conceivable, depending on what a doctor will evaluate. The following example determines how long the patient stayed on average in a region, which approximately indicates the mean velocity with which the patient crossed this specific region, for example a hallway:
108
Björn Gottfried
t = now put positionalDB for patient to position(patient, t) at t delay(break) Monitor Patient Position (patient, break) Algorithm 5: Monitor Patient Position (patient ∈ C, break ∈ R) let durations = select p.tend − p.tstart from positionalDB as p where p.id = patient and p.location = r averageDuration = 0 numO fVisits = 0 for i := 1 to |durations| do averageDuration := averageDuration + durations(i) numO fVisits = numO fVisits + 1 end for return averageDuration / numO fVisits Algorithm 6: Observe Patient in Region (patient ∈ C, r ∈ R)
This example shows how the employment of spatiotemporal databases become of interest for evaluating what has been monitored about the locomotion of people in a smart environment. Note that the let statement is conceived of as a database function that retrieves in this case the temporal interval (tend − tstart ) during which the patient stayed in the given region. For a survey on spatiotemporal databases see [1]. A number of specific methods on moving object databases is provided by [22].
6 Discussion Whereas the present work exemplifies important issues on smart environments by the spatial infrastructure of a hospital, in [18] spatial aspects of diseases are investigated. There it has been analysed how diseases develop in space and time. A result of this analysis is that diseases themselves define the environment to be considered in order to track a disease. This is complementary to the current work that presumes a specific spatial infrastructure of a smart environment and analyses how events are realised within this infrastructure. Technical details about how to localise people in this infrastructure have been deliberately omitted. As has been argued in Section 2 there exist many sensors which would provide the smart environment with the required information. One could for instance easily imagine to equip everybody entering the entrance of the hospital with an RFID chip and each door frame with RFID reading devices. Specific identifiers
Locomotion Activities in Smart Environments
109
can then be mapped to specific roles of people. However, the goal of this work is to show how the confinement to a simple sensory equipment already enables a smart environment to provide people with sophisticated support. An adequate representation makes this possible. Coming back to the assumptions stated in Section 2, it can be shown how the chosen representation fulfils them. The sensor independence could easily be realised by the introduction of the position and getRole functions which are at the interface between sensory level and reasoning level. The information coarseness assumption is satisfied as we need not to distinguish finer locations than rooms and hallways in the scenarios. The position function simultaneously enables an allocentric view and takes care of the location maintenance. The sensory equipment derives from how position and getRole functions are implemented in a specific case. Taken these assumptions together, it has been shown how an abstraction level clearly defines the interface between the sensory level and the reasoning level: a graph structure representing the topology of the environment is used for maintaining the locations of mobile objects. The sensory level can be interchanged whenever new technologies are available. Changes at the sensory level do not influence the reasoning level. Moreover, this strategy argues in favour of a sensory equipment which is defined by what is actually needed from the point of view of the reasoning components, and thus from the tasks at hand. Such a top-down view on designing smart environments is not only meaningful for dealing with locomotion activities but also for monitoring any other stationary activities or changes. Confining the sensory technologies to what is actually relevant for a specific environment, data mining and learning techniques become obsolete. Those techniques might in fact be useful in order to learn something about the behaviours of people; this can help in inventing new smart devices and smart environments. But the final solution for a new technology should work properly after installation, from the very beginning. Otherwise the system behaviour would not be deterministic from the point of view of the user, i. e. regarding his perception; he would think that the system is defective as long as it behaves irregular.
7 Outlook Several extensions are conceivable in particular when deploying further positional sensors, allowing finer distinctions such as how often nurses have to get to specific cupboards in order to fetch medicine or specific devices or something else; knowledge about this might be of interest to optimise the spatial organisation of the hospital. Another scenario also based alone on the proposed positional technology is the generation of evacuation plans which enable in the case of fire the rescue service to determine where everybody is, who is attached to specific devices, who is not able to walk on his own, etc.; such a plan can be printed at any time; as soon as
110
Björn Gottfried
the whole technology breaks down, at least the last plan will be available at some service-point outside the burning hospital building. Or, a doctor or nurse might look for a colleague; since the smart environment knows everything about the whereabouts of the staff, including who is currently not within the hospital, it can tell everybody where to find another person; it can additionally check whether this person has enough time in order to meet, and if this is the case, the smart environment can suggest both continuants a place where they could meet, for example a close by conference room where they can talk in private. While the present chapter has focused on the concept of locomotion that is restricted to the consideration of rough locations and the change of locations, finer descriptions of locomotion behaviours might be of interest. Such small scale locomotion behaviours are for instance relevant in the case of behaviours of people suffering from cognitive impairments. In this context [37] distinguish different modes of locomotion the elderly show in a nursing home: random, i. e. changing frequently the direction of movement and hesitating now and then; lapping, i. e. taking awkward paths and going a long way round; and pacing, i. e. running to and fro between specific locations. A general calculus for characterising locomotion patterns among groups of people is given by [17]. Such a calculus becomes of interest when modelling collaborative spatial work, for example when a team of doctors and nurses perform an operation. Then the intersection of movement paths should be avoided, the spatial interplay among all people should be supported, and other means for optimising the spatial activities in a group of people should be planned and monitored. Therefore, small scale motion patterns of either single people [37] or groups of people [17] should be further elaborated on in order to make them available as further tools for monitoring complex behaviours in smart environments. This chapter has been basically framed as a vision piece that argues for the deployment of locomotion based support within smart environments. In doing so a first formalisation in terms of a functional specification has been suggested which will enable such technologies to be realised in the future. The next step will consist in further elaborating on the formal aspects of this specification and to put it into a broader context that shows how other application areas will profit from these technologies too.
8 Summary A class of tasks have been presented which can be supported by smart environments. These tasks have in common that they include a motion component. That is during such tasks people move around in space: they have to follow an unfamiliar path, they want to meet others or are looking for others, or they have to perform tasks successively at different locations. For these purposes, the smart environment can show them the way (wayfinding tasks), it can tell them everything about the whereabouts of others and themselves (search tasks), it can provide help in deciding where to go next in order to stick to constraints which optimise spatial conditions
Locomotion Activities in Smart Environments
111
(spatial planning tasks), and it can evaluate the past whereabouts of patients in order to analyse their behaviours (monitoring tasks). For the purpose of dealing with these tasks it has been shown that the coarse determination of the locations of people is sufficient. That is to say that the sensory equipment of a smart environment can be restricted in the sense that it just has to sense the whereabouts of people at the level of rooms and hallways as opposed to tracking people precisely. Such an approach is likely to be more robust in that the sensory output needs no particular precision. As a consequence, being at different locations at different times is the only way movements of people are expressed in this work. This suffices to cope with a number of tasks which require the locomotion of people in smart environments. A yet more general result is achieved. It could be shown that a top-down approach for designing smart environments has many advantages: it defines a clear interface between sensory level and reasoning level. Thereby it becomes possible to interchange the sensory level with more advanced technologies once available and without users becoming aware of this. Moreover, it is possible to decide which sensory information is to be processed at all in order to realise a smart environment with a specific functionality. This allows a least commitment strategy to be employed that looks for which sensory information will already be sufficient to realise the intended functionality. In the sense of the assumption of information coarseness (Assumption 2), this improves the robustness of the whole system that needs not to rely on precise measurements.
References [1] Abraham T, Roddick JF (1999) Survey of spatio-temporal databases. GeoInformatica 3(1):61–99 [2] Allen JF (1983) Maintaining knowledge about temporal intervals. Communications of the ACM 26(11):832–843 [3] Andrienko N, Andrienko G (2007) Extracting patterns of individual movement behaviour from a massive collection of tracked positions. In: Gottfried B (ed) Proceedings of the 1st Workshop on Behaviour Monitoring and Interpretation (BMI’07), CEURS Proceedings, vol 296, pp 1–16 [4] Balazinska M, Deshpande A, Franklin MJ, Gibbons PB, Gray J, Handsen M, Liebhold M, Nath S, Szalay A, Tao V (2007) Data Management in the Worldwide Sensor Web. Pervasive Computing 6(2):30–40 [5] Bittner T (2001) The qualitative structure of built environments. Fundamenta Informaticae 46(1-2):97–128 [6] Cohn AG, Hazarika SM (2001) Qualitative spatial representation and reasoning: An overview. Fundamenta Informaticae 43:2–32 [7] Colliver V (2004) Hospital of the future: High-tech system helps improve quality of care. San Francisco Chronicle p E 1
112
Björn Gottfried
[8] Cook DJ (2006) Health Monitoring and Assistance to Support Aging in Place. Journal of Universal Computer Science 12(1):15–29 [9] Edelsbrunner H (1987) Algorithms in combinatorial geometry. Springer, Berlin [10] Egenhofer M, Franzosa R (1991) Point-Set Topological Spatial Relations. International Journal of Geographical Information Systems 5(2):161–174 [11] Epstein SL (1997) Spatial representation for pragmatic navigation. In: Hirtle SC, Frank AU (eds) Spatial Information Theory COSIT ’97, LNCS, Springer, pp 373–388 [12] Gaerling T, Lindberg E, Maentylae T (1983) Orientation in buildings: Effects of familiarity, visual access, and orientation aids. Applied Psychology 68:177– 186 [13] Gaerling T, Book A, Lindberg E (1986) Spatial orientation and wayfinding in the designed environment: A conceptual analysis and some suggestions for postoccupancy evaluation. Journal of Architectural Planning Resources 3:55– 64 [14] Gibson J (1979) The ecological approach to visual perception. Hughton Mifflin Company, Boston [15] Goodman J, Pollack R (1993) Allowable sequences and order types in discrete and computational geometry. In: Pach J (ed) New trends in discrete and computational geometry, Springer, Berlin, pp 103–134 [16] Gottfried B (2006) Spatial health systems. In: Bardram JE, Chachques JC, Varshney U (eds) 1st International Conference on Pervasive Computing Technologies for Healthcare (PCTH 2006), November 29 – December 1, Innsbruck, Austria, IEEE Press, p 7 [17] Gottfried B (2008) Representing short-term observations of moving objects by a simple visual language. Journal of Visual Languages and Computing 19:321–342 [18] Gottfried B (2009) Modelling spatiotemporal developments in spatial health systems. In: Olla P, Tan J (eds) Mobile Health Solutions for Biomedical Applications, IGI Global (Idea Group Publishing), pp 270–284 [19] Gottfried B, Guesgen HW, Hübner S (2006) Spatiotemporal reasoning for smart homes. In: Augusto JC, Nugent CD (eds) Designing Smart Homes, The Role of Artificial Intelligence, LNCS, vol 4008, Springer, Heidelberg, pp 16– 34 [20] Grenon P, Smith B (2004) SNAP and SPAN: Towards dynamic spatial ontology. Spatial Cognition and Computation 4:69–104 [21] Guesgen HW, Marsland S (2009) Spatio-temporal reasoning and context awareness. In: Nakashima H, Augusto JC, Aghajan H (eds) Handbook of Ambient Intelligence and Smart Environments, LNCS (this volume), Springer, Heidelberg [22] Güting RH, Schneider M (2005) Moving Object Databases. Morgan Kaufmann Publishers, San Fransisco, CA, USA [23] Heine C, Kirn S (2004) Adapt at agent.hospital - agent based support of clinical processes. In: Proceedings of the 13th European Conference on Informa-
Locomotion Activities in Smart Environments
[24]
[25]
[26] [27]
[28]
[29]
[30] [31]
[32]
[33] [34]
[35] [36]
113
tion Systems, The European IS Profession in the Global Networking Environment, ECIS 2004, Turku, Finland, June 14–16, p 14 Hightower J, Borriello G (2001) A Survey and Taxonomy of Location Systems for Ubiquitous Computing. Technical Report UW-CSE 01-08-03, University of Washington, Computer Science and Engineering, Box 352350, Seattle, WA 98195 Hirtle S, Heidron B (1993) The structure of cognitive maps: Representations and processes. In: Gaerling T, Golledge R (eds) Behaviour and Environment: Psychological and Geographical Aspects, Elsevier Science Publishers, pp 170–192 Hopkinson A (2007) State of the Art in RFID Technology. INFOTHECA – Journal of Informatics and Librarianship 8(1–2):179–187 Kaufmann R, Bollhalder H, Gysi M (2003) Infrared positioning systems to identify the location of animals. In: Werner A, Jarfe A (eds) Joint conference of ECPA-ECPLF, ATB Agrartechnik Bornim / Wageningen Academic Publishers, p 721 Kitasuka T, Hisazumi K, Nakanishi T, Fukuda A (2005) Positioning Technique of Wireless LAN Terminals Using RSSI between Terminals. In: Yang LT, Ma J, Takizawa M, Shih TK (eds) Proceedings of the 2005 International Conference on Pervasive Systems and Computing, PSC 2005, Las Vegas, Nevada, June 27–30, 2005, CSREA Press, pp 47–53 Krohn A, Beigl M, Hazas M, Gellersen HW (2005) Using fine-grained infrared positioning to support the surface-based activities of mobile users. In: 25th International Conference on Distributed Computing Systems Workshops (ICDCS 2005 Workshops), 6–10 June 2005, Columbus, OH, USA, IEEE Computer Society, pp 463–468 Kuipers B (1978) Modeling spatial knowledge. Cognitive Science 2:129–154 Kurata Y, Egenhofer M (2007) The 9+ -Intersection for Topological Relations between a Directed Line Segment and a Region. In: Gottfried B (ed) Proceedings of the 1st Workshop on Behaviour Monitoring and Interpretation (BMI’07), CEURS Proceedings, vol 296, pp 62–76 Kushki A, Plataniotis KN, Venetsanopoulos AN (2008) Indoor Positioning with Wireless Local Area Networks (WLAN). In: Shekhar S, Xiong H (eds) Encyclopedia of GIS, Springer, pp 566–571 Leiser D, Zilbershatz A (1989) The traveler – a computational model for spatial network learning. Environment and Behaviour 21(3):435–463 Leonard JJ, Durrant-Whyte HF (1991) Simultaneous map building and localisation for an autonomous mobile robot. In: Proceedings of IEEE/RSJ International Workshop on Intelligent Robots and Systems, pp 1442–1447 Levitt T, Lawton D (1990) Qualitative navigation for mobile robots. Artificial Intelligence 44(6):305–360 Liu H, Darabi H, Banerjee P, Liu J (2007) Survey of wireless indoor positioning techniques and systems. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 37(6):1067–1080
114
Björn Gottfried
[37] Martino-Saltzman D, Blasch B, Morris R, McNeal L (1991) Travel behaviour of nursing home residents perceived as wanderers and nonwanderers. Gerontologist 11:666–672 [38] McDermott D, Davis E (1984) Planning routes through uncertain territory. Artificial Intelligence 22(1):551–560 [39] Mea VD, Pittaro M, Roberto V (2004) Knowledge management and modelling in health care organizations: The standard operating procedures. In: Wimmer M (ed) Knowledge Management in Electronic Government, 5th IFIP International Working Conference, KMGov 2004, Krems, Austria, May 17–19, 2004, Proceedings, Springer, LNCS, vol 3035, pp 136–146 [40] Mennis JL, Peuquet DJ, Qian L (2000) A conceptual framework for incorporating cognitive principles into geographical database recognition. International Journal of Geographical Information Science 14:501–520 [41] Millonig A, Gartner G (2007) Monitoring pedestrian spatio-temporal behaviour. In: Gottfried B (ed) Proceedings of the 1st Workshop on Behaviour Monitoring and Interpretation (BMI’07), CEURS Proceedings, vol 296, pp 29–42 [42] Monferrer EMT, Lobo FT (2002) Qualitative Velocity. In: Escrig MT, Toledo F, Golobardes E (eds) CCIA 2002, Springer Berlin Heidelberg, LNAI, vol 2504, pp 29–39 [43] Muller P (1998) A qualitative theory of motion based on spatio-temporal primitives. In: Cohn A, Schubert L, Shapiro S (eds) Proceedings of the 6th International Conference on Principles of Knowledge Representation and Reasoning, Morgan Kaufmann, pp 131–141 [44] Orr RJ, Abowd GD (2000) The Smart Floor: A Mechanism for Natural User Identification and Tracking. In: Proceedings of the 2000 Conference on Human Factors in Computing Systems (CHI 2000), pp 1–6 [45] Peuquet DJ (1994) It’s about time: A conceptual framework for the representation of temporal dynamics in geographic information systems. Annals of the Association of American Geographers 84:441–461 [46] Randell DA, Cui Z, Cohn AG (1992) A spatial logic based on regions and connection. In: Proc 3rd Int. Conf. on Knowledge Representation and Reasoning: Cambridge, Massachusetts, USA, October 25–29, 1992, Morgan Kaufman, San Mateo, pp 165–176 [47] Richter KF, Tomko M, Winter S (2008) A Dialog-Driven Process of Generating Route Directions. Computers, Environment and Urban Systems 32(3):26 [48] Satoh I (2007) A location model for smart environments. Pervasive and Mobile Computing 3:158–179 [49] Steinhage A, Lauterbach C (2008) Monitoring movement behaviour by means of a large area proximity sensor array in the floor. In: Gottfried B, Aghajan H (eds) 2nd Workshop on Behaviour Monitoring and Interpretation (BMI’08), vol 396, CEURS Proceedings, pp 15–27 [50] Volcic R, Kappers AML (2008) Allocentric and egocentric reference frames in the processing of three-dimensional haptic space. Experimental Brain Research 188:199–213
Locomotion Activities in Smart Environments
115
[51] Weakliam J, Bertolotto M, Wilson DC (2005) Implicit interaction profiling for recommending spatial content. In: Shahabi C, Boucelma O (eds) 13th ACM International Workshop on Geographic Information Systems, ACM-GIS 2005, November 4-5, 2005, Bremen, Germany, Proceedings, ACM, pp 285–294 [52] Weinstein R (2005) RFID: a technical overview and its application to the enterprise. IT Professional 7(3):27–33 [53] Winter S (2003) Route Adaptive Selection of Salient Features. In: Kuhn W, Worboys M, Timpf S (eds) Spatial Information Theory: Foundations of Geographic Information Science, Int. Conference COSIT 2003, LNCS, SpringerVerlag, Ittingen, Switzerland, pp 101–117 [54] Wood Z, Galton A (2008) Collectives and how they move: A tale of two classifications. In: Gottfried B, Aghajan H (eds) 2nd Workshop on Behaviour Monitoring and Interpretation (BMI’08), CEURS Proceedings, vol 396, pp 57–71
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion Manish Kushwaha, Songhwai Oh, Isaac Amundson, Xenofon Koutsoukos, Akos Ledeczi
1 Introduction Heterogeneous sensor networks (HSNs) are gaining popularity in diverse fields, such as military surveillance, equipment monitoring, and target tracking [41]. They are natural steps in the evolution of wireless sensor networks (WSNs) driven by several factors. Increasingly, WSNs will need to support multiple, although not necessarily concurrent, applications. Different applications may require different resources. Some applications can make use of nodes with different capabilities. As the technology matures, new types of nodes will become available and existing deployments will be refreshed. Diverse nodes will need to coexist and support old and new applications. Furthermore, as WSNs are deployed for applications that observe more complex phenomena, multiple sensing modalities become necessary. Different sensors usually have different resource requirements in terms of processing power, memory capacity, or communication bandwidth. Instead of using a network of homogeneous devices supporting resource intensive sensors, an HSN can have different nodes for different sensing tasks [22, 34, 10]. For example, at one end of the spectrum low data-rate sensors measuring slowly changing physical parameters such as temperature, humidity or light, require minimal resources; while on the other end even low resolution video sensors require orders of magnitude more resources. Manish Kushwaha Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN, USA, e-mail:
[email protected] Songhwai Oh Electrical Engineering and Computer Science, University of California at Merced, Merced, CA, USA e-mail:
[email protected] Isaac Amundson Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN, USA, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_5, © Springer Science+Business Media, LLC 2010
117
118
Manish Kushwaha et al.
Tracking is one such application that can benefit from multiple sensing modalities [9]. If the moving target emits sound signal then both audio and video sensors can be utilized. These modalities can complement each other in the presence of high background noise that impairs the audio, or in the presence of visual clutter that handicap the video. Additionally, tracking based on the fusion of audio-video data can improve the performance of audio-only or video-only approaches. Audio-video tracking can also provide cues for the other modality for actuation. For example, visual steering information from a video sensor may be used to steer the audio sensor (microphone array) toward a moving target. Similarly, information from the audio sensors can be used to steer a pan-tilt-zoom camera toward a speaker. Although, audio-visual tracking has been applied to smart videoconferencing applications [42], it does not use a wide-area distributed platform . The main challenges in target tracking is to find tracks from noisy observation. This requires solutions to both data association and state estimation problems. Our system employs a Markov Chain Monte Carlo Data Association (MCMCDA) algorithm for tracking. The MCMCDA algorithm is a data-oriented, combinatorial optimization approach to the data association problem that avoids the enumeration of tracks. The MCMCDA algorithm enables us to track an unknown number of targets in noisy urban environment citeoh:2004. In this chapter, we describe our ongoing research in multimodal target tracking in urban environments utilizing an HSN of mote class devices equipped with microphone arrays as audio sensors and embedded PCs equipped with web cameras as video sensors. The targets to be tracked are moving vehicles emitting engine noise. Our system has many components including audio processing, video processing, WSN middleware services, multimodal sensor fusion, and target tracking based on sequential Bayesian estimation and MCMCDA. While none of these is necessarily novel, their composition and implementation on an actual HSN requires addressing a number of significant challenges. Specifically, we have implemented audio beamforming on audio sensors utilizing an FPGA-based sensor board and evaluated its performance as well its energy consumption. While we are using a standard motion detection algorithm on video sensors, we have implemented post-processing filters that represent the video data in a similar format as the audio data, which enables seamless audio-video data fusion. Furthermore, we have extended time synchronization techniques for HSNs consisting of mote and PC networks. Finally, the main challenge we address is system integration including making the system work on an actual platform in a realistic deployment. The paper provides results gathered in an uncontrolled urban environment and presents a thorough evaluation including a comparison of different fusion and tracking approaches. The rest of the paper is organized as follows. Section 2 presents related work. In Section 3, we describe the overall system architecture. It is followed by the description of the audio processing in Section 4 and then the video processing in Section 5. In Section 6, we present our time synchronization approach for HSNs. Multimodal target tracking is described in Section 7. Section 8 presents the experimental deployment and its evaluation. Finally, we conclude in Section 9.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
119
2 Challenges and Related Work In this section, we present the challenges for multimodal sensor fusion and tracking. We also present various existing approach for sensor fusion and tracking. In multimodal multisensor fusion, the data may be fused at a variety of levels including the raw data level, where the raw signals are fused, a feature-level, where representative characteristics of the data are extracted and fused, and a decision-level fusion wherein target estimates from each sensor are fused [17]. At successive levels more information may be lost, but in collaborative applications, such as WSN applications, the communication requirement of transmitting large amounts of data is reduced. Another significant challenge is sensor conflict, when different sensors report conflicting data. When sensor conflict is very high, fusion algorithms produce false or meaningless results [18]. Reasons for sensor conflict may be sensor locality, different sensor modalities, or sensor faults. If a sensor node is far from a target of interest then the data from that sensor will not be useful, and will have higher variance. Different sensor modalities observe different physical phenomena. For example, audio and video sensors observe sound sources and moving objects, respectively. If a sound source is stationary or a moving target is silent, the two modalities might produce conflicting data. Also, different modalities are affected by different types of background noise. Finally, poor calibration and sudden change in local conditions can also cause conflicting sensor data. Another classical problem in multitarget tracking is to find a track of each target from the noisy data. If the association of sequence of data-points with each target is known, multitarget tracking reduces to a set of state estimation problems. The data association problem is to find which data-points are generated by which targets, or in other words, associate each data-point with either a target or noise. Rest of the section presents existing work in acoustic localization, video tracking, time synchronization and multimodal fusion.
Acoustic Localization An overview of the theoretical aspects of Time Difference of Arrival (TDOA) based acoustic source localization and beamforming is presented in [7]. Beamforming methods have successfully been applied to detect single or even multiple sources in noisy and reverberant environments. A maximum likelihood (ML) method for single target and an approximate maximum likelihood (AML) method for direction of arrival (DOA) based localization in reverberant environments are proposed in [7, 4]. In [1], an empirical study of collaborative acoustic source localization based on an implementation of the AML algorithm is shown. An alternative technique called accumulated correlation combines the speed of time-delay estimation with the accuracy of beamforming [5]. Time of Arrival (TOA) and TDOA based methods are proposed for collaborative source localization for wireless sensor network
120
Manish Kushwaha et al.
applications [24]. A particle filtering based approach for acoustic source localization and tracking using beamforming is described in [39].
Video Tracking A simple approach to motion detection from video data is via frame differencing, which requires a robust background model. There exist a number of challenges for the estimation of robust background models [38], including gradual and sudden illumination changes, vacillating backgrounds, shadows, visual clutter, occlusion, etc. In practice, most of the simple motion detection algorithms have poor performance when faced with these challenges. Many adaptive background-modeling methods have been proposed to deal with these challenges. The work in [13] models each pixel in a camera frame by an adaptive parametric mixture model of three Gaussian distributions. A Kalman filter to track the changes in background illumination for every pixel is used in [21]. An adaptive nonparametric Gaussian mixture model to address background modeling challenges is done in [36]. A kernel estimator for each pixel is proposed in [11] with kernel exemplars from a moving window. Other techniques using high-level processing to assist the background modeling have been proposed; for instance, the Wallflower tracker [38] which circumvents some of these problems using high-level processing rather than tackling the inadequacies of the background model. The algorithm in [19] proposes an adaptive background modeling method based on the framework in [36]. The main differences lie in the update equations, initialization method and the introduction of a shadow detection algorithm.
Time Synchronization time synchronization in sensor networks has been studied extensively in the literature and several protocols have been proposed. Reference Broadcast Synchronization (RBS) [12] is a protocol for synchronizing a set of receivers to the arrival of a reference beacon. The Timing-sync Protocol for Sensor Networks (TPSN) is a sender-receiver protocol that uses the round-trip latency of a message to estimate the time a message takes along the communication pathway from sender to receiver. This time is then added to the sender timestamp and transmitted to the receiver to determine the clock offset between the two nodes. Elapsed Time on Arrival (ETA) [23] provides a set of application programming interfaces for an abstract time synchronization service. RITS [35] is an extension of ETA over multiple hops. It incorporates a set of routing services to efficiently pass sensor data to a network sink node for data fusion. In [15], mote-PC synchronization was achieved by connecting the GPIO ports of a mote and IPAQ PDA using the mote-NIC model. Although using this technique can achieve microsecond-accurate synchronization, it was implemented as a hardware modification rather than in software.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
121
Multimodal Tracking A thorough introduction to multisensor data fusion, with focus on data fusion applications, process models, and architectures is provided in [17]. The paper surveys a number of related techniques, and reviews standardized terminology, mathematical techniques, and design principles for multisensor data fusion systems. A modular tracking architecture that combines several simple tracking algorithms is presented in [31]. Several simple and rapid algorithms run on different CPUs, and the tracking results from different modules are integrated using a Kalman filter. In [37], two different fusion architectures based on decentralized recursive (extended) Kalman filters are described for fusing outputs from independent audio and video trackers. Multimodal sensor fusion and tracking have been applied to smart videoconferencing and indoor tracking applications [44, 42, 3, 6]. In [44], data is obtained using multiple cameras and microphone arrays. The video data consist of pairs of image coordinates of features on each tracked object, and the audio data consist of TDOA for each microphone pair in the array. A particle filter is implemented for tracking. In [3], a graphical model based approach is taken using audiovisual data. The graphical model is designed for single target tracking using a single camera and a microphone pair. The audio data consist of the signals received at the microphones, and the video data consist of monochromatic video frames. Generative models are described for the audio-video data that explain the data in terms of the target state. An expectation-maximization (EM) algorithm is described for parameter learning and tracking, which also enables automatic calibration by learning the audio-video calibration parameters during each EM step. In [6], a probabilistic framework is described for multitarget tracking using audio and video data. The video data consist of plan-view images of the foreground. Plan-view images are projections of the images captured from different points-of-view to a 2D plane, usually the ground plane where the targets are moving [8]. Acoustic beamforms for a fixed length signal are taken as the audio data. A particle filter is implemented for tracking.
3 Architecture Figure 1 shows the system architecture. The audio sensors, consisting of MICAz motes with acoustic sensor boards equipped with a microphone array, form an IEEE 802.15.4 network. This network does not need to be connected; it can consist of multiple connected components as long as each of these component have a dedicated mote-PC gateway. The video sensors are based on Logitech QuickCam Pro 4000 cameras attached to OpenBrick-E Linux embedded PCs. These video sensors, along with the mote-PC gateways, the sensor fusion node and the reference broadcaster for time synchronization (see Section 6) are all PCs forming a peer-to-peer 802.11b wireless network. Figure 2 shows the conceptual layout of the sensor network. The audio sensors perform beamforming, and transmit the audio detections to the corresponding mote-PC gateway utilizing a multi-hop message routing service [2].
122
VS#1
τ, λ
daemon
VS#2
τ, λ
daemon
Time synchronization daemon
τ, λ
Network
AS#2
Base Station
τ, λ
daemon
AS#1
Manish Kushwaha et al. Scheduler (t) AS#1 τ1, λ1
Sensor Fusion Node λ1(t)
τn, λn
AS#2 τ1, λ1
λ2(t)
τn, λn
MSF and Tracking
X(t)
VS#1 τ1, λ1
λk-1(t)
τn, λn
VS#2 τ1, λ1
λk(t)
τn, λn
Fig. 1 Architecture for the multimodal target tracking system. Here, τ denotes measurement timestamps, λ denotes sensor measurements (also called detection functions that are described in later sections), and t denotes time. The blocks shown inside the sensor fusion node are circular buffers that store timestamped measurements.
Fig. 2 Conceptual layout of sensor network for multimodal target tracking. Circles represent the audio sensors, camera silhouettes represent the video sensors and rectangles represent PCs.
The routing service also performs time translation of the detection timestamps. The video sensors run a motion detection algorithm and compute timestamped video detection functions. Both audio and video detections are routed to the central sensor fusion node. Time translation is also carried out in the 802.11b network utilizing an RBS-based time synchronization service. This approach ensures that sensor data arrives at the sensor fusion node with timestamps in a common timescale. On the sensor fusion node, the sensor measurements are stored in appropriate sensor buffers, one for each sensor. A sensor fusion scheduler triggers periodically and generates a fusion timestamp. The trigger is used to retrieve the sensor measurements from the sensor buffers with timestamps closest to the generated fusion timestamp. The
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
123
retrieved sensor measurements are then used for multimodal fusion and target tracking. In this study, we developed and compared target tracking algorithms based on both Sequential Bayesian Estimation (SBE) and MCMCDA. Note that the triggering mechanism of the scheduler decouples the target tracking rate from the audio and video sensing rates, which allows us to control the rate of the tracking application independently of the sensing rate.
4 Audio Beamforming Beamforming is a signal processing algorithm for DOA estimation of a signal source. In a typical beamforming array, each of the spatially separated microphones receive a time-delayed source signal. The amount of time delay at each microphone in the array depends on the microphone arrangement and the location of the source. A typical delay-and-sum single source beamformer discretizes the sensing region into directions, or beams, and computes a beam energy for each of them. The beam energies are collectively called the beamform. The beam with maximum energy indicates the direction of the acoustic source.
Beamforming Algorithm The data-flow diagram of our beamformer is shown in Fig. 3. The amplified microphone signal is sampled at a high frequency (100 KHz) to provide high resolution for the time delay, which is required for the closely placed microphones. The raw signals are filtered to remove unwanted noise components. The signal is then fed to a tapped delay line (TDL), which has M different outputs to provide the required delays for each of the M beams. The delays are set by taking into consideration the exact relative positions of the microphones so that the resulting beams are steered to the beam angles, θi = i 360 M degrees, for i = 0, 1, ...M − 1. The signal is downsam-
Fig. 3 Data-flow diagram of the real-time beamforming sensor.
124
Manish Kushwaha et al.
pled and the M beams are formed by adding the four delayed signals together. Data blocks are formed from the data streams (with a typical block length of 5-20ms) and an FFT is computed for each block. The block power values, μ (θi ), are smoothed by exponential averaging into the beam energy, λ (θi )
λ t (θi ) = αλ t−1 (θi ) + (1 − α )μ (θi)
(1)
where α is an averaging factor.
Audio Hardware In our application, the audio sensor is a MICAz mote with an onboard Xilinx XC3S1000 FPGA chip that is used to implement the beamformer [40]. The onboard Flash (4MB) and PSRAM (8MB) modules allow storing raw samples of several acoustic events. The board supports four independent analog channels, featuring an electret microphone each, sampled at up to 1 MS/s (million samples per second). A small beamforming array of four microphones arranged in a 10cm × 6cm rectangle is placed on the sensor node, as shown in Fig. 4(a). Since the distances between the microphones are small compared to the possible distances of sources, the sensors perform far-field beamforming. The sources are assumed to be on the same two-dimensional plane as the microphone array, thus it is sufficient to perform planar beamforming by dissecting the angular space into M equal angles, providing a resolution of 360/M degrees. In the experiments, the sensor boards are configured to perform simple delay-and-sum-type beamforming in real time with M = 36, and an angular resolution of 10 degrees. Finer resolution increases the communication requirements.
Fig. 4 (a) Sensor Node Showing the Microphones, (b) Beamform of acoustic source at a distance of 50 feet and an angle of 120 degrees.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
125
Evaluation We test the beamformer node outdoors using recorded speech as the acoustic source. Measurements are taken by placing the source at distances of 3, 5, 10, 25, 50, 75, 100, and 150 feet from the sensor, and at an angle from -180◦ to +180◦ in 5◦ increments. Figure 4(b) shows the beamforming result for a single audio sensor when the source was at a distance of 50 feet. Mean DOA measurement error for 1800 human speech experiments is shown in Figure 5. The smallest error was recorded when the source was at a distance of 25 feet from the sensor. The error increases as the source moves closer to the sensor. This is because the beamforming algorithm assumes a planar wavefront, which holds for far-field sources but breaks down when the source is close to the sensor. However, as the distance between the source and sensor grows, error begins accumulating again as the signal-to-noise ratio decreases.
Fig. 5 Human speech DOA error for distances of 3, 5, 10, 25, 50, 75 feet between acoustic source and beamformer node.
Messages containing the audio detection functions require 83 bytes, and include node ID, sequence number, timestamps, and 72 bytes for 36 beam energies. These data are transmitted through the network in a single message. The default TinyOS message size of 36 bytes was changed to 96 bytes to accommodate the entire audio message. The current implementation uses less than half of the total resources (logic cells, RAM blocks) of the selected mid-range FPGA device. The application runs at 20 MHz, which is relatively slow in this domain—the inherent parallel processing topology allows this slow speed. Nonetheless, the FPGA approach has a significant impact on the power budget, the sensor draws 130mA current (at 3.3 V) which is nearly a magnitude higher then typical wireless sensor node power currents. Flashbased FPGAs and smart duty cycling techniques are promising new directions in our research project for reducing the power requirements.
126
Manish Kushwaha et al.
5 Video Tracking Video tracking systems seek to automatically detect moving targets and track their movement in a complex environment. Due to the inherent richness of the visual medium, video based tracking typically requires a pre-processing step that focuses attention of the system on regions of interest in order to reduce the complexity of data processing. This step is similar to the visual attention mechanism in human observers. Since the region of interest is primarily characterized by regions containing moving targets (in the context of target tracking), robust motion detection is the first step in video tracking. A simple approach to motion detection from video data is via frame differencing. It compares each incoming frame with a background model and classifies the pixels of significant variation into the foreground. The foreground pixels are then processed for identification and tracking. The success of frame differencing depends on the robust extraction and maintenance of the background model. Performance of such techniques tends to degrade when there is significant camera motion, or when the scene has significant amount of change.
Algorithm The dataflow in Figure 6 shows the motion detection algorithm and its components used in our tracking application. The first component is background-foreground seg-
Fig. 6 Data-flow diagram of real-time motion detection algorithm
mentation of the currently captured frame (It ) from the camera. We use the algorithm described in [19]. This algorithm uses an adaptive background mixture model for real-time background and foreground estimation. The mixture method models each background pixel as a mixture of K Gaussian distributions. The algorithm provides multiple tunable parameters for desired performance. In order to reduce speckle noise and smooth the estimated foreground (Ft ), the foreground is passed through a median filter. In our experiments, we use a median filter of size 3 × 3. Since our sensor fusion algorithm (Section 7) utilizes only the angle of moving targets, it is desirable and sufficient to represent the foreground in a simpler detection function. Similar to the beam angle concept in audio beamforming (Section 4), the field-of-view of the camera is divided into M equally-spaced angles
θi = θmin + (i − 1)
θmax − θmin : i = 1, 2, ..., M M
(2)
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
127
where θmin and θmax are the minimum and maximum field-of-view angles for the camera. The detection function value for each angle is simply the number of foreground pixels in that direction. Formally, the detection function for the video sensors can be defined as
λ (θi ) =
W
H
∑ ∑ F( j, k) : i = 1, 2, ..., M
(3)
j∈θi k=1
where F is the binary foreground image, H and W are the vertical and horizontal resolutions in pixels, respectively and j ∈ θi indicates columns in the frame that fall within angle θi . Figure 7 shows a snapshot of motion detection.
Fig. 7 Video detection. The frame on the left shows the input image, the frame in the middle shows the foreground and the frame on the right shows the video detection function.
Video Post-processing In our experiments, we gathered video data of vehicles from multiple sensors from an urban street setting. The data contained a number of real-life artifacts such as vacillating backgrounds, shadows, sunlight reflections and glint. The algorithm described above is not able to filter out such artifacts from the detections. We implemented two post-processing filters to improve the detection performance. The first filter removes any undesirable persistent background. For this purpose we keep a background of moving averages which was removed from each detection. The background update and filter equations are bt (θ ) = α bt−1 (θ ) + (1 − α )λ t (θ )
(4)
λ (θ ) = λ (θ ) − b
(5)
t
t
t−1
(θ )
where bt (θ ) is the moving average background, and λ t (θ ) is the detection function at time t. The second filter removes any sharp spikes (typically caused by sunlight reflections and glint). For this we convolved the detection function with a small linear kernel to add a blurring effect. This essentially reduces the effect of any sharp spikes in detection function due to glints. The equation for this filter is
λ t (θ ) = λ t (θ ) ∗ k
(6)
128
Manish Kushwaha et al.
where k is a 7 × 1 vector of equal weights, and ∗ denotes convolution. We implemented the motion detection algorithm using OpenCV (open source computer vision) library. We use Linux PCs equipped with the QuickCam Pro 4000 as video sensors. The OpenBrick-E has 533 MHz CPU, 128 MB RAM, and a 802.11b wireless adapter. The QuickCam Pro supports up to 640 × 480 pixel resolution and up to 30 frames-per-second frame rate. Our motion detection algorithm implementation runs at 4 frames-per-second and 320 × 240 pixel resolution. The number of angles in Equation (2) is M = 160.
6 Time Synchronization In order to seamlessly fuse time-dependent audio and video sensor data for tracking moving objects, participating nodes must have a common notion of time. Although several microsecond-accurate synchronization protocols have emerged for wireless sensor networks (e.g. [12, 14, 27, 35]), achieving accurate synchronization in a heterogeneous sensor network is not a trivial task.
6.1 Synchronization Methodology Attempting to synchronize the entire network using a single protocol will introduce a large amount of error. For example, TPSN [14], FTSP [27], and RITS [35] were all designed to run on the Berkeley Motes, and assume the operating system is tightly integrated with the radio stack. Attempting to use such protocols on an 802.11 PC network will result in poor synchronization accuracy because the necessary low-level hardware control is difficult to achieve. Reference Broadcast Synchronization (RBS) [12], although flexible when it comes to operating system and computing platform, is accurate only when all nodes have access to a common network medium. A combined network of motes and PCs will be difficult to synchronize using RBS because each platform uses different communication protocols and wireless frequencies. Instead, we adopted a hybrid approach [2], which pairs a specific network with the synchronization protocol that provides the most accuracy with the least amount of overhead. To synchronize the entire network, it is necessary for gateway nodes (i.e., nodes connecting multiple networks,) to handle several synchronization mechanisms.
Mote Network We used Elapsed Time on Arrival (ETA) [23] to synchronize the mote network. ETA timestamps synchronization messages at transmit and receive time, thus removing the largest amount of nondeterministic message delay from the communi-
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
129
cation pathway. On the sender side, a timestamp is taken upon transmission of a designated synchronization byte, and placed at the end of the outgoing message. On the receiver side, a timestamp is taken upon arrival of the synchronization byte. The difference between these two timestamps is generally deterministic, and is based primarily on the bitrate of the transceivers and the size of the message. We selected ETA for the mote network due to its accuracy and low message overhead.
PC Network We used RBS to synchronize the PC network. RBS synchronizes a set of nodes to the arrival of a reference beacon. Participating nodes timestamp the arrival of a message broadcast over the network, and by exchanging these timestamps, neighboring nodes are able to maintain reference tables for timescale transformation. RBS minimizes synchronization error by taking advantage of the broadcast channel of wireless networks.
Mote-PC Network To synchronize a mote with a PC in software, we adopted the underlying methodology of ETA and applied it to serial communication. On the mote, a timestamp is taken upon transfer of a synchronization byte and inserted into the outgoing message. On the PC, a timestamp is taken immediately after the UART issues the interrupt, and the PC regards the difference between these two timestamps as the PCmote offset. Serial communication bit rate between the mote and PC is 57600 baud, which approximately amounts to a transfer time of 139 microseconds per byte. However, the UART will not issue an interrupt to the CPU until its 16-byte buffer nears capacity or a timeout occurs. Because the synchronization message is six bytes, reception time in this case will consist of the transfer time of the entire message in addition to the timeout time and the time it takes to transfer the date from the UART buffer into main memory by the CPU. This time is compensated for by the receiver, and the clock offset between the two devices is determined as the difference between the PC receive time and the mote transmit time.
6.2 Evaluation of HSN Time Synchronization Our HSN contains three communication pathways: mote-mote, mote-PC, and PCPC. We first examine the synchronization accuracy of each of these paths, then present the synchronization results for the entire network.
130
Manish Kushwaha et al.
Mote Network We evaluated synchronization accuracy in the mote network with the pairwise difference method. Two nodes simultaneously timestamp the arrival of an event beacon, then forward the timestamps to a sink node two hops away. At each hop, the timestamps are converted to the local timescale. The synchronization error is the difference between the timestamps at the sink node. For 100 synchronizations, the average error is 5.04 μ s, with a maximum of 9μ s.
PC Network We used a separate RBS transmitter to broadcast a reference beacon every ten seconds over 100 iterations. Synchronization error, determined using the pairwise difference method, was as low as 17.51 μ s on average, and 2050.16 μ s maximum. The worst-case error is significantly higher than reported in [12] because the OpenBrickE wireless network interface controllers in our experimental setup are connected via USB, which has a default polling frequency of 1 kHz. For our tracking application, this is acceptable because we use a sampling rate of 4 Hz. Note that the reason for selecting such a low sampling rate is due to bandwidth constraints and interference and not because of synchronization error.
Mote-PC Connection GPIO pins on the mote and PC were connected to an oscilloscope, and set high upon timestamping. The resulting output signals were captured and measured. The test was performed over 100 synchronizations, and the resulting error was 7.32μ s on average, and did not exceed 10μ s. The majority of the error is due to jitter, both in the UART and the CPU. A technique to compensate for such jitter on the motes is presented in [27], however, we did not attempt it on the PCs.
HSN We evaluated synchronization accuracy across the entire network using the pairwise difference method. Two motes timestamped the arrival of an event beacon, and forwarded the timestamp to the network sink, via one mote and two PCs. RBS beacons were broadcast at four-second intervals, and therefore clock skew compensation was unnecessary, because synchronization error due to clock skew would be insignificant compared with offset error. The average error over the 3-hop network was 101.52μ s, with a maximum of 1709μ s. The majority of this error is due to the polling delay from the USB wireless network controller. However, synchronization accuracy is still sufficient for audio and video sensor fusion at 4Hz.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
131
6.3 Synchronization Service The implementation used in these experiments was bundled into a time synchronization and routing service for sensor fusion applications. Figure 8 illustrates the interaction of each component within the service. The service interacts with sensing applications that run on the local PC, as well as other service instances running on remote PCs. The service accepts timestamped event messages on a specific port, converts the embedded timestamp to the local timescale, and forwards the message toward the sensor-fusion node. The service uses reference broadcasts to maintain synchronization with the rest of the PC network. In addition, the service accepts mote-based event messages, and converts the embedded timestamps using the ETAbased serial timestamp synchronization method outlined above. Kernel modifications in the serial and wireless drivers were required in order to take accurate timestamps. Upon receipt of a designated synchronization byte, the time is recorded and passed up to the synchronization service in the user space. The modifications are unobtrusive to other applications using the drivers, and the modules are easy to load into the kernel. The mote implementation uses the TimeStamping interface, provided with the TinyOS distribution [25]. A modification was made to the UART interface to insert a transmission timestamp into the event message as it is being transmitted between the mote and PC. The timestamp is taken immediately before a synchronization byte is transmitted, then inserted at the end of the message.
Fig. 8 PC-based time synchronization service.
132
Manish Kushwaha et al.
7 Multimodal Target Tracking This section describes the tracking algorithm and the approach for fusing the audio and video data based on sequential Bayesian estimation. We use following notation: Superscript t denotes discrete time (t ∈ Z+ ), subscript k ∈ {1, ..., K} denotes the sensor index, where K is the total number of sensors in the network, the target state at time t is denoted as x(t) , and the sensor data at time t is denoted as z(t) .
7.1 Sequential Bayesian Estimation We use sequential Bayesian estimation to estimate the target state x(t) at time t, similar to the approach presented in [26]. In sequential Bayesian estimation, the target state is estimated by computing the posterior probability density p(x(t+1) |z(t+1) ) using a Bayesian filter described by p(x(t+1) |z(t+1) ) ∝ p(z(t+1) |x(t+1) )
p(x(t+1) |x(t) )p(x(t) |z(t) )dx(t)
(7)
where p(x(t) |z(t) ) is the prior density from the previous step, p(z(t+1) |x(t+1) ) is the likelihood given the target state, and p(x(t+1) |x(t) ) is the prediction for the target state x(t+1) given the current state x(t) according to a target motion model. Since we are tracking moving vehicles it is reasonable to use a directional motion model based on the target velocity. The directional motion model is described by x(t+1) = x(t) + v + U[−δ , +δ ]
(8)
where x(t) is the target state time t, x(t+1) is the predicted state, v is the target velocity, and U[−δ , +δ ] is a uniform random variable. Because the sensor models (described later in Subsection 7.2) are nonlinear, the probability densities cannot be represented in a closed form. It is, therefore, reasonable to use a nonparametric representation for the probability densities. The nonparametric densities are represented as discrete grids in 2D space, similar to [26]. For nonparametric representation, the integration term in Equation (7) becomes a convolution operation between the motion kernel and the prior distribution. The resolution of the grid representation is a trade-off between tracking resolution and computational capacity.
Centralized Bayesian Estimation Since we use resource constrained mote class sensors, centralized Bayesian estimation is a reasonable approach because of its modest computational requirements. The likelihood function in Equation (7) can be calculated either as a product or
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
133
weighted summation of the individual likelihood functions. We describe the two methods next.
Product of Likelihood functions Let pk (z(t) |x(t) ) denote the likelihood function from sensor k. If the sensor observations are mutually independent conditioned on the target state, the likelihood functions from multiple sensors are combined as p(z(t) |x(t) ) =
∏
pk (z(t) |x(t) )
(9)
k=1,...,K
Weighted-Sum of Likelihood functions An alternative approach to combine the likelihood functions is to compute their weighted-sum. This approach allows us to give different weights to different sensor data. These weights can be used to incorporate sensor reliability and quality of (t) sensor data. We define a quality index qk for each sensor k as (t)
(t)
qk = rk max (λk (θ )) θ
(t)
where rk is a measure of sensor reliability and λk (θ ) is the sensor detection function. The combined likelihood function is given by p(z(t) |x(t) ) =
(t)
∑k=1,...,K qk pk (z(t) |x(t) ) (t)
∑k=1,...,K qk
(10)
We experimented with both methods in our evaluation. The product method produces more accurate results with low uncertainty in target state. The weighted-sum method performs better in cases with high sensor conflict, though it suffers from high uncertainty. The results are presented in Section 8.
Hybrid Bayesian Estimation In sensor fusion a big challenge is to account for conflicting sensor data. When sensor conflict is very high, sensor fusion algorithms produce false or meaningless fusion results [18]. Reasons for sensor conflict are sensor locality, different sensor modalities, and sensor faults. Selecting and clustering the sensors in different groups based on locality or modality can mitigate poor performance due to sensor conflict. For example, clustering the sensors close to the target and fusing the data from only the sensors in the cluster would remove the conflict caused by distant sensors.
134
Manish Kushwaha et al.
The sensor network deployment in this paper is small and the sensing region is comparable to the sensing ranges of the audio and video sensors. For this reason, we do not use locality based clustering. However, we have multimodal sensors that can report conflicting data. Hence, we developed a hybrid Bayesian estimation framework [17] by clustering sensors based on modalities, and compare it with the centralized approach. Figure 9 illustrates the framework. The likelihood function
Fig. 9 Hybrid Bayesian estimation framework.
from each sensor in a cluster is fused together using Equation (9) or (10). The combined likelihood is then used in Equation (7) to calculate the posterior density for that cluster. The posterior densities from all the clusters are then combined together to estimate the target state. For hybrid Bayesian estimation with audio-video clustering, we compute the likelihood functions using Equation (9) or (10). The audio posterior density is calculated using paudio (x(t+1) |z(t+1) ) ∝ paudio (z(t+1) |x(t+1) )
p(x(t+1) |x(t) )p(x(t) |z(t) )dx(t)
while the video posterior density is calculated as pvideo (x(t+1) |z(t+1) ) ∝ pvideo (z(t+1) |x(t+1) )
p(x(t+1) |x(t) )p(x(t) |z(t) )dx(t)
The two posterior densities are combined either as (product fusion) p(x(t+1) |z(t+1) ) ∝ paudio (z(t+1) |x(t+1) )pvideo (z(t+1) |x(t+1) ) or (weighted-sum fusion) p(x(t+1) |z(t+1) ) ∝ α paudio (z(t+1) |x(t+1) ) + (1 − α )pvideo(z(t+1) |x(t+1) ) where α is a weighing factor.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
135
7.2 Sensor Models We use a nonparametric model for the audio sensors, while a parametric mixture-ofGaussian model for the video sensors is used to mitigate the effect of sensor conflict in object detection.
Audio Sensor Model The nonparametric DOA sensor model for a single audio sensor is the piecewise linear interpolation of the audio detection function
λ (θ ) = wλ (θi−1 ) + (1 − w)λ (θi), if θ ∈ [θi−1 , θi ] where w = (θi − θ )/(θi − θi−1 ).
Video Sensor Model The video detection algorithm captures the angle of one or more moving objects. The detection function from Equation (3) can be parametrized as a mixture-ofGaussian n
λ (θ ) = ∑ ai fi (θ ) i=1
where n is the number of components, fi (θ ) is the probability density, and ai is the mixing proportion for component i. Each component is a Gaussian density given by fi (θ ) = N (θ |μi , σi2 ), where the component parameters μi , σi2 and ai are calculated from the detection function.
Likelihood Function The 2D search space is divided into N rectangular cells with center points at (xi , yi ), for i = 1, 2, ..., N as illustrated in Figure 10. The likelihood function value for kth sensor at ith cell is the average value of the detection function in that cell, given by pk (z|x) = pk (xi , yi ) =
1
∑
λk (θ ) (k,i) (k,i) (ϕB − ϕA ) ϕ (k,i) ≤θ ≤ϕ (k,i) B A
136
Manish Kushwaha et al.
Fig. 10 Computation of the likelihood value for kth sensor at ith cell, (xi , yi ). The cell is centered at P0 with vertices at P1 , P2 , P3 , and P4 . The angular interval subtended at the sensor due to the ith (k,i) (k,i) cell is θ ∈ [ϕA , ϕB ], or θ ∈ [0, 2π ] if the sensor is inside the cell.
7.3 Multiple-Target Tracking The essence of the multi-target tracking problem is to find a track of each object from noisy measurements. If the sequence of measurements associated with each object is known, multi-target tracking reduces to a set of state estimation problems for which many efficient algorithms are available. Unfortunately, the association between measurements and objects is unknown. The data association problem is to work out which measurements were generated by which objects; more precisely, we require a partition of measurements such that each element of a partition is a collection of measurements generated by a single object or noise. Due to this data association problem, the complexity of the posterior distribution of the states of objects grows exponentially as time progresses. It is well-known that the data association problem is NP-hard [32], so we do not expect to find efficient, exact algorithms for solving this problem. In order to handle highly nonlinear and non-Gaussian dynamics and observations, a number of methods based on particle filters has recently been developed to track multiple objects in video [30, 20]. Although particle filters are highly effective in single-target tracking, it is reported that they provide poor performance in multi-target tracking [20]. This is because a fixed number of particles is insufficient to represent the posterior distribution with the exponentially increasing complexity (due to the data association problem). As shown in [20, 43], an efficient alternative is to use Markov chain Monte Carlo (MCMC) to handle the data association problem in multi-target tracking. For our problem, there is an additional complexity. We do not assume the number of objects is known. A single-scan approach, which updates the posterior based only on the current scan of measurements, can be used to track an unknown number
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
137
of targets with the help of trans-dimensional MCMC [43, 20] or a detection algorithm [30]. But a single-scan approach cannot maintain tracks over long periods because it cannot revisit previous, possibly incorrect, association decisions in the light of new evidence. This issue can be addressed by using a multi-scan approach, which updates the posterior based on both current and past scans of measurements. The well-known multiple hypothesis tracking (MHT) [33] is a multi-scan tracker, however, it is not widely used due to its high computational complexity. A newly developed algorithm, called Markov chain Monte Carlo data association (MCMCDA), provides a computationally desirable alternative to MHT [29]. The simulation study in [29] showed that MCMCDA was computationally efficient compared to MHT with heuristics (i.e., pruning, gating, clustering, N-scan-back logic and k-best hypotheses). In this chapter, we use the online version of MCMCDA to track multiple objects in a 2-D plane. Due to the page limitation, we omit the description of the algorithm in this paper and refer interested readers to [29].
8 Evaluation In this section, we evaluate target tracking algorithms based on the sequential Bayesian estimation and MCMCDA.
8.1 Sequential Bayesian Estimation
Fig. 11 Experimental setup
The deployment of our multimodal target tracking system is shown in Figure 11. We employ 6 audio sensors and 3 video sensors deployed on either side of a road. The complex urban street environment presents many challenges including gradual change of illumination, sunlight reflections from windows, glints due to cars, high visual clutter due to swaying trees, high background acoustic noise due
138
Manish Kushwaha et al.
to construction and acoustic multipath effects due to tall buildings. The objective of the system is to detect and track vehicles using both audio and video under these conditions. Sensor localization and calibration for both audio and video sensors are required. In our experimental setup, the sensor nodes are manually placed at marked locations and orientations. The audio sensors are placed on 1 meter high tripods to minimize audio clutter near the ground. An accurate self-calibration technique, e.g. [16, 28], is desirable for a target tracking system. Our experimental setup consists of two wireless networks as described in Section 3. The mote network is operating on channel 26 (2.480 GHz) while the 802.11b network is operating on channel 6 (2.437 GHz). Both the channels are non-overlapping and different from the infrastructure wireless network, which operates on channel 11 (2.462 GHz). We choose these nonoverlapping channels to minimize interference and are able to achieve less than 2% packet loss. We gather audio and video detection data for a total duration of 43 minutes. Table 1 presents the parameter values that we use in our tracking system. We run our Number of beams in audio beamforming, Maudio 36 Number of angles in video detection Mvideo 160 Sensing region (meters) 35 × 20 Cell size (meters) 0.5 × 0.5 Table 1 Parameters used in experimental setup
sensor fusion and tracking system online using centralized sequential Bayesian estimation based on the product of likelihood functions. We also collect all the audio and video detection data for offline evaluation. This way we are able to experiment with different fusion approaches on the same data set. For offline evaluations, we shortlist 10 vehicle tracks where there is only a single target in the sensing region. The average duration of tracks is 4.25 sec with 3.0 sec minimum and 5.5 sec maximum. The tracked vehicles are part of an uncontrolled experiment. The vehicles are traveling on road at a speed of 20-30 mph speed. Sequential Bayesian estimation requires a prior density of the target state. We initialize the prior density using a simple detection algorithm based on audio data. If the maximum of the audio detection functions exceeds a threshold, we initialize the prior density based on the audio detection. In our simulations, we experiment with eight different approaches. We use audioonly, video-only and audio-video sensor data for sensor fusion. For each of these data sets, the likelihood is computed either as the weighted-sum or product of the likelihood function for the individual sensors. For the audio-video data, we use centralized and hybrid fusion. Following is the list of different target tracking approaches. 1. audio-only, weighted-sum (AS) 2. video-only, weighted-sum (VS)
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
3. 4. 5. 6. 7. 8.
139
audio-video, centralized, weighted-sum (AVCS) audio-video, hybrid, weighted-sum (AVHS) audio-only, likelihood product (AP) video-only, likelihood product (VP) audio-video, centralized, likelihood product (AVCP) audio-video, hybrid, likelihood product (AVHP)
The ground truth is estimated post-facto based on the video recording by a separate camera. The standalone ground truth camera is not part of any network, and have the sole responsibility of recording ground truth video. For evaluation of tracking accuracy, the center of mass of the vehicle is considered to be the true location. Figure 12 shows the tracking error for a representative vehicle track. The tracking
Fig. 12 Tracking error (a) weighted sum, (b) product
error when audio data is used is consistently lower than the case when the video data is used. When we use both audio and video data, the tracking error is lower than either of those considered alone. Figure 13 shows the determinant of the covariance of the target state for the same vehicle track. The covariance, which is an indicator of
Fig. 13 Tracking variance (a) weighted sum, (b) product
uncertainty in the target state is significantly lower for product fusion than weightedsum fusion. In general, covariance for audio-only tracking is higher than video-only tracking, while using both modalities lowers the uncertainty. Figure 14(a) shows the tracking error in the case of fusion based on weightedsum. The tracking error when using only video data shows a large value at time t = 1067 second. In this case, the video data has false peaks not corresponding to the target. The audio fusion works fine for this track. As expected, when we use both
140
Manish Kushwaha et al.
Fig. 14 Tracking error (a) poor video tracking, (b) poor audio tracking
audio and video data together, the tracking error is decreased. Further, Figure 14(a) shows the tracking error when using 6, 3 and 2 audio sensors for the centralized audio-video fusion. Using as few as two audio sensors can assist video tracking to disambiguate and remove false peaks. Similarly, when there is an ambiguity in the audio data, the video sensors can assist and improve tracking performance. Figure 14(b) shows the tracking error for a track where audio data is poor due to multiple sound sources. When fused with video data the tracking performance is improved. The figure shows tracking performance when using two and three video sensors for the centralized and hybrid fusion.
Fig. 15 Tracking errors (a) weighted sum, (b) product
Figure 15 shows average tracking errors for all ten vehicle tracks for all target tracking approaches mentioned above. Figure 16(a) averages tracking errors for all
Fig. 16 (a) Average tracking errors and (b) average of determinant of covariance for all tracks
the tracks to compare different tracking approaches. Audio and video modalities are able to track vehicles successfully, though they suffer from poor performance in the presence of high background noise and clutter. In general, audio sensors are
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
141
able to track vehicles with good accuracy, but they suffer from high uncertainty and poor sensing range. Video tracking is not very robust in the presence of multiple targets and noise. As expected, fusing the two modalities consistently gives better performance. There are some cases where audio tracking performance is better than fusion. This is due to poor performance of video tracking. Fusion based on the product of likelihood functions gives better performance but it is more vulnerable to sensor conflict and errors in sensor calibration. The weighted-sum approach is more robust to conflicts and sensor errors, but it suffers from high uncertainty. The centralized estimation framework consistently performs better than the hybrid framework. Figure 17 shows the determinant of the covariance for all tracks and all approaches. Figure 16(b) presents averages of covariance measure for all tracks to
Fig. 17 Determinant of covariance (a) weighted sum, (b) product
compare the performance of tracking approaches. Among modalities, video sensors have lower uncertainty than audio sensors. Among the fusion techniques, product fusion produces lower uncertainty, as expected. There was no definite comparison between the centralized and hybrid approach, though the latter seems to produce lower uncertainty in the case of weighted-sum fusion. The average tracking error of 2 meters is reasonable considering the fact that a vehicle is not a point source, and the cell size used in fusion is 0.5 meters.
8.2 MCMCDA The audio and video data gathered for target tracking based on sequential Bayesian estimation is reused to evaluate target tracking based on MCMCDA. For MCMCDA evaluation, we experiment with six different approaches. We use audio-only (A), video-only (V) and audio-video (AV) sensor data for sensor fusion. For each of these data sets, the likelihood is computed either as the weighted-sum or product of the likelihood functions for individual sensors.
142
Manish Kushwaha et al.
8.2.1 Single Target We shortlist 9 vehicle tracks with a single target in the sensing region. The average duration of tracks is 3.75 sec with 2.75 sec minimum and 4.5 sec maximum. Figure 18 shows the target tracking result for two different representative vehicle tracks. The figure also shows the raw observations obtained from the multimodal sensor fusion and peak detection algorithms. Figure 19 shows average tracking errors for
Fig. 18 Target Tracking (a) no missed detection (b) with missed detections
Fig. 19 Tracking errors (a) weighted sum fusion, and (b) product fusion
all vehicle tracks for the weighted-sum fusion and product fusion approaches. The missing bars indicate that the data association algorithm is not able to successfully estimate a track for the target. Figure 20 averages tracking errors for all the tracks to compare different tracking approaches. The figure also shows the comparison of the performance of tracking based on sequential Bayesian estimation to MCMCDA based tracking. The performance of MCMCDA is consistently better than sequential Bayesian estimation. Table 2 compares average tracking errors and tracking success across likelihood fusion and sensor modality. Tracking success is defined as the percentage of correct tracks that the algorithm is successfully able to estimate. Table 3 shows the reduction in tracking error for audio-video fusion over audio-only and video-only approaches. For summation fusion, the audio-video fusion is able to
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
143
Fig. 20 Average tracking errors for all estimated tracks. A comparison with sequential Bayesian estimation based tracking is also shown.
Summ Fusion Prod Audio Modality Video AV
Average error (m) Tracking success 1.93 74% 1.58 59% 1.49 89% 2.44 50% 1.34 61%
Table 2 Average tracking error and tracking success
reduce tracking error by an average of 0.26 m and 1.04 m for audio and video approaches, respectively. The audio-video fusion improves accuracy for 57% and 75% of the tracks for audio and video approaches, respectively. For the rest of the tracks, the tracking error either increased or remained same. Similar results are presented for product fusion in Table 3. In general, audio-video fusion improves over either audio or video or both approaches. Video cameras were placed at an angle along the
Audio Video
Summ Average Tracks error re- imduction proved (m) 0.26 57% 1.04 75%
Prod Average Tracks error re- imduction proved (m) 0.14 100% 0.90 67%
Table 3 Average reduction in tracking error for AV over audio and video-only for all estimated tracks
144
Manish Kushwaha et al.
road to maximize coverage of the road. This makes video tracking very sensitive to camera calibration errors and camera placement. Also, an occasional obstruction in front of a camera confused the tracking algorithm which took a while to recover. An accurate self-calibration technique, e.g. [16, 28], is desirable for better performance of a target tracking system.
8.2.2 Multiple Targets Many tracks with multiple moving vehicles in the sensing region were recorded during the experiment. Most of them have vehicles moving in the same direction. Only a few tracks include multiple vehicles crossing each other. Figure 21 shows
Fig. 21 Multiple Target Tracking (a) XY plot (b) X-Coordinate with time
the multiple target tracking result for three vehicles where two of them are crossing each other. Figure 21(a) shows the three tracks with the ground truth, while Figure 21(b) shows the x-coordinate of the tracks with time. The average tracking errors for the three tracks are 1.29m, 1.60m and 2.20m. Fig. 21 shows the result when only video data from the three video sensors is used. multiple target tracking with audio data could not distinguish between targets when they cross each other. This is due to the fact that beamforming is done assuming acoustic signals are generated from
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
145
a single source. Acoustic beamforming methods exist for detecting and estimating multiple targets [7].
9 Conclusions We have developed a multimodal tracking system for an HSN consisting of audio and video sensors. We presented various approaches for multimodal sensor fusion and two approaches for target tracking, which are based on sequential Bayesian estimation and MCMCDA algorithm. We have evaluated the performance of the tracking system using an HSN of six mote-based audio sensors and three PC webcamerabased video sensors. We evaluated and compared the performance for both the tracking approaches. Time synchronization across the HSN allows the fusion of the sensor data. We have deployed the HSN and evaluated the performance by tracking moving vehicles in an uncontrolled urban environment. We have shown that, in general, fusion of audio and video data can improve the tracking performance. Currently, our system is not robust to multiple acoustic sources or multiple moving targets. An accurate self-calibration technique and robust audio and video sensing algorithms for multiple targets are required for better performance. A related challenge is sensor conflict that can degrade the performance of any fusion method and needs to be carefully considered. As in all sensor network applications, scalability is an important aspect that has to be considered as well.
References [1] A. M. Ali, K. Yao, T. C. Collier, C. E. Taylor, D. T. Blumstein, and L. Girod. An empirical study of collaborative acoustic source localization. In IPSN ’07: Proceedings of the 6th international conference on Information processing in sensor networks, pages 41–50, 2007. [2] I. Amundson, B. Kusy, P. Volgyesi, X. Koutsoukos, and A. Ledeczi. Time synchronization in heterogeneous sensor networks. In International Conference on Distributed Computing in Sensor Networks (DCOSS 2008), 2008. [3] M. J. Beal, N. Jojic, and H. Attias. A graphical model for audiovisual object tracking. volume 25, pages 828–836, 2003. [4] P. Bergamo, S. Asgari, H. Wang, D. Maniezzo, L. Yip, R. E. Hudson, K. Yao, and D. Estrin. Collaborative sensor networking towards real-time acoustical beamforming in free-space and limited reverberence. In IEEE Transactions On Mobile Computing, volume 3, pages 211–224, 2004. [5] S. T. Birchfield. A unifying framework for acoustic localization. In Proceedings of the 12th European Signal Processing Conference (EUSIPCO), Vienna, Austria, September 2004.
146
Manish Kushwaha et al.
[6] N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on MultiObject Tracking, 2003. [7] J. C. Chen, K. Yao, and R. E. Hudson. Acoustic source localization and beamforming: theory and practice. In EURASIP Journal on Applied Signal Processing, pages 359–370, April 2003. [8] T. Darrell, D. Demirdjian, N. Checka, and P. Felzenszwalb. Plan-view trajectory estimation with dense stereo background models. In Proceedings. Eighth IEEE International Conference on Computer Vision (ICCV 2001), 2001. [9] M. Ding, A. Terzis, I.-J. Wang, and D. Lucarelli. Multi-modal calibration of surveillance sensor networks. In Military Communications Conference, MILCOM 2006, 2006. [10] E. Duarte-Melo and M. Liu. Analysis of energy consumption and lifetime of heterogeneous wireless sensor networks. In IEEE Globecom, 2002. [11] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for background subtraction. In IEEE ICCV’99 Frame-Rate workshop, 1999. [12] J. Elson, L. Girod, and D. Estrin. Fine-grained network time synchronization using reference broadcasts. In Operating Systems Design and Implementation (OSDI), 2002. [13] N. Friedman and S. Russell. Image segmentation in video sequences: A probabilistic approach. In Conference on Uncertainty in Artificial Intelligence, 1997. [14] S. Ganeriwal, R. Kumar, and M. B. Srivastava. Timing-sync protocol for sensor networks. In ACM SenSys, 2003. [15] L. Girod, V. Bychkovsky, J. Elson, and D. Estrin. Locating tiny sensors in time and space: A case study. In ICCD, 2002. [16] L. Girod, M. Lukac, V. Trifa, and D. Estrin. The design and implementation of a self-calibrating distributed acoustic sensing platform. In SenSys ’06, 2006. [17] D. Hall and J. Llinas. An introduction to multisensor data fusion. In Proceedings of the IEEE, volume 85, pages 6–23, 1997. [18] H. Y. Hau and R. L. Kashyap. On the robustness of Dempster’s rule of combination. In IEEE International Workshop on Tools for Artificial Intelligence, 1989. [19] P. KaewTraKulPong and R. B. Jeremy. An improved adaptive background mixture model for realtime tracking with shadow detection. In Workshop on Advanced Video Based Surveillance Systems (AVBS), 2001. [20] Z. Khan, T. Balch, and F. Dellaert. MCMC-based particle filtering for tracking a variable number of interacting targets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11):1805–1918, Nov. 2005. [21] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russell. Towards robust automatic traffic scene analysis in realtime. In IEEE Conference on Decision and Control, 1994. [22] R. Kumar, V. Tsiatsis, and M. B. Srivastava. Computation hierarchy for innetwork processing. 2003.
Tracking in Urban Environments Using Sensor Networks Based on Audio-Video Fusion
147
[23] B. Kusy, P. Dutta, P. Levis, M. Maroti, A. Ledeczi, and D. Culler. Elapsed time on arrival: A simple and versatile primitive for time synchronization services. International Journal of Ad hoc and Ubiquitous Computing, 2(1), January 2006. [24] A. Ledeczi, A. Nadas, P. Volgyesi, G. Balogh, B. Kusy, J. Sallai, G. Pap, S. Dora, K. Molnar, M. Maroti, and G. Simon. Countersniper system for urban warfare. ACM Trans. Sensor Networks, 1(2), 2005. [25] P. Levis, S. Madden, D. Gay, J. Polastre, R. Szewczyk, A. Woo, E. Brewer, and D. Culler. The emergence of networking abstractions and techniques in TinyOS. In NSDI, 2004. [26] J. Liu, J. Reich, and F. Zhao. Collaborative in-network processing for target tracking. In EURASIP, Journal on Applied Signal Processing, 2002. [27] M. Maroti, B. Kusy, G. Simon, and A. Ledeczi. The flooding time synchronization protocol. In ACM SenSys, 2004. [28] M. Meingast, M. Kushwaha, S. Oh, X. Koutsoukos, and S. S. Akos Ledeczi. Heterogeneous camera network localization using data fusion. In In ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC08), 2008. [29] S. Oh, S. Russell, and S. Sastry. Markov chain Monte Carlo data association for general multiple-target tracking. In IEEE Transactions on Automatic Control, 54(3):481–497, Mar. 2009. [30] K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe. A boosted particle filter: Multitarget detection and tracking. In European Conference on Computer Vision, 2004. [31] J. Piater and J. Crowley. Multi-modal tracking of interacting targets using Gaussian approximations. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2001. [32] A. Poore. Multidimensional assignment and multitarget tracking. In I. J. Cox, P. Hansen, and B. Julesz, editors, Partitioning Data Sets, pages 169–196. American Mathematical Society, 1995. [33] D. Reid. An algorithm for tracking multiple targets. IEEE Trans. Automatic Control, 24(6):843–854, December 1979. [34] S. Rhee, D. Seetharam, and S. Liu. Techniques for minimizing power consumption in low data-rate wireless sensor networks. 2004. [35] J. Sallai, B. Kusy, A. Ledeczi, and P. Dutta. On the scalability of routing integrated time synchronization. In Workshop on Wireless Sensor Networks (EWSN), 2006. [36] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [37] N. Strobel, S. Spors, and R. Rabenstein. Joint audio video object localization and tracking. In IEEE Signal Processing Magazine, 2001. [38] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. In IEEE International Conference on Computer Vision, 1999.
148
Manish Kushwaha et al.
[39] J. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. In Robot. Auton. Syst., volume 55, pages 216–228, March 2007. [40] P. Volgyesi, G. Balogh, A. Nadas, C. Nash, and A. Ledeczi. Shooter localization and weapon classification with soldier-wearable networked sensors. In Mobisys, 2007. [41] M. Yarvis, N. Kushalnagar, H. Singh, A. Rangarajan, Y. Liu, and S. Singh. Exploiting heterogeneity in sensor networks. In IEEE INFOCOM, 2005. [42] B. H. Yoshimi and G. S. Pingali. A multimodal speaker detection and tracking system for teleconferencing. In ACM Multimedia ’02, 2002. [43] T. Zhao and R. Nevatia. Tracking multiple humans in crowded environment. In IEEE Conference on Computer Vision and Pattern Recognition, 2004. [44] D. Zotkin, R. Duraiswami, H. Nanda, and L. S. Davis. Multimodal tracking for smart videoconferencing. In In Proc. 2nd Int. Conference on Multimedia and Expo (ICME’01), 2001.
Multi-Camera Vision for Surveillance Noriko Takemura and Hiroshi Ishiguro
1 Introduction There is an ever increasing demand for security monitoring systems in the modern world. Visual surveillance is one of the most promising areas in security monitoring for several reasons. It is easy to install, easy to repair, and the initial setup cost is inexpensive when compared with other sensor based monitoring systems, such as audio sensors, motion detection systems, thermal sensors etc. Video surveillance systems are installed in locations ranging from multinational banking organisations to public institutions to small local stores, and there is a similar disparity in the level of sophistication of installed systems. Simple one camera systems have been employed in shops to deter shoplifers for many years, however, there is an increasing trend towards more complicated systems in areas where the benefits inherent in the system outweigh the admittedly high cost of installation and the associated maintenance and usage overheads. The use to which these systems are put also varies widely with the intentions of the operators and the budget available for the implementation of the system. In shops, the main use is, by and large, to observe and record vandals, shoplifters and other miscreants. In banks and public institutions systems are used to monitor suspicious activity, and in the cases of a break-in to record the intruders. The requirements of surveillance systems vary according to their intended purposes. For example if the system is intended to monitor the entrance and exits of a large building, it is necessary for the system to record clear footage over a wide area. However if the intention is to record a small number of people within a more confined space, then Noriko Takemura Department of Machine Systems, Osaka University, Osaka, Japan, e-mail:
[email protected] Hiroshi Ishiguro Adaptive Machine Systems Lab, Osaka University, Osaka, Japan, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_6, © Springer Science+Business Media, LLC 2010
149
150
Noriko Takemura and Hiroshi Ishiguro
the system design will tend instead towards detailed observation of the subjects. In both cases it is important to be able to use the footage to track people in real time. In recent years, with the continuing fall in price of cameras and the increased durability of the systems, surveillance systems have been installed in more and more diverse locations. It is not uncommon, in developed nations, to find networked surveillance systems spread throughout large towns and cities, as well as along major transportation routes. Additionally, camera based surveillance systems are increasingly within the price range of private citizens leading to an increasing trend among security conscious individuals to install camera systems in their homes and around their property. As camera systems decrease in price and the desired area under observation increases it is becoming more and more important to implement automated, coordinated multi-camera systems, which can monitor areas with less and less need for the direct oversight of a human operator, operators whose cost of employment will soon outpace the cost of installing and maintaining automated camera systems if it has not already. Surveillance cameras serve many purposes with perhaps the most obvious use being that the footage recorded by these cameras can be used as evidence in both criminal and civil proceedings. However the very presence of the cameras serves as a deterent to prospective criminals as well as assuaging the fears and worries of the general populace. Concern about security and public safety has seen a manifold increase in recent years, due in large part to heightened awareness terrorism has garnered in the last decade. Video camera systems also have many uses outside of security surveillance. Camera systems are used to monitor factory assembly lines and areas where it is unsafe for direct human monitoring (such as nuclear reactors, volcanoes and hydroelectric powerplants) as well as being used in astronomy. There are several different approaches to coordinated multi-camera systems. Examples include developing efficient coordinated PTZ (Pan Tilt Zoom) camera systems, optimal camera placement strategies, and role sharing systems for multicamera systems. It is envisaged that efficient, cost effective realization of such systems will further decrease the number of cameras required to implement surveillance, the amount of storage media required to record the surveillance and the number of operators required to oversee the system.
2 Camera Systems 2.1 Fixed Camera Due to their low cost and the with which they can be installed, standard monocular fixed cameras are widely used for security surveillance purposes. As the recorded location is static, it is easy for operators monitoring in real time to notice unusual
Multi-Camera Vision for Surveillance
151
situations. The fact that the cameras are static also make isolating the subjects from the background both a relatively simple task to implement in software and, generally speaking, one which can be performed with a high degree of accuracy. However fixed camera systems have some major disadvantages: if the system is not well designed the monitored area may have large blind spots and the number of cameras required grows quite large as the area under surveillance increases. There are many types of fixed cameras in addition to standard monocular cameras.Steroscopic cameras can generate 3D images by viewing the same scene with two cameras fixed at slightly different angles (similar to the way human vision works). Steroscopic cameras are, however, much more expensive than standard monocular cameras. Omnidirectional cameras can generate panoramic images, either by using a standard camera and a curved mirror (in the case of hyperbolic and parabolic cameras (Fig. 1, 2) or by using multiple cameras arranged so as to enable the creation of a panoramic image (Fig. 3). While the creation of panoramic images is certainly an advantage as fewer cameras can be used to cover the same areas as would be needed in a standard setup there are some disadvantages to using omnidirectional cameras. When using omnidirectional cameras which need mirrors, the images must be converted before being used. Additionally, the use of curved mirrors reduces the maximum allowable resolution of the cameras. The purchase cost of these cameras is also higher because of the cost of machining the mirrors sufficiently. Omnidirectional cameras which rely upon multiple cameras are not as suceptible to these resolution issues, but as they feature multiple cameras lenses, there is also additional cost associated with their use.
Fig. 1 Hyperbolic Omni-directional camera
Fig. 2 Parabolic directional camera
ceiling
mounted
Omni-
152
Fig. 3 Omni-directional camera consisting of a number of aligned monocular cameras (Point Gray Research, LadyBug Recorder)
Noriko Takemura and Hiroshi Ishiguro
Fig. 4 Pan-Tilt-Zoom camera (SONY, EVI D100)
2.2 Active Camera The most commonly used active cameras are PTZ cameras (Fig. 4) which can cover a wide area using a small number of cameras, when compared with an equivalant fixed camera setup. Active cameras systems are extremely flexible, and as such control issues arise which are not seen in fixed camera setups. When using active cameras, it is necessary to implement solutions for necessary such issues as the camera role sharing problem, the problem of handling the overlapping FOV (Field Of View) of the cameras and the view planning problem. Through the use of view control, it is possible to realise an efficient visual surveillance system. The most obvious disadvantages of using active cameras are that the cameras are much more expensive than fixed cameras, and the computational cost involved in is significantly higher due to the need to determine the most efficient camera placement positions, the degree of role sharing which best suits the situation and the optimal camera view, all of which can be seen as a result of the flexibility of active camera systems. Additionally, it is more difficult for operators monitoring such systems to spot abnormalities when the background is constantly changing.
2.3 Mixed System Some systems, referred to as mixed systems, use a mixture of fixed and active cameras, aiming to overcome the pitfalls inherent in the two camera systems and to maximise the advantages of each system. In a mixed system, if the role division is successful, the overall efficiency and performance of a mixed system can rival that of a fixed camera system while also rivaling the flexibility of an active camera system. An example of such a system is one in which a number of omnidirectional or
Multi-Camera Vision for Surveillance
153
wide angle cameras are fixed in ceiling positions in order to isolate subjects from the background. Isolating subjects is, as mentioned above, a comparitively simple and computationally cheap exercise which can be performed with a high degree of accuracy. The use of fixed cameras allows the entire area to be continuously monitored, something which is difficult to implement efficiently with active cameras, and a small number of active cameras can be used to track discovered subjects in more detail. Continuous coverage of the entire area is useful because tracked subjects are persistent, i.e. they can be consistently tracked and will not disappear and reappear under normal circumstances. In an active environment it is difficult to determine if a discovered subject is a new subject, or a previously discovered subject. Mixed environments balance between the advantages of the two systems can provide a system which is robust and flexible. One example of such a mixed system is presented in Zhou et al.[54] which can acquire detailed information about subjects at a distance. The system in question, called Distant Human Identification, (DHID) is a mixed real time system consisting of a stationary wide angle camera and an active camera are paired in a master slave relationship. The stationary camera provides an undetailed view of the overall scene and, when it detects motion, compels its slave, the active camera, to provide a more detailed view of the subject in question. The central idea of this system is to allow a single unit to cover a wide area, while still allowing it to be detailled enough to perform biometric analysis of the subject under observation.
Fig. 5 An example of Zhou et al. [54]’s system. The master camera’s view can be seen above, with the slave’s view below.
154
Noriko Takemura and Hiroshi Ishiguro
2.4 Multi-modal System Another approach to surveillance is to combine visual sensors with another type of sensor. This can be used to offset the weaknesses of a purely visual system, as well as to reduce cost. For example, security surveillance systems often augment their camera systems with laser tripwires, motion sensors, audio sensors etc. Multi-modal systems are considered to be promising, however their use increases the complexity of the system. Their use also raises new issues, for example the issue of integrating the sensor information from two or more modalities into a coherent model of the environment. In Nayak et al. [35] the authors assert that, for a variety of reasons such as economic, technological or social reasons, it is often infeasible to cover the entire area being monitored with visual sensors, i.e. cameras. As such they propose a multi modal system consisting of both cameras and audio sensors.
2.5 Surveillance Issues 2.5.1 General Tracking of a Large Number of Subjects In the case that the number of subjects being tracked exceeds the number of tracking cameras, generally the objective of the system is to evenly monitor the whole area and to consistently track the trajectories of as many subjects as possible using the minimum number of cameras. In such a scenario, planning the viewing directions of the cameras is important in order to maximise the system’s ability to track the largest possible number of subjects. Jung & Sukhatme [24] developed a system that coordinates multiple mobile robots to track multiple targets. A general urgency for the entire area under surveillance is calculated and distributed to the robots. In the authors’ system, the evaluation of urgency is based on the current distribution of targets rather than on a prediction of future states. Miura & Shirai [34] attempts to solve a multi-camera multi-person trackng problem by parallelizing planning and action. A heuristic planning algorithm is used which iteratively refines the assignment of persons to cameras, formulated as an anytime algorithm. Krishna et al. [31] developed a view planning algorithm for a multi-sensor surveillance system. In order to avoid a combinatorial explosion, sensors were dynamically prioritised based on the predicted coverage of targets. Coverage prediction is performed using statistical knowledge of the target distribution. However individual subject motion is not predicted. A similar view planning algorithm, is described in Takemura & Miura [44], in which the authors developed a view planning algorithm used a multi-start local search based method, based on a probability model of the subject’s motion which allowed the system to track subjects intermit-
Multi-Camera Vision for Surveillance
155
tently, a criterion which allows the cameras used to have frequently shifting fixation points.
2.5.2 Detailed Tracking of a Small Number of Subjects When the number of cameras exceeds the number of subjects being tracked, the objective of the surveillance system tends towards generating detailed profiles of the subjects under surveillance. In such systems, common goals are to realise systems which are capable of recognising and distinguishing individuals, behaviour recognition systems and total subject tracking systems. In these systems, generally the method of assigning cameras to subjects is critical. Karuppiah et al. [27] proposed a method of dynamically configuring multiple cameras so that a target can be tracked reliably, using a utility function that evaluates the measurement accuracy and the predictability of possible events. This work deals with the tracking of a small number of persons in a relatively small area. Horling et al. [17] dealt with a cooperative vehicle monitoring by a distributed sensor network. The problem is formulated as a resource allocation problem in which the area to be monitiored by each sensor and what information should be communicated are determined with consideration of sensor and communication uncertainties. Isler et al. [22] developed algorithms for assigning targets to multiple cameras so that the expected error in the target location estimation is minimized. These works treated the case where the number of cameras is relatively larger than that of targets.
2.5.3 Tracking in specific environments In some cases, coordinated camera systems are designed to be used in particular environments for specific purposes such as soccer or ice hockey games[1][4][19]. The two main objectives of such systems are scene recognition and tracking the trajectories of all the players. When these objectives are realised it is possible to automatically generate a highlight reel of the game and to analyse the strategies employed by the participating teams. One common method of impelementing scene recognition is to make models of each scene based on the players’ positions, the position of the ball and the camera angles and to apply pattern recognition to determine what is happening in the scene. In these artificial environments, the background tends to be simple and as a result it is relatively simple to isolate subjects from the background. However in team sports, the teams generally wear uniforms, it is very difficult to distinguish the players from each other so their trajectories must be predicted using probability theory. Kang et al. [26] presents a novel approach for continuous detection and tracking of moving objects observed by multiple stationary cameras. The tracking problem is addressed by simultaneously modeling motion and apperance of the moving objects.
156
Noriko Takemura and Hiroshi Ishiguro
The performance of the proposed method is demonstrated on a soccer game captured by two stationary cameras.
2.6 Location and Subjects The subjects under surveillance can greatly influence the design of the surveillance system. Generally speaking, systems which are intended to monitor large numbers of subjects tend to track in much lower detail than systems which are used with less subjects. However this is not always the case, as it is easier to track a large number of subjects who can be easily distinguished than to track a smaller number of subjects in similar attire (e.g. members of a soccer team tend to wear team colours, making recognition more difficult), Indoor locations are, generally speaking, easier to process than outdoor locations as indoor locations tend to be controlled environments which do not experience the drastic changes in lighting conditions which occur outdoors. Factors such as the time of day, the weather and the season can greatly influence the lighting conditions in outdoor locations. These factors, plus the greater complexity usual found in outdoor backgrounds make it much more difficuly to isolate subjects from the background. Camera placement outdoors is also much more difficult than camera placement indoors. Outdoor cameras on the whole must be more robust and durable as they must operate in adverse weather conditions and be able to withstand casual vandalism. In open outdoor environments it often necessary to mount cameras to poles to provide good viewing angles and to protect them from vandalism. These poles incur further cost. Indoor cameras, on the other hand, are generally in controlled environments, which often have restricted access outside of normal working hours, reducing the need for vandalism prevention. Confined locations, with few entry and exit points are, for the most part, easier to monitor than open spaces. When there are only a small number of entry points, it is possible to continuously monitor all entry points. This simplifies the processing greatly, as if all entry points are monitored, it can be assumed that subjects cannot appear and disappear anywhere in the system. The system can be taught awareness of “sinks” and “sources” as these entrypoints can be referred [43]. In open areas the number of sinks and sources can be quite large, making this assumption difficult.
2.7 Occlusion Occlusion is also an important issue. Occlusion, in terms of surveillance, occurs when something blocks the camera‘s view of a subject. There are, generally speaking, two kinds of occlusion, occlusion of a subject by a static element of the environment and occlusion of a subject by another subject or a dynamic part of the environment. Several strategies exist for handling occlusion.
Multi-Camera Vision for Surveillance
157
In the case of a subject being occluded by a static part of the environment, it is usually possible to prepare for, and handle this kind of occlusion. The simplest method, in multicamera setups is to attempt to minimize or remove any blindspots in the area through careful camera setups. Another approach is taken by Stauffer [43] describes a method of defining sinks and sources. These can be seen as places where subjects can exit (sink) or exit (source) the area under surveillance, and areas which are occluded, either by static elements of the environment or other occlusions. Some examples of possible sinks and sources include doorways, tunnel entrances, buildings, trees, dyanmic elements of the environment, areas of occlusion or dirt on the camera lenses.
3 Case Studies 3.1 Real-time cooperative multi-target tracking by dense communication among Active Vision Agents In Ukita [45] the author describes a distributed system of active cameras which can track a large number of subjects. In the sense of the paper, a large number is seen as the number of subjects exceeding the number of cameras. The author’s system consists of a number of cooperative “Active Vision Agents” (AVAs) which can change their behaviour dynamically depending on the situation. The AVAs communicate with each other, and based upon received information adapt their behaviour. The system described is a development of an earlier system described in [46][47]. The ability of the system to track large numbers of subjects did not exist in previous versions of the system. Each AVA consists of a networked computer an a PTZ active camera. By distributing the AVA‘s throughout the environment, the author’s system is able to generate 3D environmental models allowing for continuous tracking of multiple subjects throughout a wide area. The system’s distributed design necessitates that nodes communicate information about tracked objects, which leads to communication overhead. Previous version of the system were unable to track more subjects than there were AVAs because the implemented system required that at all times each subject be tracked by a single AVA. Also each AVA only maintained persistent state information for the subject which it was currently tracking, which lead to issues with insufficient target information. In the newest version of the system, described in this paper, the author tackles these problems by dividing AVAs into two types: Freelancer AVAs and Agencies. Freelancer AVAs search for subjects that enter the observation area and refer them to Agencies. Each Agency consists of a number of AVAs that cooperatively track referred subjects. Depending on the number of so called member AVAs tracking a subject, an AVA may dynamically decide to track a different subject. If the subject which a member AVA is tracking leaves the observation area, depending on the
158
Noriko Takemura and Hiroshi Ishiguro
situation, the member AVA may choose to become a freelancer AVA or to join another Agency. The division into freelancer AVAs and Agencies allow the system to maintain a persistent state for a specific tracked object within the agency, without overly increasing the network load, as each Agency shares the subject’s state only with its own agency. The author asserts that the results of experiments demonstrate that the system is robust and can track a large number of subjects simultaneously in real time.
Fig. 6 The system described in [45]. The system can be abstracted into three distinct communication levels. Inter Agency, Intra Agency and Intra-AVA
3.2 Tracking Multiple Occluding People by Localizing on Multiple Scene Planes In Khan & Shah [28] a strategy for handling occlusion is described. The authors use a centralized system of cameras which collects information from all of the cameras and is centrally processed. A commonly used approach to avoid inclusion is to use fully calibrated 2D camera views. However, in the system described by the authors, a planar homographic occupancy constraint was developed which merges information from several different cameras to resolve occlusions. The authors system is designed to work in situation in which it is not possible to isolate the majority of the subjects at a given time. At any time several of the subjects are likely to be partially or even fully occluded by other subjects. Unlike other similar systems, Khan’ & Shah’s approach does not rely on color or shape models of the subjects. Instead after using standard techniques to isolate the
Multi-Camera Vision for Surveillance
159
foreground (the subjects) from the background, a probalistic model is used to determine the likelihood that a given pixel belongs to the foreground. Different cameras views of the same scene are combined to generate a probalistic model of which areas in the 3D image of the scene are occupied. Areas in which the foreground has a high probability of being occupied are compared over a number of time instances in order to track subjects. The system requires that at least one reference plane be visible in all camera views in order to coordinate the different camera views. The authors feel that this is generally a valid assumption in typical surveillance scenarios. The authors use the Schmeider Weatherby Clutter measure to gauge scene clutter and graph cuts for detection and tracking. The system does have the problem that there are still instances in which two subjects cannot be distinguished because of their proximity. The authors are of the opinion that this could be extended in future work by integrating color maps of the subjects to distinguish subjects in this case. Additionally it is stated that the inclusion of human shape models could reduce the number of false positives and increase the system’s ability to localize subjects.
Fig. 7 A visual example of Khan’ & Shah’s [28] system. As can be seen, the area is confined, but crowded, yet the system can distinguish the subjects
160
Noriko Takemura and Hiroshi Ishiguro
3.3 Applying a Form of Memory-based Attention Control to Activity Recognition at a Subway Station MacDorman et al.[32] implemented a centralised, indoor fixed camera system for automated behaviour recognition in a subway station. The system used twelve ceiling mounted wide angle cameras, similar to those seen in (Fig. 2) to generate panoramic views of the area under observation. The system was designed to recognise specific sets of behaviors and react accordingly through the means of an automated real time reporting system, which both displayed reports on the internet and . The system could recognise any combination of nine discrete behaviors, allowing for a possible of 512 distinct scenarios. In the paper, the authors compare and contrast their system with another system, called ADVISOR. The author’s system is memory based and was trained, by human operators, to recognise specific behavior types. ADVISOR, on the other hand, was model based and MacDorman et. al are of the opinion that being model based makes ADVISOR much less flexible, citing the example that in order for it to recognise different behavior types, the model would need to be changed. It is asserted that, in a model based system, it is necessary for all behaviors to be known a priori. This requirement makes the system very inflexible and difficult to change. In contrast the authors’ system uses a system which relies on online learning of target behaviors. This leads to a much more flexible system, which can learn new behaviours if necessary. One of the more interesting aspects of the paper is that the system uses an attention control system to determine what information is important. As a result of the fact that an inherent limitation of the memory based system is that as the area being monitored increases the amount of time required to learn new behaviors. The implemented attention control system counters this by measuring the information gain for each pixel. The pixels with the highest gain are deemed to be worthy of attention. The authors cite examples from other domains which show the effectiveness of the method [50]. Additionally the authors’ experimental results show an increase in all tasks for which attention control was used, with a significant increase in certain tasks. One particular issue that the authors cite the system is that the system finds it difficult to find behaviour which is difficult for the operator to define, because of the unclear definition’s inherent ambiguity. Additionally the authors feel that human factors can have a large impact on the systems reliability. In on of the authors’ system with test data from the ADVISOR system, the system performed on average 5% worse than the ADVISOR system. However the flexibility of a memory based system over a model based system may, in some circumstances be worth the loss in accuracy.
Multi-Camera Vision for Surveillance
161
Fig. 8 An example of the transformed composite image generated from the cameras that are used for processing MacDorman et al. [32]
3.4 DMCtrac : Distributed Multi Camera Tracking Hoffmann et Al. [14] developed a system, called DMCtrac (Distributed Multi Camera Tracking), which aims to build a distributed smart camera system which is capable of self organization. The authors define self organization to include self configuration (each camera adjusts its own FOV to search for and track objects), self optimization (each camera adjusts its own FOV to maximize the area under surveillance, self healing (if a camera becomes inactive, other cameras compensate by taking on its tasks) self protection (cameras preform functional monitoring) and anticipation (cameras detect and report critical events). DMCtrac is a development of the author’s earlier work, [15] & [16] which consists of a wireless mesh network of smart cameras in which cameras connect to other cameras dynamically and spontaneously in order to allow for the completion of common goals (such as calibration, spatial partitioning, or object tracking.) The authors envisage that the system would be useful for monitoring locations in which a wide are needs to be monitored where it would be necessary to record high quality images of any vehicles or persons in the area. The system was designed to be robust and scalable.
162
Noriko Takemura and Hiroshi Ishiguro
DMCtrac focuses on distributed management and the specific networking issues which arise in vision networks. The algorithm used consists of a state machine with four possible camera states, MASTER, SLAVE, SEARCH and LOOK, as can be seen in Fig. 9. The initial state is the SEARCH state. When a camera is in the SEARCH state, it will search for possible targets to track. If it finds a target, it will switch to the MASTER state. Alternatively, if a neighboring MASTER node requests it may switch to the LOOK state or, if all subjects are currently being tracked, it may switch to the SLAVE state. In the MASTER state, a camera will attempt to track the current subject. If it loses the subject, it will request neighboring nodes to LOOK for the object and will switch to the SLAVE state. In the SLAVE state, if it has not received messages from a MASTER node for a prolonged duration, it will switch to SEARCH state, or if a MASTER node requests it to search for a lost subject, it will switch to LOOK state. In the LOOK state, the camera searches for the requested subject. If the subject is found, it switches to the master state, if the subject is not found after a specified time period, the node switches to the SEARCH state. Testing of the system by the authors indicates that the system to packet loss, able to handle up to 10% loss.
Fig. 9 The system described in [14]. DMCtrac consists of a state machine with four possible camera states, MASTER, SLAVE, SEARCH and LOOK
Multi-Camera Vision for Surveillance
163
3.5 Abnormal behavior-detection using sequential syntactical classification in a network of clustered cameras Goshorn et al. [8] describe a system which uses a network of clustered cameras to detect abnormal behavior. The system described is based on earlier work by the authors, described in [9], [10] & [11]. In the authors’ system, the area being monitored is subdivided into different clusters, in order to minimize the required network bandwidth. A cluster is defined as a group of cameras which are in close physical proximity to each other which have a relatively similar FOV Detection of abnormal behaviors is done at both an intra cluster level (i.e. within a given cluster) and at an inter cluster level (i.e. at the network level). Normal behaviors are defined for both levels. Instead of informing the entire network when a subject leaves a cluster, each cluster just informs the adjacent cluster, which reduces the network traffic. In this system, every cluster has a number of Camera Selection Manager (CSM) and a Camera Communication Manager (CCM). Each cluster chooses which of the cameras in the cluster is most capable of observing a given subject several times a minute and the camera becomes a CSM. There is one CSM per subject per cluster. The selection of CSMs occurs in parallel and is distributed between the cluster, i.e. there is no central processor in the cluster. The CCM is tasked with handling handoffs between clusters, and is decided beforehand. Clusters are made aware of the CCMs for adjacent clusters, and when a subject leaves a cluster, neighboring clusters receive a descriptor for the subject from the cluster the subject has left. The adjacent clusters then determine a CSM for the subject. If a cluster observes the expected subject, it informs the other clusters that it has detected the subject, allowing them to free the created CSMs for other tasks. The system performs intra cluster behavior recognition by examining the actions of a CSM over time, generating a profile of normal behavior which can then be used to detect abnormal behaviour. Additionally, inter cluster behavioral recognition can be performed by comparing the actions of CCMs over time.
4 Industry Systems One example of an installed industry system is the biometric identification system implemented in Narita and Kansai Airports in Japan in 2002, which is based on facial recognition and digital fingerprinting technology[55] developed by the Omron Corporation similar to a simpler fingerprint system which was implemented as US borders several years ago. Another such example is the installation of surveillance cameras in the latest Shinkansen trains, the N700 series. These trains have a multitude of cameras installed, observing both the train entrances and the entrance to the drivers’ compartments. Surveillance systems have also been implemented on public transport systems in other country.
164
Noriko Takemura and Hiroshi Ishiguro
Another examples of an industry systems is the Censys3D SDK (Software Development Kit)[56] produced by Point Grey Technology which is designed to provide users with accurate online, tracking information through the use of a stereo vision camera system, used to generate 3D trajectories and IBM’s S3 System. Perhaps the most prominent example of an industry surveillance system is IBM’s SmartSurveillance System (S3). This system consists of the IBM Smart Surveillance Engine (IBM:SSE), which provides a software implementation of an event detection system for smart video surveillance, and the IBM Middleware for Large Scale Surveillance (IBM:MILS) which handles the data management services. The system is capable of tracking and classifying moving objects, face tracking and detection and can provide real time reporting as well as more complex analysis. There are unconfirmed reports that the US Government is considering implementing this system at the US/Mexican Border. This system is described in great detail in [2], [3], [12], [38]-[41] & [51]. Advanced Telecommunications Research Institute International (ATR) has developed an environmental platform which provides human behavior information for various robotic services [57], [7], [25] & [36]. The structural envirionmental information is created based on the precise positions of humans who are recorded moving in and out of buildings. This is thought to be an important issue in terms of the realization of robots capable of providing various services. A prototype of the environment platforms was built at UCW (Universal City Walk), Osaka.
5 Conclusion As mentioned above, visual surveillance systems still have many issues, such as camera coordination, image processing, view planning, camera direction, occlusion, subject recognition, behavior recognition, scene recognition etc. However, the increasing market for visual surveillance coupled with the increasingly affordablity and availability of ever more powerful computer equipment is leading to a burgeoning market for automated surveillance equipment. The technology used in surveillance is getting more and more sophisticated, which will, in all likelihood, lead to some of these issues becoming less signficant in the next few years. In the interim, in most scenarios automated surveillance still requires compromises be made. Some of these compromises are expected, such as the compromise between the price paid to implement the system and its sophistication. The research described herein attempts to tackle some of these problems using a variety of strategies, however many of the systems are still quite simplistic and inflexible. It can be seen from the cited literature. that is very difficult to implement a flexible robust real time system. Surveillance is limited if it is implemented using purely visual information. Therefore it is thought that integrating other sensor modalites, for example, audio sensors, laser range finders, acceleration sensors, thermal sensor etc. would be of great value in surveillance systems.
Multi-Camera Vision for Surveillance
165
6 Further Reading Readers interested in a more in depth overview of the development of Ukita’s [45] system and why design decisions were made are advised to read these Ukita’ & Matsuyama’s descriptions of earlier versions of the system available in [46] and [47]. More detailed explanations of the system used by Khan & Shah [28] can be found in earlier work by the authors [29][30]. Systems which use similar strategies include BraMBLe [20], work by Zhao & Nevatia [52][53] and Okuma et al. [37]. The contrasts between MacDorman et al. [32] and the ADVISOR system can be understood more clearly by reading [42][5][33]. Other related work includes the VSAM descibed by Collins et al in [6]. In terms of the underlying system used by MacDorman et al, readers may find Ishiguro & Nishimura [21] of interest. The work described in Hoffmann et. al [14] is based on earlier work by the authors which is described in [15] & [16]. [15] describes the method used to implement the self organization and [16] describes a solution to the issues inherent in scaling a large network of active cameras. Goshorn et al.’s [8] system is a development of ongoing work by the authors and more information can be found in [9], [10] & [11]. Another system which may be of interest to readers is presented in [49].
References [1] R. Ando, K. Shinoda, S. Furui and T. Mochizuki: A robust scene recognition system for baseball broadcast using data-driven approach. In Proceedings of the 6th ACM International Conference on Image and Vide Retrieval, pp. 186193 (2007) [2] L. Brown, J. Connell, A. Hampapur, M. Lu, A. Senior, C. F. Shu and Y. Tian: IBM Smart Survaillance System (S3). In Proceedings of the International Conference on Advanced Video- and Signal-based Surveillance (2005) [3] L. Brown, A. Senior, Y. Tian, J. Connell, A. Hampapur, C. Shu, H. Merkl and M. Lu: Performance evaluation of surveillance systems under varying conditions. IEEE Int’l Workshop on performance Evaluation of Tracking and Surveillance (2005) [4] Y. Cai, N. Freitas and J. J. Little: Robust visual tracking for multiple targets. European Conference on Computer Vision 2006, Part 4, LNCS 3954, pp. 107118 (2006) [5] N. Chleq, F. Bremond and M. Thonnat: Image understanding for prevention of vandalism in metro stations. In Advanced Video-based Surveillance Systems, C. S. Regazzoni, G. Fabri, and G. Vernazza, Eds. Boston: Kluwer Academic Publishers, pp. 106-116 (1998) [6] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D Duggins, Y. Tsin, D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt and L. Wixson: A system for
166
[7] [8]
[9]
[10]
[11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Noriko Takemura and Hiroshi Ishiguro
video surveillance and monitoring: VSAM final report. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University (2000) D. F. Glas, T. Miyashita, H. Ishiguro, and N. Hagita: Laser Tracking of Human Body Motion Using Adaptive Shape Modelling, Proc. of IROS 2007 (2007) R. Gshorn, D. Goshorn, J. Goshorn and L. Goshorn: Abnormal behaviordetection using sequential syntactical classification in a network of clustered cameras. IEEE International Conference on Distributed Smart Cameras (2008) R. Gshorn, J. Goshorn, D. Goshorn and H. Aghajan: Architecture for clusterbased automated surveillance network for detecting and tracking multiple persons. IEEE International Conference on Distributed Smart Cameras (2007) R. Gshorn: Syntactical classification of extracted sequential spectral features adapted to priming selected interference cancelers. Ph.D. thesis, University of California, San Diego (2005) R. Gshorn: Sequential behavior classification using augmented grammars. M.S. thesis, University of California, San Diego (2001) A. Hampapur, L.Brown, J. Connell, A. Ekiin, N. Haas, M. Lu, H. Merkl, S. Pankanti, A. Senior, C. F. Shu and Y. L. Tian: Smart Video Surveillance. IEEE Transactions on Signal Processing, Vol. 22, No. 2 (2005) I. Haritaoglu, D. Harwood and L. S. Davis: W4: Real-time surveillance of people and their activities. IEEE Transaction Pattern Analysis and Machine Intelligence, Vol. 22, pp. 809-830 (2000) M. Hoffman, M. Wittke, Y. Bernard, R. Soleymani and J. Hahner: DMCTRAC:Distributed multi camera tracking. IEEE International Conference on Distributed Smart Cameras, pp. 1-10 (2008) M. Hoffman, M. Wittke, J. Hahner and C. Muller-Schloer:Spatial partitioning in self organising camera systems. IEEE Jornal of Selected Topics in Signal Processing, vol.2, no.4 (2008) M. Hoffman, J. Hahner and C. Muller-Schloer: Toward self-organising smart camera systems. Architecture of Comoputing Systems - ARCS 2008, pp. 220231 (2008) B. Horling, R. Vincent, R Mailler, J. Shen, R. Becker, K. Rawlins and V. Lesser: Distributed sensor network for real time tracking. In Proceedings of the 5th Int. Conf. on Autonomous Agents, pp. 49-54 (2001) T. Huang and S. Russell: Object identification in a bayesian context. In Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (1997) N. Inamoto and H. Saito: Free viewpoint video synthesis and presentation from multiple sporting videos, Electronics and Communications in Japan (Part 3: Fundamental Electronic Science), Vol. 90, No. 2, pp. 40-49 (2007) M. Isard and A. Blake: Condensation - conditional density propagation for visual tracking. Int. Journal of Computer Vision, Vol. 29, No. 1, pp. 5-28 (1998)
Multi-Camera Vision for Surveillance
167
[21] H. Ishiguro and T. Nishimura: View and motion-based aspect models for distributed omnidirectional vision systems. In Proceedings of International Joint Conference on Artificial Intelligence, pp. 1375-1380 (2001) [22] V. Isler, S. Khanna, J. Spletzer, and C. Taylor: Target tracking with distributed sensors: The focus of attention problem. Computer Vision and Image Understanding, 100(1-2), pp. 225-247 (2005) [23] Y. Ivanov, C. Stauffer, A. Bobick and W. E. L. Grimson: Video surveillance of interactions (1999) [24] B. Jung and G. Sukhatme: A generalized region-based approach for multitarget tracking in outdoor environments. In Proceedings of the 2004 IEEE Int. Conf. on Robotics and Automation, pp. 2189-2195 (2004) [25] T. Kanda, M. Shiomi, L. Perrin, T. Nomura, H. Ishiguro and N. Hagita: Analysis of People Trajectories with Ubiquitous Sensors in a Science Museum, Proc. of ICRA2007, pp.4846-4853 (2007) [26] J. Kang, I. Cohen and G. Medioni: Tracking people in crowded scenes across multiple cameras. Asian Conference on Computer Vision (2004) [27] D. Karuppiah, R. Grupen, A. Hanson and E. Riseman: Smart resource reconfiquration by exploiting dynamics in perceptual tasks. In Proceedings of the 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1854-1860 (2005) [28] S. Khan and M. Shah: Tracking multiple occluding people by localizing on multiple scene planes. IEEE Transactions of Pattern Analysis and Machine Intelligence. (2008) [29] S. Khan and M. Shah: A multi-view approach to tracking people in crowded scenes using a planar homography constraint. In Proceedings of 9th European Conference on Computer Vision, Vol. 4, pp. 133-146 (2006) [30] S. Khan and M. Shah: Consistent labeling of tracked objects in multiple cameras with overlapping field of view. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, No. 10, pp. 1355-1360 (2003) [31] K. Krishna, H. hexmoor and S. Sogani: A t-step ahead constrained optimal target direction algorithm for a multi sensor surveillance system. In Proceedings of the 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1840-1845 (2005) [32] K. F. MacDorman, H. Nobuta, S. Koizumi and H. Ishiguro: Memory-based attention control for activity recognition at a subway station.IEEE MultiMedia, Vol. 14, Issue 2, pp. 38-49 (2007) [33] G. Medioni, I. Cohen, F. Bremond and R. Nevatia: Event detection and analysis from video streams/ IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 23, No. 8, pp. 873-888 (2001) [34] J. Miura and Y. Shirai: Parallel scheduling of planning and action for realizing an efficient and reactive robotic system. In Proceedings of the 7th Int. Conf. on Conrol, Automation, Robotics and Vision, pp. 246-251 (2002) [35] J. Nayak, L. G. Argueta, B. Song, A. R. Chowdhury and E. Tuncel: Multitarget tracking through opportunistic camera control in a resource constrained
168
[36]
[37] [38]
[39]
[40]
[41]
[42]
[43] [44]
[45]
[46]
[47]
[48]
[49]
Noriko Takemura and Hiroshi Ishiguro
multimodal sensor network. IEEE International Conference on Distributed Smart Cameras (2008) H. Oike, T. Wada, T. Iizuka, H. Wu, T. Miyashita and N. Hagita: Detection and Tracking using Multi Color Target Models, Proc. of SPIE Optics East 2007 (2007) K. Okuma, A. Taleghani, N. de Freitas, J. Little and D. Lowe: A boosted particle filter: multitarget detection and tracking. In ECCV (2004) A. Senior, A. Hampaur, Y. L. Tian, L. Brown, S. Pankanti, R. Bolle: Appearance models for occlusion handling. Journal of Image and Vision Computing, Vol. 24, No. 11, pp. 1233-1243 (2006) A. Senior, S. Pankanti, A. Hampapur, L. Brown, Y. L. Tian, A. Ekin, J. Connell, C. F. Shu and M. Lu: Enabling video privacy through computer vision. IEEE Security and Privacy, Vol. 3, No. 5 (2005) A. Senior and A. Hampapur: Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration. Workshop on Applications of Computer Vision (2005) A. Senior, A. Hampapur, L. Brown, Y. L. Tian, C. F. Shu and S. Pankanti: Distributed active multi-camera networks. Ambient Intelligence ed. P.Remagnino. G. L. Foresti and T. Ellis. Kluwer Publishing (2004) N. T. Siebel and S. J. Maybank: The advisor visual surveillance system. In Proceedings of the ECCV Workshop on Applications of Computer Vision (2004) C. Stauffer: Estimating Tracking Sources and Sinks. Computer Vision and Pattern Recognition Workshop, Vol. 4, pp.35-42 (2003) N. Takemura and J. Miura: View planning of multiple active cameras for wide area surveillance. International Conference on Robotics and Automation, pp. 3173-3179 (2007) N. Ukita: Real-time cooperative multi-target tracking by dense communication among Active Vision Agents. An International Journal, Vol. 5, No. 1, pp. 15-30 (2007) N. Ukita and T. Matsuyama:Real-time cooperative multi-target tracking by communicating Active Vision Agents. Computer Vision and Image Understanding, Vol. 97, No. 2, pp. 137-179 (2005) N. Ukita and T. Matsuyama: Real-time multi-target tracking by cooprerative distributed active vision agents. In Proceedings of 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, pp. 829-823 (2002) C. R. Wren, A. Azarbayejani, T. J. Darrell and A. P. Pentland: Pfinder: Realtime tracking of the human body. IEEE Transaction on Pattern Analysys and Machine Intelligence, Vol. 19, pp. 780-785 (1997) T. Xiang and S. Gong: Video behaviour profiling and abnormality detection without manual labelling. IEEE International Conference on Computer Vision, vol. 2, pp. 1238-1245 (2005)
Multi-Camera Vision for Surveillance
169
[50] Y. Yang and J. O. Pedersen: A compariative study on feature selection in text categorization. International Conference on Machine Learningm pp. 412-420 (1997) [51] Z. Zhang, G. Potamianos, A. Senior, S. Chu and T. Huang: A joint system for person tracking and face detection. In Proceedings of HCI Workshop, ICCV (2005) [52] T. Zhao and R. Nevatia: Tracking multiple humans in complex situations. IEEE TPAMI (2004) [53] T. Zhao and R. Nevatia: Tracking multiple humans in crowede environment. In IEEE CVPR (2004) [54] X. Zhou, R. T. Collins, T. Kanade and P. Metes: A Master-Slave system to acquire biometric imagery of humans at distance. First ACM SIGMM international workshop on Video surveillance, pp. 113-120 (2003) [55] http://www.omron.com/r\_d/coretech/vision/okao. htmlCited23Jan2009 [56] http://www.ptgrey.com/products/censys3d/index. aspdocument.Cited23Jan2009 [57] http://www.irc.atr.jp/ptNetworkRobot/ networkrobot-e.htmlCited23Jan2009
Part III
Mobile and Pervasive Computing
Collaboration Support for Mobile Users in Ubiquitous Environments Babak A. Farshchian and Monica Divitini
1 Introduction The idea of supporting collaboration with computers goes back to the work done by Douglas C. Engelbart. In his seminal demonstration in 1968 [1], he introduced and demonstrated remote collaboration between two persons through sharing computer screens and using audio-visual communication channels over a network. However, it took almost twenty years before the vision introduced by Engelbart was taken further. Approximately at the same time period when ubiquitous computing research was taking shape in research laboratories around the word, CSCW (Computer Supported Cooperative Work) was also emerging as a research field concerned with the role of (networked) computers as mediators of the interactions taking place among people. CSCW can be described “as an endeavor to understand the nature and requirements of cooperative work with the objective of designing computer-based technologies for cooperative work arrangements.” [2]. The first CSCW conference was held in 1986 in Austin, Texas, USA [3]. In the early years, focus of research within CSCW has mainly been on supporting geographically distributed actors. This effort aimed often at providing virtual spaces that could promote collaboration by promoting mutual awareness, communication, and coordination, even when lacking the everyday natural modalities of interaction. Though the designed support has been mainly intended for distributed workers, different studies of cooperative settings made apparent that collaboration is very “physical”. Our actions and interactions in the physical space constitute an important resource for the smooth and seamless conduct of collaboration in small groups [4]. Other research efforts also pointed out the need to support mobility of co-workers and the Babak A. Farshchian SINTEF ICT, Trondheim, Norway, e-mail:
[email protected] Monica Divitini Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_7, © Springer Science+Business Media, LLC 2010
173
174
Babak A. Farshchian and Monica Divitini
risk of binding users to desktop applications that support distributed but not mobile work [5]. In addition, collaboration happens not only in meeting rooms but also in hallways, coffee corners, streets, libraries etc. These properties, albeit obvious, has shown to pose challenging requirements to support technology. Research in the field of Ambient Intelligence (AmI) and related fields of ubiquitous and pervasive computing shares many of the goals of CSCW with respect to natural and seamless interaction with ICT. We will investigate these common goals in this chapter. Although CSCW researchers have traditionally acknowledged the role of physical embodiment in collaboration, limitations of technology have prevented researchers from creating solutions that can fully support these properties of collaboration. Ambient intelligence supports an important part of human behaviour in general, and collaboration in particular, through its focus on physical spaces, physical resources and their technological support. Our research agenda for ubiquitous collaboration is set to take advantage of research results in CSCW and AmI in order to create better solutions for supporting natural collaboration in co-located or distributed groups. In this chapter we will investigate some properties of collaborative work in small groups. We will provide a survey of some existing systems and platforms that support aspects of ubiquitous collaboration. We will later introduce the concept of a human grid, which we use to underpin our research on platforms for ubiquitous collaboration. We will outline Ubicollab, an open source platform that is being developed from ground up with a focus on supporting distributed ubiquitous collaboration.
2 Characteristics of Collaboration Our focus in this chapter is on collaboration that happens in small groups of up to 1015 people. Collaboration of course happens also in larger groups and organizations, but this type of small-scale collaboration has been studied extensively in CSCW, and its research problems are closely related to HCI and AmI. A number of theories have been proposed and used to analyze the subject of small groups using technology, and to propose improvements to such technological support. It is an ambitious goal to summarize and compare the various theories and we will not attempt at a thorough analysis here. However, we believe it is easy to observe a set of characteristics that are identified in most of collaboration theories and studies. The characteristics that we are interested in are those that lie directly in the intersection between CSCW and AmI. In this section we take a closer look at these characteristics.
2.1 Collaboration and Shared Context In day-to-day collaboration scenarios we can distinguish among three types of information exchange mechanisms among collaborating parties [6]:
Collaboration Support for Mobile Users in Ubiquitous Environments
175
• Intentional communication: Information contained in explicit utterances that are exchanged between the participants. This includes both verbal and non-verbal (e.g. gesture) communication. • Feedthrough: Information that is mediated through artifacts involved in collaboration. For instance, rearranging documents on a table might give a specific signal to collaborating parties. • Consequential communication: Information that is generated due to the fact that our bodies and our body language are all visible to other parties. This information is used to create what is called awareness in CSCW literature. Awareness can be defined as “an understanding of the activities of others, which provides a context for your own activity” [7]. The quality and the effectiveness of collaboration depend on proper access to awareness information by all parties involved. Awareness information relates to what in AmI is referred to as context. Awareness information contributes to create a shared context where sharing happens among collaborating parties. There are a number of theories that have been used to explain cooperative work and inform the design of CSCW systems (see [8] for a discussion of the role of theories in CSCW). These theories propose to structure collaboration according to different criteria. This kind of structuring means that parts of the awareness information (or shared context) become more relevant based on which theory one is considering and what the structure means to the users. Examples of such theories include: • Activity theory [9, 10]: According to this theory activities are the center of attention for collaborating individuals. An activity is an analytical collective phenomenon that consists of a subject (a person or a group with a motivation), an object (being produced or manipulated by the activity) and tools (used to mediate the relation between the subject and the object). Based on this model, shared context should emphasize the activities that the users are involved in, the users’ relations to tools and objects, and the temporal development of these relations. • Social worlds [11]: Social words is a theory initially developed by the sociologist Anselm Strauss. The locale framework was later developed to bridge the social worlds theory of Strauss into a more informative theory for technology development [12]. According to these theories, collaboration is structured within social worlds, or locales, and is carried on in a series of actions embedded in interactions. Interactions are carried out by interactants (individuals or groups or societies) who have their own identity, biography and perspective. Collections of interactions are the basis for trajectories that guide the development of social worlds in various ways. • Distributed cognition [13]: A theory focusing on the representation of knowledge “both inside the heads of the individuals and in the world” [14]. This theory is a further development of cognitive theories which mainly focused on knowledge representation in brain. The unit of analysis here is not an activity or a social world, but a cognitive system composed of individuals and artifacts.
176
Babak A. Farshchian and Monica Divitini
In this perspective, distributed cognition treats collaboration from a systems theory perspective: all pieces in the cognitive system are equally important for successful completion of collaboration. • Situated action [15]: This theory is probably the one that imposes least structure on shared context. The assumption is that any structuring or planning of a collaborative effort and related context is a rationalization that can be done afterwards and not a priory to the effort. The unit of analysis in situated action theory is a setting that is contained in an arena [14]. A setting is the relationship between an individual and his/her surroundings (the arena). This means that relevant context is totally dependent on the individual (or group of individuals) who is currently relating to available context. Context that is important in one setting might be of no value in other settings. The above theories are not the only ones, but illustrate the various focus on shared context, or awareness information, that is demonstrated within CSCW. These theories, through defining a system of concepts such as activity, social world, cognitive system, and setting, put different emphasis on what is considered important context. Although the sources of shared context are the same, the way this colleted information is grouped, presented and emphasized is different. In particular we can see the following differences in the theories: • Their emphasis on goal and purpose: Activity theory has the largest focus on goal and purpose. An activity cannot exist without a purposeful subject who has an agenda/goal. Social world theory has a focus on goals at a larger scale or higher level (i.e. related to culture, identity), while distributed cognition and situated action are unclear about the importance of goals [14]. Goals and purpose are similar to user intent in AmI [16] and are quite challenging to model or acquire as context. • Their flexibility in defining context structures: The situated action theory is most flexible with respect to context. According to this theory the emphasis on the different parts of the context is relative to the individual and the setting, and cannot be defined a priory. Therefore frameworks and systems need to be highly dynamic and flexible with respect to their context models [17]. The other models provide various levels of predefined structuring of context, but none are agnostic with respect to this structuring. • Their emphasis on embodied action: This dimension will also be discussed later. Here it suffices to say that the different theories have different emphasis on how physically embodied the actions performed by individuals or groups are. Again, situated actions theory seems to be completely embodied in the physical settings. This means the way the physical environment (the settings) behaves during interactions with individuals or groups is the main determinant of how well collaboration is performed. Distributed cognition is quite similar to situated action, while the other theories put increased emphasis on the non-physical, i.e. the goal, motivations, culture, identity etc. CSCW and AmI have clearly overlapping interest in studying context. In AmI context often relates to information about an individual’s situation rather than in-
Collaboration Support for Mobile Users in Ubiquitous Environments
177
formation exchange among group members: “Context is any information that can be used to characterise the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves” [18]. While the information that constitutes context in AmI seems to be similar to awareness information in CSCW, the way the information is structured and used is often different. In CSCW, as demonstrated by the discussed theories, context is almost always relative to a group of people rather than individuals. Moreover, while in most AmI research context is used by the technological system to infer actions on behalf of the individual, in CSCW context is often used not only by the technological system but also by other group members as well [19].
2.2 Embodied Interactions and Artifacts as Resources One important characteristic of natural collaboration, similar to any type of natural interaction, is that it is embodied in 3D physical spaces. The importance of this embodying is widely acknowledged both in CSCW and in AmI. The seminal paper by Harrison and Dourish [20] discussed the importance of space (in three dimensional sense), and place as referring to how we use spaces as resources: “Space is the opportunity; place is the understood reality”. Some characteristics of 3D space according to [20] are: relational orientation and reciprocity, i.e. the fact that we share common points of reference in the same space (e.g. “down”, “up”, “left”); proximity and action, i.e. the fact that our actions are limited to our physical proximity; partitioning, i.e. the fact that distance and physical obstacles can be used to partition the physical room; and presence and awareness, the fact that physical space contains information about people and signs of their activities. Mark Weiser put equal emphasis on embodying in physical space in his definition of ubiquitous computing, and tried to differentiate it from virtual reality: “Indeed, the opposition between the notion of virtual reality and ubiquitous, invisible computing is so strong that some of us use the term ‘embodied virtuality’ to refer to the process of drawing computers out of their electronic shells. The virtuality of computer-readable data – all the different ways in which it can be altered, processed and analyzed – is brought into the physical world.” [21]. In fact, the majority of the research in AmI and ubicomp draws upon the same idea that users are located in a physical world constituted by 3D spaces. Not only the spaces we reside in, but also the artifacts contained in these spaces are essential resources for both CSCW and AmI, albeit with a different focus. In AmI, artifacts are largely regarded as specialized computer-human interaction resources that can provide greater degree of affordance and “calmness” than e.g. a desktop computer. Artifacts, such as “tabs, pads and boards” in [21] are used in AmI as a way of making technology disappear by making the interaction as natural as possible. A similar approach has been followed in HCI through e.g. tangible user interfaces [22].
178
Babak A. Farshchian and Monica Divitini
In CSCW, artifacts play a widely acknowledged role as coordination mechanisms in small and large groups [23, 24], and as externalized knowledge [13], to name a few. All major CSCW theories include a notion of artifacts in different forms [14]. Affordance of physical artifacts, both at interaction and at collaboration level, has been studied. Physical artifacts are flexible, tangible, customizable, and portable [25]. In addition they improve collaboration by being predictable, supporting peripheral awareness, and enabling implicit communication [24].
2.3 Mobility of People and Resources Mobility of people and resources is a common concern for AmI and CSCW. The role that mobility of people plays in collaboration has been documented well in the CSCW research literature also before mobile technologies were widespread [5, 26, 27]. In a study of a product design team, Bellotti and Bly [26] could document that only around 10% of the studied group members’ time was spent at their desktop. The rest of the time was spent in meeting rooms, other colleagues’ offices, model shops etc. Luff and Heath observed that micro- and macro-mobility helped individuals in London Underground to “overhear and oversee the contributions of others, whilst rendering their own activities visible [to others]” [5]. The role of mobility in informal communication is also well-documented [27]. In short, people move frequently in order to meet other people or access resources that are not available locally. Being able to move freely while using technology is a prerequisite for making technology disappear. In fact, mobility research can be regarded as a precursor to AmI research [16] and mobility is regarded as one of the key enablers for AmI [28]. Mobility in AmI is related to both people and resources. A large amount of research in AmI and ubicomp is related to dynamically connecting people and resources as they move from one location to another.
2.4 Physical Distribution of People The original idea of CSCW was to provide technological support not only for colocated groups but also, and maybe most crucially, to geographically distributed groups. Physical distribution is a reality of today’s businesses, and poses hard challenges to CSCW systems. The type of rich interaction that is possible in face-to-face scenarios is reduced considerably when using technology as mediator [29]. Additionally, a number of studies show that informal communication, i.e. the type of unplanned opportunistic communication that is crucial for smooth coordination and cooperation, is severely reduced when people reside in geographically distributed spaces [26, 27].
Collaboration Support for Mobile Users in Ubiquitous Environments
179
Focus on physical distribution and related support for collaboration has been scarce in AmI research. Some research has studied how artifacts are used in shared spaces [30, 31], but this line of work does not consider how remote collaboration can be supported using shared artifacts.
2.5 Flexibility and the Need for Tailoring Flexibility generically refers to the property of a system that makes it modifiable to satisfy new or changing needs of its users. Flexibility is a core aspect of cooperation support and it has been studied in relation to different dimensions. No two groups work in the same way. There are variations in knowledge, processes, preferences, task complexity etc. Flexibility has been identified as a core issue in the adoption of media-spaces, i.e. systems providing a permanent audio and video link across two or more locations. Experiences with the adoption of an early system developed by Xerox show the importance of a system that can be tailored to adapt it to specific needs and conventions within a group of users [32]. This allows developing new functionalities that are relevant to the group and to develop and make explicit conventions about the group accepted usage of the system, for example concerning privacy. Flexibility has also been recognized as core with respect to support for managing task interdependencies, see e.g. [23, 33]. Flexibility allows in this perspective to adapt to local or global, temporary or permanent, changes in the way the work has to be coordinated. In addition there is a need to tailor to the preferences of individual users. This need is shared with many AmI applications that generally focus on the need for personalization. It is also important to point out that, especially in ubiquitous environments, systems must be flexible with respect to the technical context of use since they operate in environments that might be very different in terms of offered services and high dynamics [34]. Consider for example the technical complexity that is associated with the simple scenario of a person that is moving from one room to another, using a personal device to access services offered in the newly entered room. Given the need for flexibility along different dimensions, it is often impossible to identify a set of predefined configurations that users or groups can select from. This has led to re-thinking the way systems are constructed, supporting the development of systems as composition of simpler services, often offering tools to empower users in the tailoring and programming of the support that they need. Seminal in this perspective in the area of CSCW is OVAL. This system was allowing users to create their own configurations of components and in this way define radically tailorable groupware systems fitting to their environments [35]. RAVE was also supporting tailorabily by allowing users to re-define the behavior of buttons and share the developed code within groups of users [32]. More recently, different research efforts have focused on identifying basic concepts and services that can be composed by users to create complex applications, also addressing the specific needs
180
Babak A. Farshchian and Monica Divitini
of end-user development of cooperative systems [36]. On the other hand, end user tailoring and programming has been identified as one of the major challenges also in AmI [37, 38]. In this perspective, the interests in the CSCW and AmI are rather complementary, the first focusing more on group issues and the second on flexibility with respect to the technical environment and personalization.
3 Existing Technology in Support of Ubiquitous Collaboration The CSCW research field started when mobile and ubiquitous technologies were not commonplace. Although sociological and ethnographical studies of groups showed clearly the need for ubiquity and physicality, initial technologies were often bound to the desktop and tried to simulate natural interactions in different ways. For instance, a number of CSCW frameworks and systems tried to simulate the properties of the physical space in form of virtual spaces [39, 40]. Additionally, the seminal work on media spaces in Xerox PARC [41] and other related work tried to use multimedia channels such as audio and video to extend these properties of physical space beyond physical distances. A number of early awareness systems used ubiquitous computing technologies to support informal impromptu communication [42]. As a result of the interest in applying sociological theories to design, a number of CSCW systems were informed by specific theories. Social worlds and the locale framework were used to implement Orbit [43], a flexible desktop-based groupware. Situated action, pointing out the problems connected to defining and formalizing plans, has inspired a number of systems that provide flexible means of interactions rather than attempting a formalization of work [44]. Key in supporting collaboration in these systems is the provision of means to promote awareness. For example, room-based groupware such as TeamRooms [45] and related shared workspace awareness systems [6] support a high level of flexibility and tailorability with respect to context modeling (albeit all in virtual rooms). Activity theory has also inspired research in CSCW and, to a more limited extent, the development of a number of systems, as for example a technological framework called ABC [46] with support for ubiquity in hospital work. Within the area of AmI and ubicomp the issue of collaboration support has been addressed by a few of systems. The Speakeasy platform and the Casca collaborative application built on top if it [47] supported peer-to-peer setup of group areas, and allowed users to connect to physical resources using resource discovery mechanisms. The Interactive Workspaces project [48] focused on physical resources such as shared displays and projectors in meeting rooms, and collaborative interaction with these. Another system called BEACH (Basic Environment for Active Collaboration with Hypermedia) [49] was used to implement the concept of roomware, which are room elements with integrated information technology such as interactive tables, walls, or chairs. A number of systems have also been developed lately focusing on the use of shared physical displays in public spaces and how these can support collocated or distributed collaboration [50, 51]. Experiences with this type
Collaboration Support for Mobile Users in Ubiquitous Environments
181
of systems pointed out that their design and deployment arise a number of challenges in terms of hereogeneity and dynamism of the environment, robustness and interaction techniques [34]. Physical interfaces have also been suggested as a way to promote embedded interaction and phidgets (physical widgets) are provided to support their rapid development [52]. In the area of mobility we can see a recent popularity in making conventional groupware applications mobile. This can be regarded as a first step in the right direction, although most of these mobile groupware applications are merely a miniaturized version of their desktop ancestors. This means that specific requirements for mobility are yet to be supported. The use of location and sharing of other simple context information is also supported in some of these systems [53]. The proliferation of mobile and ambient systems targeting groups shows an increasing interest in supporting ubiquitous collaboration. As outlined in the review above, there is an increasing awareness of the need to develop frameworks and platforms to provide developers and users with flexible and tailorable tools. The work described in the following sections falls within this line of research. UbiCollab is a platform for the development of ubiquitous applications to support cooperation among distributed users. The platform is based on the core notion of Human Grid. The Human Grid is a unifying implementational concept that has been developed based on the study of existing theories and systems. This concept will be outlined in the next section, and an initial implementation of UbiCollab will be described using a simple example application.
4 Human Grid as a Unifying Concept for AmI and CSCW The concept of a human grid [54] constitutes our vision of ubiquitous collaboration. A human grid denotes a collection of (geographically distributed) users and the resources each of them has available in their physical vicinity. Figure 1 shows an instance of a typical human grid. A human grid exists for each collaborative activity that a group of users is engaged in. The individuals in a human grid are connected to each other using communication technologies. The resources each user wishes to use in the collaboration are also connected to the human grid using communication technologies. Interactions among the users are supported using the resources and the communication across physical spaces. The core part of a human grid is the collaboration instance (for short CI, shown in the middle of Figure 1) and its human participants. The CI acts as a shared context for collaboration. Information in a CI can be grouped and represented according to the needs and the habits of the users. A CI can contain shared information organized in different ways in order to support different views of collaboration. For instance, a CI can represent an activity, a social world, a locale, a cognitive system, or a setting. Participants of a CI will physically inhabit collaboration spaces (for short CS). Collaboration spaces represent physical spaces (such as offices, streets, homes) where CI participants reside in the course of their collaboration with other
182
Babak A. Farshchian and Monica Divitini
participants. A CS can be used in a human grid as a resource for collaboration. A CI participant becomes an inhabitant of a CS when he/she physically enters that CS. It is up to the CI participants to decide how they want to represent and use their collaboration spaces in a human grid. Each CS can contain physical and digital resources that become available to the human grid once a CI participant inhabits a CS. Typical resources in an office environment include projectors, whiteboards, printers, and file servers. In a home environment resources might be TVs, media servers, electrical equipments such as lamps, heaters etc. Although resources are owned by individual CI participants, a participant can allow other participants in the human grid to use and control his/her resources. For instance, in an office setting a participant can use projectors in multiple meeting rooms to show a presentation to all the CI participants. Resources in a human grid are deployed for collaboration using applications. Applications implement the collaboration logic necessary for each CI. A simple application in an office-related CI might be a slide presentation application that will use a set of resources such as projectors, microphones, speakers and file servers located in various CSs. In addition to implementing collaboration logic, applications are responsible for what we call publication and actualization in a human grid. Publication and actualization are the mechanisms that help implement shared context. Ambient information collected from various CSs (using resources as sensors) is published to a CI by applications. For instance, a presence application might publish
Fig. 1 The Human Grid
Collaboration Support for Mobile Users in Ubiquitous Environments
183
the state of a light sensor (a resource) in an office (a CS) into a CI. This published information can be used by other application to actualize actions. In this case, the fact that the light is off might trigger an action such as locking the door. In the next section we will present a simple application that is prototyped in order to demonstrate the human grid concept. This application is implemented on top of our experimental platform called UbiCollab. UbiCollab will be described later.
4.1 The Example of UbiBuddy One of the most popular collaborative applications nowadays is the buddy list. A buddy list application provides a “live” list of close friends (buddies) and information about their online status (often called awareness, availability or presence information). As buddies go online and offline, the user is notified through peripheral notices. These notices can remind or motivate the user to initiate an informal, opportunistic communication with his/her buddies, or just keep an eye on their activities. Communication is often done using text messages (often called instant messages) and in recent versions more and more through audio and video channels. Each user can control his/her availability to others by personalizing the application. UbiBuddy (see Figure 2) is an enhanced buddy list application implemented using the human grid concept. UbiBuddy tries to address some shortcomings of available buddy list applications: • Limited support for collaboration context - Generally buddy list applications are built assuming flat social networks, i.e. individuals are either available to all their buddies or to none of them. UbiBuddy, on the other hand, allows individuals to define multiple buddy groups (in form of CIs) and switch among these groups. Each of these groups represents a separate human grid with specifically associated participants (buddies). Individuals can enter and exit buddy
Fig. 2 UbiBuddy integrates with physical artifacts and devices surrounding its user
184
Babak A. Farshchian and Monica Divitini
groups instead of going totally online or offline. A user can e.g. decide to be online/available to a group of his/her colleagues in a specific project, and be offline/unavailable to all his/her other colleagues. • Limited support for mobility - Support for mobility in conventional buddy list applications normally means running the application on mobile devices. This in general means that the same exact functionality is provided through smaller keyboards and screens. In most cases the main advantage of providing peripheral notices about buddies is lost because the application is often not in continuous line of sight anymore (as opposed to when running on a desktop computer screen) but is in user’s pocket. Some of these applications also provide location awareness in terms of individuals’ positions fetched from their mobile device. In UbiBuddy mobility of individuals is supported by allowing participants in a buddy group to define collaboration spaces (CSs) using meaningful information. UbiBuddy users will see each other as being at “home” or in the “office” instead of as being a “dot on the map”. In addition, peripheral notices are supported through directing information in a buddy group to various resources available in a CS (see also next point). For instance changes in a user’s online status can be actualized by turning on/off a lamp in the living room (see Figure 2.C). • Lack of connection to the physical context - Conventional buddy list applications not only have a limited array of output channels for displaying presence and availability information, but also a limited set of input sources for collecting information about users’ context (often limited to keyboard and mouse activities). By using the concept of a resource in a human grid, UbiBuddy can use physical resources in users’ surroundings as both sensors and displays of presence information (Figure 2.C). For instance, turning the light off in a user’s bedroom can set the user’s availability to offline.
Fig. 3 UbiBuddy main window and context details window
Collaboration Support for Mobile Users in Ubiquitous Environments
185
UbiBuddy is implemented as an application on top of our experimental platform called UbiCollab. UbiBuddy’s GUI is very similar to a normal buddy list but with some significant differences: • The main overview window, which is shown when the user starts UbiBuddy (see left side of Figure 3) gives an overview of all the buddy groups that a user has set up, and an overview of who else is online in each group. The user can also use this window to manually control own availability in each group. • The buddy group details window, which is shown when the user clicks on a group in the main window. Once in this window, the user can see more details about the group such as a list of all participants, their location and the resources they have shared in the group. • The buddy details window (see right side of Figure 3), which allows a user to see more details about a specific buddy, e.g. what CS each buddy inhabits and which services each buddy has shared in that group. In addition, a number of generic UbiCollab tools and applications are integrated in UbiBuddy. These tools are common to all UbiCollab applications and allow individuals to create and manage CIs (see Figure 4), CSs, resources and applications. More information on these tools will be provided in the section on UC architecture. To summarize, UbiBuddy demonstrates the human grid concept by defining buddy groups as CIs, while physical resources are used as sources and displays of presence information in a manner true to AmI and ubicomp. Collaboration spaces are used to locate users and resources, and to publish and actualize presence information in the correct way. The logic to handle changes to the availability of buddies (e.g. going offline for colleagues when at home) is done by UbiBuddy itself, which is implemented in form of an application. In the following sections we will provide a short analysis, using UbiBuddy as an example, of how the concept of human grid supports the various characteristics of collaboration discussed earlier.
Fig. 4 CI Manager GUI
186
Babak A. Farshchian and Monica Divitini
4.2 Shared Context in Human Grid Shared context in collaborative work can take different forms. In the case of UbiBuddy, for instance, shared context is limited to sharing of availability information about the members of a buddy group. Although the basic idea in most scenarios is to make information available to all collaborating parties, the nature of this information, the way the information is organized, presented etc. varies greatly. As we have discussed earlier, different collaboration theories (and the systems developed based on these theories) structure and emphasize shared context in different ways. The concept of a collaboration instance tries to capture this variety by providing the basic sharing mechanism without presuming specific structures. Beside some basic administrative information, applications have full control over what information is shared in a CI and how this information is structured/presented. CI in this way can be regarded as a common information space [55] in the sense that the content and the format of this content or “packaging” of the information is defined by the participants. Of course this puts greater demand on supporting the process of agreement and social construction of meaning among the participants. Additionally, flexibility is needed at the application level in order to easily manipulate such information.
4.3 Embodiment in Human Grid Human grid supports embodiment by both defining the concept of a collaboration space (CS), and allowing the inclusion of people and (physical) resources in these spaces. Using CSs helps promote embodiment in two ways. First, individuals are able to define meaningful representations of familiar spaces (e.g. “My home”) and even create a shared meaning for a CS used in a human grid (e.g. “Babak’s home”). Second, information about whereabouts of individuals can be used to adapt system behavior, as envisioned in many AmI scenarios. For instance, in UbiBuddy, a participant’s online status can automatically be set to offline whenever that participant inhabits the CS “My home”. Resources in a human grid are also used to strengthen embodied action. In UbiBuddy, a participant in a buddy group might control his/her availability by turning on or off the lights in his office. Additionally, associating resources to collaboration spaces allows for some degree of automatic reconfiguration of available resources. This means that, in the example above, the light switch is a resource that might be available only in the office, while at home other resources are used to signal the same action (e.g. turning on or off the TV). This access to resources can be automatically adjusted, and information provided to the users1 . 1
Note that in UbiCollab we also implement two default collaboration spaces called “Body” and “Web”. The Body CS includes all resources that the user carries with her (e.g. mobile phone, portable music player) and are normally available to the user everywhere. The Web CS includes all web services and similar online software services that are available on the internet, and thus can be used everywhere.
Collaboration Support for Mobile Users in Ubiquitous Environments
187
4.4 Mobility in Human Grid Mobility is supported by allowing users to move from one CS to another, and continue to use their applications without any interruptions. Applications are associated with CIs. This means, as long as individuals are participating in the same CI, their applications will have access to the proper resources in emerging collaboration spaces as they move from one CS to another. In cases where CSs are predefined, such as “home” or “office”, mobility in human grid is similar to nomadic mobility where collaboration happens in familiar spaces. In order to support more spontaneous mobility in unfamiliar spaces we have defined a special CS called “Body”. The Body space is used to contain resources that the user carries with her/him, i.e. portable resources. Additionally, in our UbiCollab platform we are also working on supporting the rapid creation of temporary spaces that will further facilitate full mobility beyond nomadic scenarios.
4.5 Support for Physical Distribution in Human Grid Geographic distribution is inherently supported in human grid by assuming that participants in a CI inhabit collaboration spaces that are by default disjoint. Shared context contained in the CI is one way by which inhabitants of different CSs communicate with each other. Depending on the application area, necessary shared context is distributed among CSs by applications publishing and actualizing information. Additionally, we assume that different levels of communication will be supported by direct communication among the resources that reside in different CSs. For instance, a video camera in one CS might be connected to a TV screen in another CS. This kind of communication can potentially have a high level of dynamism and media richness. Moreover, resources in distant CSs can be controlled by remote participants. For instance, the inhabitant of one CS can use a screen in a remote CS in order to display a document. The way shared physical resources and artifacts are used in distributed collaboration is an open research topic. Empirical research and observations are needed in order to both learn how people will choose to utilize such possibilities, and how the affordances of such resources need to be changed in order to better support these collaboration scenarios.
4.6 Flexibility and Tailoring in a Human Grid The concept of human grid is an acknowledgement of the fact that technology-based collaboration support needs to be flexible enough in order to fit into emerging and dynamic scenarios. A human grid does not make any strict assumptions about the type of collaboration that will happen. It presumes only the very basic properties
188
Babak A. Farshchian and Monica Divitini
that are common to most types of collaboration and supported by the majority of collaboration theories discussed earlier. More detailed description of each of these properties is left to the specific application of human grid in various collaboration domains. E.g. the type of resources and spaces used in collaboration will vary from one human grid to another, and the information sharing needs of various groups will be different. On the other hand, the vision of ambient intelligence assumes a level of proactiveness in order to help users get rid of the routine tasks involved in everyday activities. To this end, the concept of human grid provides a number of features that we might call “adaptability primitives” (i.e. the concepts of CI, CS, resources, and applications). These features don’t implement pro-activeness per se, but allow pro-active applications to be implemented effectively. The underlying technological platform that will implement the human grid concept will play a central role in providing this flexibility to application developers, and at the same time allowing the level of pro-activeness desired for each application domain. The following sections will describe the overall architecture of our experimental platform that is used to implement a proof-of-concept human grid environment.
5 Implementation of Human Grid: The UbiCollab Platform The concept of human grid is being implemented in form of an experimental platform called UbiCollab2 . In UbiCollab, multiple human grids are realized inside a peer-to-peer network called a UbiNetwork (See Figure 5). Each peer in a UbiNetwork is called a UbiNode, and belongs to one UbiCollab user. The UbiNode plays the role of an identifier for the user, and also contains the UbiCollab platform. UbiNodes are typically mobile devices (PDAs in the current implementation of UbiCollab). Stationary UbiNodes are also possible, and typically belong to users who are not mobile but who provide online community services to a UbiNetwork. Each UbiNode communicates directly with the resources in the physical space where its owner resides. Resources are discovered using UbiCollab’s resource discovery mechanisms (see below). Resources are external to UbiCollab and use their own native communication protocols. They are therefore accessed within UbiCollab using Proxy Services (described shortly). Proxy Services are similar to drivers in an operative system (e.g. printer drivers). They run on the user’s UbiNode and implement a uniform way of communicating with the various resources. UbiCollab’s functionality is implemented in form of independent services called Platform Services. Platform Services run on the user’s UbiNode and communicate with their peers on other UbiNodes in the UbiNetwork.
2
www.ubicollab.org
Collaboration Support for Mobile Users in Ubiquitous Environments
189
5.1 UbiNode Overall Architecture UbiCollab platform is fully deployed on each UbiNode, as shown in Figure 6. The UbiCollab Platform Services are implemented as independent modules that reside in a container on user’s UbiNode (rectangles in the left side of the UbiNode in Figure 6). Independent here means that they can be installed and used by their own, without requiring the other Platform Services in order to function. Each of these Platform Services implements a part of UbiCollab functionality. The pentagons in the right side of the UbiNode in Figure 6 denote the Proxy Services that each UbiCollab user will have to install in order to access external resources. The left side of the UbiNode is called the platform space, and the right side is called the user space. The current implementation of UbiCollab uses OSGi (Open Services Gateway initiative3 ) as container. UbiCollab platform services are deployed as OSGi bundles (bundles are java files that implement services in OSGi). OSGi supports dynamic deployment (i.e. runtime installation or upgrade of bundles) and provides an easy way to export services using web standards. Dynamic deployment means that not only new versions of Platform Services can be installed in runtime, but more importantly Proxy Services can be installed on the run as users encounter new resources in their physical surroundings. UbiCollab currently consists of the following Platform Services, each of them implementing a part of the human grid concept: • CI manager: Responsible for creation and maintenance of Collaboration Instances.
Fig. 5 A sample UbiNetwork with four UbiNodes
3
www.osgi.org
190
Babak A. Farshchian and Monica Divitini
• Session Manager: Responsible for implementing a runtime environment for the applications. • CS manager: Responsible for creation and maintenance of Collaboration Spaces. • Service Domain (SD) manager: Responsible for maintaining and managing the user’s Proxy Services. • Resource Discovery (RD) manager: Responsible for finding external resources. • Identity manager: Allows users to create multiple virtual identities to represent themselves in the platform and when using external resources. The following sections will describe briefly each of UbiCollab’s Platform Services.
5.2 Collaboration Instance Manager The main job of the Collaboration Instance Manager (CI Manager) is to make sure that the information in a CI is always up-to-date for all human grid members, and that only members of a CI can access this information. This is shown in Figure 7. In this figure, each UbiCollab user Tom, Jan, Dennis, Robert and Lisa have their own UbiNode with CI Manager installed on it. Jan and Tom are members of a CI called “Golf club”. CI Manager makes sure that any information Jan or Tom publish into the CI is synchronized with the UbiNode of the other participant. CI Manager communicates with CI Managers of other participants in a peer-to-peer fashion. Information in a CI is stored as files in a standard file and folder system. This means that applications have full flexibility in storing any information they need in a CI. CI Manager provides the following API towards UbiCollab applications:
Fig. 6 The architecture of a UbiNode
Collaboration Support for Mobile Users in Ubiquitous Environments
191
• CI Management: Allow users to create, delete, rename and annotate CIs, and to invite and remove members from a CI. • Information publication: Allow applications or users to write information (files) into a CI. This information is then synchronized with CI Managers of other CI participants. • Change notifications: Once CI information is changed by one CI participant’s application, other applications are notified about the change. • Participant lookup: CI Manager uses peer discovery in order to find the CI participants within the UbiNetwork. Currently JXTA’s4 lookup mechanisms are used for this, but we can think of scenarios where more restrict registry-based lookup can be implemented.
Fig. 7 CI Manager and synchronization of CIs
4
JXTA (www.jxta.org) is a java-based peer-to-peer framework that is used in the current implementation of UbiCollab.
192
Babak A. Farshchian and Monica Divitini
5.3 Session Manager Applications in UbiCollab are realized as a combination of Proxy and Platform services. The specific combination of these services is specified in form of an Application Template. Application Templates are XML-based declarative information and are handled by the Session Manager. The user runs an application by providing the Session Manager with a link to the application’s template. Session Manager then creates a Session for the application and takes care of allocating needed resources to the application. Additionally, Session Manager is also in charge of handling user mobility by switching to new resources as they become available due to users moving from one space to another [56].
5.4 Collaboration Space Manager Collaboration Spaces (CS) are representations of bounded physical locations that users normally reside in. Space Manager provides an API that allows users and applications to perform the following: • CS Management: Define and manage CSs by tagging physical coordinates. Coordinates can be detected automatically using a position sensor, or users can enter coordinates manually. Users can then annotate these definitions based on a flexible internal CS model. • Locating users: CS Manager can return a user’s current location. The location is identified by either providing the CS Manager with the GPS (Global Positioning System) coordinates for the user (acquired e.g. automatically through a GPS sensor, or entered manually) or by the user manually setting an existing CS as his/her current location. CS Manager then returns zero or more matching CSs that indicate to the applications where the user is. Automatic acquisition of location is tried out in form of Bluetooth-enabled GPS sensors that are discovered and installed in each user’s Service Domain (see below for Service Domains). • Location-tagging of resources: CS Manager is used to tag newly discovered resources with a CS tag. This is useful for embedded resources such as a projector or a printer in a meeting room. CS tags for resources are used to provide context-aware access to Services that are physically available to the user. CS Manager also defines two default CSs for its user: a Web Space for online services (such as a file sharing service) and a Body Space for wearable devices (such as MP3 players or GPS sensors carried around by the user). This set of simple location-based services allows UC applications to know where each user is and what resources each user has access to. The level of automation can be varied from full automation using various positioning services to fully manual by the user entering all the information. CS Manager has an associated management GUI to allow users manually adjust and change CS-related information. In addition, we are working on defining a standard CS exchange format in order to allow users
Collaboration Support for Mobile Users in Ubiquitous Environments
193
exchange their CS definitions. This can be a useful functionality in situations where a number of standard “rooms” are defined. For instance, a user entering a museum can download a predefined set of CSs with extended information about the rooms in the museum.
5.5 Service Domain Manager Resources in UbiCollab denote external resources such as devices, artifacts, and web services. In UC, the process of making these resources available to the applications is called Proxy Service installation. Proxy Service installation is analogous to installing a driver (e.g. printer) on a laptop computer. In Figure 6 these Proxy Services are shown as pentagons. An installed Proxy Service resides in user’s Service Domain (SD) and can be used by applications through its API. A Proxy Service might also implement additional functionality such as a local client GUI, security and logon mechanisms. The SD Manager in UC provides a set of functions to make the accessing and management of the SD easy for both the user and the applications: • Proxy Service installation: An external resource can be installed as a Proxy Service in the SD by providing the SD Manager a URL to its Proxy Service code. SD Manager will then download the Proxy Service code and add it to the user’s SD. The URL can be entered either manually or using the Resource Discovery Manager (see below). • SD maintenance: Proxy Services can be deleted (when not needed anymore), and the user can change their associated meta-data. Such meta-data includes the location of the resources (i.e. the Collaboration Space), and additional information such as human-understandable descriptions. • Sharing of services: SD Manager allows users to assign user name and password to Proxy Services (see also ID Manager below), assign different levels of access to the Proxy Services, and create a secure URL to these that can be included in a Collaboration Instance. In this way other users in a CI can access and control.
5.6 Resource Discovery Manager Resource Discovery (RD) Manager implements user-friendly mechanisms for accessing and integrating external resources in UC. Discovery of resources can be done in many different ways, and a good number of discovery mechanisms are already implemented by others. In UC we wanted to avoid creating yet another discovery protocol. RD Manager uses so-called discovery plug-ins to enable interaction with native discovery mechanisms. For instance, a plug-in which is already implemented uses an RFID reader pen to read an RFID tag from a resource. All plug-ins return a URL to the Proxy Service code (an OSGi bundle) for the discov-
194
Babak A. Farshchian and Monica Divitini
ered resource. This URL can be passed to SD Manager, which will use it to install the Proxy Service in user’s Service Domain.
5.7 Identity Manager Identity (ID) Manager is a recent module under development for UC. ID Manager is in charge of facilitating personalization and preserving the privacy of UC users. ID Manager allows the user to create a personal profile. This profile contains values for various parameters that the user has set for personalizing services. For instance, the printer service will have a placeholder in user’s profile with user-set values (such as double-sided printing). The user can also use the ID Manager to create an arbitrary number of virtual IDs. Each virtual ID (VID) has a pseudo-name (nick-name) and a password. Each VID provides access to a subset of the personal profile data, in this way allowing for anonymous access and personalization of various services. ID Manager is under development and will be integrated fully in the UC platform. The goal is that all services (Proxy and Platform services) will support VID-based authentication and personalization. ID Manager supports the following functions: • Profile management: Create and maintain the personal profile, with areas associated to service types that the user normally uses. • VID management: Create and manage VIDs, and associate VIDs to various services. • User management: Connect to community services for verification and trust management related to VIDs.
Fig. 8 Various steps in the resource discovery process
Collaboration Support for Mobile Users in Ubiquitous Environments
195
6 Conclusions In this chapter we have discussed the concept of ubiquitous collaboration, and have presented an overview of existing technologies to support it. We have also outlined our own research in this area in form of the concept of human grid and the experimental UbiCollab platform. We have deliberately focused on supporting the “mechanics” of ubiquitous collaboration in UbiCollab. UbiCollab in its current form supports the basic functions. These basic functions (and their robust functioning) are our main short term goal. In the long term we are looking into implementing additional components in UbiCollab. We believe providing an open platform where components can be added and removed freely is necessary for a sustainable platform. Research in AmI provides a number of advances in technology that are highly relevant to CSCW. On the other hand, supporting social interactions in small and large groups should be considered as a natural development within AmI and ubicomp. We can regard the development from single user (HCI) to multiuser (CSCW) as what will also happen to Ambient Intelligence. In our view future AmI will be some form of “Ambient Collective Intelligence”. Some of the important questions to ask in this respect are: • How can advances in context acquisition and processing in AmI be used in CSCW in order to implement shared context, improve awareness and support group processes? How can theories from CSCW be used to make better and more informed use of available context in AmI? • How can connected artifacts, such as mobile and embedded devices, be used in collaboration? How can the notion of artifact as studied in CSCW as a resource for collaboration enrich the notion of physical connected objects in AmI? How is meaning created by using shared physical artifacts over geographical distances? • How can designs and innovations in end user configuration from AmI be used to facilitate adaptation and tailoring of technology to the needs of groups? What does it mean to personalize a shared space? What are the best processes for tailoring shared smart spaces? How will we handle conflicts? • How can AmI spaces be used from a distance? How can context from one space be shared with another space when groups are distributed across geographical distances? • How will pro-active AmI technologies behave in a group context? In our research we treat AmI as fundamentally collaborative. In real life there are very few AmI spaces that are “single user”. At the same time, in order for CSCW to support anything close to natural face-to-face collaboration, we need to move away from the desktop and focus on natural physical surroundings of collaborating parties. Ubiquitous collaboration support is a compelling research field and we can already see innovative technologies that aim at addressing this shared challenge.
196
Babak A. Farshchian and Monica Divitini
7 Acknowledgements We thank all the students who have been involved in the development of UbiCollab, and have contributed with ideas and implementations of the platform and the UbiBuddy application. A full list of theses can be found at UbiCollab web pages at www.ubicollab.org.
References [1] Engelbart, D.C., English, W.K.: A research center for augmenting human intellect. In: Proceedings AFIPS Fall Joint Computer Conference, San Francisco, CA, USA, AFIPS (1968) 395–410 [2] Bannon, L., Schmidt, K.: CSCW: Four characters in search of a context. In Bowers, J.M., Benford, S.D., eds.: Studies in Computer Supported Cooperative Work. Theory, Practice and Design. North-Holland, Amsterdam (1991) 3–16 [3] Krasner, H., Greif, I., eds.: Proceedings of the 1986 ACM conference on Computer-supported cooperative work. ACM, New York, N.Y., Austin, Texas, December 3-5, 1986 (1986) [4] Olson, J.S., Teasley, S.D., Covi, L., Olson, G.M.S.: The (currently) unique advantages of collocated work. In Hinds, P., Kiesler, S., eds.: Distributed Work. MIT Press, Cambridge, MA (2002) 113–135 [5] Luff, P., Heath, C.: Mobility in collaboration. In: Proceedings CSCW, ACM Press (1998) 305–314 [6] Gutwin, C., Greenberg, S.: A descriptive framework of workspace awareness for real-time groupware. CSCW 11 (2002) 411–446 [7] Dourish, P., Bellotti, V.: Awareness and coordination in shared workspaces. In: Proceedings CSCW, Toronto, Ontario, Canada, ACM (1992) 107–114 [8] Halverson, C.A.: Activity theory and distributed cognition: Or what does cscw need to do with theories? Computer Supported Cooperative Work (CSCW) 11 (2002) 243–267 [9] Kaptelinin, V., Nardi, B.A.: Acting with Technology: Activity Theory and Interaction Design. The MIT Press (2006) [10] Kuutti, K.: The concept of activity as a basic unit of analysis for CSCW research. In: Proceeding ECSCW, Amsterdam, The Netherlands, Springer (1991) [11] Fitzpatrick, G., Tolone, W.J., Kaplan, S.M.: Work, locales and distributed social worlds. In: Proceedings ECSCW, Stockholm, Sweden, Springer (1995) [12] Fitzpatrick, G.: The Locales Framework: Understanding and designing for Wicked Problems. Kluwer Academic Publishers (2003) [13] Hutchins, E.: Cognition in the Wild. The MIT Press, Cambridge, MA (1995) [14] Nardi, B.A.: Studying context: A comparison of activity theory, situated action models, and distributed cognition. In Nardi, B.A., ed.: Context and Consciousness. The MIT Press, Cambridge, Massachusetts (1995) 69–102
Collaboration Support for Mobile Users in Ubiquitous Environments
197
[15] Suchman, L.A.: Plans and Situated Actions: The Problem of Human-machine Communication. Cambridge University Press, Cambridge (1987) [16] Satyanarayanan, M.: Pervasive computing: Vision and challenges. IEEE Personal Communications 8 (2001) 10–17 [17] Greenberg, S.: Context as a dynamic construct. Human-Computer Interaction 16 (2001) 257–268 [18] Dey, A.K.: Understanding and using context. Personal and Ubiquitous Computing 5 (2001) 4–7 [19] Edwards, W.K.: Putting computing in context: An infrastructure to support extensible context-enhanced collaborative applications. ACM Transactions on Computer-Human Interaction 12 (2005) 446–474 [20] Harrison, S., Dourish, P.: Re-place-ing space: The roles of place and space in collaborative systems. In: Proceedings of CSCW, Boston, Massachusetts, USA, ACM Press (1996) 67–76 [21] Weiser, M.: The computer for the twenty-first century. Scientific American 265 (1991) 94–104 [22] Ishii, H., Ullmer, B.: Tangible bits: Towards seamless interfaces between people, bits and atoms. In: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press (1997) 234–241 [23] Schmidt, K., Simone, C.: Coordination mechanisms: Towards a conceptual foundation for CSCW systems design. Computer Supported Cooperative Work (CSCW). An International Journal 5 (1996) 155–200 [24] Robinson, M.: Design for unanticipated use... In: Proceedings ECSCW, Amsterdam, The Netherlands, Springer (1991) 195 [25] Nomura, S., Edwin, H., Barbara, E.H.: The uses of paper in commercial airline flight operations (2006) [26] Bellotti, V., Bly, S.: Walking away from the desktop computer: Distributed collaboration and mobility in a product design team. In: Proceedings CSCW, Boston, Massachusetts, United States, ACM (1996) 209–218 [27] Kraut, R.E.e.a.: Informal communication in organizations: Form, function, and technology. In Oskamp, S., Spacapan, S., eds.: People’s reactions to technology in factories, offices, and aerospace. Sage Publications (1990) 145–199 [28] ISTAG: Scenarios for ambient intelligence in 2010. Technical report, EU (2001) [29] Covi, L.M., Olson, J.S., Rocco, E., Miller, W.J., Allie, P.: A room of your own: What do we learn about support of teamwork from assessing teams in dedicated project rooms. In: Proceedings CoBuild. (1998) [30] Brush, A., Inkpen, K.: Yours, mine and ours? Sharing and use of technology in domestic environments. In: Proceedings UbiComp. Springer (2007) 109–126 [31] Neustaedter, C., Brush, A.J.B., Greenberg, S.: ’The calendar is crucial’: Coordination and awareness through the family calendar. ACM Transactions on Computer Human Interactions In press (2009) [32] Mackay, W.: Media spaces: Environments for informal multimedia interaction. In Beaudouin-Lafon, M., ed.: Computer Supported Co-operative Work. John Wiley & Sons, Chichester, UK (1999) 55–82
198
Babak A. Farshchian and Monica Divitini
[33] Abbott, K.R., Sarin, S.K.: Experiences with workflow management: Issues for the next generation. In Neuwirth, R.F., Christine, eds.: Proceedings CSCW, Chapel Hill, October 22-26, NC, USA, ACM Press (1994) 113–120 [34] Russell, D.M., Streitz, N.A., Winograd, T.: Building disappearing computers. Communications of the ACM 48 (2005) 42–48 [35] Malone, T.W., Lai, K.Y., Fry, C.: Experiments with OVAL: A radically tailorable tool for cooperative work. ACM Transactions on Information Systems 13 (1995) 177–205 [36] Lieberman, H., Paterno, F., Wulf, V., eds.: End-user Development. Springer (2006) [37] Newman, M.W., Sedivy, J.Z., Neuwirth, C.M., Edwards, W.K., Hong, J.I., Izadi, S., Marcelo, K., Smith, T.F., Sedivy, J., Newman, M.: Designing for serendipity: Supporting end-user configuration of ubiquitous computing environments. In: Proceedings DIS, London, England, ACM Press (2002) [38] Ruyter, B.d., Sluis, R.V.D.: Challenges for end-user development in intelligent environments. In Lieberman, H., Paterno, F., Wulf, V., eds.: End User Development. Springer (2006) 243–250 [39] Greenberg, S., Roseman, M.: Using a room metaphor to ease transitions in groupware. In Ackerman, M., Pipek, V., Wulf, V., eds.: Sharing Expertise: Beyond Knowledge Management. MIT Press, Cambridge, MA (2003) 203– 256 [40] Churchill, E.F., Snowdon, D.N., Munro, A.J.: Collaborative Virtual Environments. Springer (2001) [41] Bly, S.A., Harrison, S.R., Irwin, S.: Media spaces: Bringing people together in a video, audio, and computing environment. Communications of the ACM 36 (1993) 28–46 [42] Farshchian, B.A.: Presence technologies for informal collaboration. In: Being there: Concepts, effects and measurement of user presence in synthetic environments. Volume 5 of Emerging Communication: Studies on New Technologies and Practices in Communication. IOS Press, Amsterdam (2002) 209–222 [43] Mansfield, T., Kaplan, S., Fitzpatrick, G., Phelps, T., Fitzpatrick, M., Taylor, R.: Evolving Orbit: A process report on building locales. In: Proceedings Group, Phoenix, Arizona, USA, ACM Press (1997) 241–250 [44] Schmidt, K., Simone, C.: Mind the gap! towards a unified view of cscw. In: COOP, Sophia Antipolis, France (2000) [45] Roseman, M., Greenberg, S.: Teamrooms: Network places for collaboration. In: Proceedings CSCW, Boston, Massachusetts, USA, ACM Press (1996) 325– 333 [46] Bardram, J.E.: Activity-based computing: Support for mobility and collaboration in ubiquitous computing. Personal and Ubiquitous Computing 9 (2005) 312–322 [47] Edwards, W.K., Newman, M.W., Sedivy, J.Z., Smith, T.F., Balfanz, D., Smetters, D.K., Wong, H.C., Izadi, S.: Using speakeasy for ad hoc peer-to-peer collaboration. In: Proceedings CSCW, New Orleans, Louisiana, USA, ACM Press (2002) 256–265
Collaboration Support for Mobile Users in Ubiquitous Environments
199
[48] Johanson, B., Fox, A., Winograd, T.: The interactive workspaces project: Experiences with ubiquitous computing rooms. IEEE Pervasive Computing 1 (2002) 67–74 [49] Tandler, P.: The BEACH application model and software framework for synchronous collaboration in ubiquitous computing environments. Journal of Systems & Software 69 (2003) 267–296 [50] Goldie, B.T., McCrickard, D.S.: Enlightening a co-located community with a semi-public notification system (2006) [51] Divitini, M., Farshchian, B.A.: Shared displays for promoting informal cooperation: An exploratory study. In Darses, F., Dieng, R., Simone, C., Zacklad, M., eds.: Proceetings COOP, Hyôlres Les Palmiers, France, IOS Press (2004) 211–226 [52] Greenberg, S.: Collaborative physical user interfaces. In Okada, K., Hoshi, T., Inoue, T., eds.: Communication and Collaboration Support Systems. IOS Press, Amsterdam (2005) 24–42 [53] Cheverst, K., Smith, G., Mitchell, K., Friday, A., Davies, N.: The role of shared context in supporting cooperation between city visitors. Computers & Graphics 25 (2001) 555–562 [54] Farshchian, B.A., Divitini, M.: Ubicollab architecture whitepaper. Technical report, Norwegian University of Science and Technology (2007) [55] Bannon, L., Bødker, S.: Constructing common information spaces. In: Proceedings ECSCW, Kluwer Academic Publishers (1997) 81–96 [56] Olsen, K.A.G.: Distributed Session Management, Session Mobility and Transfer in UbiCollab. PhD thesis, Norwegian University of Science and Technology (NTNU) (2008)
Pervasive Computing Middleware Gregor Schiele, Marcus Handte and Christian Becker
1 Introduction Pervasive computing envisions applications that provide intuitive, seamless and distraction-free task support for their users. To do this, the applications combine and leverage the distinct functionality of a number of devices. Many of these devices are invisibly integrated into the environment. The devices are equipped with various sensors that enable them to perceive the state of the physical world. By means of wireless communication, the devices can share their perceptions and they can combine them to accurate and expressive models of their surroundings. The resulting models enable applications to reason about past, present and future states of their context and empower them to behave according to the expectations of the user. This ensures that they provide high-quality task support while putting only little cognitive load on users as they require only minimal manual input. To provide a truly seamless and distraction-free user experience, the applications can be executed in a broad spectrum of vastly different environments. Thereby, the require only little manual configuration since they can autonomously adapt and optimize their execution depending on the capabilities of the environment. In many cases, changes to the environment are compensated with little impact on the support provided by the application. Failures are handled transparently and the capabilities of newly available devices are integrated on-the-fly. Given this or similarly ambitious visions, pervasive applications are very attractive from a user’s perspective. In essence, they simply promise to offer more sophisticated and more reliable task support for everyone, everywhere. From an Gregor Schiele Universität Mannheim, Germany, e-mail:
[email protected] Marcus Handte Fraunhofer IAIS and Universität Bonn, Germany, e-mail:
[email protected] Christian Becker Universität Mannheim, Germany, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_8, © Springer Science+Business Media, LLC 2010
201
202
Gregor Schiele, Marcus Handte and Christian Becker
application developers perspective, on the other hand, they are close to a nightmare come true: unprecedented device heterogeneity, unreliable wireless communication and uncertainty in sensor readings, unforeseeable execution environments that span the complete spectrum from static to highly dynamic, changing user requirements, fuzzy user preferences and the list goes on and on. As a result, even simple applications that process a small number of user inputs can often exhibit enormously large spaces of possible execution states and these spaces need to be considered by the application developer. Thus, without further precautions, the development of pervasive applications is a non-trivial, time-consuming and error-prone task. It is exactly at this point, where pervasive computing middleware can come to the rescue. Over the last years, middleware for pervasive computing has come a long way. Given the diversity of pervasive applications and the harsh conditions of their execution environments, the design of a comprehensive and adequate middleware for pervasive applications is certainly a rewarding but also a challenging goal. The pursuit of this goal has lead to a multitude of different concepts which resulted in an even greater number of supportive mechanisms. It is due to this variety of concepts and mechanisms that the anatomy of existing middleware differs widely. As a consequence, it is necessary to take a closer look at the influencing factors in middleware design in order to enable a meaningful in-depth discussion and comparison.
2 Design Considerations In this section we discuss the three main influencing factors for the design of pervasive computing middleware: the organizational model of the middleware, the level of abstractions provided by the middleware, and the tasks supported by the middleware.
2.1 Organizational Model The first key influential factor of middleware design is the organizational model of the execution environment. Given a certain organization, the middleware developer may introduce assumptions to improve or simplify middleware mechanisms and application development. In addition, the organizational model also introduces quantitative thresholds for important variables such as the maximum and typical number of devices or the frequency and types of fluctuations that can be expected. At the present time, there are two predominant organizational models, as shown in Figure 1. In the past, middleware developers primarily focused on either one: • Smart environment: A large fraction of pervasive computing middleware systems such as [19], [36], [33], [34] are geared towards the organizational model of a smart environment. Thereby, a smart environment can be defined as a spatially limited area, e.g. a meeting room or an apartment, equipped with various
Pervasive Computing Middleware
203
Fig. 1 Predominant organizational models
sensors and actuators. As a result, many devices in smart environments are stationary. However, as users might be using mobile devices such as laptops or personal digital assistants (PDAs) as well, some devices must be dynamically integrated. Typically, the integration is performed depending on the physical location of the device which can be determined automatically by the available sensors or it can be manually configured through some user interface. Due to the fact smart environments usually contain a number of stationary devices, middleware for smart environments relies typically on an stationary and rather powerful coordinating computer to provide basic services. Since the services are often an essential building block of the applications, the presence of the coordinating computer is required at all times and thus, its availability must be high. Clearly, demanding the availability of a powerful and highly available computer may be problematic due to technical and economic constraints. However, if it can be guaranteed, a coordinating computer can significantly ease middleware and application development, since it provides a central point of management that can enforce policies, mediate communication, distribute information, etc. • Smart peers: In addition to middleware for smart environments, there are also a good amount of middleware systems such as [5], [6], [17], [20] that are based on the idea of dynamically formed collections of smart peers. In contrast to smart environments that view a pervasive system as a set of computers located in a certain area, smart peers postulate a people-centric as opposed to location-centric
204
Gregor Schiele, Marcus Handte and Christian Becker
perspective on pervasive applications. The key idea is to understand a pervasive system as the collection of computers that surrounds a person independent from the physical location. As a result, the devices in such as system cannot rely on the continuous presence of any of the other devices. This prohibits centralized coordination on a powerful and highly available system. Instead, the smart peers must make use of suitable mechanisms to coordinate in a decentralized manner. In order to get the necessary locality required to keep the dynamic set of devices manageable, the middleware systems typically limit the size on the basis of spatial proximity. While this may seem more flexible at a first glance, the smart peer model is not without issues. Specifically, the management mechanisms tend to be more complicated and if not properly designed, they can also be less efficient.
2.2 Provided Level of Abstraction Another influential factor of the middleware design is the provided level of abstraction. Given that pervasive computing middleware may provide a great variety of services, the level of abstraction can also differ widely, in general. However, from a high-level point of view, it is possible to identify three main levels and in most pervasive computing middleware, the services can be classified as providing either one: • Full transparency: From an application developer’s perspective, the ultimate middleware system should provide full transparency with respect to the addressed challenges. If full transparency can be provided, the application developer can simply ignore all related issues since they are handled by the middleware. Yet, providing transparency requires a thorough understanding of the problem domain and the application domain since the middleware designer must be able to identify a completely generic solution that provides satisfying results for all possible application scenarios. As a consequence, middleware services that provide full transparency are usually restricted to a particular type of application and to a very specific type of problem. • Configurable automation: Given that full automation often imposes severe restrictions on the supported scenario, a frequently used trade off is configurable automation. Instead of defining a middleware layer that abstracts from all details, the middleware designer creates an abstraction that automates the task with the help of some configuration of the application developer. As a consequence, the application developer can forget about the intrinsic problems after deciding upon the basic configuration. In cases where the number of possible solutions is reasonable small, configurable automation can greatly simplify application development. Yet, as the number of possible solutions grows, the configuration becomes more and more complicated until it outweighs the benefits of automation.
Pervasive Computing Middleware
205
• Simplified exposure: In cases where configurable automation can no longer achieve satisfying results, pervasive computing middleware can strive for simplified exposure. So instead of offering mechanisms to automate a certain task, the middleware can restrict itself to the automated gathering of information needed to fulfill the task. Given that many tasks are not well-understood or have a huge solution space, this level of abstraction is sometimes the only reasonable option. Clearly, this might seem to be a very low level of abstraction. However, it can still be very valuable for two reasons. On the one hand, the gathering of the necessary information can be a complex task in itself. On the other hand, centralized gathering of information can often be more efficient, especially, in cases where multiple applications require the same type of information.
2.3 Supported Tasks Besides from the organizational model and the level of abstractions, the most influential factor in pervasive computing middleware design is certainly the task supported by the system. Due to the great variety of challenges, the set of tasks is equally large. Yet, several tasks must also be handled by conventional systems and in many cases, the solutions applied by pervasive computing middleware follow traditional patterns. To avoid the repetition of main stream concepts, the in-depth discussion of pervasive computing middleware in this chapter is restricted to the following three main services that can be found in a majority of the existing system. • Spontaneous Interaction: Independent from the underlying organizational model, pervasive applications need to interact spontaneously with an ever-changing set of heterogeneous devices. To do this, applications not only need to communicate with each other but they also need to detect and monitor the available set of devices. As a consequence, even the most basic middleware systems provide functionality to enable this kind of spontaneous remote interaction. The details of the support, however, depend on the organizational model and the level of abstraction can vary accordingly. • Context Management: Besides from interaction, a key difference between traditional and pervasive computing applications is that pervasive applications need to take the state of the physical world in to account. In addition to writing appropriate application logic, this bears many complex challenges. First and foremost, the relevant observations need to be made. Depending on the phenomena, this may require coordinated measurements of multiple distributed sensors and the fusion of several readings. During this process, the application has to deal with varying sensor availability and uncertainties in measurements. Moreover, the application might have to deal with various types of sensors resulting in different data representations, etc. As a result, server middleware systems aim at hiding these complexities by providing integrated context management services.
206
Gregor Schiele, Marcus Handte and Christian Becker
• Application Adaptation: The dynamics and heterogeneity of pervasive systems as well as the fact that pervasive applications are usually distributed raises an inherent need for adaptation. In order to function properly despite their highly dynamic execution environment, applications need to adapt to the overall system properties, the available capabilities that can be leveraged at a certain point in time and the context of their users. To simplify application adaptation, many middleware systems provide services that hide several aspects or that provide at least configurable automation. To do this, they introduce additional abstractions, e.g. to specify dependencies on the execution environment or to define optimization goals, and they provide supportive mechanisms. It is noteworthy that only the most basic systems cover solely one of the services described above. In fact, most systems have building blocks for each of them. While some systems have provided basic support for all of these tasks right from the early stages of their development (e.g. [19], [36]), other systems have been extended with additional services over time (e.g. [34], [6]). Yet, due to differences in the underlying organizational model and the specifics of their application scenario, the resulting middleware solutions explore vastly different abstractions. The following sections, provide a more detailed discussion of the three middleware services described above and they relate the internal structure as well as the provided level of abstraction to the organizational model and the application scenario.
3 Spontaneous Interaction Pervasive computing applications are typically distributed over a number of participating devices, e.g. a wall-mounted display for graphical output, a mobile device for user input, and a fixed infrastructure server for data storage. To support this, a suitable distribution platform is needed. We distinguish between three different requirements that such a platform or middleware must fulfill: • Ubiquitous communication and interaction: First, the middleware should support ubiquitous and configuration-free interaction with other devices. This includes basic communication abilities, i.e., the ability to reach other devices and to communicate with them interoperably. In addition a suitable interaction abstraction is required to access remote devices. As an example, a service-based approach can be used. • Integration of heterogeneous devices: Second, the middleware should allow to integrate a wide range of different devices into the system. This includes not only devices based on different hardware and software platforms – the classical goal of interoperability – but also devices with very different resources. As an example, one device might offer a lot of CPU power and memory but no graphical output. Another device might have very scarce computational and memory resources but a large graphical display attached to it.
Pervasive Computing Middleware
207
• Dynamic mediation: Third, the middleware should provide a dynamic mediation service that allows to select suitable interaction partners at runtime. This service can be realized in different ways, e.g. by letting devices interact directly with each other or by using an additional system component to mediate between devices. In the following we discuss each of these issues in more detail and analyze how different existing middleware systems for pervasive computing address them.
3.1 Ubiquitous Communication and Interaction Enabling devices to cooperate with each other requires interoperability between these devices. Interoperability is a multilayer issue that needs to be guaranteed starting at the lowest layers of the networking stack up to the application layer. Here, a common understanding of the semantics of shared functionalities is required. To achieve this, different syntactic or semantic descriptions have been proposed. Middleware has to provide the necessary mechanisms to achieve interoperability on these different layers, or at least has to support applications in this task.
3.1.1 Interoperability In general there are three main possibilities to solve the issue of interoperability as shown in Figure 2. interoperability
fixed standardized protocol set
dynamically negotiated protocol set
interaction bridges
Fig. 2 Possibilities for interoperability
The first possibility is to standardize a fixed common set of protocols, technologies and data formats to interact with other devices. As long as a device uses this standard set of functionality, it will be able to communicate and interact with everybody else in the system. This approach is used by most classical middleware systems, e.g., CORBA [32], Java RMI [50] or DCOM [16]. One prominent example for a pervasive computing middleware relying on a fixed set of standard protocols is UPnP [52]. It defines a set of common communication protocols, e.g., HTTP and SOAP, to allow interaction between devices. Messages are encoded with XML. The exact format of each message must be defined between the communication partners
208
Gregor Schiele, Marcus Handte and Christian Becker
before the interaction can take place. Clearly, the main challenge of this approach is the standardization process itself. Once a common standard is established, interaction becomes possible efficiently. However, it is difficult to change the standard protocols without updating or exchanging all devices. This is specifically problematic in pervasive computing systems, since (1) many devices are embedded into everyday items and thus cannot be updated easily, and (2) devices have a relatively long life time if they are embedded into long living objects, like furniture. The second, much more flexible possibility to achieve interoperability is to allow the middleware to negotiate the protocols used to interact with others at runtime. This additionally allows to switch between different communication technologies at runtime. As users and their devices move around in a pervasive computing system, the available communication technologies may vary over time. As an example, a wireless local area network (WLAN) might be available to users in a building but become unavailable when the user leaves it. By switching to another technology, the middleware is able to offer the maximum achievable communication performance. One middleware allowing such so-called vertical handoffs is the BASE middleware [5]. Based on specific communication requirements that can be selected by an application, BASE negotiates the used protocols and technology with the communication partner dynamically. If during an interaction a technology becomes unavailable, it reselects another one automatically and continues the interaction. Another example for a middleware in this class is Jini [53]. Jini is a Java-based middleware that enables the utilization of distributed services in dynamic computer networks. It relies on mobile code to distribute service-specific proxies to clients. These proxies can include proprietary protocol stacks to access the remote service. This allows to use different protocols for accessing different services, and thus enables already deployed devices to interact with new devices that use newer protocols. However, Jini does not allow to switch between different protocols for the same service dynamically. A more sophisticated variant of this approach are dynamically reconfigurable middleware systems (e.g. [28], [38], [4], [10], [9], [41]). These middleware systems are able to adapt their behavior at runtime to different requirements and protocols, e.g., by selecting which wireless communication technique should be used or how marshalling is done. To do so, these systems offer a reflection interface to allow the application to determine and change the current middleware behavior dynamically. A third possible approach is to introduce interaction bridges into the system that map between different approaches. This approach is very well known on lower layers of the networking stack, e.g., to interconnect different networking technologies. In addition, it can be used on higher layers, too. As an example, the Unified Object Bus [37] used in Gaia provides interoperability between different component systems by defining a common interface to control component life-cycles.
3.1.2 Communication Abstractions Another important issue when discussing ubiquitous interoperable interaction is the used communication abstraction. Conventional middleware offers various commu-
Pervasive Computing Middleware
209
nication abstractions to develop distributed applications. These abstractions range from network-oriented message passing primitives and publish-subscribe abstractions [18] over associative memory paradigms such as tuple spaces [11] to remote procedure calls [7] or their object-oriented counterpart of a remote method invocation (RMI). Since the latter extends the concept of a local procedure call - or a method invocation respectively - to distributed application development, it is commonly supported by mainstream middleware like CORBA [32], Java RMI [50], and Microsoft’s .NET Remoting [12]. Many pervasive computing middleware systems (e.g. [36], [5], [53], [52]) have adopted RMI as their primary communication abstraction. The main advantage of RMI is that it is well known and generally well understood by developers. However, since a remote call requires to previously bind to a specific communication partner, RMI leads to rather tightly coupled communication partners. This may be problematic in highly dynamic environments with frequent disconnections between communication partners. Without further precautions, such a disconnection results in a communication failure that must be handled programmatically by the application developer. A possible solution is to store requests and responses in persistent queues as done, e.g., by the Rover Toolkit [2]. This way the communicating devices can be disconnected temporarily at any point in time and can continue their interaction once they are connected again. Yet, the application of such queued remote procedure calls is only suitable in cases where very high latencies can be tolerated and requires that the communication partners meet eventually to finish their interaction. If this is not the case, the binding between the communication partners must be dissolved and a new binding must be determined. Note that in case of a stateful interaction that has a shared interaction state at both partners, it is necessary to restore this state at the new communication partner before the interaction can continue. This can, e.g. be done using message logs [22] and checkpointing [22] [20]. To avoid such problems, communication abstractions with loosely coupled communication partners can be used. Here communication partners do not communicate directly but are decoupled by a intermediate communication proxy, e.g., a message queue or tuple space. To initiate an interaction, a device sends a message to the proxy, which stores the message. If another device is interested in the message it can access the proxy and retrieve the message. This approach is e.g. used in iRos [34]. Applications send event messages to a so-called event heap. The event heap stores events and allows other applications to retrieve them at a later time. Events are automatically removed from the event heap after a predefined time. Since there is no explicit binding between communication partners, the event sender does not need to know if its events are consumed by anyone or simply dropped. In addition, the recipient of the events can change during runtime without the sender knowing about it. Similarly to iRos, the one.world system [20] is based on loose coupling. It uses tuple spaces instead of an event heap. Each interaction results in an event data item being stored in the tuple space. An interested recipient can retrieve the data item, consuming it in the process, and can place an appropriate answer in the tuple space.
210
Gregor Schiele, Marcus Handte and Christian Becker
3.2 Integration of Heterogeneous Devices To integrate heterogeneous devices into the system, the middleware must be able to be executed on each of them. This requires the middleware to be adapted to different hardware and software platforms, e.g. by introducing a small system abstraction layer that encapsulates the specific platform properties. On top of this abstraction layer, the system is realized in a platform independent way. Another possibility is to develop the system using a platform independent execution environment like e.g. the Java virtual machine. This way the middleware itself is platform independent and all platform dependent system parts are pushed down in the underlying execution environment. Clearly this works only for target environments for which the used execution environment is available. The challenge of developing cross-platform systems that can be executed on different heterogeneous devices is not specific to pervasive computing. It exists for all cross-platform applications. Therefore we will not go into further detail on this issue here. The reader is referred to [8] for a more elaborate discussion. In pervasive computing the middleware faces an additional challenge when trying to integrate different devices. The used devices may differ widely between the capabilities and resources they offer. A stationary server may have a lot of resources with respect to CPU power, memory, network bandwidth and energy. However, it may not offer any user input or output capabilities. On the other side, a mobile device might have very scarce resources but offer such user interface capabilities. The middleware must take this into account to be usable for all these devices. It should operate with few resources if necessary but utilize abundant resources if possible. On a mobile and thus energy-poor device it should offer energy saving mechanisms to achieve suitable device life times. There are different approaches to create system software with minimal resource requirements that still maintains enough flexibility to make use of additional resources. The two extremes are (1) building multiple, yet compatible systems for different classes of devices, and (2) building modular systems with a minimal yet extensible core. As an example for the first approach, the Object Management Group defined minimumCORBA [31], a subset of CORBA intended for resource-poor devices. Another example is PalmORB [39], a minimal middleware for personal digital assistants that only offers client-side functionality. Thus, the mobile device cannot offer services itself but only acts as a client for remote services. The second approach is to provide a modular and extensible system architecture. One example for this is the Universally Interoperable Core (UIC) [41]. It provides a minimal reflective middleware that is configurable to integrate various resources. UIC can be used in static and dynamic variants. The static version allow to tailor the middleware to a given device but does not allow to change the middleware configuration at runtime. This is only possible in the dynamic variant. BASE [5] is another micro-broker-based middleware system for pervasive computing. It allows its minimal core to be extended by adding so-called plugins, each one abstracting a certain communication protocol or technology.
Pervasive Computing Middleware
211
The Gaia project [36] combines both aforementioned approaches. For extremely resource-limited devices, they provide a specialized middleware called LegORB [40]. On other devices, they employ dynamicTAO [38], which relies on a small but dynamically extensible core.
3.3 Dynamic Mediation Applications in pervasive computing are faced with unforeseeable and often highly dynamic execution environments. The communication partners that will be available at runtime are therefore often unknown at development time and may even change dynamically. To cope with that a suitable mediation service must be available that allows to decide dynamically with which devices to communicate and cooperate. Due to the dynamics of the execution environments, this mediation must be performed continuously. Mediation can be discussed for tightly or loosely coupled communication partners. As described before, loosely coupled communication partners do not communicate directly but via an intermediate communication proxy. In this case, mediation is rather simple as it is implicitly done by the proxy storing the data. A message is simply delivered to every device that accesses the proxy and reads the stored data. It is the recipient device’s responsibility to filter the messages that are interesting to it. Thus, mediation is done on the receiver side in this scenario. Tightly coupled communication partners require a much more sophisticated communication mediation. In such systems, the partners are first bound to each other at runtime. After that they exchange data directly. If one of the partners becomes unavailable, the communication binding breaks and must be renewed with different partners. Typical examples for this approach are service-based systems. Therefore, the process of searching for suitable communication partners is typically referred to as service discovery. Existing service discovery approaches can be classified into peer-based (or server-less) and mediator-based (or server-based) discovery approaches.
3.3.1 Peer-based Discovery In peer-based approaches (e.g., [52], [29]), all nodes participate in the discovery process directly (see Figure 3). To find a service, a client broadcasts a discovery request to the whole or part of the network. Nodes offering a suitable service answer the request. Alternatively, service providers broadcast service announcements periodically, notifying others about their services. The provider can announce only its own services [52] or all services it knows of [24]. Recipients of these announcements store them in a local cache. The cache can then be used to answer discovery requests of local [52] or both local and remote clients [24], thereby enhancing the system’s efficiency. The main advantage of peer-based approaches is their
212
Gregor Schiele, Marcus Handte and Christian Becker
Fig. 3 Peers-based discovery
simplicity and flexibility. This makes them specifically suited for highly dynamic environments. On the downside, to detect new services continuously, devices have to send requests and/or announcements regularly, largely limiting the scalability of peer-based approaches due to the resulting communication overhead and resulting energy consumption. A service discovery system that specifically tries to reduce this energy consumption is DEAPspace [29]. It uses synchronized time windows to broadcast service announcements. Between these windows, devices can be deactivated to save energy. Service descriptions are replicated on all devices and announced collectively. To lessen communication overhead, only one device sends its announcement in each cycle, preferably one with currently unknown services or much energy. Still, all service descriptions must be broadcast regularly, even if no client is interested in them. In addition, to enable new devices to join in, devices have to stay active for more than 50% of their time to ensure overlapping time windows.
3.3.2 Mediator-based Discovery In contrast to peer-based discovery systems, mediator-based service discovery approaches (e.g., [49], [25], [51]) delegate service discovery to a number of special devices – the mediators – that manage a possibly distributed service registry on behalf of all devices (see Figure 4). The mediators can either be independent from each other [49], can coordinate their entries with each other [51], or can form a hierarchy of mediators [1]. To publish services, providers register their services at the registry. Clients query it to detect services. As an optimization, clients can register their required services at the mediators to provide them with up-to-date information about new services via updates. Therefore, no broadcast is needed to detect or announce services, resulting in less communication overhead than with peer-based approaches. This makes mediator-based approaches especially suited to create highly scalable systems (e.g. [25], [51]). Their main drawback is that without a mediator in
Pervasive Computing Middleware
213
Fig. 4 Mediator-based Discovery
the vicinity, no discovery is possible. This is specifically problematic in highly dynamic environments without fixed infrastructure devices, because we cannot assume continuous connectivity to a specific device that could act as mediator. SLP [25] handles missing mediators by allowing nodes to switch to peer-based discovery if they do not find a mediator. Still, mediators are predefined. As they cannot sleep, battery-operated mediators will run out of energy quickly, forcing all other nodes to switch back to peer-based discovery. In addition, there may be more than one mediator in the vicinity, e.g. if two people come together with their devices. Gaia [36] uses a heartbeat concept for service discovery. As long as a service is available in a given environment, it sends periodic heartbeat messages to the mediator, the so-called PresenceService. As a result, the PresenceService sends enter and leave messages on a service presence channel that interested clients can listen into. An example for an energy-efficient mediator-based discovery system is SANDMAN [45]. Similar to DEAPspace, SANDMAN saves energy by deactivating devices. However, it does so by grouping devices with similar movement patterns into dynamic clusters. Each cluster selects a mediator that takes over service discovery for its cluster. The mediator stays active and answers discovery requests from clients. All other devices power themselves down until they are needed. This approach is more complicated than DEAPspace but scales to larger systems and allows longer deactivation times.
4 Context Management As applications adapt to changes in the environment, the relevant parameters of the environment need to be modeled and changes have to be propagated to interested or affected applications in an appropriate manner. Context management denotes the task of context modeling, acquisition and provisioning where context relates to
214
Gregor Schiele, Marcus Handte and Christian Becker
relevant environmental information of an entity. We will first take a look at context itself and established definitions. Thereafter we will discuss the three major responsibilities of context management, namely the acquisition and fusion, modeling and distribution, provisioning and access. After more than a decade of research in context management a number of definitions exist. We do not want to present a survey but discuss two classical definitions. Schilit, Adams and Want identify three important aspects of context[46]. These are • where you are, • who you are with, and • what resources are nearby. Thus, they classify context-aware systems according to their adaptation to the location of use, the collection of nearby people resembling the social context, and the available network and computing infrastructure. Also, the change in these three sets over time plays a crucial role to the adaptation of a context-aware application. A context-aware application must be capable of examine the available context and adapt accordingly. Dey provided one of the first approaches that integrated a framework for adaptive applications, the context toolkit [15], along with software engineering aspects. A number of applications has been investigated. Dey defines context as any information that can be used to characterize the situation of an entity [14]. Entities can be considered to be persons, places, or any object that is considered relevant to the interaction between a user and an application, including the user and applications themselves. This definition specifically addresses applications that are user centered. Dey further defines a system to be context-aware if it uses context to provide relevant information and/or services to the user. Relevancy depends on the user’s task. A further classification of applications leads to three classes of applications: • Context-aware presentation: the application changes its appearance depending on context information. Examples are navigation systems that change the level of detail depending on the speed or multi-modal application that can select different output media, e.g., audio or video, based on context information, e.g., environmental noise. • Automatic execution of a service: in this class, applications are triggered depending on context information. Examples are reminder services that can notify users based on configured patterns, e.g., during a shopping tour, or automatic execution of services in an automated home. • Tagging of context to information for later retrieval: this class of applications allows to augment the daily environment of user with virtual artifacts. Messages that are linked to places or objects can become visible to users due to a distinct context, e.g., their identity, their activity, their role. So far we have seen on a rather abstract view how applications can utilize context in order to change their behavior based on context. A yet open question is the
Pervasive Computing Middleware
215
Fig. 5 Generic Model of Context Management
management of context. Figure 5 shows a simple and generic model of context management. The physical world is situated at the bottom of the model. Sensors – which may also be users entering data about our environment into an information system – feed the context model. The context model thus contains a more or less comprehensive abstraction of a part of our physical environment. This allows applications to reason about context and take actions. The interfaces between sensors and the context model as well as between context model and applications can be synchronous (pull) or asynchronous (push). Finally, the applications also can provide context information to the model, e.g., user preferences or tagging of context information. We will briefly look at some relevant aspects of context management in the next subsections.
4.1 Acquisition and Fusion Since context to a large degree is related to the physical world, we need sensors to capture the state. This immediately leads to two major problems that have to be handled: • Accuracy: sensors measure and by doing so, they exhibit a deviation with respect to the actual value of the measured entity. Sensors typically specify the mean and maximum error when reading data. A GPS sensor refers to a sphere in space which depends on the number of satellites that can be seen and may be improved by some additional signal. Temperature readers only work with a distinct degree. As a consequence, one can hardly handle context data as facts
216
Gregor Schiele, Marcus Handte and Christian Becker
with discrete values. Intervals, possibly enriched with a distribution function of the values, can reflect this fact. • Freshness: once a value is read, it ages. Depending on the characteristics of the entity which is measured, sensor readings show a rather long period where their value can be considered "up to date". Once we know how values in the physical world can change, e.g., the maximum speed of an object, we can estimate the maximum deviation over time. This can help us to reduce the rate in which sensors are read. Basically, this is a transformation from freshness into accuracy once we know the rate of change of the monitored entity. Although we can only gather information about our environment with a distinct accuracy, there are methods to gain a more accurate view. First, combinations of sensors can be used. Using different technology can help to reduce errors. Even the same sensor technology can help to gain more accurate sensor readings, e.g., when visuals are taken from different perspectives. In contrast to redundant or multiple readings from the same or similar sensor technology one can use additional information in order to increase the accuracy or reduce the error probability. If, for example, an indoor positioning system returns two rooms as possible position additional information can help to narrow down the position. Additional information could be login information at a computer or key stroke activity of an instant messenger. Combining different sensor information in order to reduce errors or deduce other context information is called sensor fusion. Higher context information based on low level sensors is a typical task of sensor fusion. As an example consider a meeting room that allows readings of light condition, temperature, ambient noise, etc, e.g., via the ContextCube [3]. A situation with closed blinds, artificial light (can be determined by the wave length), and one person talking can be reasoned to be a presentation. Rules for sensor fusion and reasoning require adequate modeling of sensor information and the relation between different sensors. We will talk briefly about context modeling in the next section.
4.2 Modeling and Distribution Context acquisition can be a costly task. Besides simple and cheap sensors, context can also be gathered by expensive methods, e.g., surveying. In order to mitigate the resulting costs, sharing the costs among a number of applications can be helpful. However, if a single institution cannot handle the costs sharing context information across institutions can pave the way. A popular example is map data. Although it is quite expensive to gather, the scale of its applications from classical maps to modern navigation systems allows achievable costs for the individual user. A prerequisite for sharing is standardization. This can be done by standardized interfaces: the internal representation of the data is decoupled by a a set of functions from the using applications. This eases using data being distributed across a number of services. However, this still requires that the retrieved information can be interpreted across a number of applications. This leads to the second prerequisite: standardized
Pervasive Computing Middleware
217
data. Applications need to interpret the data – obviously they need type information. There is a plethora of different context modeling schemes which can only covered by a brief discussion here. Simple approaches use name value pairs. Clearly, this requires that every application knows the names of the context concepts and interprets the associated data the same. Name value pairs do not allow to relate concepts to each other. Thus, specialization of existing concepts is not possible as well as modeling hierarchies, e.g., inclusion of locations. Object-oriented modeling techniques allow to model specialization of concepts and thus extension of existing hierarchies. Additional relations between the concepts cannot be handled and the modeling is restricted to classes. Ontologies offer additional means for modeling and allow reasoning over context. This is, however, a complex field on its own. The interested reader is referred to [48]. Other aspects of distribution is the placement of data in a set of distributed services. This can help to balance requests from services and allow scalability. This requires directory services that forward the requests to the servers hosting the information. Clearly, such directories should not result in performance bottlenecks or single points of failure themselves.
4.3 Provisioning and Access As discussed above, the interface to a context model – realized as a context service – is crucial in order to provide suitable abstractions for applications. Dey as well as Bauer, Becker and Rothermel identify three classes of context that are relevant for context access: • identity: as in regular databases, context information can be identified by a unique key. This allows to query for context information in a direct way. • location: since context information is related to our physical environment, many queries will relate to objects and their spatial relation, i.e., the nearest objects (nearest neighbor queries), objects residing in a distinct spatial area (range queries), and queries for an objects position (position queries). • time: past context can be stored and retrieved later allowing for, e.g., lifelogging applications. Also, some context, e.g., weather or traffic jams, can be predicted. Thus, a context model should support queries for the past, current context and the future. Not every context model has to support all these queries. Also, the kind of queries can differ. Applications can poll context by issuing queries to the context model. This is similar to classical databases. Since many applications react on context changes, it can ease the development of applications when they register only for a notification that is issued, when the context changes.
218
Gregor Schiele, Marcus Handte and Christian Becker
5 Application Adaptation In general, the distraction-free and intuitive task support envisioned by pervasive computing cannot be achieved by a single device alone. As Mark Weiser pointed out already [54], the real power of pervasive computing emerges from the interaction of all of them. Given that only very simple tasks can be supported by one device, thorough task support requires some form of distributed coordination to integrate the unique capabilities of multiple devices. In pervasive computing environments, the integrating coordination of devices is specifically challenging for several reasons. First and foremost, due to the embedding of devices into everyday objects, it is pretty unlikely that different environments will exhibit similar sets of devices. Secondly, due to mobility, failures and continuous evolution, the set of devices that is available in a certain environment is changing over time and many of the changes cannot be predicted reliably. Finally, even if the set of devices is known in advance and does not change, different users may require or prefer the utilization of different devices for the same task in the same environment. As a consequence, the coordination of the devices cannot be done statically in advance. In order to support diverse settings and to provide distraction-free task support for different users, the coordination must be done on the basis of the target environment and user. Furthermore, since the environment may change at any given point in time, the coordination must be able to adapt to many changes. Without appropriate coordination support at the middleware layer, coordination must be handled by the application layer. Besides from complicating the task of the application developer, this approach has the a number of significant drawbacks. For instance, if some parts of the coordination logic cannot or shall not be handled by the application, e.g. to support customization for different users, it is hard to provide an application-overarching customization interface. Similarly, if multiple applications are executed at the same time, their coordination logic might interact in unforeseeable and undesirable ways. In the past, researchers have proposed two alternative coordination approaches which are depicted in Figure 6. Both have been supported by corresponding middleware systems and in principle, they are complementary. However, they have not been integrated in a single unifying system so far. The first approach, inter-application adaptation, coordinates the execution of several non-distributed applications [19], [34]. The second approach, intra-application adaptation, strives for coordination of a single application that is distributed across several devices [13], [6], [42], [33]. The following sections provide a detailed look at each of the two approaches and discusses exemplary middleware services to support the arising tasks.
Pervasive Computing Middleware
219
Fig. 6 Application Adaptation
5.1 Inter-Application Adaptation The goal of inter-application adaptation is to provide thorough task support by coordinating the execution of multiple applications across a set of devices. Usually, the applications themselves are not distributed and they are not aware of each other. This enables the utilization of arbitrary compositions, however, this does not necessarily induce that the applications do not interact with each other at all. They may interact either through some form of mediated communication or they may rely on classical mechanisms such as files. A charming benefit of inter-application adaptation is that it enables the reuse of traditional applications in pervasive computing environments with minor or even no modifications. Considering the amount and the value of available applications this argument is quite convincing. In addition, inter-application adaptation can shield the application developer completely from all aspects of distribution since the coordination does not require application support. Last but not least, as applications do not interact with each other directly, intra-application adaptation facilitates robustness and extensibility. To provide support for inter-application adaptation, middleware can take care of multiple tasks. First, the middleware can determine the set of applications that
220
Gregor Schiele, Marcus Handte and Christian Becker
shall be executed. Upon execution, it can supply the applications with user-specific session state. Secondly, the middleware can detect changes and adapt the set of applications accordingly. If one application shall be replaced with another one, it can provide transparent transcoding services for user data. Finally, the middleware can also provide services to facilitate the interaction of applications. Clearly, not all existing middleware systems that target inter-application adaptation provide all services. The following briefly describes how two prominent middleware systems, namely Aura [19] and iRos [34], support inter-application adaptation. To understand their mechanisms it is noteworthy to point out that both middleware systems are targeted at smart environments and thus, they assume that some configuration can be done manually and that it is possible to implement centralized services on a device with guaranteed high availability. To manage application compositions automatically, Aura introduces a task abstraction. As indicated by the name, a task actually represents a user task including the relevant data. Examples for tasks are the preparation of a certain document or presentation. Tasks can be specified by the user through some graphical user interface. When a user wants to start or continue a specified task, Aura automatically maps the task to a certain set of applications on a set of the currently available devices. This mapping is done automatically by the PRISM component of Aura. To do this, the PRISM component relies on a so-called environment manager which is a centralized component that maintains a repository of the available applications. In order to supply the applications with the user data and in order to deal with disconnections, Aura relies on the advanced capabilities of the Coda distributed file system [44]. Thereby, Aura can also adapt the fidelity of the data using Odyssey [30] in order to optimize the execution despite fluctuating network quality. However, interaction between applications that are supporting the same task is not supported by any of the services of Aura. While iRos does not support an automated mapping between user tasks and applications, it provides services that are specifically targeted at interaction support. To prevent a tight coupling of applications, iRos introduces a so-called event heap [26]. The event-heap allows applications to post and receive notifications. The notifications can be generated and processed by applications that have been specifically developed with the event heap in mind or by scripts that extend existing applications that are not aware of iRos. A presentation application, for example, may generate an event whenever the slide changes. Some other application may then use the corresponding notifications to dim the light in the room as long as the presentation is running. In addition to pure event distribution, the event heap also stores events for a limited amount of time. This reduces timing constraints between applications and it can be used to recover from application failures. For example, by processing the recent events stored in the event heap, a script may be able to restore the internal state of an application after a failure. Just like Aura, iRos also takes care of data management but it uses a slightly different approach. Instead of reusing an existing distributed filesystem, iRos introduces a so-called data heap. Besides from persistent data storage, the data heap can also provide automated type conversions on the basis of the Paths system [27]. Thus, instead of selecting an application on the ba-
Pervasive Computing Middleware
221
sis of the data type of the input files, the user can simply select an application that provides the desired feature set. Despite its undeniable benefits, inter-application adaptation is not without issues. On the one hand, mediated communication or indirect interaction by sharing session data can ensure that individual applications are not tightly coupled. On the other hand, however, it only supports weak forms of coordination between applications well. In cases where stronger forms are required, e.g. if a certain action should take place shortly after a certain event occurred, additional middleware support can greatly simplify application development. Similarly, if a single application must be distributed to fulfill its task, e.g. because it requires resources that cannot be provided by a single device, the application does not benefit from inter-application adaptation. Mitigating this is the primary goal of intra-application adaptation.
5.2 Intra-Application Adaptation The goal of inter-application adaptation is to provide thorough task support by coordinating the execution of a distributed application across a set of devices. The level of integration that can be achieved by this approach exceeds the achievable level of multiple independent applications. Providing thorough task support without coordinating the actions of multiple devices is often simply not possible. For example, a collaboration application that runs on several mobile devices might need to share data in a tightly controlled way. Unlike inter-application adaptation which can benefit on an unlimited degree of compositional flexibility, a distributed application usually requires the availability of certain functionality. For example, in order to print a document, a word processing application requires a printer. In addition to minimum requirements an application may also want to achieve certain optimization goals that may also depend on user preferences, e.g. use the nearest or the cheapest printer. As a consequence, the flexibility of the composition of functionalities is usually limited by application requirements and optimization criteria. Depending on the complexity of the application, the configuration and runtime adaptation of the composition can be quite complicated. As a result, a primary goal of many pervasive computing middleware systems that support intra-application adaptation is the automation or at least the simplification of these aspects. A noteworthy exception is the approach taken by Speakeasy [17]. Instead of automating the composition, Speakeasy proposes the manual composition of a distributed application from components that can be recombined without restrictions. Thus, a user can create compositions that have not been foreseen at development time. However, manual composition is hardly viable for applications that consist of a larger number of components or for environments that exhibit a higher degree of dynamics. The system-supported configuration and runtime adaptation of a composition requires a model of the application that captures the individual building blocks, their dependencies on each other and the dependencies on the environment. The models
222
Gregor Schiele, Marcus Handte and Christian Becker
that are applied in practice can be classified depending on the supported degree of flexibility: At the lowest level, the model contains a static set of building blocks with static dependencies. As a consequence, a valid configuration can be formed on a set of devices that fulfills the requirements and runs the specified building blocks of the application. This approach is taken by the MetaGlue [13] agent platform, for example. Although, the approach results in simple applications that must only deal with a fixed and know set of code, the lack of flexibility can be a significant short-coming in practical scenarios. One reason for this is that the approach requires devices to run an application-defined building block. In many scenarios this requires that the devices are capable and willing to install additional code or it simply restricts the set of possible devices to those that are equipped with the code already. In order to mitigate this, it is possible to refrain from restricting the building block to a specific implementation. In order to ensure the correctness of the application despite the unknown implementation, it is necessary to model the required building blocks using a detailed functional and non-functional description. As this description must be provided by the application developer, this approach increases the modeling effort. However, in return, it becomes possible to use alternative implementations of the same functionality without risking undesirable effects on the application behavior. In addition to broadening the number of target devices, this indirect specification also facilitates the evolution of individual building blocks. Yet, having a static set of building blocks may be limiting in cases where one building block can be replaced by a combination of building blocks. The same holds true in cases where the number of building blocks cannot be specified in advance, e.g. because it depends on the environment. To support such scenarios, it is possible to introduce flexible dependencies into the application model and several middleware systems follow this approach. The exact implementation, however, differs depending on the middleware. The Gaia system [36], for example, relies on an application model that allows the utilization of a varying number of components to display or perceive some information. This can be used to adapt the output of an application to the number of users or to the number of available screens. The PCOM component system [6] attaches dependencies to each component and requires that all recursively occurring dependencies are satisfied by another component. As a result, applications in PCOM are essentially trees of components that are composed along their dependencies. The system proposed in [33] also relies on recursion. However, instead of attaching dependencies to components, the authors introduce an additional level of indirection called Goals [43]. Goals represent a specific user goal that can be decomposed automatically into a set of simpler sub Goals. After the decomposition has succeeded, they are mapped to components using Techniques. A Technique essentially resembles a script that defines how the components need to be connected and thus, it is possible to support more complex composition structures than in PCOM. Although, a flexible dependency specification is appealing at a first glance, it is noteworthy that the resulting configuration problems are often quite complex which may limit the scalability of such approaches.
Pervasive Computing Middleware
223
On the basis of the application model, pervasive computing middleware can then provide services to automate the configuration and runtime adaptation. The canonical approach is to construct a configuration that satisfies the requirements defined by the application model. The complexity of this depends heavly on the underlying problem and on the completeness of construction procedure. If the construction systematically enumerates all solutions, it can often exhibit an exponential runtime [33], [21]. To mitigate this, it is possible to tradeoff completeness for increased performance [6]. Alternatively, if the environment is mostly static, the configuration can be computed once and reused multiple times [35]. During the construction, additional optimization goals can be introduced in the form of some utility function [33]. The concrete implementation of the utility function may be specified by the user or the application developer. In the case of runtime adaptation, the utility function may encompass cost components [23]. Usually, the cost components model the effort for changing the running configuration which is dependent on the application model and the adaptation mechanisms. During the construction of the configuration, the utility value is usually optimized heuristically as a complete optimization would lead to considerable performance penalties. Middleware systems that are based on the organizational model of a smart environment can assume the permanent presence of a powerful computer that may be used to compute configurations. As a consequence, these systems usually perform the configuration centralized [33]. In peer-based systems, the availability of a single powerful device cannot be guaranteed. As a result, the computer system that is used for configuration is either selected dynamically [47] or the configuration process is distributed among the peers [21].
6 Conclusion The development of pervasive applications that provide seamless, intuitive and distraction-free task support is a challenging task. Pervasive computing middleware can simplify application development by providing a set of supportive services. The internal structure of the services is heavily influenced by the overall organization of the pervasive system and the targeted level of support. Three main services that can be provided by pervasive computing middleware are support for spontaneous interaction, support for context management and support for application adaptation. As basis for the cooperation of heterogeneous devices found in pervasive systems, middleware support for spontaneous interaction alleviates the handling of low-level communication issues. Beyond communication, support for context management ensures that application developers do not have to deal with the intrinsics of gathering information from a multitude of unreliable sensors. Finally, support for application adaptation simplifies the task of coordinating a changing set of devices to provide a seamless and distraction-free user experience. Although these services can greatly simplify many important tasks of the developer, the development of intuitive task support remains a significant challenge. As
224
Gregor Schiele, Marcus Handte and Christian Becker
new scenarios and devices continue to emerge, application developers have to invent and explore novel ways of supporting user tasks with the available technology. Undoubtedly, this will result in additional challenges that may at some point spawn the development of further middleware systems and services.
References [1] Adjie-Winoto W, Schwartz E, Balakrishnan H, Lilley J (1999) The design and implementation of an intentional naming system. In: Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP99), ACM [2] Anthony D Joseph MFK Joshua A Tauber (1997) Mobile computing with the rover toolkit. IEEE Transactions on Computers 46(3):337–352 [3] Bauer M, Becker C, Hähner J, Schiele G (2003) ContextCube - providing context information ubiquitously. In: Proceedings of the 23rd International Conference on Distributed Computing Systems Workshops (ICDCS 2003) [4] Becker C, Geihs K (2000) Generic QoS-support for CORBA. In: Proceedings of 5th IEEE Symposium on Computers and Communications (ISCC’2000) [5] Becker C, Schiele G, Gubbels H, Rothermel K (2003) Base – a micro-brokerbased middleware for pervasive computing. In: Proceedings of the IEEE international conference on Pervasive Computing and Communications (PerCom), URL citeseer.nj.nec.com/550575.html [6] Becker C, Handte M, Schiele G, Rothermel K (2004) Pcom - a component system for adaptive pervasive computing applications. In: 2nd IEEE International Conference on Pervasive Computing and Communications (PerCom 04) [7] Birrell AD, Nelson BJ (1984) Implementing remote procedure calls. ACM Trans Comput Syst 2(1):39–59, DOI http://doi.acm.org/10.1145/2080.357392 [8] Bishop J, Horspool N (2006) Cross-platform development: Software that lasts. In: SEW ’06: Proceedings of the 30th Annual IEEE/NASA Software Engineering Workshop, IEEE Computer Society, Washington, DC, USA, pp 119–122, DOI http://dx.doi.org/10.1109/SEW.2006.14 [9] Blair G, Coulson G, Andersen A, Blair L, Clarke M, Costa F, Duran-Limon H, Fitzpatrick T, Johnston L, Moreira R, Parlavantzas N, Saikoski K (2001) The design and implementation of Open ORB version 2. IEEE Distributed Systems Online Journal 2(6) [10] Blair GS, Coulson G, Robin P, Papathomas M (2000) An architecture for next generation middleware. In: Proceedings of Middleware 2000 [11] Carriero N, Gelernter D (1986) The s/net’s linda kernel. ACM Trans Comput Syst 4(2):110–129, DOI http://doi.acm.org/10.1145/214419.214420 [12] Chappell D (2006) Understanding .NET (2nd Edition). Addison-Wesley Professional [13] Coen M, Phillips B, Warshawsky N, Weisman L, Peters S, Finin P (1999) Meeting the computational needs of intelligent environments: The metaglue
Pervasive Computing Middleware
[14] [15]
[16] [17]
[18]
[19] [20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28] [29]
225
system. In: 1st International Workshop on Managing Interactions in Smart Environments (MANSE’99), pp 201–212 Dey AK (2001) Understanding and using context. Personal Ubiquitous Comput 5(1):4–7, DOI http://dx.doi.org/10.1007/s007790170019 Dey AK, Abowd GD (2000) The context toolkit: Aiding the development of context-aware applications. In: Proceedings of the Workshop on Software Engineering for Wearable and Pervasive Computing Eddon G, Eddon H (1998) Inside Distributed Com. Microsoft Press Edwards KW, WNewman M, Sedivy JZ, Smith TF, Balfanz D, Smetters DK (2002) Using speakeasy for ad hoc peer-to-peer collaboration. In: 2002 ACM Conference on Computer Supported Cooperative Work, pp 256–265 Eugster PT, Felber PA, Guerraoui R, Kermarrec AM (2003) The many faces of publish/subscribe. ACM Comput Surv 35(2):114–131, DOI http://doi.acm. org/10.1145/857076.857078 Garlan D, Siewiorek D, Smailagic A, Steenkiste P (2002) Towards distractionfree pervasive computing. IEEE Pervasive Computing 1(2):22–31 Grimm R, Davis J, Lemar E, MacBeth A, Swanson S, Anderson T, Bershad B, Borriello G, Gribble S, Wetherall D (2004) System support for pervasive applications. ACM Transactions on Computer Systems 22(4):421–486 Handte M, Becker C, Rothermel K (2005) Peer-based automatic configuration of pervasive applications. Journal on Pervasive Computing and Communications pp 251–264 Handte M, Schiele G, Urbanski S, Becker C (2005) Adaptation support for stateful components in PCOM. In: Proceedings of Pervasive 2005, Workshop on Software Architectures for Self-Organization: Beyond Ad-Hoc Networking Handte M, Becker C, Schiele G, Herrmann K, Rothermel K (2007) Automatic reactive adaptation of pervasive applications. In: IEEE International Conference on Pervasive Services (ICPS ’07) Helal S, Desai N, Verma V, Choonhwa L (2003) Konark - a service discovery and delivery protocol for ad-hoc networks. In: Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2003), vol 3, pp 2107–2113 Hodes TD, Czerwinski SE, Zhao BY, Joseph AD, Katz RH (2002) An architecture for secure wide-area service discovery. Wirel Netw 8(2/3):213–230, DOI http://dx.doi.org/10.1023/A:1013772027164 Johanson B, Fox A (2002) The event heap: A coordination infrastructure for interactive workspaces. In: 4th IEEE Workshop on Mobile Computing Systems and Applications, pp 83–93 Kiciman E, Fox A (2000) Using dynamic mediation to integrate cots entities in a ubiquitous computing environment. In: 2nd International Symposium on Handheld and Ubiquitous Computing, pp 211–226 Ledoux T (1999) OpenCorba: A reflective open broker. In: Proceedings of the 2nd International Conference on Reflection (Reflection’99), pp 197–214 Nidd M (2001) Service discovery in DEAPspace. IEEE Personal Communications 8(4):39–45
226
Gregor Schiele, Marcus Handte and Christian Becker
[30] Noble B, Satyanarayanan M (1999) Experience with adaptive mobile applications in odyssey. Mobile Networks and Applications 4(4):245–254 [31] Object Management Group (2002) Minimum CORBA specification, revision 1.0 [32] Object Management Group (2008) Common object request broker architecture (corba/iiop), revision 3.1. online publication, http://www.omg.org/spec/CORBA/3.1/ [33] Paluska J, Pham H, Saif U, Chau G, Ward S (2008) Structured decomposition of adaptive applications. In: 6th Annual IEEE International Conference on Pervasive Computing and Communications [34] Ponnekanti SR, Johanson B, Kiciman E, Fox A (2003) Portability, extensibility and robustness in iros. In: Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PERCOM 2003) [35] Ranganathan A, Chetan S, Al-Muhtadi J, Campbell R, Mickunas D (2005) Olympus: A high-level programming model for pervasive computing environments. In: 3rd IEEE International Conference on Pervasive Computing and Communications, pp 7–16 [36] Román M, Campbell RH (2000) GAIA: Enabling active spaces. In: Proceedings of the 9th ACM SIGOPS European Workshop [37] Román M, Campbell RH (2001) Unified object bus: Providing support for dynamic management of heterogeneous components. Technical Report UIUCDCS-R-2001-2222 UILU-ENG-2001-1729, Universiy of Illinois at Urbana-Champaign [38] Román M, Kon F, Campbell RH (1999) Design and implementation of runtime reflection in communication middleware: The dynamictao case. In: Proceedings of the 19th IEEE International Conference on Distributed Computing Systems Workshops, Workshop on Electronic Commerce and Web-Based Applications, pp 122–127 [39] Román M, Singhai A, Carvalho D, Hess C, Campbell R (1999) Integrating PDAs into distributed systems: 2K and PalmORB. In: Proceedings of the International Symposium on Handheld and Ubiquitous Computing (HUC’99) [40] Román M, Mickunas D, Kon F, Campbell RH (2000) Legorb and ubiquitous corba. In: Proceedings of the IFIP/ACM Middleware’2000 Workshop on Reflective Middleware [41] Román M, Kon F, Campbell RH (2001) Reflective middleware: From your desk to your hand. IEEE Distributed Systems Online Journal, Special Issue on Reflective Middleware [42] Roman M, Hess C, Cerqueira R, Ranganathan A, Campbell R, Nahrstedt K (2002) Gaia: A middleware infrastructure to enable active spaces. IEEE Pervasive Computing 1(4):74–83 [43] Saif U, Pham H, Paluska J, Waterman J, Terman C, Ward S (2003) A case for goal-oriented programming semantics. In: System Support for Ubiquitous Computing Workshop, 5th Annual Conference on Ubiquitous Computing [44] Satyanarayanan M (2002) The evolution of coda. ACM Transactions on Computer Systems 20(2):85–124
Pervasive Computing Middleware
227
[45] Schiele G, Becker C, Rothermel K (2004) Energy-efficient cluster-based service discovery. In: 11th ACM SIGOPS European Workshop, pp 20–22 [46] Schilit B, Adams N, Want R (1994) Context-aware computing applications. In: Proceedings of the Workshop on Mobile Computing Systems and Applications, pp 85–90 [47] Schuhmann S, Herrmann K, Rothermel K (2008) A framework for adapting the distribution of automatic application configuration. In: 2008 ACM International Conference on Pervasive Services (ICPS ’08), pp 85–124 [48] Strang T, Linnhoff-Popien C (2004) A context modeling survey. In: First International Workshop on Advanced Context Modeling, Reasoning And Management (UbiComp 2004) [49] Sun Microsystems (2001) Jini technology core platform specification, version 1.2. online publication [50] Sun Microsystems (2006) Jdk6 remote method invocation (rmi) - related apis and developer guides. online publication, http://java.sun.com/javase/6/docs/technotes/guides/rmi/index.html [51] uddiorg (2004) UDDI spec technical committee draft, version 3.0.2. online publication, http://uddi.org/pubs/uddi_v3.htm [52] UPnP Forum (2008) Universal plug and play device architecture, version 1.0, document revision date 24 april 2008. online publication, http://www.upnp.org/specs/arch/UPnP-arch-DeviceArchitecture-v1.020080424.pdf [53] Waldo J (1999) The jini architecture for network-centric computing. Communications of the ACM 42(7):76–82 [54] Weiser M (1991) The computer for the 21st century. Scientific American 265(3):66–75
Case Study of Middleware Infrastructure for Ambient Intelligence Environments Tatsuo Nakajima
1 Introduction In ambient intelligence environments, computers and sensors are embedded in our daily lives massively[1]. This should not be taken in the narrow-minded sense of “a computer on every desk”. Rather, computers will be embedded in everyday objects augmenting them with information processing capabilities. This kind of embedding would be discreet and unobtrusive: The computers would disappear from our perception, leaving us free to concentrate on the task at hand—unlike today, where a majority of users perceives computers as getting in the way of their work. Although a large part of this vision is already becoming a reality, other parts of the vision, such as usability and universal interoperability, are still far away. The popularity of the World Wide Web has exploded, and it is easy to forecast that any ambient intelligence infrastructures we will build in the future will include global Internet connectivity. Moreover, the ongoing miniaturization of electronics, together with advances in power saving and wireless communication systems, has brought us to the stage where most of us are happy to carry computing and communication devices all the time. The most popular item for the general public is by far the cellular phone, but many also enjoy PDAs and digital music players. Mobility is now another important aspect of the ambient intelligence scenario, and will be so even more as we evolve from gadgets we carry to gadgets we wear. The next generation of home appliances is also going to be enhanced with communication and sensing capabilities. You can already buy CD players that connect to the Internet to look up the artist name and track titles for any CD you feed them; but this style of interaction will soon extend to non-electronic goods. The refrigerator will know that you are running out of milk and it will send a message to your phone when you are on the way to the shop—or it might reorder it directly by contacting Tatsuo Nakajima Department of Computer Science and Engineering, Waseda University, Tokyo, JAPAN, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_9, © Springer Science+Business Media, LLC 2010
229
230
Tatsuo Nakajima
the supplier through the Internet. The packaged food you wish to reheat will transmit the correct time and power settings to the microwave oven. And the washing machine will ask for confirmation before proceeding if it spots a white lace garment in a load of blue jeans. There will be appliances enhanced by ambient intelligence technologies not only in our homes, but also in shared spaces such as offices, shops, restaurants and public transport vehicles such as trains. They will be connected to the Internet and might act as gateways (as well as direct interlocutors) for our mobile and wearable devices. We will be able to share and retrieve information wherever we are. Devices and applications will become more responsive and friendly by personalizing their behavior according to the preferences of the person who is using them. The capability of these systems to sense some aspects of their environment will make them adapt to the current situation. Location and environmental information, user identity, even user mood are some of the possible inputs to such systems [33] that will allow applications to change their behavior to fit the circumstances. New interaction techniques will become possible[12, 20, 22, 36, 38]; a variety of sensors will monitor our behavior, and improved models of the real world will provide better awareness of the context for the computing systems [2, 5, 6, 13, 14, 35]. Any physical space can be augmented by embedding computers in it. Examples include the personal space (for body-worn devices), home spaces, vehicle spaces, office spaces, and university campus spaces. As shown in Fig. 1, surrounding objects such as a mirror, cushion, table, and chair will become smart objects and the integration of the smart objects make it possible to offer various advanced smart services. Public spaces such as airports, stations, bus stops, and public transport will also be globally connected, and different spaces will have different characteristics. For example, a personal space may contain a cellular phone, a PDA and an MP3 player that are connected by body networks (later these may be complemented by smart clothing). Usually, appliances used in the personal space are small, battery-powered and wireless. On the other hand, in a home space, audio and video appliances are connected by wired high speed networks and are endowed with generous resources in terms of CPU, memory and energy. As shown in Fig. 2, these smart objects can be used to observe our daily behavior[8, 16] and return immediate feedbacks to change a user’s behavior to make his/her lifestyle more desirable[30, 31]. These attractive future applications in ambient intelligent environments cannot be developed with reasonable price without the support of adequate middleware infrastructures. In this chapter, we describe four middleware infrastructures developed in our project for building ambient intelligence environments. Our middleware infrastructures offer high level abstraction for specific application domains to hide complexities such as distribution and context-awareness. We report how to offer high level abstraction and how to implement non functional properties hidden in the middleware infrastructure. We also discuss how to implement them on standard infrastructure software and protocols in order to facilitate the development of application services. The remaining of this chapter is structured as follows. Section 2 presents why middleware infrastructure is important to build ambient intelligence environments.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
231
Fig. 1 Examples of Smart Objects Embedding Small Computers and Sensors
Fig. 2 Ambient Intelligent Applications that Reflect Human Behavior to Motivates Desirable Lifestyle
In Section 3, we describe four middleware infrastructures that we developed in our project, and Section 4 shows six lessons learned from it. We show the future directions in Section 5.
232
Tatsuo Nakajima
2 High Level Abstraction For building complex systems, abstraction is a very powerful tool to deal with the complexities. There are a lot of places where using abstraction is effective. For example, home computing applications require to access a variety of home appliances such as televisions and video cassette recorders. The applications require to access the appliances without considering the differences among implementations to build portable application services[24, 25]. It is important to develop complex applications without taking into account the details about underlying infrastructures[3, 10, 32, 40]. Also, future application services require new techniques such as spontaneous device/service integration and context awareness[34]. These techniques require good abstraction to build well structured applications by hiding complexities. Moreover, applications that invoke such abstractions, may replace the underlying infrastructure by another implementation of the same abstraction, without modifying the application itself. The behavior of application services on ambient intelligence environments should be adapted to the current situation. For example, the behavior may be personalized according to a user’s preference. Also, the most suitable interaction devices for a service may be changed according to a user’s current location. We believe that it is important to hide the changes under the abstraction provided by middleware infrastructures to make the development of application services easier. Therefore, application programmers need not to take into account these changes. Traditional system software like an operating system provides low-level abstraction to control physical resources such as CPU, memory and devices. However, in ambient intelligence environments, application services need to access a variety of sensor devices and services. Also, it is important to abstract information from sensors and to support the composition of services to build practical ambient intelligence environments in an easy way. Therefore, middleware infrastructures offering high level abstraction is necessary. Developing middleware infrastructures for ambient intelligence is not easy, and we need to take into account the following issues. • Which abstraction is appropriate? • How to hide complexities in ambient intelligence environments? • How to reduce the development cost of middleware infrastructure? Middleware infrastructures should offer abstraction to facilitate the development of ambient intelligent services. The most appropriate abstraction is different for respective application services. In one case, middleware infrastructures should offer standard abstractions if we already have many existing applications on the standard interfaces. In another case, middleware infrastructures should offer new high level abstractions. The abstractions are useful to build ambient intelligences without considering low level details. There is a variety of complex issues in ambient intelligence environments. For example, there may be various devices that have dramatically different characteristics. This kind of ultra heterogeneity should be hidden in middleware infrastructures.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
233
Also, ambient intelligence services need to control many devices that are connected via networks, but services may not take into account such distribution. Also, the structure of services should be dynamically reconfigured according to the current situation. However, these complexities should be hidden from an application programmer to make it easy to develop their applications. Middleware infrastructures for ambient intelligence are very complicated. It is important to reduce their development cost. If the cost is too high, there is no merit to use them. One of the ways to reduce the development cost is to reuse existing software as much as possible. In ambient intelligence environments, many small computers are embedded in our daily lives. Traditionally, these computers install embedded software, and the software is written for respective systems from scratch. However, it is important to adopt commodity software for embedded systems to reduce the development cost.
3 Case Studies This section describes four middleware infrastructures that are developed in our project. These middleware infrastructures do not offer generic services for building ambient intelligence services, but facilitate the development of applications for specific domains. The first middleware infrastructure offers a domain specific high level abstraction for controlling audio and visual home appliances. The second helps to develop distributed augmented reality services that hides complex details in the interface. The third enables smart objects to enhance additional functionalities incrementally. Finally, the fourth offers context retrieving and analysis for developing context-aware services. These middleware infrastructures are very typical in ambient intelligence environments.
3.1 Middleware Infrastructure for Distributed Audio and Video Appliances 3.1.1 Design Issues The high level abstraction makes it easy to implement a variety of home applications that are composed of different functionalities provided by different appliances. Also, the abstraction enables us to integrate home appliances with various services on the Internet. However, it is not easy to implement the high level abstraction, mainly because of the diversity of the appliances. We considered the following issues while designing the middleware infrastructure. • Considering the tradeoff between the portability and real-time properties.
234
Tatsuo Nakajima
Fig. 3 Home Appliance System Architecture
• Developing most parts of a middleware infrastructure in a high level programming language like Java. • Exploiting the power of high level abstraction. One approach to solve the problem is to adopt widely available software platforms. The solution allows us to build complex software on various appliances in an easy way if the platforms can be used on a wide range of appliances.
3.1.2 System Architecture The middleware infrastructure provides a couple of high level abstractions for building new services in home computing[24, 25, 39]. As shown in Fig. 3, in the middleware infrastructure for home computing, we have implemented HAVi[11], which is a distributed middleware component for audio and visual home appliances. HAVi provides standard high level API to control various A/V appliances. The reason to choose HAVi as a core component in our distributed middleware for audio and visual home appliances is that HAVi is very well documented, and some real products will be available in the near future. Therefore, it is possible to use the products in our future experiments. The HAVi component is written in Java and the Java virtual machine runs on the Linux operating system. Also, continuous media streams such as DV or MPEG2 are processed in applications written in the C language running on Linux directly. Our system consists of five components. The first component is the Linux kernel which provides basic functionalities such as thread control, memory management, network systems, file systems and device drivers. The kernel also contains IEEE 1394 device drivers and MPEG2 decoder drivers which are required to implement HAVi. The second component is a continuous media management component run-
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
235
ning on Linux, and written in the C language. The component sends or receives media streams to/from IEEE 1394 networks and MPEG2 encoder/decoder devices. In addition, it sends media streams to a speaker and a window system to simulate various audio and visual home appliances. The component encodes, decodes and analyzes media elements in a timely manner. The third component is a Java virtual machine to execute a middleware component such as HAVi. Java provides automatic memory management, thread management, and a lot of class libraries to build sophisticated middleware components and home computing applications in an easy way. Also, Java Swing and AWT enable us to build sophisticated graphical user interfaces to write applications for home appliances. Therefore, a lot of standard middleware components for home computing assume to use Java due to the rich class libraries. The fourth component is the HAVi middleware component written in Java, described in the next section. The last component is a home computing application program that invokes HAVi API to control home appliances. The component adds extra functionality to actual home appliances. One of the most important roles of the application component is to provide a graphical user interface to control home appliances.
3.1.3 Middleware for Networked Home Appliances A variety of audio/visual home appliances such as TV and VCR are currently connected by analog cables, and these appliances are usually controlled individually. However, recently there have been several proposals for solving this problem. Future audio/visual home appliances will be connected via the IEEE 1394 interface, and the network will be able to deliver multiple audio and video streams that may have different coding schemes. HAVi(Home Audio/Video Interoperability) Architecture proposed by Sony, Matsushita, Philips, Thomson, Hitachi, Toshiba, Sharp and Grundig, is a standard distributed middleware for audio and visual home appliances. It provides the ability to control these appliances in an integrated fashion by offering a standard programming interface. Therefore, an application can control a variety of appliances that are compliant with HAVi without taking into account the differences among appliances, even if they are developed by different vendors. The HAVi architecture also provides a mechanism to dynamically load Java bytecode from a target device. The Java bytecode knows how to control a target appliance since the code implements a protocol to communicate between a client device and a target appliance. Therefore, the client device does need not to know how to control the target appliance, and the target appliance can use the most suitable protocol for controlling it. Also, the bytecode may be retrieved from a server on the Internet, therefore it is easy to update software to control target appliances. The architecture is very flexible, and has the following three advantages. • The standardized programming interface that provides high level abstraction hides the details of the appliances. Therefore, it is possible to build applications that are vendor independent. Also, the application can adopt future technologies
236
Tatsuo Nakajima
without modifying target appliances. This feature enables us to build a new appliance by composing existing appliances. • The software to control target appliances can be replaced through the Internet. This feature removes the limitations of traditional home appliances since the functionalities may be updated. • The Java bytecode loaded from a target appliance enables us to use the most appropriate protocol. The feature allows us to choose the state of the art technologies without modifying existing applications when a company releases the appliance. Especially, the first and the third advantages are suitable for developing a prototype very rapidly, since it allow us to adopt future technologies very easily without modifying existing appliances and home application programs. The HAVi component consists of nine modules, called software elements in the HAVi specification. DCM(Device Control Module) and FCM(Function Control Module) are the most important software elements. Each appliance has a DCM for controlling the appliance. The DCM is usually downloaded to a HAVi device, and it invokes methods defined in DCM. Each functionality in an appliance is abstracted as a FCM. Currently, tuner, VCR, clock, camera, AV Disk, amplifier, display, modem and Web Proxy are defined as FCM. Since these FCMs have a standard API to control the underlying device, an application can control appliances without knowing the detailed protocol to access the appliances.
3.1.4 Applications Scenario In a program emulating a digital TV appliance, a Linux process receives a MPEG2 stream from an IEEE 1394 device driver, and displays the MPEG2 stream on the screen. The sender program that simulates a digital TV tuner receives the MPEG2 stream from a MPEG encoder, and transmits it to the IEEE 1394 network. Currently, the MPEG2 encoder is connected to an analog TV to simulate a digital TV tuner, and the analog TV is controlled by a remote controller connected to a computer with a serial line. A command received from the serial line is transmitted to the analog TV through infrared. The receiver process contains three threads. The first thread initializes and controls the program. When the thread is started, it sends a DCM containing a tuner FCM, which includes the Java bytecode, to a HAVi device. In our system, the HAVi device is a Linux based PC system. When the HAVi device receives the bytecode, a HAVi application on the HAVi device shows graphical user interface to control the emulated digital TV appliance. If a user enters a command through the graphical user interface, the HAVi application invokes a method of the DCM representing a tuner. Then, the HAVi device executes the downloaded bytecode to send a command to the emulated digital TV appliance, and the first thread will receive the command. After the second thread in the emulated digital TV appliance receives packets containing MPEG2 video stream from an IEEE 1394 network, it constructs a video frame, and inserts the frame in a time driven buffer. The third thread retrieves the
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
237
frame from the buffer and delivers the video frame to a MPEG2 decoder, according to the timestamp recorded in the video frame. Finally, the decoder delivers the frame to a NTSC line connected to the analog TV display.
3.2 Middleware Infrastructure to Hide Complex Details in Underlying Infrastructures 3.2.1 Design Issues The middleware infrastructure described in this section allows us to build distributed mixed reality applications in an easy way. The middleware is designed based on the following principles: • An application programmer should not take into account complex algorithms. • Distribution should be hidden in the middleware. • Dynamic reconfiguration according to the current situation should be hidden from a programmer. Our middleware infrastructure facilitates the development of mixed reality applications for ambient intelligence by composing existing multimedia components and automaticaly reconfigure connections among components when the situation of the user changes.
3.2.2 System Architecture Our middleware infrastructure called MiRAGe[41] consists of the multimedia framework, the communication infrastructure and the application composer, as shown in Fig. 4. The multimedia framework is a CORBA(Common Object Request Broker Architecture)-based component framework for processing continuous media streams. The framework defines CORBA interfaces to configure multimedia components and connections among the components. In the figure, every circle resembles a multimedia component that is controlled through the CORBA interface. Multimedia components supporting mixed reality can be created from the MR class library. By composing several instances of the classes, complex mixed reality multimedia components can be constructed, without taking into account various complex algorithms realizing mixed reality. The communication infrastructure based on CORBA consists of the situation trader and OASiS. The situation trader is a CORBA service that supports automatic reconfiguration, and is co-located with an application program. It contains Adaptive Pseudo Objects(APO), Pseudo Object Managers(POM) and a configuration manager used for dynamic configuration of multimedia components. Its role is to manage the configuration of connections among multimedia components when
238
Tatsuo Nakajima
Fig. 4 Overview of MiRAGe Architecture
the current situation is changed. OASiS is a context information database that gathers context information, such as location, about objects from sensors. Also, OASiS behaves like a naming and trading service to store object references. The situation trader communicates with OASiS to detect changes in the current situation. Finally, the application composer, written by an application programmer, coordinates an entire application. A programmer needs to create several multimedia components and connect them. He specifies a policy on how to reconfigure these components to reflect situation changes. By using our framework, the programmer does not need to be concerned with detailed algorithms for processing media streams, because these algorithms are encapsulated in reusable multimedia components. Also, distribution is hidden by our CORBA-based communication infrastructure, and automatic reconfiguration is hidden by the situation trader service. Therefore, developing mixed reality applications becomes much easier by using our framework.
3.2.3 Multimedia Framework The main building blocks in our multimedia framework are software entities that internally and externally stream multimedia data in order to accomplish a certain task. We call them multimedia components. A multimedia component consists of a CORBA interface and one or more multimedia objects. Our framework offers the abstract classes MSource, MFilter and
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
239
MSink1. Developers extend the classes and override the appropriate methods to implement functionality. Multimedia objects need only to be developed once and can be reused in any components. For example, Figure 4 shows three connected components. One component contains a camera source object for capturing video images, one component contains the MRDetector and MRRenderer filter objects for implementing mixed reality functionality, and one component contains a display sink object for showing the mixed reality video images. In a typical component configuration, video or audio data are transmitted between multimedia objects, possibly contained by different multimedia components, running on remote machines. Through the CORBA interface defined in MiRAGe, connections can be created in order to control the streaming direction of data items between multimedia objects.
3.2.4 Communication Infrastructure A configuration manager, owned by the situation trader, manages stream reconfiguration by updating connections between multimedia objects. Complex issues about automatic reconfiguration are completely hidden from application programmers. The situation trader is linked into the application program. In our framework, a proxy object in an application composer refers to APO, managed by the situation trader. Each APO is managed by exactly one POM that is responsible for the replacement of object references by receiving a notification message from OASiS upon a situation change. A reconfiguration policy needs to be set for each POM. The policy is passed to OASiS through the POM, and OASiS selects the most appropriate target object according to the policy. In the current design, we can specify a location parameter as a reconfiguration policy. The policy is used to select the most suitable multimedia component according to the current location of a user. A configuration manager controls the connections among multimedia components. Upon situation change, a callback handler in the configuration manager is invoked in order to reconfigure affected streams by reconnecting appropriate multimedia components.
3.2.5 Mixed Reality Class Library The MR class library is a part of the MiRAGe framework. The library defines multimedia mixed reality objects for detecting visual markers in video frames and superimposing graphical images on visual markers in video frames. These mixed reality multimedia objects are for a large part implemented using the ARToolkit. Application programmers can build mixed reality applications by configuring multimedia 1
In Figure 1, Cam is an instance of MSource, MRDetector and MRRenderer are instances of MFilter, and Disp is an instance of MSink.
240
Tatsuo Nakajima
components with the mixed reality objects and stream data between them. In addition, the library defines data classes for the video frames that are streamed through the mixed reality objects. MRFilter is a subclass of MFilter and is used as a base class for all mixed reality classes. The class MVideoData encapsulates raw video data. MRVideoData class is a specialization of MVideoData and contains a MRMarkerInfo object for storing information about visual markers in its video frame. Since different types of markers will be available in our framework, the format of marker information must be defined in a uniform way. The class MRDetector is a mixed reality class and inherits from MRFilter. The class expects a MVideoData object as input and detects video markers in the MVideoData object. The class creates a MRVideoData object and adds information about detected markers in the video frame. The MRVideoData object is transmitted as output. The class ARTkDetector is a subclass of MRDetector that implements the marker detection algorithm using the ARToolkit. The MRRenderer class is another mixed reality class derived from MRFilter. The class expects an MRVideoData as input and superimposes graphical images at positions specified in the MRMarkerInfo object. The superimposed image is transmitted as output. The OpenGLRenderer is a specialization of MRRenderer and superimposes graphical images generated by an OpenGL program.
3.2.6 Application Scenario In a typical mobile mixed reality application, our real-world is augmented with virtual information. For example, a door of a classroom might have a visual tag attached to it. A PDA or a cellular phone, equipped with a camera and an application program for capturing visual tags, superimposes the tag by a schedule of today’s lecture. We assume that in the future our environment will deploy many mixed reality servers. In the example, the nearest server stores information about today’s lecture schedule and provides a service for detecting visual tags and superimposing them by the information about the schedule. Other mixed reality servers, located on a street, might contain information about what shops or restaurants can be found on the street and until how late they are open. To build the application, an application composer uses components for capturing video data, detecting visual markers, superimposing information on video frames and displaying them. The application composer contacts a situation trader service to retrieve a reference to a POM managing references to the nearest mixed reality server. When he moves, a location sensor component notifies sensed location information to OASiS, and OASiS notifies the situation trader to replace the current object reference to the reference of the nearest mixed reality server. In this way, the nearest mixed reality server can be selected dynamically according location, while automatic reconfiguration is hidden from the application programmer.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
241
3.3 Middleware Infrastructure to Support Spontaneous Smart Object Integration 3.3.1 Design Issues Smart object systems usually involve sensor instrumented everyday objects, mobile devices, displays, etc. to provide some user-centric proactive services. Our middleware infrastructure provides support for building this class of applications in a simplistic way by utilizing a document centric approach. We take into account the following design considerations. 1. Providing an appropriate wrapper framework and abstraction to build smart objects without concerning about the target application requirement. 2. Providing an appropriate abstraction in order to externalize application requirements and utilize smart objects without concerning about interfaces and management of smart objects. 3. Utilizing a runtime intermediator that handles smart object management (Bootstrapping, Discovery, Utilization) and provides a mapping between application and smart object services based on structural typing. 4. Providing service extension support for both the application and the smart object using primary abstractions. These design principles enable developers to write applications and to build smart objects in a generic way regardless of the constraints of the target environment. The runtime infrastructure provides the pairwise mapping using structural typing thus externalizing smart object management and addressing heterogeneity issues away from the applications, allowing developers to focus on the application development only. This results in simple and rapid development of smart object systems.
3.3.2 System Architecture Our middleware infrastructure consists of an Artefacrt Framework, an Application Development Model and a Runtime Intermediary Infrastructure called FedNet[17, 18]. The basic workflow of our middleware is shown in Fig. 5 (a). Artefacrt framework represents a smart object by encapsulating its augmented functionalities (e.g., proactivity of the table lamp) in one or multiple service profiles atop a runtime and allows additions of profiles incrementally. Applications in our approach are represented as a collection of implementation independent functional tasks. These tasks are atomic actions that represent the smart objects’ services. An infrastructure component FedNet, manages these applications and artifacts and maps the task specifications of the applications to the underlying artifacts’ services by matching respective documents (that express the applications and the artifacts). Primarily the two abstractions Profile and Task are used in our system and realized by corresponding documents. Additionally, end user tool can be built atop our middleware independently to deploy, configure and manage the applications and artifacts.
242
Tatsuo Nakajima
Fig. 5 (a) System Architecture (b) The internal architecture of FedNet
3.3.3 Middleware Abstraction There are primarily two abstractions that we have utilized in the middleware infrastructure. From the smart objects’ perspective it is the notion of profile that handles the service implementation detail and protocol issues. Since profiles are independently built following a plugin architecture, a smart object service can be extended anytime by adding new sensors or actuators and attaching corresponding profile into the smart objects core. Also, if a specific service needs to be updated only the corresponding profile need to be replaced, not the entire smart object or the applications utilizing them. Furthermore, a profile may provide services in various granularities thus supporting multiple applications requiring services at different scale (i.e., some applications may ignore some service features). The second abstraction is from applications’ perspective, i.e., tasks that simply externalize an applications requirements, so any application can be expressed with this abstraction. Not necessarily all tasks of an application can be supported by an existing environments, however with the incremental addition of new smart objects in the environment or porting application to another environment with richer smart objects might enable the full functional features of an application. In addition an applications functionalities can be updated independently (application binary and the document) without concerning the impact of such update in the middleware or smart objects. In the approach, such flexibilities are provided elegantly by only expressing applications’ task specifications in documents and ignoring smart object management issues at the application level. FedNet provides the appropriate mapping of these documents with smart objects documents expressing their services.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
243
3.3.4 Artefact Framework The artefact framework provides a layered architecture where basic smart object functionalities are combined in a generic core component that act as the runtime. Additional augmented features can be added as plug-ins into the core. Each augmented feature is called a Profile. These profiles are artefact independent and represent a generic service implementing service specific protocols. For example: sensing room temperature could be one profile, and multiple artifacts (e.g., a window, an airconditioner, etc.) can be augmented with a thermometer for supporting this profile. A profile handles the service specific heterogeneity by hiding the implementation detail and can be plugged into the runtime core of our artefact framework, allowing any profile to be a part of a suitable artifact. The runtime core of a smart object hosts an array of its profiles and provides the basic communication facilities (e.g., service advertisement, pushing/pulling profile services to/from respective applications). A smart object and its service profiles are externalized using structured documents expressed in XML. This document specifies the profile detail, i.e., input/output data type, methods, parameters, etc.
3.3.5 Task Centric Application Model Any application is composed of several functional tasks, i.e., atomic actions. In smart object systems, these atomic actions may be: “turn the air-conditioner on", “sense the proximity of an object" etc. In our application model, atomic actions are externalized as Tasks in a corresponding document expressed in XML. This document explicitly specifies the service requirements of this application and each task in the document specifies the respective profiles of the smart objects it needs to accomplish its goal. The next feature of our application model is the interface definition. We consider defining strict interfaces for applications limits the portability and adoption of applications. Since any application only needs to manipulate data to interact with underlying smart objects, a compatible and consistent message is enough to enable applications to interact with smart objects. Consequently, we have addressed this concern by allowing FedNet, the intermediary component of our middleware to act as the gateway of smart object services and accessing those services from applications in a RESTful manner using simple HTTP get and post with XML messages. Thus applications do not need to adhere any middleware specific interfaces to interact with the smart objects yet can leverage their services via FedNet.
3.3.6 FedNet Runtime Infrastructure In our approach both the applications and artifacts are infrastructure independent and expressed in high level descriptive documents (i.e., task and profile specifications). FedNet provides the runtime association among them by utilizing only the
244
Tatsuo Nakajima
documents of these applications and artifacts. It can contact the the artefact core using the semantics described in the artefact documents for mapping application tasks, similarly application can contact FedNet in a RESTful manner. FedNet itself is packaged in a generic binary and composed of four components, Fig. 5 (b) shows the interaction among its components. Application Repository hosts all the applications where as the Artefact Repository manages all the artifacts running in FedNet environment. FedNet Core provides the foundation for the runtime federation. When an application is deployed the task specification is extracted from the application repository by the FedNet Core. It analyzes the task list by querying the artefact repository and generates an appropriate template of the federation containing artifacts whose service profiles match the task specifications and attaches it into a generic Access Point component for that application. When an application is launched, the access point sends the federated artifacts data semantics, to the application. This allows an application to know the semantics of movable data in advance. From then on, the application delegates all its requests to the access point which in turn forwards them to the specific artefact. The artifacts’ responses to these requests by providing their profile outputs either by pushing the environment state (actuation) or pulling the environment states (sensing) back to the access point that are fed to the application.
3.3.7 Application Scenario We have built several smart object systems using our middleware infrastructure. In this section we are presenting one of those systems. We constructed a smart mirror[8] by augmenting a regular laptop with an acrylic magic mirror. Initially this mirror has a display profile. We wrote an application for this mirror where the application can show some personalized information (e.g. weather, stock quote, movie listing etc.) into the mirror display. However, this application can proactively show information only when someone is in front of the mirror. But for such proactivity it requires a proximity profile. To enable this application feature, we added a proximity profile to this mirror by attaching an Infra-red sensor. This improved the applications interactivity. Afterwards we built a completely separate application for the mirror where user’s dental hygiene is reported in a persuasive way utilizing the metaphor of a clean and dirty aquarium[30]. We replaced the previous application running on the mirror with the new one. This new application requires a smart toothbrush that can detect its state-of-use. We constructed the smart toothbrush by attaching a wireless accelerometer sensor and deployed it in our environment with corresponding profile. Thus the mirror shows an aquarium reflecting users brushing practice whenever the user brushed his teeth in front of the mirror. All the smart objects and applications were deployed and configured using our end user deployment tool running atop FedNet. Both the applications were built independently and deployed with corresponding documents expressing the tasks, similarly the toothbrush and mirror were built independently and deployed with corresponding documents. FedNet provided the
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
245
runtime association among them thus freeing the application from smart object management. Furthermore, this scenario highlighted the service extension feature of our middleware. We have added new profiles to an existing smart mirror allowing an existing application to leverage new functionalities. Importantly, the application did not have to take into account the heterogeneity issues introduced by the addition of an Infra-red sensor as it was handled by the proximity profile implementation.
3.4 Middleware Infrastructure for Context Analysis 3.4.1 Design Issues One of important technology trends in the ambient intelligence evolution is using sensors. In order to enhance interaction between human and computers, various sensors such as GPS receivers, accelerometers and capacitive touch sliders are incorporated into smart objects and mobile terminals. Sensors were originally used only to monitor and control a device’s internal states (e.g. temperature, battery condition), but the progress of miniaturization and low cost production of sensors have diversified the purpose in the recent years. Several recent mobile terminals contain accelerometer sensors to detect the rotation of the terminals. The progress of the sensors have been reaching the vision of context-aware computing[37]. While designing a couple of smart objects that offer context-aware services, we found several issues in the extraction of context information. The validity of sensor data and analysis algorithms change frequently. For example, biological sensor data are available under some limited conditions (i.e. “at a user’s hand”). Especially, mobility diversifies use cases of a smart objects and we have to select an available a set of sensors according to the situation. If several different sensors are available simultaneously, context information can be detected in multiple ways by changing the configuration of sensors and analysis algorithms. Based on these complexities, we designed a middleware infrastructure that provides a high level of abstraction in order to facilitate the development of context-aware services. While designing the middleware infrastructure, we have considered the following design issues. • A complex context analysis module is implemented by composing simple context analysis modules. • New context analysis modules to extract more precise context information should be easily added to reduce the ambiguity in the current context information. The first issue requires the middleware infrastructure to offer a mechanism to compose multiple small context analysis modules easily. Also, the second issue requires that all context information is shared by all context analysis modules to augment each other’s inaccuracy. In the following subsections, we introduce the middleware infrastructure for context analysis called Citron.
246
Tatsuo Nakajima
3.4.2 System Architecture The middleware infrastructure Citron facilitates the development of complex context analysis[19, 42]. To implement the design issues discussed in the previous subsection, Citron consists of the following software components. • A context analysis module framework that speficies the relationship among multiple analysis modules. • A common space to share the extracted context information among context analysis modules. In Citron, we have adopted the blackboard architecture to coordinate context analysis modules. We call each context analysis module as a Citron worker. The blackboard architecture is a data centric processing architecture that has been developed for a speech understanding system. There are one common shared message board and multiple worker modules for collaborative context analysis. Each Citron worker reads low level context information from the shared board, and writes extracted high level context information after processing it. Citron workers can collaborate with each other without knowing each other. Context information in the shared board can be complemented independently because changing or adding workers is very easy. To implement the blackboard architecture, we adopt a tuple space based programming model[4]. The tuple space is a widely used abstraction for the interprocess communication among independent processes. The tuple space refines a common shared space with a very flexible data type called tuples, and enables applications to search tuples with a template based query. The significant characteristics of the tuple space model are 1) loose coupling of Citron workers, 2) information sharing among Citron workers and 3) flexible data representation. Fig. 6 shows an overview of the Citron system architecture, and the architecture contains two components: Citron Worker and Citron Space. Citron Worker is a sensor data analysis module. Each worker collects sensor data and retrieves context from Citron Space. Also, each worker takes the responsibility for one context extraction process. For example, a worker that recognizes “held or not” observes the value of a skin resistance sensor and detects “held” or “free” through the threshold analysis. Citron Space is a tuple space for storing tuples that represents context information analyzed by the Citron Workers. Citron Space processes data management requests from Citron Workers and applications running on Citron. Context information is represented as the combination of meta information, such as subject, state and time. Programmers do not need to take into account detailed knowledge about underlying infrastructures such as IP addresses and ports. Moreover, context information in different applications could be represented in an unified format.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
247
Fig. 6 Citron architecture overview
3.4.3 Citron Worker Framework The Citron worker framework offers a high level abstraction for programming a Citron worker. It contains common functions for implementing context analysis algorithms without considering detailed knowledge about underlying infrastructures, such as the connection management, with Citron Space and sensor data retrieval from sensor devices. Developers who implement Citron Workers to extract context information have to consider only the implementation of analysis algorithms. Detailed parameters such as the type of the sensor, the subject of the context, time lag in the analysis, and polling interval are declared in the initialization profile of a Citron Worker. After the initialization, the analysis function is invoked at every specified interval. This function invokes an analysis logic module to retrieve specified sensor values and context information. The result of the context analysis is written as high level context information into Citron Space. Winograd[43] discussed the characteristics and tradeoff between the blackboard architecture and other frameworks for developing context-aware services, such as the Context Toolkit[6] . Other middleware infrastructures offer high level abstractions to develop context analysis modules similar to Citron, but they provide different mechanisms to communicate among context analysis modules. For example, in Context Toolkit, context analysis modules are connected directly, and context information is passed along the direct passes between the context analysis modules.
248
Tatsuo Nakajima
3.4.4 Application Scenario We have developed a sample context-aware application named “RouteTracer”, which uses the user’s context information extracted by Muffin, a mobile terminal equipped various sensors. The application displays the track of a walking route with user’s state in real time. RouteTracer shows not only the track of the route but also walking speed and how long a user had stayed at the same point. The walking speed is divided into five levels and distinguished by using different colors. The location point where the user had stopped is represented as a circle and its radius becomes larger based on the stopping interval. From the view point of the application, three kinds of context information are required: “walking state”, “walking direction” and “walking speed”. To draw the track of the route, at least “walking direction” and “walking state” are required. Context information representing “walking speed” is optional, but it is necessary to speculate the distance and to create a more accurate map. “Walking speed” is measured by the level of the activity of Muffin. Based on our examination, we have found that the activity is correlated with the walking speed of a user, while he is walking and while watching Muffin. It follows that context representing “watching” is required. To extract this context, six Citron Workers run in this application: “orientation”, “walking”, “activity”, “watching”, “holding” and “top side”. The “orientation” worker retrieves data derived from a compass and analyzes where Muffin is headed to. This orientation can be regarded as the direction of a walking user, when a user is watching at Muffin. The “watching” worker recognizes whether the user is watching at Muffin or not by a result of “holding” and “top side” worker analysis. If the user holds Muffin and displays or the head of Muffin turns up, the “watching” worker recognizes the state as “The user is watching at Muffin”. The “holding” worker retrieves data derived from a skin resistance sensor and analyzes the data. The “top side” worker analyzes the gravity acceleration. The “activity” worker also retrieves the acceleration and analyzes the motion of Muffin by using the FFT analysis. This worker requires context generated by the “top side” worker, because the axis, which is mainly measured, changes according to the top side of Muffin. As described above, an analyzed activity is divided into five levels and treated as the walking speed. The “walking” worker decides whether a user is moving or resting by analyzing data derived from an accelerometer. This context is the same as the “walking state” in the RouteTracer. It is also possible for the RouteTracer to recognize the walking state only by “walking speed”. However, the performance in the analysis is not good, because the “activity” worker needs about 6.4 seconds to recognize the context. It is the time to collect 128 samples into the buffer and analyze it at every 50 msec. On the other hand, the “walking” worker analyzes whether a user is walking or not by simple 0 cross detection in real time. It is expected that the parallel analysis using these two Citron Workers enables RouteTracer to be more responsive to recognize the walking state.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
249
4 Middleware Design Issues for Ambient Intelligence 4.1 High Level Abstraction and Middleware Design One of the most important design issues for building middleware infrastructures for ambient intelligence environments is what properties the middleware infrastructures should hide. Our experience shows that hiding these complex issues makes application programming dramatically easy. However, achieving complete distribution transparency is very hard to implement because different abstractions require different ways to hide the distribution. Each abstraction may also have different assumptions and constraints about dynamic reconfiguration. It is not easy to hide these properties in a common infrastructure that can be shared by various middleware, and we need to carefully consider how to hide the properties in each middleware infrastructure. High level abstractions specialized for a particular application domain are very useful in the development of ambient intelligence applications. A similar conclusion is discussed in a middleware infrastructure for supporting synchronous collaboration in an office environment[40], and a middleware infrastructure for building locationaware applications[14]. In our approach, the first middleware infrastructure focuses on supporting audio/visual home computing applications. The second middleware infrastructure supports mixed reality applications. The third middleware infrastructure facilitates the development of extensible smart objects. The fourth middleware infrastructure makes it easy to develop context-aware services that require the analysis of complex context information. These middleware infrastructures support only specialized application domains, and their functionalities do not overlap each other. Lesson 1: It is important to develop specialized high level abstractions that support one specific domain while generic enough to cover a wide range of applications within that domain. Moreover, abstraction should be usable together in order to compose new abstractions.
4.2 Interface v.s. Protocol The third middleware has adopted a standard protocol, SOAP, as the underlying communication protocol. In [26, 27], we also adopted Web-based protocols intensively to coordinate multiple home appliances. This made it easy to adopt commercial products in our experiments, and the approach is very important for incremental evolution of ambient intelligence environments. However, devices and appliances in smart environments may want to change their interfaces independently. We need a more spontaneous approach to make them collaborate. We believe that it is desirable not to fix the interface between appliances, but to determine the message format to communicate with each other. The format may adopt the XML-based representation. In the approach, each appliance may add extra tags that offer additional
250
Tatsuo Nakajima
functionalities. Let us assume that an appliance receives the message. If the message contains some tags that cannot be understood by the appliance, the tags can simply be ignored. Therefore, each appliance can extend the message format independently. We believe that the approach is desirable in ambient intelligence environments to support communication among appliances or smart objects. In the first middleware, we have adopted CORBA to compose multimedia components. The middleware offers multiple interfaces to communicate between an application composer and components. The approach is also useful to support multiple versions of the interfaces, and components can extend its functionalities by adding a new interface. Lesson 2: Programs should have a loosely-coupled, generic interface that allows for custom extension that does not interfer with existing programs.
4.3 Non Functional Properties Middleware infrastructures for ambient intelligence need to offer various nonfunctional properties such as context-awareness, timeliness, reliability and security. However, it is not easy to offer a common service for supporting context-awareness because modeling our real world for building any ambient intelligence application requires to define ontologies in a complete way. We suspect that we can implement a generic and reusable high level service to support context-awareness that is used in various middleware infrastructure for ambient intelligence. On the other hand, we believe that low level support for handling context information like [6] can be used uniformly in many middleware infrastructures. This means that it is desirable to hide the details of the behavior of sensors in a component to develop middleware infrastructures in an easy way, but the middleware infrastructure should not interpret the meaning of the value retrieved from sensors because there is no common consensus how to model our world in a standard way. Also, we found that it is not easy to offer a common service for adding security in each middleware infrastructure. We found that each middleware infrastructure requires customized security support because each application domain may have different requirements. We believe that future middleware infrastructures should offer a variety of nonfunctional concerns such as security, timeliness, privacy protection, trust relationship among people, and reliability. Generic support of all the concerns may make the infrastructures too big and complex. Especially, when multiple middleware infrastructures are integrated, the concerns are cross-cut across them. Therefore, it is important to support only the minimum and implement support for non-functional properties using software structuring techniques like an aspect-oriented programming [23].
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
251
Lesson 3: Common generic services that offer non-functional concerns may make it difficult to develop practical ambient intelligence applications. Services should be customized in respective middleware infrastructures.
4.4 Portability A standard middleware infrastructure provides a common application programming interface that enables us to build portable applications. There are many open specifications for implementing commodity software. Although each product has a different implementation, an application using a common interface runs on any implementations. However, we have experienced several problems. Since Java virtual machines that have different implementation behave in different ways, the application’s behavior is also changed. For example, the implementation of scheduling and synchronization has a great impact on the behavior of applications. It is important to develop future environments based on commodity software. For example, if each implementation raises a different exception for the same internal fault, it is difficult to implement a portable application. Likewise, if a location system returns a different granularity (precision and accuracy) of location information, it is hard to implement portable location-aware applications. Also, if respective implementations provide different levels of security, reliability, and predictability it is impossible to build portable software. Our observation shows the importance to specify the assumption of each implementation explicitly, and the specification of the behavior in an abstract way. Lesson 4: The behavior in the implementation of a middleware infrastructure should be rigorously defined to implement portable applications.
4.5 Human Factors In our middleware infrastructures, dynamic reconfiguration is hidden from the user. However, these properties are closely related to human factors. For example, if an interaction device is switched in an unexpected way, it might surprise the user. This means that middleware infrastructures that hide dynamic reconfiguration will need to take into account human factors [7]. Our experiences show that automatic selection of interaction devices is not a good approach[29]. Instead, we use a token to choose the most suitable interaction device in an explicit way. For example, let us assume that a user is using a telephone in a living room. When s/he moves to the kitchen, s/he may change the speaker and microphone. In this case, the user brings a token attached to the telephone, and put it into a base in the kitchen. Our system detects the event, and changes the interaction devices for the user. However, unexpected and automated changes may be attractive for realizing pleasurable services. For example, if an environment detects that a user and his
252
Tatsuo Nakajima
girl friend are together in a room, it is desirable to make the room’s lighting strategy more romantic automatically. Also, implicit changes are desirable if a user utilizes services without being aware. In this case, the services should not interrupt the user’s current activity that he is focusing on. We believe that strategies for dynamic reconfiguration should be customized in each application. The programming interface to control dynamic reconfiguration should be clearly separated from other programming interfaces. When representing personal information on a display, we need to take into account how to protect privacy information of a user. However, the information is useful to offer better customized services to the user. We found that it is important to evaluate the tradeoff between the quality of services and the amount of privacy information. When a user cannot trust an application, s/he will not give his/her personal information, but if s/he wants better services and trusts the services, s/he can give more of his/her personal information. Also, when multiple users share a display, it is desirable to abstract information to protect privacy information. The abstraction level of the representation is determined according to the degree of secrecy of the information. Ambient displays[30, 31] or informative art[15] is a first step towards information abstraction that allows us to protect privacy information and to reduce information overload in our society. Lesson 5: Middleware infrastructures should be flexible to implement dynamic reconfiguration and information representation according to the requirements of respective applications. In the near future, designers of middleware infrastructures for ambient intelligence should learn psychology, and we need to consider that the adaptation of software should not contradict our mental model. We believe that the implementation of the real world and mental model in middleware infrastructures is an important research topic for building practical middleware for ambient intelligence. For example, in [29], if the automatic selection of interaction devices does not reflect the user’ s mental model, it will confuse the user. If dynamic (re)configuration is not consistent with the user’s mental model, it may surprise the user. However, the implementation of a real world and mental model requires to represent ontologies in a standard way in order to access them from a variety of middleware infrastructures. We also need to consider social and cultural aspects when designing applications that interact with the real world. For example, it is important to take into account trust and privacy in future ambient intelligence environments. We believe that it is important to model psychological, social, and anthropological concepts into programs that interact with the real world. For example, in our middleware infrastructures, user location is used to offer context-aware services, but it depends on the user’s situation whether to offer location information. Information related to privacy should be treated very carefully, and traditional concepts about privacy and trust are very naive to offer practical ambient intelligence services. We also believe that the designers for middleware infrastructures for ambient intelligence should know aesthetics to provide pleasurable services[21] or to abstract information. To develop pleasurable services, we may need to take into account
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
253
emotion, peak experience, and unconsciousness. In [28], the middleware infrastructure allows us to embed tags into various objects and places, and controls our experience by changing the behavior of applications. We found that this approach is very useful to offer pleasurable services. Lesson 6: Middleware infrastructures should incorporate a model for psychological, sociological and anthropological concepts explicitly.
5 Future Directions Middleware infrastructures are the keys to speed up the development of ambient intelligence services. In this chapter, we have introduced four middleware infrastructures developed in our project, and showed several experiences with their development. Of course, there are other middleware infrastructures that have been developed in other research groups and many programming interfaces have been defined in standard organizations. One of the most important future directions is how to use multiple middleware infrastructures at the same time to develop one service. Using multiple middleware infrastructures usually causes architectural mismatch[9] if they are not designed to coexist together at design time. Also, different middleware infrastructures that offer similar functionalities should interoperate without using ad-hoc solutions. The middleware infrastructures should support various nonfunctional properties such as security and reliability, but non-functional properties are implemented in each real middleware infrastructure redundantly. We need to investigate how to avoid redundancy in a systematic way. To solve the above problems, we need more research on the coordination of multiple middleware infrastructures.
References [1] E.H. Aarts, J.L. Encarnacao, True Vision: The Emergence of Ambient Intelligence, Springer-Verlag, 2006. [2] B. Brumitt, J. Krumm, B. Meyers, S. Shafer, Ubiquitous Computing and the Role of Geometry. IEEE Personal Communications, August 2000. [3] R. Cerqueira, C.K. Hess, M. Roman, R.H. Campbell, Gaia: A Development Infrastructure for Active Spaces, In Proceedings of Workshop on Application Models and Programming Tools for Ubiquitous Computing, 2001. [4] N. Carriero and D. Gelernter. Linda in context. Communications of the ACM, 32(4): 444–458, Apr. 1989. [5] K. Cheverst, N. Davies, K. Mitchell, A. Friday, Experiences of Developing and Deploying a Context-Aware Tourist Guide: The GUIDE Project. In Proceedings of the Sixth Annual International Conference on Mobile Computing and Networking, 2000.
254
Tatsuo Nakajima
[6] A. Dey, G.D. Abowd, D. Salber, A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications, HumanComputer Interaction (HCI) Journal, Vol.16, No.2-4, 2001. [7] K. Edwards, V. Bellotti, A.K. Dey, M. Newman, Stuck in the Middle: The Challenges of User-Centered Design and Evaluation for Middleware, In the Proceedings of CHI 2003, 2003. [8] K. Fujinami, F. Kawsar and T. Nakajima; AwareMirror: A Personalized Display using a Mirror, International Conference on Pervasive Computing (Pervasive 2005), LNCS 3468, 2005. [9] D. Garlan, R. Allen, J. Ockerbloom: Architectural Mismatch or Why It’s Hard to Build Systems Out Of Existing Parts. In Proceedings of the 17th International Conference on Software Engineering, 1995. [10] R. Grimm, et al, System support for pervasive applications, ACM Transactions on Computer Systems, Vol.22, No.4, 2004. [11] S. Gibbs, R. Gauba, R. Balaraman, R. Lea, Havi: Example by Example : Jave Programming for Home Entertainment Devices, Prentice Hall, 2001. [12] T. Hodes, R.H. Katz, A Document-based Framework for Internet Application Control. In Proceedings of the Second USENIX Symposium on Internet Technologies and Systems, 1999. [13] A. Harter, A. Hopper, A Distributed Location System for the Active Office. IEEE Network Magazine, 8(1), January 1994. [14] A. Harter, A. Hopper, P. Steggles, A. Ward, P. Webster, The Anatomy of a Context-Aware Application. In Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, 1999. [15] L.E. Holmquist, T. Skog, Informative Art: Information Visualization in Every day Environments, In Proceedings of GRAPHITE 2003, 2003. [16] F. Kawsar, K. Fujinami, and T. Nakajima; Augmenting Everyday Life with Sentient Artefacts, In Proceedings of Joint Conference on Smart Objects and Ambient Intelligence (sOc and EuSAI), pp. 141-146, 2005. [17] F. Kawsar, T. Nakajima, and K. Fujinami; Deploy Spontaneously: Supporting End-Users in Building and Enhancing a Smart Home, In Proceedings of the 10th International Conference on Ubiquitous Computing (Ubicomp2008), 2008. [18] F. Kawsar, K. Fujinami, and T. Nakajima; A Document Centric Approach for Supporting Incremental Deployment of Pervasive Applications, In Proceedings of the Fifth Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (Mobiquitous 2008), 2008. [19] K. Hanaoka, A. Takagi, T. Nakajima: A Software Infrastructure for Wearable Sensor Networks. In Proceedings of the 12th IEEE Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2006), 2006. [20] H. Ishii, B. Ullmer, Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms. In Proceedings of Conference on Human Factors in Computing Systems, 1997. [21] P.W. Jordan, Designing Pleasurable Products, CTI, 2000.
Case Study of Middleware Infrastructure for Ambient Intelligence Environments
255
[22] N. Khotake, J. Rekimoto, Y. Anzai, InfoStick: an interaction device for InterAppliance Computing. In Proc. Workshop on Handheld and Ubiquitous Computing (HUC’99), 1999. [23] G. Kiczales, J. Lamping, A. Memdheker, C. Maeda, C.V. Lopes, JM. Loingtier, J. Irwin, Aspect-Oriented Programming, In Proceedings of ECOOP’97, 1997. [24] T. Nakajima, System Software for Audio and Visual Networked Home Appliances on Commodity Operating Systems, In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms 2001, 2001. [25] T. Nakajima, Experiences with Building Middleware for Audio and Visual Netwoked Home Appliances on Commodity Software, In Proceedings of the ACM International Conference on Multimedia 2002, 2002. [26] T. Nakajima, D. Ueno, I. Satoh, H. Aizu, A Virtual Overlay Network for Integrating Home Appliances, In the Proceedings of the 2nd International Symposium on Applications and the Internet, 2002. [27] T. Nakajima, I. Satoh, A Software Infrastructure for Supporting Spontaneous and Personalized Interaction in Home Computing Environments, Journal of Personal and Ubiquitous Computing, Vol.10, No.6, Springer, 2005. [28] T. Nakajima, Personal Coordination Server: A System Infrastructure for Designing Pleasurable Experience, In Proceedings of the IEEE International Conference on Pervasive Services 2005, 2005. [29] T. Nakajima, How to reuse exisiting interactive applications in ubiquitous computing environments?, In Proceedings of The 2006 ACM Symposium on Applied Computing (SAC), 1127-1133, 2006. ˝ [30] T. Nakajima, V. Lehdonvirta, E. Tokunaga, H. Kimura: OÒReflecting Human ˝ Behavior to Motivate Desirable LifestyleOÓ DIS2008: The 6th ACM Conference on Designing Interactive Systems, Cape Town, 2008. [31] T. Nakajima, H. Kimura, T. Yamabe, V. Lehdonvirta, C. Takayama, M. Shiraishi, Y. Washio, Using Aesthetic and Empathetic Expression to Motivate Desirable Lifestyle, In Proceedings of 3rd European Conference on Smart Sensing and Context(EuroSSC) 2008, 2008. [32] S. Ponnekanti, B. Lee, A. Fox, P. Hanrahan, T. Winograd, ICrafter: A Service Framework for Ubiquitous Computing Environments, In Proceedings of the International Conference on Ubiquitous Computing 2001, 2001. [33] R. Picard, Affective Computing. The MIT Press, 1997. [34] K. Raatikainen, H.B. Christensen, T. Nakajima, Applications Requirements for Middleware for Mobile and Pervasive Systems, ACM Mobile Computing and Communications Review, Vol.16, No.4, 2002. [35] T.-L. Pham, G. Schneider, S. Goose, A Situated Computing Framework for Mobile and Ubiquitous Multimedia Access using Small Screen and Composite Devices. ACM Multimedia, 2000 [36] J. Rekimoto, M. Saitoh, Augmented Surfaces: A Spatially Continuous Workspace for Hybrid Computing Environments. In Proceedings of CHI’99, 1999.
256
Tatsuo Nakajima
[37] B. Schilit, N. Adams, and R. Want. Context-aware computing applications. In IEEE Workshop on Mobile Computing Systems and Applications, Santa Cruz, CA, US, 1994. [38] I. Siio, T. Masui, K. Fukuchi, Real-world Interaction using the FieldMouse. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST’99), 1999. [39] K. Soejima, M. Matsuda, T. Iino, T. Hayashi, and T. Nakajima, Building Audio and Visual Home Applications on Commodity Software, IEEE Transactions on Consumer Electronics, Vol.47, No.3, 2001. [40] P. Tandler, The BEACH Application Model and Software Framework for Synchronous Collaboration in Ubiquitous Computing Environments, Journal of Systems and Software, October, 2003. [41] E. Tokunaga, T. Nakajima, et. al., A Middleware Infrastructure for Building Mixed Reality Applications in Ubiquitous Computing Environments, In the Proceedings of Mobiquitous2004, 2004. [42] T. Yamabe, A. Takagi, T. Nakajima: Citron: A Context Information Acquisition Framework for Personal Devices. In Proceedings of the 11th IEEE Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2005), 2005. [43] T. Winograd. Architectures for context, In Journal of Human ComputerInteraction, Vol.10, No.2, Lawrence Earlbaum, 2001.
Collaborative Context Recognition for Mobile Devices Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
1 Introduction The next wave of mobile applications is at hand. Mobile phones, PDAs, cameras, music players, and gaming gadgets are creating a connected mobile ecosystem where it is possible to implement systems with significant embedded intelligence. Such advances will make it possible to move many functions of the current PCcentric applications to the mobile domain. Since the inherent difficulties that come with mobility—limited UIs, short attention spans, power dependency, intermittent connectivity, to name but a few—are still not going away, new solutions are needed to make mobile computing satisfactory. We are facing the paradox of cramming ever more functions into our ever more portable devices, while seeking to achieve radically better usablility and semi-usable automated intelligence. How could we pull this off? In the HCI (Human-Computer Interaction) community it is almost universally acknowledged that context awareness is a key function in next generation interfaces. To achieve more fluid interaction with the users, systems would need to develop a sense of the users’ activities beyond the current application they are dealing with. The systems should have a rough idea of the users’ location, social situation, tasks, activities, and many other factors that characterise human life. With this information it will be easier (though not trivial) to customise the offered information such that it is more relevant for the users’ situation. Mobile devices are following their users to all kinds of places. Therefore, they are taking part in situations that their desktop counterparts never needed to deal Pertti Huuskonen Nokia Research Center, Visiokatu 1, 33720 Tampere, Finland e-mail: pertti.huuskonen@ nokia.com Jani Mäntyjärvi and Ville Könönen VTT Technical Research Centre of Finland, Kaitoväylä 1, 90571 Oulu, Finland e-mail: jani.
[email protected],
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_10, © Springer Science+Business Media, LLC 2010
257
258
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
with. For instance, the desktop PC never had to keep track of constantly changing locations. Some mobile devices are considered rather personal and are therefore usually carried along with the users practically everywhere, all the time. Many people consider their mobile phones to be the central hubs for their lives, expected to be available at all times. As such devices are moving with people, they become part of the social situations the users immerse themselves in. The devices, however, rarely have any idea of the situation around them. We seek to change that.
1.1 Humans: Context-aware Animals We, humans, are the most advanced context-aware entities on the planet. Large areas of our brains are connected with functions that, if implemented in machines, would fall in the domain of context awareness. We are, for instance, automatically aware of the place we are in, activities that take place around us, and other sentient beings present. We automatically follow events and developments around us and, if needed, take action (e.g. flee predators).
Fig. 1 Humans, context-aware animals, spend much of their early lives learning the proper behaviour for social situations through observing other people and imitating them.
A great deal of human context-aware behavior can be explained with primitive functions—survival, reproduction, subsistence. Over the course of time, more and more of our brain functions have been devoted towards social behavior. Several of the most recently evolved brain structures have become necessary to deal with the range of complications found in human societies. Such features are far less pronounced with other animals [56]. Consider the case when you arrive late into a department meeting (or school, theatre, concert...). You know how to enter and find your place without causing too much disturbance in the event. You are aware of the proper behavior for that event. How? First, you have been educated in this throughout your life. Growing up, you have been immersed in all kinds of situations where you have learned the proper
Collaborative Context Recognition for Mobile Devices
259
behaviour from other people. Second, you have become an expert in interpreting the little clues in your environment that suggest how to behave: how people converse, sit or stand, flock. Their dresses, the furniture, the decoration, in fact everything in the place gives you hints on the situation. Manners, social etiquette, could be seen as rules for context-driven behaviour1. Humans are quite good in imitating other humans. We may consciously or subconsciously adapt our behavior to match that of other people. This may happen, for instance, to appear polite, to be safer, or to show appreciation. Or, walking with a friend, we may adapt our pace to march in rhythm. Parts of our brain appear to be associated with observing other people and simulating their behavior [46], which makes it easier for us to adapt. Of course, human life is more complicated than that. We are deliberately simplifying the situation to make our point: people know how to deal with all kinds of situations and adapt. How could we teach machines to do the same—even to a rudimentary level? Our approach is use a simplified model of human behavior for context awareness: when in doubt, do as others do. When lost, ask others for advice. Our focus in this chapter is to describe the principle of the proposed scheme for collaborative context understanding, some decision making algorithms and networking issues in the domain of connected mobile devices. Further, we outline a platform for realising the scheme. We conclude with implications of the method towards future systems.
2 Context From Collaboration In signal understanding systems it is often useful to have reasonable starting hypotheses of the expected inputs. For instance, in a speech recognition system, it is beneficial to know the kind of language that the speaker will most likely utter. Such starting hypotheses, when accurate, can drastically improve the efficiency of recognition. In classic AI terms, good initial conditions can significantly reduce the search space. Context aware systems are fundamentally doing signal understanding. Their key function is combine sensor signals from multiple sources, both physical and informational, and process those signals such that an interpretation of the situation is formed. Therefore context aware systems will benefit of good initial hypotheses just like any signal understanding system. A useful way to form initial hypotheses for context recognition is to ask surrounding devices what their interpretation of the current situation is, and then to abstract an internal context hypothesis out of those interpretations. This is the approach we propose in this chapter. 1
This is a mechanistic view, and is used here only to motivate our approach. Human sciences have shown that analyzing and modeling human behavior is a vast sea of research, far from being well understood.
260
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
The context aware nearby devices will have already tried to interpret the situation, or possibly their users have already explicitly set some of the context clues (such as the ringtone volume). This information is the more likely to be available the more devices there are nearby. On the other hand, if there are no nearby devices, then probably the situation is less socially sensitive, so it matters less if context is interpreted incorrectly. This scheme—collaborative context recognition—requires communication channels between the devices. Such short-range links are already well available, though arguably not yet ubiquitous. For instance, most smartphones contain Bluetooth transceivers, and wireless LAN support is becoming increasingly common in all kinds of devices. In general, there is a trend in consumer devices towards wireless links. Obviously the links are not enough: it matters what kind of data is exchanged. The devices need to share common definitions of context. Ontologies for context recognition are needed before useful context exchanges can happen between devices, or at least a shared set of context tags must be agreed.
3 Mobile Context Awareness Context awareness has been an active field for well over a decade. In the space of computing research, it falls in the cross-section of ubiquitous computing, artificial intelligence, and human-computer interaction. With the recent advances in mobile technologies, particularly mobile telephony and mobile media players, the focus of context awareness research is shifting towards mobile systems. This is quite appropriate, since context information is arguably richer when the devices are moving around with their users. In this section we summarize some typical approaches to mobile context awareness.
3.1 Context Sources Context aware systems reported in the literature obtain their context data from a number of different context sources. The most popular form of context data today is location, since navigation systems are now commonplace in vehicles and starting to appear in pedestrian applications. These systems may not directly offer context aware features but offer excellent ground for implementing them. Various kinds of positioning systems have been applied, including GPS, cellular networks, wireless lans, ultrasonic and RFID tags, and Bluetooth devices [32, 2, 1, 13]. The position data they produce varies significantly in terms of accuracy, noise, intermittency. The systems reported in the literature often make use of several techniques simultaneously to improve the accuracy and reliability, for instance falling back to cellular network identification where GPS is unavailable.
Collaborative Context Recognition for Mobile Devices
261
Environmental sensors are often part of context-aware ubicomp systems. Typical household applications measure temperatures, air pressure, gas concentrations or humidity levels [20, 4]. Such ambient parameters usually give slowly changing data that yield only general suggestions of the context, so they need to be combined with other data sources. Some smart space projects have experimented with sensored floors and furniture to recognize human presence and/or activities [19, 48, 44]. A number of experimenters have measured biometric data on humans for use in context awareness [42, 43, 25]. Signals of biological origin—heart rate, breathing rate, skin conductivity, ECG, body temperatures, to name a few—can be quite troublesome to measure reliably, particularly in the case of moving people. As a result, many reported systems have had to focus on instrumentation issues. However, with the increase of low-cost, low-power wireless sensors, we are expecting a surge in biometric context data sets. Acceleration sensors have been added to mobile devices with the general aim of studying the movements of their users. The sensors have become so cheap and accurate that they are today found in many high-end mobile phones and media players and consumer game devices. Again, the focus in consumer devices to date has not been context awareness, but the richness of acceleration data has made it a popular target for context awareness research [3, 27, 54, 11]. Besides position and acceleration sensing, various sensors have been tested with mobile devices, such as touch or proximity [17]. The internal functions of the mobile devices also give useful information on the users’ activities. Monitors for key presses, application launches, media viewing, calling and messaging are being reported [45, 26]. The multimedia functions in modern gadgets offer rich audio-visual data. Mobile phones and digital cameras today produce billions of images that can be used as sensors for various kinds of data—for instance, illumination parameters, movement detection, or presence of people or vehicles [28]. Microphones are being trialled for differentiating between music, speech and noise [26] or for measuring the properties of the aural space. In wearable computers, cameras and microphones have been used for context sensing [52]. Processing sensor measurements into analyzed signal features can be quite involved. Sometimes signal processing can be bypassed—for instance, if the data already resides in symbolic form. The informational “sensors” available in the devices and through online sources offer context data that is already easily computer readable (though not easily understood by them). Online calendars, email systems, instant messaging and blogs have been used as data sources for context aware applications [7]. The presence of people can be sensed in a number of ways, including infrared motion detectors, computer vision, tags and radio signals (e.g. cellular or Bluetooth transceivers). However, the largest use of these technologies to date has been in property monitoring and surveillance, not in context awareness.
262
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
3.2 Application Areas To date, most context aware applications are based on some form of location sensing. Popular types include situated messaging [39], location-aware advertising [40], geotagging of images [38], location-based searches (e.g. Google Maps). Location is clearly the context type that has received most use both in research and in industry [15]. Driven by the success of navigation tools, the market will be bustling with location aware applications in the coming years. Suggestive evidence can be found, for instance, in the applications submitted to the Android developer contest (http: //code.google.com/android/adc_gallery/). About half of those applications are based on the positioning capabilities of the upcoming mobile phone platform. One rising trend is combining positioning data with media. Geotagging capabilities for images and video have been added in a number of on-line media bases [37, 53]. Such tools allow users to attach location information to the media they have created, and use location later when searching for media created by themselves—and others. Some devices equipped with GPS embed position data in the image metadata, while other tools automate later manual tagging [55]. In the near future, other forms of context data will find uses in media files [51, 31]. Context-aware messaging has been tried for text messages [21] or advertisements [40]. In those trials, location was again the most frequently applied context type, but provisions were made for future additions. Presence information has been popular in IM (instant messaging) applications for a long time. Users enter their availability (online, back soon, not reachable...) so other people know if they can be contacted. Variations of this scheme have been proposed for phonebooks [50]. More recently, microblogging applications such as Jaiku and Twitter have experimented with enhanced presence information that includes user’s current position and state of activity, as deduced by their client software. Activity analysis of humans has been, besides location, the second most popular topic in mobile context awareness research. A number of approaches have been tried to recognizing users’ physical states, for instance sitting, walking, running, sleeping [10, 47, 16]. This state information, in turn, has been used in combination with other context data for various purposes, or for physiological analysis such as estimating one’s energy consumption [22]. Context awareness is often said to be the key to intelligent user interfaces. Some successes have been reported [18], and some failures, including the Microsoft paperclip. In mobile phones, context information could be used to overcome the problem arising from users’ short attention spans [41]. Rapid sensing of the use situation could help to focus the interaction to the crucial information. At the moment, context awareness research has mostly produced novel features for existing UI types, such as shaking a music player to shuffle songs in recent iPod Nanos. It could be argued that once context aware features for UIs are understood well enough, they are re-implemented with more traditional computation
Collaborative Context Recognition for Mobile Devices
263
techniques. In this way, context awareness echoes the development patterns found in the larger area of artificial intelligence. The class of applications for lifelogging (or memory prosthesis, augmented memory) is closely related to context awareness [30, 12]. Such applications record a number of data items people have encountered in their lives: for instance, locations they have visited, people they have met, pictures they have taken, web pages and messages they have read or written. In addition, multiple types of sensor data is typically recorded to the databases for later abstraction and recognition. Lifelogging usually emphasises immediate storage and later retrieval of life data, whereas context awareness often seeks to recognize context on the fly. These two approaches are complementary, and can fruitfully feed each other. Finally, new gaming devices, both consoles (such as Nintendo Wii) and mobile phones (such as Apple iPhone) have started using acceleration sensors as key inputs. Also, cameras have been applied to gaming (e.g. in Sony Eyetoy). This is introducing a new class of games where the physical world meets the game world. This approach is adopted on a larger scale in pervasive games, that take place in the real world with augmented object. Some of such games have made use of context awareness [5, 35].
3.3 Context Recognition Mobile context recognition has fascinated researchers from the mid 90’s when “smarter” mobile phones and PDAs appeared and positioning and miniaturised sensors emerged. Main challenges in building context recognition systems are (1) how to perform mapping from data obtained from suitable context sources to meaningful situations of a mobile device user, (2) how to implement methods to extract context data on a mobile device, and (3) how to design algorithms able to adapt user’s routines and exceptions. Research on context recognition has been conducted using various approaches; HMM, neural networks, SOM, dynamic programming, data mining, etc. Very attractive context recognition rates have been reported—up to 90%. However, context recognition can never be bulletproof and approaches—on UI level and on algorithm level - for tolerating uncertainty are also under development. A central element of context recognition on mobile terminals is the framework for managing context data processing on a device. Several suggestions for such frameworks have been developed [26, 45, 8]. While such frameworks provide platform dependent solutions for fluent processing, they can’t solve the inherent problem of uncertainty. Ideally, context processing should give data that could be used up to prediction and recommendation levels. Collaborative context recognition seeks to provide some improvement to recognition accuracies.
264
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
3.4 Distributed Context Awareness A basic element of any context recognition system is a pattern classification algorithm for mapping observations to context classes. However, the classification is usually more or less inaccurate. Errors are caused by imperfect observations and the model in the pattern classification system. Collaborative context reasoning (CCR) seeks to decrease errors by providing joint context recognition based on several imperfect context recognition systems located in the same context. The objective is that recognition systems can utilize information available from each other, and thus their joint context recognition accuracy can be significantly higher than individual context recognition accuracies. In the next section, we discuss social rules and mechanisms for CCR, with an emphasis on voting techniques.
4 Making Rational Decisions In cooperative distributed problem solving, there exists a special entity, for example the designer of the system, that defines an interaction protocol for all agents in the system and a function that stipulates how the agents select their actions. Context recognition systems differ from cooperative distributed problem solving systems in that all context recognizers act autonomously and implement their own context selection procedures. In CCR, individual context recognizers then input their context estimates to an external mechanism that combines individual estimates for providing more accurate joint context estimates. The joint context estimate can be either enforcing in the sense that the context recognizers should use the joint estimate even if it is contradictory to their own individual estimates or it can be more free; individual context recognizers can utilize this information if they are not very confident about their own estimates. As the context recognizers make their own estimates autonomously, a reasonable assumption is that the context recognizers act rationally, i.e. they maximize their own welfare. Therefore, it is rational to utilize joint context estimates only if the context recognizer benefit from this additional information. Later in this section, discussing communication requirements, we argue that linking confidence values to context estimates provide a way to evaluate the usefulness of joint context estimates. We start our discussion by going inside a context recognizer and placing its functionality into the decision making framework. Then we proceed with discussing different voting mechanisms and explaining how to use voting as a CCR method.
Collaborative Context Recognition for Mobile Devices
265
4.1 Distributed Decision Making In decision making, the goal is to make optimal decisions with respect to a utility function modelling preferences of a decision maker. In distributed decision making, the decision maker has to take into account also decisions of other active components in the systems, i.e. other decision makers. In the traditional game theory, a decision maker (player) does not only know its own utility function but also utility functions of all other decision makers in the system. Based on this knowledge, all the decision makers are able to calculate an equilibrium solution, i.e. a solution from which it is not rational to deviate for any individual decision maker alone. However, in real-world problems, having such comprehensive knowledge about the utility functions of all decision makers is not possible. Usually it is even not possible to learn these utility functions by continuously interacting with other decision makers. Therefore some other opponent modeling techniques are needed, for example probabilistic opponent models. A thorough dissection of the basic game theoretic methods is [36]. In context recognition systems, context recognizers select a context class from a set of all context classes. Putting this selection task into the decision making framework we have to associate a utility value to each context class. How to determine this value, depends on a pattern classification algorithm that maps sensory input to the contexts. Later in this section, we use as an example the Mimimun-distance classifier that simply calculates a distance of the current sensory input from the prototype (ideal) vectors of all context classes and selects a class with the minimum distance. With this method, Euclidean distance between the current input and a prototype vector is used as a utility value.
4.2 Voting Protocols as Distributed Decision Making Strategies Distributed rational decision making mechanisms such as auctions and voting protocols are extensively studied in the game theory and machine learning research fields. A good survey of differerent mechanisms is [49]. Using voting mechanisms as a basis for distributed pattern recognition is studied in [29]. Committee machines provide an alternative way to combine different classifiers, see e.g. [14]. Combining local context information with the information from other context recognizers is studied in [33]. In voting mechanisms, the setting includes several independent voters that vote on the common subject. In addition to the voters there is a voting protocol, i.e. a social rule that stipulates how to combine individual votes to the final outcome. Next we will discuss several commonly used voting protocols. In each case we assume that there exist m alternatives of which one is selected to be a final outcome of the voting.
266
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
• Borda count. Each voter arranges the alternatives to an order that portray her preferences. The first choice gets m votes, the second one m − 1 votes and the last one only 1 vote. The final outcome of the scheme is an alternative with the maximal number of votes. Note that this is exactly the voting scheme that is used in the Eurovision Song Contest. • Majority voting. Each voter gives a vote to her preferred alternative. The final outcome of the scheme is an alternative with the maximal number of votes. The plain majority voting mechanism is used formally in direct elections and informally for doing daily practical things, for example deciding where to go for eating. • Weighted majority voting. This scheme differs from the plain majority voting in a way that each voter weights its vote with a confidence value. This value tells to other voters how sure the voter is about its choice. Note that the weighted majority voting is actually a broad class of different voting schemes with different weighting methods. For example, paper referees are often asked also to judge their own expertise on the research area of the paper to be evaluated. • Binary protocol. In this protocol, only two options are presented to the voters at each time step. Then an alternative with the maximum number of votes is competed against a new alternative. The outcome of this scheme is dependent on the order in which the alternatives are presented to the voters. By selecting diffent weighting methods, weighted majority voting brovides a reasonably versatile class of different voting schemes for most application areas. In the remaining part of this text, we will focus on the weighted majority voting and its utilization in CCR. Another important property of the above discussed voting protocols is that they assume that all the voters act sincerely. When discussing voting as a CCR method, we also assume sincerity of the voters. However, we also deliberate consequences of insicerity in applications that utilize CCR.
4.3 Voting as CCR Method Here we view a context recognition system as a voter that votes on different contexts. Although there can be several context recognizers in the same environment, they all might have a slightly different model of their context and therefore they are able to complete each other. This hopefully leads to an improved joint recognition accuracy.
4.3.1 Confidence Evaluation As discussed above, by using a weighted majority voting scheme, a voter reveals how confident she is about her confidence estimate. In other words, a context recognizer should judge how well its internal model, i.e. parameters of a pattern recogni-
Collaborative Context Recognition for Mobile Devices
267
tion method and system, can capture the relationship between sensory information and estimated context classes. Unfortunately, such a confidence information is not directly available from all pattern recognition methods such as Support Vector Machines, or Feed-forward neural networks [9]. Here we present a simple pattern recognition method, the Minimum-distance classifier, and discuss a related confidence estimation method. Consider a classification task where we have to associate an N-dimensional sample s to one of the C-classes. For example, dimensions can be features computed from sensory inputs. For each class j = 1, . . . ,C, we have I j example samples xij , i = 1, . . . , I j . Further, let c j represent the ideal vector, i.e. a vector that represents the class in the best possible way. A natural way to choose the ideal vector is to use the class mean vector, i.e.: 1 c = j I j
Ij
∑ xij .
i=1
Now, the classification to the class j∗ can be accomplished as follows: C
j∗ = arg min ||s − c j ||, j=1
where || · || is a norm, e.g. Euclidean norm. The above described linear classifier has several advantages. It has small computational and space requirements, it is easy to implement on various platforms and special teaching algorithms are not needed—it is enough only to have an ideal vector for each context class. However, for getting the perfect classification result, context classes should be linearly separable which is rarely the case. Therefore, if the separativity of the context classes is highly nonlinear, the Minimum-distance classifier leads to suboptimal results. Detailed analysis of the Minimum-distance classifier can be found in [9]. The Minimum-distance classifier compares the sample to be classified to the class centers and therefore we can identify the following extreme cases: • A sample is located very near of a class mean vector. In this case, the sample belongs almost surely to the class. • Distances between a sample and the mean vectors are almost identical. In this case, we cannot distinguish between the classes. So it seems plausible that we can use the distance as a basis for the confidence evaluation. A distance between a sample s and the class mean vector ci is shown in Eq. 1. (1) d i = ||s − ci ||. It would also be a desirable property to limit the confidence values to the fixed interval, e.g. to the unit interval. This can be accomplished by dividing all the distances by the maximal distance value. The procedure is shown in Eq. 2.
268
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
d˜i =
di max j=1...C d j
.
(2)
In this paper, we use an absolute difference between the maximal and the minimal scaled distance value as a weighting function. The weight w for the estimate is therefore: (3) w = 1 − min d˜i . i=1...C
If the sample s is very near of a meanvector, corresponding d˜i is small and the w gets value near of 1. On the other hand, if the distances d i are almost identical, d˜i is near of 1 and weight w is correspondingly near of 0. Detailed discussion of the Minimum-distance classifier as a CCR method can be found in [23].
4.3.2 Insincerity Where several devices obtain information from each other, and they rely on the data for critical functions, it may become very inviting for malicious parties to tamper with the CCR process or misuse the information. Here, we briefly explain a couple of such cases and discuss some solutions to overcome the problems. The CCR process creates rounds of queries and based on confidences on the collaboratively agreed context and on each device’s own subjective idea of the current context. In certain situations the process can be highly unstable and even prone to oscillatory behaviour. Furthermore, this may ultimately lead some devices to “endless” CCR loops. Also, the devices that eagerly try to obtain a proper collaborative context value may eventually drain their batteries. Example of such a situation is when several false parties publish contexts with low confidence values. Another way to cause harm for innocent mobile devices is to publish wrong context value with high confidence. Again, with suitable relative amount of insincere devices interfering with the CCR process, it can converge to totally wrong context values. In the worst scenario, innocent devices may even perform serious actions and malfunction based on collaboratively agreed faulty values. According to evaluation with real context data [24] the situations described above are likely to occur, and thus it is very welcome to discuss solutions to avoid them. First of all when dealing with context aware functionality on a mobile device, it is crucial to let a user to be in charge in executing serious actions on a device. Secondly, devices participating in a CCR process should have mechanisms to detect possible extreme cases leading to hazard, and prevent them. This can be accomplished e.g. by regulating the transceiver according to progress of context value’s confidence or by regulating the acceptance of the collaborative context value according to similarity to one’s own current context value [33].
Collaborative Context Recognition for Mobile Devices
269
4.3.3 Properties of communication protocols supporting voting as a CCR method Collaborative context recognition seeks to mimic combination of human-type perception and decision making. Due to this design philosophy, CCR is quite local, information hungry/query intense, but open and social method, and creates high data traffic with small size individual data transactions. In addition, response and setup times must be very quick with very low power consumption. Unfortunately, such communication requirements differ quite a bit from established local area mobile communication protocols (such as Bluetooth). Although the amount of data to be exchanged between devices in the voting protocols is small, each time we want to make a joint context estimation, data have to be broadcasted among the voting devices. In many cases, the devices are battery-driven and therefore continuous communication is quite demanding for such restricted devices. Communication requirements can be divided into two categories: • Listening to the context estimates and the confidence values coming from other devices • Broadcasting own context estimates and confidence values However, based on the confidence information, we can significantly reduce these requirements. Let us consider the first item; if a voter continuously reports low confidence values, it is plausible to assume that the quality of its contextual information is really low. Therefore we can reduce the rate in which we observe the voter. On the other hand, if we are really confident about our own context estimates, we do not necessarily need the information coming from the other voters and we can act in isolation, especially in the case that we are running out of battery. In here we refer back to the beginning of the section; it is rational for a context recognizer to utilize context information coming from other devices only if it can be used to improve context recognition accuracy.
5 Case Study: Physical Activity Recognition In this section we discuss a realistic CCR example in which several independent context recognizers are located in the same context, doing the same physical activity in this case, and give their estimate on the current context. We start the section by describing the features of the dataset recorded from several physical activities. Then we proceed with test settings and results obtained from the test runs.
270
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
5.1 Data Set and Test Settings In this text, we illustrate CCR methods by using a realistic data set [42]. It is a large data set comprising measurements from several sport activities as well as daily activities such as shopping and eating in a restaurant. In this text we concentrate only on sport activities. Twelve persons took part in the data recording. Their age range was from 19 to 49 years, with an average of 27 years. The average length of the recordings was almost seven hours. They contain both sessions with pre-arranged activities and free-form sessions in which the test subjects were allowed to freely do activities they were interested in. The pre-arranged activities were annotated by a test supervisor, while the free activities were annotated by the test persons themselves. In this study we focus only a few sport activities and we do not make a difference between two different annotation schemes. The following activities were considered in our study: bicycling, soccer, lying still, nordic walking, rowing, running, sitting, standing, and walking.
5.1.1 Sensor Units The principal sensor units for this study are two three-channel acceleration sensors and a heart rate monitor. Signals from the acceleration sensors are recorded by using Embla A10, a 19-channel recorder device. Annotations are made by using iPaq device running a custom annotation software. The sensor placement is illustrated in Fig. 2.
5.1.2 Feature Space for the Activity Recognition In addition to the raw signals from the sensors units, the data set also provides a vast number of features calculated from the raw data. These include both time domain and frequency domain features. In this study we use 10 features that are selected by using automatic feature selection methods (SFS and SFFS). The list of the selected features and a detailed description of the methods can be found in [24].
5.1.3 Test Settings Although data recording sessions were carried out for each test subject separately, the environments in the sessions were very similar. We therefore make the artificial assumption that all the recording sessions for the same activity were carried out at the same time. This allows us to use the data for a simulation of group decision making experiment. Transformed into the same time scale, all test persons can be
Collaborative Context Recognition for Mobile Devices
271
Fig. 2 An image depicting sensor placements during the data recording sessions. 1: Heart rate sensor, 2: Acceleration sensor on hip, 3: Acceleration sensor on wrist.
seen acting as individual decision makers and together they vote on recognition of the current activity. The group size varied from an individual person to 6 persons in our test runs. The Minimum-distance classifier was trained with remaining 6 persons. For each group a random activity was selected and the group members voted on the activity. In the test runs, the total number of groups was 10000. A clarifying example is shown in Fig. 3 in the case of the group containing three persons. In this example, all persons are performing the same activity and hence the datasets for all persons (implemented as Minimum-distance classifiers) are almost the same. However, small modification took place between the recording sessions, for example small variations in the location of the exercice bike and therefore all the persons see the activity from slightly different “angle”.
272
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
Fig. 3 An example of joint activity estimation in the case of three idependent voters. Possible voting schemes include Borda count, majority voting, and weighted majority voting. Confidence values for weighted majority voting can be computed by using Eq. 3. Classifiers are Minimumdistance classifiers and data sets for different persons are from the same activity but from slightly different “angle”.
5.2 Empirical Results Fig. 4 depicts the joint recognition accuracies. Clearly, the accuracies with all voting mechanisms generally improve when the number of voters increase. The Borda count leads clearly to the worst result. The weighted majority voting leads to smoother curve than plain majority voting, because individual unsure voters do not have significant contributions to the overall recognition accuracy. From the Fig. 4 it can be seen that individual context recognizers can significantly benefit from using the joint context estimates. Note that in these test runs the pattern classification method was really simple one, Minimum-distance classifier, which can seen as relatively low context recognition accuracies when recognition is done in isolation. However, with 6 recognizer, the accuracy is already over 90%, which can be considered to be very high in real-world applications. This is significant, since our method allows the individual recognizers to be very simple (and thus computation and energy efficient), but together they produce useful results.
Collaborative Context Recognition for Mobile Devices
273
Fig. 4 Joint recognition accuracies with different voting schemes.
6 Context Recognition Platform For testing and integrating CCR methods with applications, we have implemented a context recognition platform for Symbian S60 mobile phones. The schematic view of the platform is depicted in Fig. 5. The core of the platform is the Minimumdistance classifier and it contains different modules for feature computation and sensory information integration from either built-in sensors or sensors connected via Bluetooth interface. As the mobile platforms usually have quite limited resources, we offload some of their data to an external server. The information is sent over HTTP encoded into XML messages that can include raw data, computed feature values or context estimates with confidence values. The server side is implemented by using Apache Tomcat-server. The server can have several connections on simultaneously and is therefore able to send information further to other mobile devices. It is also possible to apply more complex data mining algorithms to the data on the server. The platform has been applied in a number of tests. However, for the CCR method we do not want to rely on servers too much, as in real world settings all the devices participating in a CCR process are likely to be mobile.
274
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
Fig. 5 A schematic view of the context recognizers that run context recognition platform. Application in this small example only shows positions of the other phones connected to the system.
7 The Way Forward We are running experiments with the context recognition platform in a small scale. However, to test the feasibility of our approach, it will be necessary to deploy the platform in a larger number of devices that are used in everyday situations. Such larger-scale prototyping will undoubtedly give us insight in the feasibility of the CCR method, but also bring light to some of the real world complications. Some of them are already clear from lab tests.
7.1 Energy Savings Battery consumption is an essential factor of all mobile systems. Many devices go to considerable lengths to save power. For instance, mobile phones sleep a lot: they switch most of their key functions off when not needed, but prepare to bring them up again in a microsecond’s notice. The cellular radio transceiver is a major power drain, and therefore cellular protocols are designed so that the receiver can be switched off for relatively large fractions of use time.
Collaborative Context Recognition for Mobile Devices
275
Our method is inherently problematic for sleep-mode devices. The communication with nearby devices needs to go on at least periodically, if not constantly. Synchronous communication among the nearby devices would help, but the protocols will be hard to realize for a heterogenous ad-hoc group of devices. Moreover, the background recognition algorithms may take considerable time (and therefore power) to run in case of complex situations with large number of nodes. We seek to achieve greater power efficiency through better communication protocols, that require less always-on listening, and through optimizing processing such that it can be carried out incrementally and periodically, not constantly. For networked devices, a centralized server can take over much of the work from mobile devices. In a completely mobile ad-hoc system a central server can’t be relied upon, however. Improvements in short-range data links will improve the feasibility of our system. The current generation of Bluetooth transceivers still suffer from very long discovery times, which make them unsuitable for rapidly changing situations (such as walking through a crowd). Also, in some areas the public opinion seems to be currently going against Bluetooth, due to risks that the press are bringing forward.
7.2 Risks The potential risks of our approach are indeed numerous, if left unaddressed. It’s easy to imagine possible ways of attacking the mobile devices. For instance, malicious nodes could deliberately set their context to something improper to cause harm to other nodes’ recognition (as explained in Section 4.3.2) . The context data could include spam, just like any other unregulated information channels. More seriously, the context data could be used for various kinds of attacks, including buffer overflows, which in turn could be used for intrusion. Such misdeed will happen as soon as there are financial benefits to be had. Such risks are by no means unique to our approach, and will have to be addressed for most mobile devices and applications. However, we get some help from the social nature of our domain. Collaborative context recognition is most useful in a shared situation, that is, where a number of people are present. This shared physical presence is key factor to decrease (but not eliminate) mischievous acts. It takes less nerve to attack computers over the internet than people in the same room. The Big Brother syndrome is a larger concern that our approach shares with many other systems. CCR is fundamentally based on sharing one’s context data with other devices. Some of those devices could store and relay context data for dubious purposes. For instance, it will be possible to recognize that certain people have met in a certain place at a certain time, which is something that governments are keen to know (and criminals—imagine a bomb going off when enough people are in the same place). Anonymity techniques could help somewhat, but can’t protect personal data against massive data mining. It may also become mandatory to give access to one’s personal context store per juridical order. We are expecting a rich set of
276
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
protection and crypting mechanisms to be conceived for short-range interaction, perhaps based on some variation of PKI (Public-Key Infrastructure). CCR can be used for many purposes, by many applications. Some are inherently riskier than others. For instance, the archetypal use case for context awareness is to automatically silence your phone’s ringtone when you enter a meeting. A CCRenabled phone would collaborate with other phones in the meeting and notice that most of them have already been silenced, so your phone would do the same. Yet, this application is socially risky. It is humiliating if your phone decides that a wedding is not a meeting and rings in the church. Context recognition is always approximate, and bound to fail often. It can be preferable to use context awareness for applications that are more socially appropriate (”safer”), such as situated reminders, contextual advertising, and games. These issues—privacy, protection, social appropriateness—have so far been outside our focus. However, they are necessary requirements for real-world systems. We are expecting some solutions from the large body of research and development now going on in the domain of mobile applications in general.
7.3 Knowledge-based CCR? Context data needs higher level knowledge representation to be widely useful. Ultimately, mobile devices should be able to exchange contexts using a shared set of agreed symbols that convey some semantic meaning. The ongoing work on Semantic Web [6] and related efforts promise to bring shared machine-level vocabularies for distributed systems (such as ours). Yet, generic common-sense understanding of the world [34] is still some way in the future. Context awareness deals with human life, which is a vast domain with billions of issues to be modeled. We need to confine us to much more limited domains (such as media sharing) to be able to model the knowledge involved. Even then, there are no generic context awareness ontologies agreed upon that devices could use for context collaboration. Until such ontologies are developed, CCR systems need to confine to tags, symbols whose meaning is shared on the application level. In general, context awareness shares the same problem that plagues artificial intelligence and HCI all over. As soon as a system is able to demonstrate some level of intelligence (more precisely, functions that would appear intelligent if carried out by humans), the users raise their expectations on the system’s capabilities, and are then disappointed in the next moment when the system fails to perform intelligently. As an example, consider the previous example on ringtone silencing. A phone could be using trivial rule-based reasoning in the style of “IF there is a meeting in the calendar AND there are several people in the same room with you THEN silence the ringtone”. Given the right situation, this simple rule can work like a charm, and the user is happy. Then, in the church, when the calendar condition fails, the phone ringing puts the user to shame. It will not be immediately obvious to the user why this happened. This simple rule could be understandable to even common people,
Collaborative Context Recognition for Mobile Devices
277
but any more complex reasoning will remain opaque. It will be necessary to design collaborative context applications such that they fail gracefully: provide reasonable functions even when useful knowledge is absent. If graceful failure is not possible, then at minimum the system should be able to give meaningful error messages, and possibly further explanations for those users who want to know more. Of course, this has not yet been achieved in most systems today. Acknowledgements This work has been supported by the Academy of Finland and Tekes.
References [1] Anastasi G, Bandelloni R, Conti M, Delmastro F, Gregori E, Mainetto G (2003) Experimenting an indoor bluetooth-based positioning service. In: Proceedings of the 23rd International Conference on Distributed Computing Systems Workshops (ICDSW 2003), pp 1067–1080 [2] Balakrishnan H, Baliga R, Curtis D, Goraczko M, Miu A, Priyantha B, Smith A, Steele K, Teller S, Wang K (2003) Lessons from developing and deploying the cricket indoor location system. Tech. rep., MIT Computer Science and AI Lab [3] Bao L, Intille S (2004) Activity recognition from user-annotated acceleration data. In: Proceedings of the Second International Conference on Pervasive Computing, pp 1–17 [4] Beigl M, Krohn A, Zimmer T, Decker C (2004) Typical sensors needed in ubiquitous and pervasive computing. In: in Proceedings of the First International Workshop on Networked Sensing Systems (INSS ’04, pp 153–158 [5] Benford S, Magerkurth C, Ljungstrand P (2005) Bridging the physical and digital in pervasive gaming. Commun ACM 48(3):54–57 [6] Berners-Lee T, Hendler J, Lassila O, et al (2001) The Semantic Web. Scientific American 284(5):28–37 [7] Dey A, Abowd G (2000) CybreMinder: A context-aware system for supporting reminders. Lecture notes in computer science pp 172–186 [8] Dey A, Abowd G (2000) The Context Toolkit: Aiding the Development of Context-Aware Applications. Workshop on Software Engineering for Wearable and Pervasive Computing [9] Duda R, Hart P, Stork D (2001) Pattern Classification, 2nd edn. John Wiley & Sons, Inc. [10] Farringdon J, Moore AJ, Tilbury N, Church J, Biemond PD (1999) Wearable sensor badge and sensor jacket for context awareness. In: ISWC ’99: Proceedings of the 3rd IEEE International Symposium on Wearable Computers, IEEE Computer Society, Washington, DC, USA, p 107 [11] Gellersen H, Schmidt A, Beigl M (2002) Multi-sensor context-awareness in mobile devices and smart artifacts. Mobile Networks and Applications 7(5):341–351
278
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
[12] Gemmell J, Bell G, Lueder R, Drucker S, Wong C (2002) Mylifebits: fulfilling the memex vision. In: MULTIMEDIA ’02: Proceedings of the tenth ACM international conference on Multimedia, ACM, New York, NY, USA, pp 235– 238 [13] Hallberg J, Nilsson M, Synnes K (2003) Positioning with bluetooth. In: Proceedings of the 10th International Conference on Telecommunications (ICT 2003), vol 2, pp 954–958 [14] Haykin S (1999) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall [15] Hazas M, Scott J, Krumm J (2004) Location-aware computing comes of age. Computer 37(2):95–97 [16] Himberg J, Flanagan J, Mantyjarvi J (2003) Towards context awareness using symbol clustering map. Proceedings of the Workshop for Self-Organizing Maps pp 249–254 [17] Hinckley K, Pierce J, Horvitz E, Sinclair M (2005) Foreground and background interaction with sensor-enhanced mobile devices. ACM Trans ComputHum Interact 12(1):31–52 [18] Ho J, Intille SS (2005) Using context-aware computing to reduce the perceived burden of interruptions from mobile devices. In: CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, pp 909–918 [19] Ito M, Iwaya A, Saito M, Nakanishi K, Matsumiya K, Nakazawa J, Nishio N, Takashio K, Tokuda H (2003) Smart furniture: Improvising ubiquitous hotspot environment. icdcsw 00:248 [20] Jantunen I, Laine H, Huuskonen P, Trossen D, Ermolov V (2008) Smart sensor architecture for mobile-terminal-centric ambient intelligence. Sensors and Actuators A: Physical 142(1):352–360 [21] Jung Y, Persson P, Blom J (2005) DeDe: design and evaluation of a contextenhanced mobile messaging system. Proceedings of the SIGCHI conference on Human factors in computing systems pp 351–360 [22] Knight F, Schwirtz A, Psomadelis F, Baber C, Bristow W, Arvanitis N (2005) The design of the sensvest. Personal Ubiquitous Comput 9(1):6–19 [23] Könönen V, Mäntyjärvi J (2008) Decision making strategies for mobile collaborative context recognition. In: Proceedings of the 4th International Mobile Multimedia Communications Conference (MobiMedia 2008) [24] Könönen V, Mäntyjärvi J, Similä H, Pärkkä J, Ermes M (2008) A computationally light classification method for mobile wellness platforms. In: Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’08) [25] Korhonen I, Bardram J (2004) Guest editorial introduction to the special section on pervasive healthcare. Information Technology in Biomedicine, IEEE Transactions on 8(3):229–234 [26] Korpipää P (2005) Blackboard-based software framework and tool for mobile device context awareness. VTT Publications 579
Collaborative Context Recognition for Mobile Devices
279
[27] Korpipää P, Koskinen M, Peltola J, Mäkelä SM, Seppänen T (2003) Bayesian approach to sensor-based context awareness. Personal Ubiquitous Comput 7(2):113–124 [28] Kulkarni P, Ganesan D, Shenoy P, Lu Q (2005) Senseye: a multi-tier camera sensor network. In: MULTIMEDIA 2005: Proceedings of the 13th annual ACM international conference on Multimedia, ACM, New York, NY, USA, pp 229–238 [29] Lam L, Suen S (1997) Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transaction on Systems, Man and Cybernetics, Part A 27:553–568 [30] Lamming M, Flynn M (1994) Forget-me-not: Intimate computing in support of human memory. Proceedings of FRIEND21 94:2–4 [31] Lehikoinen J, Aaltonen A, Huuskonen P, Salminen I (2007) Personal Content Experience: Managing Digital Life in the Mobile Age. Wiley-Interscience [32] Liu H, Darabi H, Banerjee P, Liu J (2007) Survey of wireless indoor positioning techniques and systems. IEEE transactions on systems, man, and cybernetics part c: applications and reviews 37(6):1067–1080 [33] Mäntyjärvi J, Himberg J, Huuskonen P (2003) Collaborative context recognition for handheld devices. In: Proceedings of the first IEEE International Conference on Pervasive Computing and Communications (PerCom 2003), pp 161–168 [34] Minsky M (2000) Commonsense-based interfaces. Communications of the ACM 43(8):66–73 [35] Mitchell K, McCaffery D, Metaxas G, Finney J, Schmid S, Scott A (2003) Six in the city: introducing real tournament - a mobile ipv6 based contextaware multiplayer game. In: NetGames ’03: Proceedings of the 2nd workshop on Network and system support for games, ACM, New York, NY, USA, pp 91–100 [36] Myerson R (1991) Game Theory: Analysis of Conflict. Harvard University Press [37] Naaman M (2006) Eyes on the world. Computer 39(10):108–111 [38] Naaman M, Harada S, Wang Q, Garcia-Molina H, Paepcke A (2004) Context data in geo-referenced digital photo collections. In: MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, ACM, New York, NY, USA, pp 196–203 [39] Nakanishi Y, Tsuji T, Ohyama M, Hakozaki K (2000) Context aware messaging service: A dynamical messaging delivery using location information and schedule information. Personal and Ubiquitous Computing 4(4):221–224 [40] Ojala T, Korhonen J, Aittola M, Ollila M, Koivumäki T, Tähtinen J, Karjaluoto H (2003) SmartRotuaari-Context-aware Mobile Multimedia Services. Proceedings of the 2nd International Conference on Mobile and Ubiquitous Multimedia, Norrköping, Sweden pp 8–19 [41] Oulasvirta A, Tamminen S, Roto V, Kuorelahti J (2005) Interaction in 4-second bursts: the fragmented nature of attentional resources in mobile hci. In: CHI
280
[42]
[43] [44]
[45]
[46] [47]
[48]
[49]
[50] [51]
[52]
[53]
[54]
[55]
[56]
Pertti Huuskonen, Jani Mäntyjärvi and Ville Könönen
’05: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, pp 919–928 Pärkkä J, Ermes M, Korpipää P, Mäntyjärvi J, Peltola J, Korhonen I (2006) Activity classification using realistic data from wearable sensors. IEEE Transactions on Information Technology in Biomedicine 10(1) Pentland AS (2004) Healthwear: Medical technology becomes wearable. Computer 37(5):42–49 Pirttikangas S, Suutala J, Riekki J, Röning J (2003) Footstep identification from pressure signals using hidden markov models. Proceedings of the Finnish Signal Processing Symposium (FINSIG 2003) pp 124–128 Raento M, Oulasvirta A, Petit R, Toivonen H (2005) Contextphone: A prototyping platform for context-aware mobile applications. IEEE Pervasive Computing 4(2) Ramachandran V, Oberman L (2006) Broken mirrors: A theory of autism. Scientific American 295(55):63–69 Randell C, Muller H (2000) Context awareness by analysing accelerometer data. The Fourth International Symposium on Wearable Computers 1(1):175– 176 Salber D, Dey A, Orr R, Abowd G (1999) Designing for ubiquitous computing: A case study in context sensing. Graphics, Visualization, and Usability Center Technical Report 99 29:99–29 Sandholm T (1999) Multiagent Systems: A Modern Introduction to Distributed Artificial Intelligence, MIT Press, chap Distributed Rational Decision Making, pp 201–258 Schmidt A, Takaluoma A, Mäntyjärvi J (2000) Context-Aware telephony over WAP. Personal and Ubiquitous Computing 4(4):225–229 Sorvari A, Jalkanen J, Jokela R, Black A, Koli K, Moberg M, Keinonen T (2004) Usability issues in utilizing context metadata in content management of mobile devices. Proceedings of the third Nordic conference on Humancomputer interaction pp 357–363 Starner T, Schiele B, Pentland A (1998) Visual contextual awareness in wearable computing. In: Proceedings of the Second International Symposium on Wearable Computers (ISWC 1998), pp 50–57 Toyama K, Logan R, Roseway A (2003) Geographic location tags on digital images. Proceedings of the eleventh ACM international conference on Multimedia pp 156–166 Van Laerhoven K, Schmidt A, Gellersen H (2002) Multi-sensor context aware clothing. Wearable Computers, 2002(ISWC 2002) Proceedings Sixth International Symposium on pp 49–56 Viana W, Filho J, Gensel J, Oliver M, Martin H (2007) PhotoMap-Automatic Spatiotemporal Annotation for Mobile Photos. Lecture notes in computer science 4857:187 Zhang J (2003) Evolution of the human ASPM gene, a major determinant of brain size. Genetics 165:2063–2070
Part III
Mobile and Pervasive Computing
Security Issues in Ubiquitous Computing∗ Frank Stajano
1 Fundamental Concepts The manifesto of ubiquitous computing is traditionally considered to be the justly famous 1991 visionary article written for Scientific American by the late Mark Weiser of Xerox PARC [64]. But the true birth date of this revolution, perhaps hard to pinpoint precisely, precedes that publication by at least a few years: Weiser himself first spoke of “ubiquitous computing” around 1988 and other researchers around the world had also been focusing their efforts in that direction during the late Eighties. Indeed, one of the images in Weiser’s article depicts an Active Badge, an infraredemitting tag worn by research scientists to locate their colleagues when they were not in their office (in the days before mobile phones were commonplace) and to enable audio and video phone call rerouting and other follow-me applications. That device was initially developed by Roy Want and Andy Hopper [61] at Olivetti Research Limited (ORL) in 1989. Soon afterwards other research institutions, including Xerox PARC themselves who later hired Want, acquired Active Badge installations and started to explore the wide horizons opened by the novel idea of letting your spatial position be an input that implicitly told the computer system what to do. I was fortunate enough to work at ORL as a research scientist throughout the Nineties, thereby acquiring first hand experience of these and other pioneering developments in ubiquitous computing2. Looking back and putting things into historical perspective, ubiquitous computing really was a revolution, one of these major events in computing that take place once every decade or so and transform society © Frank Stajano University of Cambridge Computer Laboratory, Cambridge, United Kingdom. e-mail:
[email protected] ∗
Revision 86 of 2009-01-29 18:00:05 +0000 (Thu, 29 Jan 2009).
2
I later told that story in my book [55] and the relevant 20+ page extract can be downloaded for free from my web site at http://www.cl.cam.ac.uk/~fms27/secubicomp/ secubicomp-section-2-5.pdf. H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_11, © Springer Science+Business Media, LLC 2010
281
282
Frank Stajano
even for ordinary non-techie people, as happened with the personal computer and the World Wide Web. Weiser opened his landmark paper with a powerful metaphor: writing is everywhere, but it is so deeply embedded in the fabric of society that it has become invisible, in the sense that we no longer notice it. His vision, that something similar would one day happen to computers, has undoubtedly come true: most people have no idea of the number of microprocessors they own and interact with on a daily basis; if asked, they may easily forget the ones hidden in their car, in their microwave oven and in their mobile phone—even though the latter probably offers more computing power than the 486 desktop computers that were top of the range when Weiser’s article was published. This revolution has transformed society in many ways. The one we shall explore together in this chapter is security. A modern computer is a sufficiently complex system that even an expert is incapable of entirely understanding it at all levels— even more so when we consider a system of many networked computers instead of a standalone machine, and even more so when such networking is invisible and spontaneous, as wireless technology allows it to be. Amongst all this complexity lurk bugs and unexpected interactions that cause unpredictable and probably undesirable behaviour. Worse, such bugs and interactions can be exploited maliciously by an adversary in order to cause intentional damage. It is easy to imagine that an adversary might render the system unusable (“A virus formatted my hard disk!”) or at least cause data loss (“A virus deleted all my JPGs!”). However, as networked computer systems become more and more deeply embedded in the fabric of society, the type of damage that can be inflicted by malicious attacks on them become more varied and more pervasive. Nowadays, computer security can seriously affect even people who think they don’t use computers—because they probably use them and depend on them without even realizing that they do. This first section of the chapter will explore the scope of “ubiquitous computing”, noting how much ground is covered, implicitly or explicitly, by this phrase and its many alternatives; it will give a brief overview of the core concepts of security; and it will then look at ubiquitous computing from the viewpoint of security, pointing out what could go wrong, what is at stake and what might require protection.
1.1 Ubiquitous (Pervasive, Sentient, Ambient. . . ) Computing From the late Eighties, when ubiquitous computing (or “ubicomp” for short) was the blue-sky research dream of a few visionaries, throughout the Nineties and the first decade of the new millennium, more and more research institutions have joined the exploration. Thanks in part to generous sponsorship from major long term government initiatives (especially in Korea and Japan), the ubicomp research field has become mainstream, and then quite crowded. Whether in academia or in industry, many groups have invented their own new name for the topics they explored—partly to highlight that their focus was just on a specific aspect of the phenomenon; but all
Security Issues in Ubiquitous Computing
283
too often, a cynic might say, simply for the same reason that cats leave scent marks. Be that as it may, we now have a variety of names for ubiquitous computing and each tells its own instructive little story. “Ubiquitous computing” [64], a locution starting with a sophisticated and at the time uncommon word that literally means “present everywhere”3, places the emphasis on computers that, since they are everywhere, must be (by implication) cheap commodities that are plentiful and unobtrusive. In the Eighties the Microsoft slogan of “a computer on every desk” was still to some extent a dream, though one within reach; but “ubiquitous” pointed further down the line, to computers hidden inside everyday objects, in the same way as you have electric motors inside everyday objects, from vacuum cleaners to lawnmowers and hi-fi systems, and yet you don’t say “oh, look, we have half a dozen electric motors in the living room”. “Pervasive computing” [5] has similar connotations, particularly with respect to emphasizing embedded computers as opposed to standalone ones: computing then “pervades” the environment. Names such as “Invisible computing” [49] or “Disappearing computing” [59] also underline the distinction between the old computer as a keyboardand-monitor machine and the new ones that are just embedded, each dedicated to its own special purpose within the object that embeds it, and not requiring one to sit in front of them or think about their existence. “Sentient computing” [31] refers to computer systems that sense some aspects of their environment in order better to serve their users—for example by taking note of who is in the room and switching to their preferences in terms of lighting, temperature and user interface. “ambient intelligence” [20] is again similar, conjuring images of a world that automatically adapts to the users and their preferences; but this locution puts the accent, rather than on the sensing, on the deductive processing that takes place in the back-end. “Calm computing” [65], which predates most of the others, describes essentially the same scenario, this time placing the emphasis on the fact that this new generation of computers is intended to work in the background, silently taking care of issues on our behalf as opposed to requiring constant attention and intervention from the user. Perhaps the most significant thing to learn from this multitude of names, other than laughing at these manifestations of the “Not Invented Here” syndrome, is that ubiquitous computing means different things to different people, even within the research community, and that the global revolution that has changed the world even for non-techies, putting a powerful computer in the ordinary person’s pocket in the form of a mobile phone that is also a camera, a music player and a web browser, is to some extent only fully described by the union of all of these disparate viewpoints.
3
Amusingly, some authors have recently started to use the difficult word “ubiquitous” with no regard for its original meaning, using it instead as a synechdoche for “ubiquitous computing”. We therefore hear of etymologically meaningless concepts such as “the ubiquitous society”. . .
284
Frank Stajano
1.2 Security Security is another word that is frequently abused and overloaded to mean different things depending on the speaker. Since this is the only chapter of this volume specifically dedicated to security, I’ll take you through a two-page crash course on what I consider to be the most fundamental security ideas. If you wish to read more on system security, my favourite books on the subject are Beyond Fear [53], for an easy to read yet insightful and valuable introduction; Security Engineering [4], for an authoritative and comprehensive reference suitable for both the researcher and the practicing engineer; and the older but still valuable Fundamentals of Computer Security Technology [3], for a rigorous formal description of the principles; not to mention of course my own Security for Ubiquitous Computing [55] for a coherent and self-contained look at how all this applies specifically to ubicomp. The viewpoint we shall adopt here, which I believe is the only one leading to robust system security engineering, is that security is essentially risk management. In the context of an adversarial situation, and from the viewpoint of the defender, we identify assets (things you want to protect, e.g. the collection of magazines under your bed), threats (bad things that might happen, e.g. someone stealing those magazines), vulnerabilities (weaknesses that might facilitate the occurrence of a threat, e.g. the fact that you rarely close the bedroom window when you go out), attacks (ways a threat can be made to happen, e.g. coming in through the open window and stealing the magazines—as well as, for good measure, that nice new four-wheel suitcase of yours to carry them away with) and risks (the expected loss caused by each attack, corresponding to the value of the asset involved times the probability that the attack will occur). Then we identify suitable safeguards (a priori defences, e.g. welding steel bars across the window to prevent break-ins) and countermeasures (a posteriori defences, e.g. welding steel bars to the window after a break-in has actually occurred4, or calling the police). Finally, we implement the defences that are still worth implementing after evaluating their effectiveness and comparing their (certain) cost with the (uncertain) risk they mitigate5 . 4
One justification for the a-posteriori approach, as clearly explained by Amoroso, is that it allows you to spend protection money only after you see it was actually necessary to do so; that way, you don’t pay to protect against threats you only imagined. On the other hand, you can only hope to prevent reoccurrences, not the original threat, because by then you have already lost the asset; and in some cases (e.g. loss of life) this may be unacceptable. Another justification, emphasized by Schneier, is that you might not be able to safeguard against every possible attacker even if you wanted to; therefore, relying only on prevention will leave you exposed. Note that some authors, including Schneier, use the term “countermeasures” for both a-priori and a-posteriori defences. 5 It is also worth noting in passing that, when managing risk, simply balancing the expected loss against the cost of safeguards and countermeasures is not the whole story: in many situations a potential victim prefers to pay out a known amount in advance, perhaps even every month but with no surprises, rather than not paying anything upfront but occasionally having to face a large unexpected loss—even if the second strategy means lower disbursements in the long run. This is what keeps insurers in business. Besides, insurers can spread a very costly but very unlikely-tooccur loss across their many premium-paying customers, whereas that kind of loss would probably bankrupt individual victims if they had to bear it by themselves. Then, bearing the cost of the
Security Issues in Ubiquitous Computing
285
This choice is always a trade-off, and it is important to note that the amount of protection that it is reasonable to deploy is a function not simply of the possible threats (what bad things could happen) but crucially of the corresponding risks (how likely it is that those bad things would happen, and how much we’d lose if they did) which in turn depend on the value of the assets6 . In the example above, if you were storing gold ingots under your bed instead of magazines, the threats and vulnerabilities and attacks would be the same but the risks would change dramatically. Steel bars at the window might be a proportionate response to protect gold ingots but they are probably a poor security choice to protect magazines, because their cost (sourcing and installation cost, but also inconvenience cost for the user of the bedroom) is disproportionate to the value of the assets they protect and, equally importantly, to the likelihood that such assets will ever be stolen. You’d be better off just by getting into the habit of closing the window (a much cheaper and weaker safeguard, but still an adequate one for that risk). Moreover, considering the relative likelihoods of burglaries and house fires, and comparing the loss of life by fire (increased by the impossibility of escaping through a window now blocked by steel bars) to that of material assets through burglary, for a residential installation the steel bars might turn out to be a poor security trade-off even for protecting gold ingots. This risk management mindset is necessary if you wish to provide effective security rather than, as all too often happens, just an appearance of security in order to appease the customer or the regulator. Traditionally, the threats to information systems are classified with reference to three fundamental security properties that they might violate: confidentiality (ensuring that only authorized principals can read the information), integrity (ensuring that only authorized principals can modify the information) and availability (ensuring that authorized principals can access and use the system as intended, without undue delays). These three categories do not necessarily exhaust all the desirable security properties of an information system but they are a good starting point to understand what’s most important, at least in the context in which the system is used. You’ll find that in many cases people automatically assume that the primary security concern is the protection of confidentiality (“unauthorized principals shouldn’t be able to see the students’ exam grades”) but that, on second look, integrity is instead more significant (“unauthorized principals must not be able to edit the student’s grades”) and the primary concern by far, especially when the system integrates several different functions, is in fact availability (“the school’s computer system cannot stop working for more than a couple of days”, as it handles not only the students’ grades but also the course timetable, the library catalogue and the staff payroll). attack would be impossible, whereas paying the insurance premium might turn out to be cheaper and more convenient than implementing a safeguard. 6 Note how attacker and defender may place very different values on the same assets. If a smalltime crook enters the bank vault and steals a hundred dollars, to him the value of what he gains is a hundred dollars; but, to the bank, the value of what they lose is much greater—especially if it becomes generally known that a burglary in their vault occurred. Similarly, attacker and defender may have very different ideas on the likelihood of certain events happening. It is important to take the viewpoint of the correct actor into account when performing quantitative risk evaluations.
286
Frank Stajano
You may have noticed the recurrence of the phrase “authorized principals” in the previous paragraph. This appropriately suggests that the foundation upon which these three security properties rest is authentication, that is to say a means of distinguishing between authorized and unauthorized users of the system. To be more precise, this foundation usually consists of a sequence of three sub-actions: identification (understanding who the user claims to be), verification (checking the validity of this claim), authorization (granting that particular user the permission to perform certain specific actions within the system). Interestingly, in some cases there is a level of indirection such that you don’t identify a specific individual but just a role with associated capabilities—e.g. the librarian checks an authorization token to ensure that this customer is entitled to borrow up to four books, but doesn’t need to know her name and surname in order to lend her the volumes.
1.3 Security Issues in Ubiquitous Computing The information security concepts presented in the previous section were developed for networked computer systems well before the advent of ubiquitous computing. They still represent a useful framework to reason about security in ubiquitous computing too. You should strive to recognize where the above-mentioned “boldface terms” are in any specific ubicomp scenario you may come to face, and then evaluate the corresponding security trade-offs. What are the assets? What are the risks? What safeguards and countermeasures are worth implementing to mitigate the risks? A good starting point, so long as it is not mistaken for the whole security analysis, is to think about possible threats in the various scenarios of interest. As you imagine various specific ubicomp scenarios, try to understand what the assets are in each setting and what kind of damage they might incur. Go beyond the obvious: if you are thinking about a mobile phone, it is probably clear that the handset itself is an asset (particularly if you paid for a sophisticated model rather than accepting the cheap default one offered with your subscription plan) and that so is the ability to make phone calls that will be paid for by the person owning the phone (you) rather than by the person holding it right now (the thief). But isn’t your phone directory also an asset? Your photos? Your short messages and emails? And is the damage primarily the fact that you no longer have them (availability), or the fact that someone else can now view them (confidentiality)? If you have a phone that lets you connect to the web, as will become more and more prevalent in the future, how about the credentials that your phone uses to connect to web sites that run financial transactions, such as online stores or home banking? You go from theft of the device to theft or fraud enabled by possession of the device. What about malicious software planted into your phone that silently accepts a call from a “trigger” number to turn on the microphone, acting as a stealth bug that you unknowingly but voluntarily carry into that confidential meeting? (Why are you so sure that such software isn’t already in your phone right now, by the way? How could you tell? A web search yields several companies ready to sell your enemies a
Security Issues in Ubiquitous Computing
287
copy7.) Is your privacy also an asset, then? If you think so, in what other ways could it be violated? Would you also consider your current location, or the timestamped history of your past locations, as a privacy-sensitive piece of information? If so, surely you also see how it could be threatened simply by the fact that you carry a mobile phone. Privacy is a complex issue and a very relevant one for ubiquitous computing. But, even for the public-spirited technologist whose aim is to design privacy-protecting ubicomp systems, the job of protecting privacy is made harder by two factors: one, that the potential victims do not appear to care much about it; two, that there are powerful economic interests pushing in the direction of privacy violations. These factors must be addressed, as they can be stronger than any simple-minded technical protection you might put in. We’ll get back to this at the end of the chapter. When looking for security issues in a given ubicomp setting, don’t just think about assets and threats from the viewpoint of the potential victim. Often, some of the most interesting insights are obtained by thinking like the attacker. Which of the victim’s assets would be most valuable to you as an attacker? Maybe these are things that the victim doesn’t even consider as assets—at least until the attacker has had a go at them. And how would you go about attacking them? Adopting the mindset of the adversary is not easy and requires a great deal of lateral thinking; however, in most adversarial endeavours, from lockpicking to cryptography to tank armour development, the wisdom of experience says that the only people capable of building solid defences are the ones who are skilled at breaking them. You need the mindset of playing outside the rules, of violating conventions, of breaking out of neatly formalized abstraction levels: you will then, for example, defeat that unbreakable cipher by flying below the mathematics and attacking at the level of physics, electric currents, noisy signals and non-ideal power supplies.
2 Application Scenarios and Technical Security Contributions The best way to get a feeling for the practical security issues relating to ubiquitous computing is to discuss a few significant scenarios. The most interesting question to ask is: “in what way does ubicomp make a difference to security in this setting”? Sometimes there is an obvious qualitative difference; other times there is “only” a quantitative difference and yet it becomes qualitative because of the scale of the change: when things become a thousand times smaller, or a thousand times faster, or a thousand times more numerous, the way we use them changes entirely; and the new usage patterns open up new vulnerabilities as well as new opportunities. In this short book chapter it will certainly not be possible to examine in detail all the scenarios of interest, which would have to include at least smart homes, traffic management systems, smart cars, wireless sensor networks, ubicomp in sports and healthcare and so on, as well as specific wireless technologies such as Bluetooth, 7 If you were paranoid you would already “know” that the spooks arm-twisted the manufacturers to add such functionality in the firmware anyway.
288
Frank Stajano
NFC, Wi-Fi, Zigbee and others. We shall arbitrarily pick a few significant examples and a few general security contributions that apply to several ubicomp settings.
2.1 Wearable Computing As computers become smaller and cheaper and more powerful, they become devices we can easily carry around wherever we go. From the 10 kg “portables” of the Eighties to modern sub-kilogram netbooks and palm-sized PDAs (personal digital assistants), the truly ubiquitous instantiation of a computing device that even ordinary people carry around all the time is nowadays, as we have already remarked, the cellular phone. As more and more of these phones acquire the ability to connect to the Internet, to run external applications and to act as email clients, their architectural similarities with regular computers bring back all the traditional threats, vulnerabilities and attack vectors for which personal computers are notorious, starting with viruses and worms. It is therefore important for the ubicomp security practitioner to keep up to date about ordinary PC security. The fact that the computer changes shape to a telephone—or, perhaps tomorrow, to augmented-reality glasses that also accept voice commands—is irrelevant. Viruses, worms, spam, “Nigerian” scams and phishing will all migrate from desktop computers to mobile phones as soon as enough users starts making use of phone-based applications and web services. The transition from model-specific phone firmware to generic software platforms based on just two or three major phone OSes will help reach the critical mass level at which developing malware for phones becomes seriously lucrative. “Augmented reality”, by the way, refers to a type of display (usually implemented with special glasses) in which a computer generated image is superimposed onto the normal scene that is actually in front of the viewer—as opposed to “virtual reality” in which everything the viewer sees is computer-generated. One of the best known early adopters of augmented reality and wearable computing was Steve Mann, who for years went around with a wearable camera over his head and studied the technical and social consequences of doing so [43], from ways of stitching together composite images as the viewer’s head moves [44] to issues of ownership of video footage containing images of other people [42], and to the relationship between the personal wearcam and the now-ubiquitous CCTV surveillance cameras [41]. Although wearable cameras are far from being a mainstream or ubiquitous item at the time of writing, they are a good catalyst for a discussion about the security and privacy implications of wearable computing in general. Mann’s WearCam was at least in part meant as a provocation and therefore serves its purpose even if it does nothing more than making you think about such issues. A later project in a similar direction, the SenseCam8 built by Lindsay Williams at Microsoft Research 8
The SenseCam is a low-resolution wide-angle still camera, worn on the chest with a necklace chain, that takes a photo every minute or so. The user never presses the shutter button or aims the camera, which works completely automatically. The user does not see what the camera records except at the end of the day when the camera is docked and its photos are downloaded.
Security Issues in Ubiquitous Computing
289
[26], integrates with the MyLifeBits project [25] to provide a visual diary of one’s life and has been used as a memory prosthesis for Alzheimer patients. What are the consequences of transferring your memories from the wetware of your brain to some external hardware? For privacy-sensitive material, the change makes a big difference, especially in the face of the threat of coercion: it is much easier to extract your secrets by brute force out of your wearable camera than out of you, and the former carries much less of a moral stigma than the latter [56]. You might not consider the disclosure of your visual diary as a major risk if you only think of wearable cameras9 , since you probably don’t wear one; but you might reconsider your stance in light of the fact that most of the wearable (or at least “pocketable”) technological gadgets you carry around, such as music players, digital cameras, voice recorders, USB sticks and so on, not to mention once again mobile phones, can hold multiple gigabytes of information—overall, a self-writing life diary even if you don’t mean it. Even ignoring malicious intent for the moment, it is now fairly easy to lose several gigabytes in one go when that USB stick falls out of your pocket (think public transport, conference, cinema, gym. . . ). This affects availability, unless you have up to date backups (ha ha)10 , and confidentiality, unless all your mass storage is protected by strong encryption (ha ha again). A variety of embarrassing high profile incidents [14, 7, 30] have demonstrated how easy it is, even for professionals and government officers entrusted with sensitive personal information, to lose gigabytes of it, and how unlikely it is that technical countermeasures (such as mass storage encryption) be actually used in practice11. There are, however, more subtle privacy issues with wearable computing than just the relatively obvious one of losing media containing confidential information. We already hinted at the fact that the timestamped history of the locations you visited, which is more akin to metadata, can be equally sensitive. Let’s open a separate subsection to examine that aspect in greater detail.
2.2 Location Privacy One of the new buzzwords of ubicomp is “location-based services”. At a bus stop, your location-aware mobile phone will not just bring up a timetable—it will also open it at the page containing the schedule of the specific buses that actually stop there. 9
What if the camera also featured an always-on microphone? Some ubicomp devices, such as certain music players, PDAs and phones, automatically synchronize their contents with a desktop or laptop computer on connection, which provides a certain amount of redundancy and backup at little effort for the user. This is a good thing, but it is still far from a robust system solution: in the ubicomp world, the syncing desktop or laptop is not managed by a professional system administrator and may itself have its disk accidentally wiped at some point for a variety of reasons (malware, hardware failure, operator error etc). 11 By the way: if and when encryption of secondary storage eventually becomes mainstream, expect many serious failures at the key management level. 10
290
Frank Stajano
At the time of writing, location-based services are still mostly research visions rather than a commercial reality; but their chance will come once high resolution positioning is integrated into all mobile phones. This trend is already under way, led by the US requirement that phones placing emergency calls be locatable to 100 m or less. Researchers are busy imagining creative ways of putting this new capability to other uses. An aspect of the question that has not yet received sufficient attention is security; in particular, the issue of protecting the location data of the individuals making use of such services. Without such provision, having one’s movements tracked 24 hours a day will become an unacceptable invasion of one’s privacy. Think for example of the patient who visits a psychiatrist or a clinic specializing in sexually transmitted diseases: even if the test results and the discussions with the psychiatrist are never disclosed, the very fact that she attended might cause her great distress if others heard about it, and she might legitimately want to keep that to herself. location privacy issues arise not just with mobile phones but to a certain extent with any mobile device that communicates with fixed infrastructure while on the move: the coffee shop, the library and the airport lounge will all have a record of the MAC address of your laptop in their logs if you use their wi-fi network. Crossing abstraction levels, your bank can track you around the country if you use your cash card in their ATM machines. So can the credit card company when you use their card in various shops around the world12 . So can the subway or train company when you use their electronic season ticket to enter and exit each station. Historically, the ORL Active Badges were among the first wearable computing devices and, because their explicit primary purpose was to help you find where your colleagues were, they were instrumental in early explorations of location privacy. In that context, Ian Jackson [32] was one of the first to investigate location privacy in ubiquitous computing: he designed a system that, through mix-based anonymizers, would allow users to control the amount of information disclosed to applications and other users by the location tracking system. Alastair Beresford and I [8] explored fundamental location privacy principles within the experimental playground of the Active Badge and its successor the Active Bat; we introduced the mix zone, a spatial equivalent of David Chaum’s mix [13], and offered both a way to protect location privacy without entirely disallowing location-based applications, and a framework for measuring the amount of privacy (or, more precisely, anonymity) offered by a given system. Jackson’s primary way to protect location privacy is through access control, i.e. by giving you the power to grant or deny access to your location information based on the identity of the requester, as well as other features of the specific query being made. The strategy we adopt is instead to anonymize the information supplied to applications. So an application will know that someone is in that place, and will be able to provide service to that person, but won’t know who it is. To assess whether this protection is any good, we pretend we are a malicious application and attempt to find out the identity of the person whose location is being 12 And it is in theory possible to distinguish from the logs whether the card was present during the transaction as opposed to having been read over the phone or internet.
Security Issues in Ubiquitous Computing
291
supplied anonymously. This is usually not hard: for each “home location” (i.e. each office, in our setting) we find out which pseudonym occupies that location more than any other; that’s usually the “owner” of that location—meaning we have already de-anonymized the pseudonym. Other simple heuristics can also be used if any ambiguity remains. We then introduce a countermeasure, the “mix zone”, an unobservable spatial region in which we prevent the hostile application from tracking us. We hope that, on exiting the mix zone, the application will have mixed us up (i.e. confused us) with someone else. Once again we then take on the role of the hostile application and try to break the anonymity, attempting to correlate those who went into the mix zone with those who came out of it. In the course of this process we develop quantitative measures of location privacy. These allow us to assess in an objective manner the effect of any privacyprotecting countermeasures we may adopt, such as degrading the spatial or temporal resolution of the data supplied to applications. Marco Gruteser and Dirk Grunwald [27] propose an alternative safeguard, namely to degrade the spatial and temporal resolution of the answers from the location server to the applications in order to ensure a certain level of anonymity for the entities being located. The most effective way of protecting location privacy, however, is to reverse the architecture and, instead of having mobile nodes (e.g. tags or cellphones) that transmit their location to the infrastructure, have the infrastructure tell the mobile nodes where they are (as happens in GPS) and, if appropriate, what location-based services are available nearby. For example, instead of the user asking the server which restaurants are nearby, the restaurants would broadcast messages with their location and the type of food they offer, which mobile nodes would pick up or discard based on a local filter set by the user’s query. Scalability can be achieved by adjusting the size of the broadcast cells (smaller, more local cells to accommodate more broadcasters and/or more frequent broadcasts). Note that in this reversed architecture the restaurants simply cannot know who receives their ads (hence absolute privacy protection), whereas in the former case the server knows and could tell them. This in itself may be an economic incentive against the deployment of such an architecture. George Danezis and others [18, 17] have studied the problem from an interestingly different perspective, with the aim of highlighting how much users actually value their privacy. Their studies use techniques from experimental psychology and economics in order to extract from each user a quantitative monetary value for one month’s worth of their location data (median answer: under 50 EUR), which they then compare across national groups, gender and technical awareness.
2.3 RFID RFID (radio frequency identification) tags are a crucial development in ubicomp. For our purposes we consider them as low-cost, resource-constrained devices that
292
Frank Stajano
communicate over a short-range radio interface and that are mainly used as machinereadable labels—to let a back-end system know that the real-world entity logically associated with the tag, be it a box of razor blades at a supermarket checkout or a passenger at a subway gate, is now in the vicinity of the reader. RFID chips can be made so small that they can be embedded in paper: possible applications include tickets, passports and even banknotes. Although tagging may look like an extremely basic application, RFID may well turn out to be the first truly ubiquitous computing technology. You might even call it “disposable computing”. . .
2.3.1 The Next Barcode To a first approximation, once the price is right, there will be an RFID tag anywhere you now see a barcode. There are indeed many similarities between barcodes and the RFID tags poised to replace them—so many that it is easier just to focus on the differences. Conceptually, only two differences are significant: the code space cardinality and the transmission mechanism. As for code space cardinality, the international EAN barcode standard only features 13 decimal digits and is therefore limited to a million manufacturers, each allowed to define up to 100,000 products. The Auto-ID Labs consortium, in contrast, defines a 96 bit code and a partitioning scheme that allows for 256 million manufacturers and 16 million products per manufacturer. More importantly, there are still enough bits left to provide 64 billion serial numbers for each individual product. While all the beer cans in the six-pack look the same to the barcode scanner, with RFID each razor blade cartridge in the whole warehouse will respond with its own distinct serial number. As for transmission mechanism, optical acquisition means barcode and reader must be manually aligned, as anyone having ever queued at a supermarket checkout knows all too well. The RFID tag, instead, as its name indicates, uses radio frequency: it can be read without requiring line of sight. From these two differences stem new opportunities. Because the code embeds a unique serial number, you can prove with your till receipt that this defective item was sold to you in this store; the manufacturer, in turn, when receiving the defective return from the retailer, knows exactly on what day and on which assembly line of which plant the item was made and can check for similar defects in any other items of the same batch. Because reading the code does not require a manual alignment operation, it may happen on a continuous basis rather than just at the checkout: smart shelves, both in the supermarket and in your own kitchen, can tally the products they contain and go online to reorder when they are running low.
2.3.2 Machine Vision—Without the Hard Bits Andy Hopper’s vision of Sentient Computing [31] is one of systems with the ability to sense the world that surrounds them and to respond to stimuli in useful ways with-
Security Issues in Ubiquitous Computing
293
out the need for explicit prompting by their users. Despite phenomenal advances in machine vision in recent years, today’s computer systems are not yet quite capable of “seeing” what surrounds them; but RFID, especially if universally deployed, makes a big difference here. Instead of fixing poor eyesight with expensive and technologically advanced solutions (contact lenses or laser surgery), we go for the cheap and cheerful alternative of wrapping a fluorescent jacket around each object to make it stand out. The myopic computer doesn’t actually see the object: it just notices the fluorescent coat. Always remember this.
2.3.3 Technology There are many varieties of tags using different working frequencies (implying different trade-offs between penetration, range and read rate), different power options (from passive tags powered by the radio wave of the reader to semi-passive and active tags that include batteries), different amounts of memory in the tag and different packaging options. They currently spread the cost spectrum from approximately 0.10 to 20 USD when bought in bulk. Proponents and manufacturers hope to bring down the cost of the cheapest tags by another order of magnitude; this won’t be easy to accomplish but there is strong economic pressure towards that goal. Much of the security discussion on RFID can be conducted at a relatively high level but there are a couple of technological points worth noting. The process by which a reader device finds out which tags are within range must take into account the possibility of finding several such tags and must return the individual codes of all the tags in a reasonable time (imagine an automated supermarket checkout scenario). The process is known as singulation and can be implemented in several ways. An efficient one, in common use, is binary tree walking: for each bit of the code, the reader broadcasts a request, to which tags must respond with their corresponding bit. If there is no collision, the reader moves to the next bit; otherwise it visits the two subtrees separately, first asking the tags with a 1 in that position to be quiet and then vice versa. Concerning transmission range: passive tags are energized by the radio wave carrying the reader’s query and respond with a radio transmission of their own. This structure introduces a clear asymmetry: since the power of the transmission from the reader will necessarily be much higher than that from the tag, there will be a small spatial region near the tag in which both tag and reader can be overheard, but a much wider concentric region beyond that in which only the reader can be heard. This is significant for passive eavesdropping attacks. There are significant technological constraints to the manufacture of tags meant to be batteryless and so cheap that they can be affixed to individual retail items without significantly increasing their cost. In practice the limits on gate count and power consumption mean that the tags’ ability to compute cryptographic primitives is severely limited when not totally absent.
294
Frank Stajano
2.3.4 Security and Privacy What risks are brought about by RFID systems? The first book to address this issue was probably mine [55]: as an attention-grabbing joke I suggested that sexual predators would take pleasure in reading the cup size, model and colour of the bra of their victims. Can you imagine any other risks? Katherine Albrecht has since become the most outspoken critic of the privacy threats introduced by RFID [2]. Apart from the numerous privacy risks for end-users, there are other kinds of concerns. What security problems can you anticipate for a retailer who tags all his products with RFID? Can you think of any frauds that the introduction of RFID tags makes easier? RFID tags have been proposed as electronic price tags allowing automated checkout. It is not hard to imagine ways in which one might check out without paying, or paying for a cheaper product, by tampering with the price tags. Think in systems terms: even the glue [34] that sticks the tag to the object is part of the security envelope! RFID tags might also become a vector for attacking the back-end systems: Melanie Rieback et al. [52] showed how a maliciously reprogrammed tag could attack the back-end server through SQL injection or buffer overflow and even, in theory, propagate to infect further tags. RFID tags have been proposed as anti-counterfeiting devices. How hard is it to clone them? (It depends to some extent on whether they have to look the same as the original to a visual inspection, or only at the radio interface.) See the later subsection on PUFs. What about RFID tags used in access control situations, such as door entry cards and subway access cards? Do you expect the key card to be a passive or an active tag? What do you imagine the authentication protocol to look like? And how could a crook attack it to gain entry illegitimately? A replay attack (record a known-good response from a genuine user, then play it back later) is easy to mount if the key card always gives the same response. But, even if it doesn’t, a relay attack (a specialized form of man-in-the-middle attack where the man-in-the-middle only forwards messages rather than attempting to alter or decrypt them) can open the door with the unsuspecting cooperation of a legitimate keyholder [28]. And the fact that regular users operate the turnstiles by dropping their handbag on the reader (rather than taking their wallet out of the bag and their card out of the wallet) makes it easier to conceal the attacking equipment—there is no need to make it resemble a genuine card.
2.3.5 Technical Safeguards The first significant privacy-protecting solutions (hash-based access control, randomized access control and silent tree walking) were invented by Steve Weis et al. [63] and will be examined next. Weis’s master thesis [62] is a great source of good ideas. Ari Juels’s extensive and authoritative survey [35] is an excellent refer-
Security Issues in Ubiquitous Computing
295
ence, as is the book edited by Simson Garfinkel and Beth Rosenberg that collects interesting contributions from a wide spectrum of sources [23]. Gildas Avoine currently maintains what is perhaps the web’s most extensive bibliography on RFID security and privacy at http://www.avoine.net/rfid/index.html. Killing the tag. A simple solution to many RFID privacy problems, and the most frequently cited by proponents of the tags, is to disable the tags permanently at checkout13. This was the standard mode of operation originally proposed by the AutoID Center, with tags responding to a “kill” command protected by an 8-bit “password”. Discussion of denial of service attacks is left as an exercise for the reader. This strategy prevents subsequent tracking but also disables any potential user benefits of the technology (e.g. fridge automatically reordering regular items about to run out, sorting refuse by type at the recycling centre. . . ). Hash-based access control. This scheme and the two that follow are due to Weis et al. [63]. To prevent unauthorized readers from accessing tag contents, each tag T is assigned a key kT , which is hashed to yield a meta-ID mT = h(kT ). The tag just stores mT , not kT . Authorized readers are given the keys of the tags they can read. When a reader queries a tag, the tag responds with the non-invertible metaID mT . The reader responds to that with the key kT , found by reverse lookup in the reader’s table of known tags. At that point the tag unlocks and sends its true ID. Randomized access control. One problem with the previous scheme is that the meta-ID itself is a unique identifier, even though the real ID is unreadable. So the eavesdropper can’t tell it’s a black 16 GB iPod but it can tell it’s the same item it saw yesterday and the day before. People can then be tracked by “constellations” of objects that move around with them, including their glasses, their watch, their wallet and their shoes. In this alternative scheme, therefore, the tag responds to a reader query by generating a random number r and transmitting the pair (r, h(ID||r)). The reader performs a brute-force forward search on all the IDs it knows and eventually finds the one that matches. This solution ensures that the responses from the same tag are all different. Note that an attacker who does not know the tag’s ID would have to brute-force an infeasibly large code space; however a legitimate user owning many tags will pay a high performance cost at every query, so this method is suitable for individuals but not for, say, supermarkets. Silent tree walking. Since the reader’s messages can be overheard from very far away, a remote eavesdropper could learn the ID of a tag by observing which portions of the code tree are explored by the reader during singulation, even if the eavesdropper could not hear the replies of the tag itself. Weis et al. [63] therefore modify the binary tree walking algorithm as follows. When a reader asks tags in range to respond with the next bit of their code, if there is no collision then the bit common to all the tags that responded is a secret for the remote eavesdropper. When there is a collision, this shared secret can be 13
An equivalent solution would be to affix the tag to the packaging rather than to the object.
296
Frank Stajano
used as a one-time pad to mask off the next bit to be transmitted. Therefore the reader splits the two sides of the tree not by asking for “who’s got a 1 in the next bit” but rather “who’s got a 1 when they xor the next bit with the secret prefix?” which leaves the remote eavesdropper in the dark14. Similarly, if the reader needs to transmit a value v confidentially to a tag, the tag can generate a one-time pad value p and send it to the reader, unheard by the remote eavesdropper. The reader will then transmit v ⊕ p to the tag. The blocker tag. The blocker tag, proposed by Juels et al. [36], performs selective jamming of the singulation protocol. A subtree of the RFID code space (could just be, say, the half with the topmost bit set) is designated as a privacy zone and tags are moved into that subtree at checkout. Subsequently, any time a reader attempts to navigate the privacy protected subtree, a special RFID blocker tag, meant to be carried by the user in her shopping bag, responds to the singulation protocol pretending to be all the tags in that subtree at once. The reader is therefore stalled. In the user’s home, though, the blocker tag no longer operates and the smart home can still recognize the tags as appropriate. In a more polite variant, the blocker tag makes its presence known so that readers don’t waste time attempting to navigate the privacy-protected subtree. Anti-counterfeiting using PUFs. It has been suggested that RFID tags could be embedded in luxury goods (e.g. designer shoes) to prevent forgeries. The underlying assumption is that the technological capacity for manufacturing RFID chips is not widely available. This attitude is akin to a physical version of “security by obscurity”. Can we do better? A much stronger proposition is that of embedding a challenge-response black box inside an RFID chip [60] in such a way that the secret cannot be easily duplicated (because it is not merely a string of bits but a physical structure) and any attempt to open the chip to extract the secret by physical means will destroy the secret itself: basically, tamper-resistance at the RFID chip level. This is achieved by implementing the challenge-response box as a function keyed by some physical imperfections of the chip, such as parasitic electrical features of its coating or slight differences in the propagation delay of adjacent and nominally equal combinatory circuits. Opening up the chip to insert microprobes would alter those quantities and therefore change the “key”. This kind of construction, indicated as as “Physical One-Way Function” by Ravikanth Pappu et al. [51] who first proposed it with an implementation based on lasers and holograms, has also been called “Physical Random Function” and “Physical Unclonable Function” by Blaise Gassend et al. [24] who first described an integrated circuit implementation for it.
14
There is a bootstrapping problem for this scheme: where does the secret come from in the first round? The authors handwave it away by stating that the population of tags being examined shares a common prefix, such as a manufacturer ID, which can be transmitted from tags to reader without being overheard by the distant eavesdropper. In practice, however, one should question the plausibility of both the common prefix assumption and of the implicit one that the eavesdropper wouldn’t be able to guess that common prefix, if indeed it is something like a manufacturer’s code.
Security Issues in Ubiquitous Computing
297
Suitable protocols have to be designed to cope with the fact that the device will only probabilistically respond in the same way to the same input (to some extent this can be remedied with error correction circuitry, but in the RFID environment gates are expensive) and with the fact that even the manufacturer doesn’t actually know “the key”—he can only sample the device on a set of inputs before releasing it, and must remember the values somewhere. Distance bounding protocols. Challenge-response protocols were designed to defeat replay attacks. However they are still subject to more subtle relay attacks. A relay attack, sometimes called a wormhole attack, is a specialized form of manin-the-middle attack where the attacker does not alter the messages it overhears but just transports them to another location. A door might issue a challenge to verify the presence of an authorized entry card; but a relay attack would involve relaying that challenge to a legitimate entry card that is not currently near the door, and relaying the card’s response to the door, which would find it genuine and grant access. Adding strong cryptography to the exchange cannot, by itself, fix the problem. Distance-bounding protocols were introduced to defeat such relay attacks. The idea is to measure the response time and, from that and the speed of light (0.3 m/ns), infer that the responder cannot be any further than a certain distance. Then the door would ask the entry card to respond within a specified time, to be sure that the card is sufficiently close to the door. The challenge-response protocol has to be specially designed in order to be used for distance bounding: just measuring a regular challenge-response exchange would provide timing information of insufficient precision for positioning purposes once we take into account the delays introduced by buffering, error correction, retransmission, cryptographic computations and so on. The fundamental technique introduced by Stefan Brands and David Chaum [11] is to issue a single bit of challenge and expect an immediate single-bit response. This allows the verifier to measure just the round-trip time of one bit, eliminating most of the overheads. To prevent random guessing of the response by an adversary with 50% probability of success, this fast bit exchange is repeated for n rounds, reducing the probability of a correct guess to 2−n . Gerhard Hancke and Markus Kuhn [29] developed a distance-bounding protocol optimized for the RFID environment: the tag, acting as prover, does not have to compute any expensive operations and the protocol, unlike others, works even on a channel subject to bit errors. The protocol also addresses the issue that passive RFID devices do not have an internal clock (they derive it from the radio wave of the reader) and therefore an attacker might attempt to supply the RFID token with a faster clock in order to make it respond more quickly, which would in turn allow the attacker to pretend to be closer to the verifier than he is. Multi-factor access control in e-passports Modern “e-passports” incorporate an RFID chip containing a digitized photograph of the bearer and optionally other biometrics such as fingerprints and iris scans. To prevent unauthorized access to such data while the passport is in the owner’s pocket or purse, the RFID chip will only release that information after a successful challenge-response in which the
298
Frank Stajano
reader demonstrates knowledge of the data printed on the OCR strip of the main page of the passport. The intention is to ensure that a reader can only acquire the RFID chip data when the passport is open and has been optically scanned. Ari Juels, David Molnar and David Wagner [37] offer a detailed description of this and related mechanisms. Markus Kuhn [40] suggested a simpler way to achieve the same effect: incorporate an RF shield in the front cover of the passport.
2.4 Authentication and Device Pairing Most authentication methods ultimately rely on one of “what you know, what you have or what you are” (e.g. a password, a lock key, a fingerprint). But across a network, all you get are bits, and the bits of the crook smell the same as the bits of the honest user. A variety of protocols have been devised for authentication in distributed systems, the most influential being probably Needham-Schroeder [48] which later gave rise to Kerberos [38] and ultimately to the authentication system of Windows 2000 and its successors. That family of protocols relies on a central authority that knows all the principals in its security domain and can act as introducer to any two entities wishing to communicate with each other. To use such a protocol, it is necessary for the central authority to be reachable. Authentication with public key certificates may appear not to require connectivity, but it does if we take into account the need to check for revoked certificates. In the ad-hoc wireless networks that are commonplace in ubiquitous computing, where online connectivity to a global authority cannot be guaranteed, we may therefore need a more local authentication solution. In the ubicomp context, authentication is often really just an interaction between two devices: one which offers services and one which requests them. The latter device (client) is in some way a controller for the former (server), in that it causes the server to perform certain actions: one entity wants the other to “do something”. Under this alternative viewpoint, authentication morphs into a kind of master-slave pairing. The server (or “slave” or “verifier”) authenticates the client (or “master”or “prover”) and, if the authentication succeeds, the server accepts the client as its master for the duration of that interaction session.
2.4.1 Big Stick In more cases than you might at first think, the Big Stick principle [55] is an appropriate access control policy: Whoever has physical control of the device is allowed to take it over.
Your fridge usually works that way. So does your DVD player. This is a strong, sound security policy because it closely matches the constraints already imposed
Security Issues in Ubiquitous Computing
299
by reality (when your made-up rules conflict with those of reality, the latter usually win). However it may not be appropriate for devices you must leave unattended outside your control, and it does nothing to deter theft.
2.4.2 Resurrecting Duckling The Resurrecting Duckling security policy model [57] was created to solve the above problem and in particular to implement secure transient association: you want to bind a slave device (your new flat screen TV) to a master device (your cellphone used as a universal remote controller) in a secure way, so that your neighbour can’t turn your TV on and off by mistake (or to annoy you) and so that the stolen TV is useless because it doesn’t respond to any other remote controller; but also in a transient way, so that you can resell it without also having to give the buyer your cellphone. It is based on four principles [55]: 1 - Two State principle. The entity that the policy protects, called the duckling, can be in one of two states: imprintable or imprinted. In the imprintable state, anyone can take it over. In the imprinted state, it only obeys its mother duck (q.v.). 2 - Imprinting principle. The transition from imprintable to imprinted, known as imprinting, happens when a principal, from then on known as the mother duck, sends an imprinting key to the duckling. This must be done using a channel whose confidentiality and integrity are adequately protected (physical contact is recommended). As part of the transaction, the mother duck must also create an appropriate backup of the imprinting key. 3 - Death principle. The transition from imprinted to imprintable is known as death. It may only occur under a very specific circumstance, defined by the particular variant of the resurrecting duckling policy model that one has chosen. Allowable circumstances, each corresponding to a different variant of the policy, include the following. • • •
Death by order of the mother duck (default). Death by old age after a predefined time interval. Death on completion of a specific transaction.
4 - Assassination principle. The duckling must be built in such a way that it will be uneconomical for an attacker to assassinate it, i.e. to cause the duckling’s death artificially in circumstances other than the one prescribed by the Death principle of the policy. Note that the Assassination principle implies that a duckling-compliant device must be endowed with some appropriate amount of tamper resistance. The Resurrecting Duckling policy has very general applicability. It is not patented and therefore it has been applied in a variety of commercial situations.
300
Frank Stajano
It has been extended [54] to allow the mother duck to delegate some or all of its powers to another designated master.
2.4.3 Multi-Channel Protocols Security protocol descriptions usually consist of sequences of messages exchanged between two or more parties. It may however be significant to explore the properties inherently provided by different channels. When you receive a radio message, for example, it is hard to figure out for sure who sent it to you: there may be a name in the header, but it might be a fake. But if you receive a message over a visual channel (e.g. you scan a barcode) then the source is obvious, and is quite difficult for a middleperson to forge; so the visual channel implicitly provides data origin authenticity. Consider other channels commonly used in ubiquitous computing and examine their useful properties: fingers pressing buttons on a keyboard; voice; infrared; direct electrical contact; optical fibres; quantum communication; and so on. Look at them in terms of confidentiality, integrity, source or destination authenticity, capacity, cost per bit, usability, whether they are point-to-point or broadcast and so on. Sometimes, the best approach is to use different channels at different stages in the protocol, sending (say) message 3 over radio but message 4 over a visual channel. This applies particularly to the ad-hoc wireless networks that are commonly found in ubiquitous computing environments. When two principals Alice and Bob wish to establish a shared secret although they have never met before, if the channel over which they communicate does not offer confidentiality, then an eavesdropper who overhears their exchange will also learn the secret they establish15 . To overcome this problem, they may use a DiffieHellman key exchange16: Alice thinks of a secret a and sends ga , Bob thinks of b and sends gb , and both of them can compute gab (respectively from a and gb or from b and ga ) while Eve the eavesdropper can’t (because she only has ga and gb and the discrete log problem is computationally hard). This defeats the eavesdropper, but the next problem is the man-in-the-middle: an active attacker sits between the two parties and pretends to be Bob to Alice and to be Alice to Bob. The problem now is that Alice, when she establishes a shared secret using Diffie-Hellman over radio, can’t say for sure whether she is establishing it with Bob or with middleperson Mallory. It would be secure, although very laborious, for them to exchange their ga and gb on pieces of paper, because then Alice would know for sure that gb comes from the intended Bob, and vice versa; but radio waves, instead, might have come from anywhere. Unfortunately the dataorigin-authentic channels (writing a note on a piece of paper, copying a string of hex digits from a display into a keypad, acquiring a visual code with a camera. . . ) have usability constraints that limit their capacity to a small number of bits per mes15 16
Bluetooth pairing used to be affected by this problem [33, 67]. We omit all the “mod n” for brevity.
Security Issues in Ubiquitous Computing
301
sage, insufficient to carry a full ga . The idea of multi-channel protocols [6, 46, 66] is to use a convenient, user-friendly and high capacity channel such as radio for the “long” messages, and a different, perhaps less convenient but data-origin-authentic, channel for short messages that somehow validate the ones exchanged over the other channel.
2.5 Beyond Passwords Fifteen years ago, many computer systems did not even allow you to set a password longer than 8 characters. Nowadays, 8-character passwords can be trivially cracked by brute force, because computers are now fast enough to try all possible combinations. Many longer passwords can also be cracked by trying dictionary words, song lyrics, girl names, swear words, movie lines and a number of “r&latively 0bv1ous Var1a$h0ns”. We have reached the stage where many of the passwords that a non-geek can plausibly remember are within reach of a brute-force search. The problem is compounded by the proliferation of systems that require a password: nowadays most of us own and use several computers (an uncommon circumstance fifteen years ago) but, besides that, we are requested to use login passwords in many other computer-mediated interactions. Each online shopping web site, each blog, each email server, each ISP account, each online banking account requires a password, adding to the cognitive load. And, of course, reusing the same password for all your accounts makes you globally vulnerable to a malicious insider at any of the corresponding sites—or, more generally, reduces the security of all your accounts to that of the most insecure of them all. In the context of ubiquitous computing, where each person is expected to interact daily with tens or hundreds of devices, it should be clear that depending on passwords for authentication is not going to be a long term solution. What are the alternatives? We already mentioned the well-known taxonomy of authenticators whose categories are “something you know, something you have or something you are”17 . If “something you know” is no longer suitable, let’s then explore the other two options in greater detail.
2.5.1 Tokens: Something You Have In token-based authentication, you are authenticated based on the fact that you own a physical artefact. Outside computers, this has been a common practice for centuries: rings, seals, medallions, passports and indeed (mechanical) keys are examples. An obvious problem we mentioned is that the procedure verifies the presence of the 17
Sometimes, for added security at the cost of a little extra hassle for the user, the verifier might check more than one such item, preferably from different categories (for example ownership of a card and knowledge of its PIN): we then speak of “multi-factor authentication”.
302
Frank Stajano
token, not of the person carrying it, and as such it fails if the token is forgotten, lost or stolen and presented by someone else. Methods for binding the token to the intended person have been devised, such as the wrist or ankle straps used for attaching a radio tag to prisoners released on probation (if the strap is opened or cut in order to remove the tag, the radio signal stops, triggering an alarm at the back-end monitoring system). Their use on non-prisoners carries unpleasant overtones. Moreover, if the token gives access to anything valuable (e.g. a bank vault), the fact that it stops working when the strap is opened may be a perverse incentive to kidnap the wearer or chop off the limb18 . Physical tokens like door keys can be cloned by someone with physical access to them. More worryingly, electronic tokens that communicate via a wireless signal are subject to replay attacks, where the signal is recorded and later replayed. For that reason, all but the very worst wireless tokens use a challenge-response mechanism, where each challenge is different and an attacker is not supposed to be able to predict the response given the challenge. An important advantage of the wireless token-checking arrangement is that the verifier can interrogate the token many times, for example once a minute, to check whether the authenticated party is still there19 . With passwords, instead, if the session lasts for three hours, the verifier only knows that the authenticated principal was there at the start of the three hours but has no real guarantees that she continued to be there throughout. A more subtle problem is that of relay (not replay) attacks, already mentioned in the section on RFID: if the token is, for instance, a wireless tag that opens a door when placed on the door’s pad, an attacker will place a fake tag near the door pad and a fake door pad (actually just its electronics) near the tag and create a wormhole between them.
2.5.2 Biometrics: Something You Are A biometric feature which has been studied and used long before computers is the fingerprint. Some laptops and cellular phones now include basic fingerprint readers. Other biometrics used for recognition of individuals include hand geometry, voice, face (geometry of facial features), retina and iris. An unavoidable problem is that the response from the recognition system cannot be perfect: there is always a certain amount of misclassification giving rise to false positives (verifier recognizes user when it shouldn’t; this allows fraud) and false negatives (verifier doesn’t recognize user when it should; this insults the legitimate user). It is easy to reduce one at the expense of the other by moving the recognition threshold but it is difficult to reduce both. Technically, the system offering the best performance nowadays is iris recognition, used in airport security in several countries around the world [19]. 18
The same comment applies to biometrics, examined next. Imagine the case of wanting to keep a laptop locked, with all files encrypted, unless in presence of its owner [15]. 19
Security Issues in Ubiquitous Computing
303
A problem of biometrics as opposed to other authentication methods is that your biometric features are yours for life and therefore, barring surgery, they can’t be revoked. If your thumbprint opens your front door and someone clones it and raids your apartment, when you install a replacement door lock you can’t go to the locksmith and get a new thumbprint. Another problem, this time from the privacy perspective, is that using the same biometric feature for authenticating to a multitude of different services makes you traceable across these various services should they share and compare such authentication data. And, on the subject of reuse of authentication credentials, if a thumbprint reader is installed not only at your front door but also at your company and on your safety deposit box at your bank, does this mean that a malicious insider at your company now has what it takes to make a gummy finger [45] and raid your house and safety deposit box? For that matter, is it rational to use fingerprints and iris codes as keys that protect anything valuable now that, under US-VISIT, you must give them to border security officers every time you travel to the United States? A common fallacy is to assume that recognizing a valid biometric feature (e.g. recognizing that what’s being presented at the reader is Joe’s fingerprint or Joe’s iris scan) is authentication, since at the same time you say who you are and you prove it, under the assumption that no one else has that biometric. But it’s not quite authentication: it’s actually closer to just identification. Although there is admittedly an element of verification in the process of checking whether what’s being presented at the reader matches with Joe’s fingerprint or iris as previously recorded in the verifier’s database, to compete the verification the verifier would also need to check that the feature is in fact genuinely “attached” to the person who was born with it, and is not the result of applying a gummy finger or a printed contact lens. This is hard to ensure and almost impossible in unattended settings where the scanning is not supervised by a human officer. For this reason, locking a laptop or a PDA with a fingerprint does not offer strong protection—particularly when you consider that the owner may have even left fingerprints on the object itself, from which a cloned finger could be produced. It’s relatively OK to assume that the biometric is unique to an individual, but it is wrong to assume that it is a secret. One may argue that something similar happens with a token. Indeed so, and it is important to remember that what is being authenticated is the presence20 of the token, from which the presence of its intended owner does not necessarily follow. The difference, perhaps more psychological than theoretical, is that it is normal to think of the biometric as being closely associated with its owner (“something you are”), as opposed to the token (“something you have”) where it is more natural to assume that the two may at some point become separated.
20
But recall the earlier discussion about relay attacks: unless a distance bounding protocol is employed, not even the presence of the token may be guaranteed.
304
Frank Stajano
3 Going Up a Level: the Rest of the Story We could go on for much longer looking for threats and discussing technical security solutions for many other plausible ubicomp scenarios: how about, say, hordes of Internet-enabled home appliances being remotely “0wned” by organized crime and recruited into evil botnets in order to send spam or perform DDOS attacks? However, to provide effective security, once we have a general feeling for the issues that challenge us and the technical solutions we may use to address them, it is also important to step back and look at the field from a higher viewpoint. Often the true root of the problem is not strictly technical and therefore a technical solution will only act as a temporary and ineffective fix.
3.1 Security and Usability Usability is probably one of the most difficult challenges for security today [16]. Even though we have a pretty good understanding of low level security mechanisms and protocols (for instance, we are able to build efficient and robust ciphers, such as AES, that even major spy agencies cannot crack by brute force), we are not very good yet at building security systems that don’t get in the way of users. The problem is that, if a security mechanism is too difficult or too inconvenient to use, it will just not be used. The user will take whatever steps are necessary to bypass the security in order to get her job done. Introducing strong security mechanisms without concern for their repercussions in terms of human factors is not a good recipe for building a secure system. As we made clear in one of the initial sections, security is not a feature or a component that can be added later: it is instead a global property of an entire system, just like sound quality in a hi-fi system. Engineering a system for security is similar to engineering it for robustness; but, while robustness is about protecting the system from accidental damage, security is about protecting it from damage that is purposefully caused by malicious attackers. Usability, just like security, is not a feature or a component but a system property. It is not about “building a user interface” but about understanding how the user interacts with the system and understanding where the system fails because it is counterintuitive, confusing or frustrating. It is about designing and refining the system so that the user’s intention can easily be translated into action without ambiguity, difficulty or stress. It is about viewing the system with someone else’s eyes and realising that what is obvious, intuitive and easy for the designer may not be so for the user. In the interaction between security and usability there is often a tension: essentially, security is about limiting the user’s actions whereas usability is about facilitating them. Fortunately, the situation is not as contradictory as it may sound from the above comment, because security is about limiting bad actions while usability is about facilitating good actions. However it is undeniable that there is a tension
Security Issues in Ubiquitous Computing
305
and that security mechanisms, in their attempts to impede the bad actions of bad users, often get in the way of the honest regular users who just want to pursue their honest activities as simply as possible: witness for example wireless LANs left at their default insecure settings21 , Bluetooth pairing PINs (and cellphone PINs and any other kind of factory-supplied PINs or passwords) left at their easily-guessable default value, security-related dialog boxes that ask for a confirmation from the user but are only perceived as “Here’s a hard question you won’t understand anyway, but press OK if you want to keep working otherwise the computer will stop”; and so on. (Security has been described as a “tax on the honest” caused by the existence of the dishonest.) As we said, when security and usability fight, it’s usability that wins, in the sense that the user will probably prefer to switch off the security entirely. In a constrained situation—such as, for example, that of a military environment—it may be possible for the security designer to impose a security mechanism on the user regardless of how inconvenient it is to operate; but, in the civilian context of, say, the consumer electronics market, building a system that is too inconvenient to operate because of its poorly designed security features will only lead customers to vote with their wallets and choose a competing product. Especially since very few customers choose a product based on its security rather than its features or performance.
3.2 Understanding People It is important to be able to view one’s design through someone else’s eyes. For security, we must view our systems through the eyes of an attacker. • • • • • • • •
Where are the weaknesses? Where are the valuable assets? How can I misuse the system in order to break it? What implicit assumptions of the designer can I violate in order to make the system behave in an unexpected way? Where can I cause maximum damage? What are my motivations in attacking the system? What do I stand to gain from it? What attacks can I deploy on an industrial scale, replicating them with little effort on all the devices of this model?
It is not easy to acquire this mindset; but, if we cannot ask these questions ourselves, we’ll leave it to the real attackers to do it for us once the product is fielded. For usability, too, we need to adopt someone else’s viewpoint; but this time we must view the system through the eyes of a user. • How can I make the product do what I want? (I know what I want, but not how to cause this machine to do it.) 21
And sold in that configuration because “insecure by default, but just works” is much better for the manufacturer than having the customer return their wireless access point and buy a competitor’s because they were incapable of setting up WPA.
306
Frank Stajano
• Why doesn’t what I do (which seems natural to me) work? • I find this activity too hard. • I find this other activity not difficult, but annoying and frustrating—it insults my intelligence. • I am not enjoying this. • I forget how to do this. • It’s not obvious. • The display is hard to see. • It’s too small to read at my age. • I can’t figure out what the icon represents: is it a cloud, a splatted tomato or a devil? • This other icon I can see, but I can’t understand the metaphor—what does a picture of a toilet roll mean in this context? • I can’t understand the response of the system. • It didn’t do what I thought it would. And so forth. It takes much humility for a designer to listen to and accept all the criticism implicit in such comments. But they are like gold dust. Learning to listen to the user, to understand the user’s viewpoint and to think like the user is an essential step for progress towards an intuitive and pleasant to use system that reduces the possibility of accidental mistakes. In that spirit, the well-thought-out security specification of the One Laptop Per Child project [39] makes for instructive and fascinating reading. Of particular interest are the “principles and goals” which demonstrate a laudable explicit concern towards making security usable even for very young children: for example, Security cannot depend upon the user’s ability to read a message from the computer and act in an informed and sensible manner. [. . . ] a machine must be secure out of the factory if given to a user who cannot yet read.
3.3 Motivations and Incentives Most systems are actually marketed and sold on the basis of their features rather than of their security. This provides little incentive for vendors to compete on providing the best security. Often, security-related components are totally absent, or are added as an afterthought to count as extra “features” or to satisfy a regulatory requirement. Needless to say, this ad hoc approach cannot deliver a great deal of proper security at the system level. To make matters worse, customers do not usually understand security at the system level—they only worry about specific incidents—and are not usually in a position to evaluate whether the system on offer would or would not be secure for the intended purpose, meaning that there is no feedback loop to close in order to improve security at the next iteration, except if independent evaluation is provided by competent experts.
Security Issues in Ubiquitous Computing
307
This disillusioned and somewhat cynical description applies across the board but it is certainly no less true for ubiquitous computing than for the general case. For example, when we worked with civil engineers to monitor bridges and tunnels for signs of deterioration using wireless sensor networks, we performed penetration testing on the WSN hardware that our team had chosen and we found a variety of security holes, from subtle ones to others that were quite blatant [58]. We duly reported our findings to the manufacturer, months ahead of publication, but it became clear from their responses that, at least for the time being, security was not a commercial priority. Similar examples come from the worlds of Wi-Fi and Bluetooth where products shipped with weak and easy to break authentication procedures until researchers exposed the vulnerabilities [10, 33], forcing new standards to be promulgated and adopted. The same also happened with RFID tags and cards [9, 22] and in the latter case NXP, maker of the MIFARE Classic card, sued Radboud University in an attempt to block the above publication describing the break [47]. The blame doesn’t rest exclusively with vendors, though: customers vote with their wallet, and largely get what they ask for. The generic lack of security is therefore also due to the fact that customers don’t appear to place a high value on it, as shown by their unwillingness to pay22 any substantial extra premium to obtain it [1]. You may design strong security and privacy protection features only to see that your users don’t actually care and just leave them all disabled for simplicity. As far as privacy goes, there are in fact several powerful entities with strong incentives to intrude on the privacy of ordinary citizens. A brilliant paper by Andrew Odlyzko [50] provides the counterintuitive but fundamental insight that knowing private information about you (and in particular about your ability and willingness to pay for specific goods and services) is economically valuable because it allows vendors to engage in price discrimination, selling the same goods at a different price to each buyer in order to extract the maximum price that each buyer is willing to pay—in an extreme version of what airlines have already been doing for years. Similarly, insurance companies have an obvious incentive to find out as much as possible about their prospective customers in order to steer away, or charge more, the prospects that are more likely to come up with expensive claims. Governments cite national security as their main reason for invading the privacy of citizens and travellers, though one should seriously question whether the response is proportionate to the actual risk [53]. A report by civil liberties watchdog Statewatch [12] quoted a paper from the EU’s “Future Group” [21] as saying: Every object the individual uses, every transaction they make and almost everywhere they go will create a detailed digital record. This will generate a wealth of information for public security organisations, and create huge opportunities for more effective and productive public security efforts.
The report commented:
22
Though, some argue, there will also be cynical users who are unwilling to pay not because they don’t value privacy but because they don’t believe that what’s on offer at a premium will provide it.
308
Frank Stajano
When traffic data including internet usage is combined with other data held by the state or gathered from non-state sources (tax, employment, bank details, credit card usage, biometrics, criminal record, health record, use of e-government services, travel history etc) a frightening detailed picture of each individual’s everyday life and habits can be accessed at the click of a button.
None of this would have been technically possible before the age of ubiquitous computing and the associated ubiquitous back-end database servers. Ubiquitous computing is unquestionably changing society and affecting the lives of all of us. The pervasiveness of invisible computer systems in modern society has the potential for making many hard things simpler and for making many interesting things possible; but at the same time it has the potential for exposing us to many new risks to our safety and privacy that did not previously exist and which we, as individuals, are not well equipped to address. As responsible architects of the new world of ubicomp we have a duty to protect the unwary from threats they cannot face on their own, particularly as they originate in innovations that they did not request or ever fully understand. But this is a battle that public-spirited system designers cannot fight alone: systems will never be secure unless users value and demand that they be. The first step is therefore to educate the public about security and risks, in order that they can use security as one of the criteria upon which to base their choice of which systems to support or buy. The second step is to understand, by carefully and humbly listening to the users’ own viewpoint, what are the assets to be protected, and the extent to which it is reasonable to go through inconvenience and expense in order to protect them. This trade-off is at the core of security and, in the world of ubiquitous computing, it is essential that everyone learn to find the appropriate balance. Acknowledgements This chapter is based on copyrighted material from my book Security for Ubiquitous Computing (both from the first edition published by Wiley in 2002 and from the draft of a forthcoming second edition). Some of this material started as studies and reports I wrote for my consultancy clients, to whom I am grateful for the opportunity of rooting my research in real world problems. I also thank Stuart Wray, Bruce Christianson and Bernd Eckenfels for detailed and insightful feedback on an earlier draft, as well as Nick Towner, Pete Austin and a blog reader signing off as “Lil-Wayne Quotes” for shorter but equally valuable comments. About the Author Frank Stajano, PhD, is a University Senior Lecturer (≈ associate professor) at the Computer Laboratory of the University of Cambridge, UK. His research interests cover security, ubiquitous computing and human factors, all linked by the common thread of protecting security and privacy in the modern electronic society. Before his academic appointment he worked as a research scientist in the corporate research and development laboratories of Toshiba, AT&T, Oracle and Olivetti. He was elected a Toshiba Fellow in 2000. He continues to consult for industry on a variety of topics from security to strategic research planning. Since writing Security for Ubiquitous Computing he has given about 30 keynote and invited talks in three continents. He is also a licensed martial arts instructor and runs the kendo dojo of the University of Cambridge. Visit his web site by typing his name into your favourite search engine.
Security Issues in Ubiquitous Computing
309
References [1] Acquisti A (2004) Privacy in electronic commerce and the economics of immediate gratification. In: Proceedings of the ACM Electronic Commerce Conference (ACM EC), ACM, URL http://www.heinz.cmu.edu/ ~acquisti/papers/privacy-gratification.pdf [2] Albrecht K, McIntyre L (2005) Spychips: How Major Corporations and Government Plan to Track Your Every Move with RFID. Thomas Nelson [3] Amoroso E (1994) Fundamentals of Computer Security Technology. Prentice Hall [4] Anderson R (2008) Security Engineering (Second Edition). Wiley [5] Ark WS, Selker T (1999) A look at human interaction with pervasive computers. IBM Syst J 38(4):504–507 [6] Balfanz D, Smetters DK, Stewart P, Wong HC (2002) Talking to strangers: Authentication in ad-hoc wireless networks. In: Proceedings of Network and Distributed System Security Symposium (NDSS 2002), The Internet Society, URL http://www.isoc.org/isoc/conferences/ndss/02/ papers/balfan.pdf [7] BBC (2007) Thousands of driver details lost. URL http://news.bbc. co.uk/1/hi/northern_ireland/7138408.stm [8] Beresford A, Stajano F (2003) Location privacy in pervasive computing. IEEE Pervasive Computing 2(1):46–55, URL http://www.cl.cam.ac.uk/ ~fms27/papers/2003-BeresfordSta-location.pdf [9] Bono SC, Green M, Stubblefield A, Juels A, Rubin AD, Szydlo M (2005) Security analysis of a cryptographically-enabled rfid device. In: McDaniel P (ed) Proceedings of the 14th USENIX Security Symposium, USENIX Association, pp 1–16, URL http://www.usenix.org/events/sec05/ tech/bono/bono.pdf [10] Borisov N, Goldberg I, Wagner D (2001) Intercepting mobile communications: the insecurity of 802.11. In: MobiCom ’01: Proceedings of the 7th annual international conference on Mobile computing and networking, ACM, New York, NY, USA, pp 180–189, DOI http://doi.acm.org/10.1145/381677. 381695 [11] Brands S, Chaum D (1993) Distance-bounding protocols (extended abstract). In: Helleseth T (ed) Advances in Cryptology—EUROCRYPT 93, SpringerVerlag, 1994, Lecture Notes in Computer Science, vol 765, pp 344–359 [12] Bunyan T (2008) The shape of things to come. Tech. rep., Statewatch, URL http://www.statewatch.org/analyses/ the-shape-of-things-to-come.pdf [13] Chaum DL (1981) Untraceable electronic mail, return addresses, and digital pseudonyms. Commun ACM 24(2):84–88, DOI http://doi.acm.org/10.1145/ 358549.358563 [14] Collins T (2007) Loss of 1.3 million sensitive medical files in the us. URL http://www.computerweekly.com/blogs/tony_collins/ 2007/07/loss-of-13-million-sensitive-m.html
310
Frank Stajano
[15] Corner MD, Noble BD (2002) Zero-interaction authentication. In: MobiCom ’02: Proceedings of the 8th annual international conference on Mobile computing and networking, ACM, New York, NY, USA, pp 1–11, DOI http: //doi.acm.org/10.1145/570645.570647, URL http://www.sigmobile. org/awards/mobicom2002-student.pdf [16] Cranor LF, Garfinkel S (2005) Security and Usability: Designing Secure Systems that People Can Use. O’Reilly [17] Cvrcek D, Matyas V, Kumpost M, Danezis G (2006) A study on the value of location privacy. In: Proc. Fifth ACM Workshop on Privacy in the Electronic Society (WPES), ACM, pp 109–118, URL http://www.fi.muni.cz/ usr/matyas/PriceOfLocationPrivacy_proceedings.pdf [18] Danezis G, Lewis S, Anderson R (2005) How much is location privacy worth? In: Proceedings of Workshop on Economics of Information Security (WEIS), URL http://infosecon.net/workshop/pdf/ location-privacy.pdf [19] Daugman J (2004) How iris recognition works. IEEE Transactions on Circuits and Systems for Video Technology 14(1), URL http://www.cl.cam. ac.uk/users/jgd1000/csvt.pdf [20] Ducatel K, Bogdanowicz M, Scapolo F, Leijten J, Burgelman JC (2001) Scenarios for ambient intelligence in 2010. Tech. rep., European Commission, Information Society Directorate-General, URL ftp://ftp.cordis. europa.eu/pub/ist/docs/istagscenarios2010.pdf [21] Future Group (2007) Public security, privacy and technology in europe: Moving forward. Tech. rep., European Commission Informal High Level Advisory Group on the Future of European Home Affairs Policy, URL http://www.statewatch.org/news/2008/jul/ eu-futures-dec-sec-privacy-2007.pdf [22] Garcia FD, de Koning Gans G, Ruben Muijrers PvR, Verdult R, Schreur RW, Jacobs B (2008) Dismantling mifare classic. In: Jajodia S, Lopez J (eds) 13th European Symposium on Research in Computer Security (ESORICS 2008), Springer, Lecture Notes in Computer Science, vol 5283, pp 97–114 [23] Garfinkel S, Rosenberg B (eds) (2005) RFID : Applications, Security, and Privacy. Addison-Wesley [24] Gassend B, Clarke D, van Dijk M, Devadas S (2002) Controlled physical random functions. In: ACSAC ’02: Proceedings of the 18th Annual Computer Security Applications Conference, IEEE Computer Society, Washington, DC, USA, p 149 [25] Gemmell J, Bell G, Lueder R, Drucker S, Wong C (2002) Mylifebits: fulfilling the memex vision. In: MULTIMEDIA ’02: Proceedings of the tenth ACM international conference on Multimedia, ACM, New York, NY, USA, pp 235– 238, DOI http://doi.acm.org/10.1145/641007.641053 [26] Gemmell J, Williams L, Wood K, Lueder R, Bell G (2004) Passive capture and ensuing issues for a personal lifetime store. In: CARPE’04: Proceedings of the the 1st ACM workshop on Continuous archival and retrieval of personal
Security Issues in Ubiquitous Computing
[27]
[28] [29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
311
experiences, ACM, New York, NY, USA, pp 48–55, DOI http://doi.acm.org/ 10.1145/1026653.1026660 Gruteser M, Grunwald D (2003) Anonymous usage of location-based services through spatial and temporal cloaking. In: Proceedings of MobiSys 2003, The Usenix Association, San Francisco, CA, USA, pp 31–42, URL http://www.winlab.rutgers.edu/~gruteser/ papers/gruteser_anonymous_lbs.pdf Hancke G (2005) A practical relay attack on iso 14443 proximity cards. URL http://www.cl.cam.ac.uk/~gh275/relay.pdf Hancke GP, Kuhn MG (2005) An rfid distance bounding protocol. In: SECURECOMM ’05: Proceedings of the First International Conference on Security and Privacy for Emerging Areas in Communications Networks, IEEE Computer Society, Washington, DC, USA, pp 67–73, DOI http://dx.doi.org/ 10.1109/SECURECOMM.2005.56, URL http://www.cl.cam.ac.uk/ ~mgk25/sc2005-distance.pdf Heffernan B, Kennedy E (2008) Alert as 170,000 blood donor files are stolen. URL http://www.independent.ie/national-news/ alert-as-170000-blood-donor-files-are-stolen-1294079. html Hopper A (2000) The clifford paterson lecture, 1999: Sentient computing. Phil Trans R Soc Lond A 358(1773):2349–2358, URL http://www.cl.cam.ac.uk/Research/DTG/lce-pub/ public/files/tr.1999.12.pdf Jackson IW (1998) Who goes here? confidentiality of location through anonymity. PhD thesis, University of Cambridge, URL http://www. chiark.greenend.org.uk/~ijackson/thesis/ Jakobsson M, Wetzel S (2001) Security weaknesses in bluetooth. In: Naccache D (ed) CT-RSA, Springer, Lecture Notes in Computer Science, vol 2020, pp 176–191, URL http://link.springer.de/link/ service/series/0558/bibs/2020/20200176.htm Johnston RG, Garcia AR (1997) Vulnerability assessment of security seals. Journal of Security Administration 20(1):15–27, URL http://lib-www. lanl.gov/la-pubs/00418796.pdf Juels A (2006) Rfid security and privacy: A research survey. IEEE Journal on Selected Areas in Communication 24(2), URL http://www.rsasecurity.com/rsalabs/staff/bios/ ajuels/publications/pdfs/rfid_survey_28_09_05.pdf, the draft version on the author’s web site is more detailed than the one eventually published in J-SAC. Juels A, Rivest RL, Szydlo M (2003) The blocker tag: Selective blocking of rfid tags for consumer privacy. In: Atluri V (ed) Proc. 8th ACM Conference on Computer and Communications Security, ACM Press, pp 103–111, URL http://www.rsasecurity.com/rsalabs/staff/ bios/ajuels/publications/blocker/blocker.pdf
312
Frank Stajano
[37] Juels A, Molnar D, Wagner D (2005) Security and privacy issues in epassports. In: Proceedings of International Conference on Security and Privacy for Emerging Areas in Communications Networks (SecureComm 2005), IEEE Computer Society, Los Alamitos, CA, USA, pp 74–88, DOI http://doi. ieeecomputersociety.org/10.1109/SECURECOMM.2005.59 [38] Kohl J, Neuman C (1993) The kerberos network authentication service (v5). RFC 1510, IETF, URL http://www.ietf.org/rfc/rfc1510.txt [39] Krsti´c I (2007) Bitfrost. URL http://wiki.laptop.org/go/ Bitfrost, draft 19 release 1, timestamped 2007-02-07, but last edited on 2009-01-04. [40] Kuhn M (2003) Rfid friend and foe, with a note on biometric passports. The RISKS Digest 22, URL http://catless.ncl.ac.uk/Risks/ 22.98.html@subj7#subj7 [41] Mann S (1995) Privacy issues of wearable cameras versus surveillance cameras. URL http://wearcam.org/netcam_privacy_issues. html [42] Mann S (1996) ‘smart clothing’: Wearable multimedia and ‘personal imaging’ to restore the balance between people and their intelligent environments. In: Proceedings, ACM Multimedia 96, Boston, MA, pp 163–174, URL http: //wearcam.org/acm-mm96.htm [43] Mann S (1997) Wearable computing: A first step toward personal imaging. Computer 30(2):25–32, URL http://www.wearcam.org/ ieeecomputer/r2025.htm [44] Mann S, Picard RW (1997) Video orbits of the projective group: A simple approach to featureless estimation of parameters. IEEE Transactions on Image Processing 6:1281–1295 [45] Matsumoto T, Matsumoto H, Yamada K, Hoshino S (2002) Impact of artificial gummy fingers on fingerprint systems. In: Proceedings of SPIE, vol 4677, Optical Security and Counterfeit Deterrence Techniques IV, URL http: //cryptome.org/gummy.htm [46] McCune JM, Perrig A, Reiter MK (2005) Seeing-is-believing: Using camera phones for human-verifiable authentication. In: SP ’05: Proceedings of the 2005 IEEE Symposium on Security and Privacy, IEEE Computer Society, Washington, DC, USA, pp 110–124, DOI http://dx.doi.org/10.1109/SP.2005. 19 [47] Mills E (2008) Dutch chipmaker sues to silence security researchers. URL http://news.cnet.com/8301-10784_3-9985886-7.html [48] Needham RM, Schroeder MD (1978) Using encryption for authentication in large networks of computers. Communications of the ACM 21(12):993–999 [49] Norman DA (1998) The Invisible Computer: Why Good Products Can Fail, the Personal Computer Is So Complex, and Information Appliances Are the Solution. MIT Press [50] Odlyzko AM (2003) Privacy, economics, and price discrimination on the internet. In: Sadeh N (ed) ICEC2003: Fifth International Conference on
Security Issues in Ubiquitous Computing
[51]
[52]
[53] [54]
[55] [56]
[57]
[58]
[59]
[60]
[61]
313
Electronic Commerce, ACM, pp 355–366, URL http://www.dtc.umn. edu/~odlyzko/doc/privacy.economics.pdf Pappu R, Recht B, Taylor J, Gershenfeld N (2002) Physical one-way functions. Science 297(5589):2026–2030, URL http://www.sciencemag. org/cgi/reprint/297/5589/2026.pdf Rieback MR, Crispo B, Tanenbaum AS (2006) Is your cat infected with a computer virus? In: PERCOM ’06: Proceedings of the Fourth Annual IEEE International Conference on Pervasive Computing and Communications, IEEE Computer Society, Washington, DC, USA, pp 169–179, DOI http://dx.doi.org/ 10.1109/PERCOM.2006.32 Schneier B (2003) Beyond Fear. Copernicus (Springer) Stajano F (2001) The resurrecting duckling—what next? In: Christianson B, Crispo B, Malcolm JA, Roe M (eds) Security Protocols, 8th International Workshop, Revised Papers, Springer-Verlag, Lecture Notes in Computer Science, vol 2133, pp 204–214, URL http://www.cl.cam.ac. uk/~fms27/papers/duckling-what-next.pdf Stajano F (2002) Security for Ubiquitous Computing. Wiley, URL http:// www.cl.cam.ac.uk/~fms27/secubicomp/ Stajano F (2004) Will your digital butlers betray you? In: Syverson P, di Vimercati SDC (eds) Proceedings of the 2004 Workshop on Privacy in the Electronic Society, ACM, Washington, DC, USA, pp 37–38, URL http://www.cl. cam.ac.uk/~fms27/papers/2004-Stajano-butlers.pdf Stajano F, Anderson R (1999) The resurrecting duckling: Security issues in ad-hoc wireless networks. In: Christianson B, Crispo B, Malcolm JA, Roe M (eds) Security Protocols, 7th International Workshop, Proceedings, Springer, Lecture Notes in Computer Science, vol 1796, pp 172–182, URL http://www.cl.cam.ac.uk/~fms27/papers/ 1999-StajanoAnd-duckling.pdf Stajano F, Cvrcek D, Lewis M (2008) Steel, cast iron and concrete: Security engineering for real world wireless sensor networks. In: Bellovin SM, Gennaro R, Keromytis AD, Yung M (eds) Proceedings of 6th Applied Cryptography and Network Security Conference (ACNS 2008), Lecture Notes in Computer Science, vol 5037, pp 460–478, DOI 10.1007/ 978-3-540-68914-0_28, URL http://www.cl.cam.ac.uk/~fms27/ papers/2008-StajanoCvrLew-steel.pdf Streitz NA, Kameas A, Mavrommati I (eds) (2007) The Disappearing Computer, Interaction Design, System Infrastructures and Applications for Smart Environments, Lecture Notes in Computer Science, vol 4500, Springer Tuyls P, Batina L (2006) Rfid-tags for anti-counterfeiting. In: Pointcheval D (ed) Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, Springer, Lecture Notes in Computer Science, vol 3860, pp 115–131 Want R, Hopper A, Falcao V, Gibbons J (1992) The active badge location system. ACM Transactions on Information Systems 10(1):91–102, DOI http://doi.
314
[62]
[63]
[64]
[65]
[66]
[67]
Frank Stajano
acm.org/10.1145/128756.128759, URL http://www.cl.cam.ac.uk/ Research/DTG/publications/public/files/tr.92.1.pdf Weis SA (2003) Security and Privacy in Radio-Frequency Identification Devices. Master’s thesis, Massachusetts Institute of Technology, Cambridge, MA 02139 Weis SA, Sarma SE, Rivest RL, Engels DW (2004) Security and Privacy Aspects of Low-Cost Radio Frequency Identification Systems. In: Security in Pervasive Computing, Lecture Notes in Computer Science, vol 2802, pp 201– 212 Weiser M (1991) The computer for the twenty-first century. Scientific American 265(3):94–104, URL http://www.ubiq.com/hypertext/ weiser/SciAmDraft3.html Weiser M, Seely Brown J (1997) The coming age of calm technology. In: Denning PJ, Metcalfe RM (eds) Beyond Calculation: The Next Fifty Years of Computing, Springer-Verlag, pp 75–85, URL http://www. ubiq.com/hypertext/weiser/acmfuture2endnote.htm, previously appeared as “Designing Calm Technology” in PowerGrid Journal, v 1.01, July 1996, http://powergrid.electriciti.com/1.01. Wong FL, Stajano F (2005) Multi-channel protocols. In: Christianson B, Crispo B, Malcolm JA (eds) Security Protocols, 13th International Workshop Proceedings, Springer-Verlag, Lecture Notes in Computer Science, vol 4631, pp 112–127, URL http://www.cl.cam.ac.uk/~fms27/papers/ 2005-WongSta-multichannel.pdf, see also updated version in IEEE Pervasive Computing 6(4):31–39 (2007) at http://www.cl.cam.ac. uk/~fms27/papers/2007-WongSta-multichannel.pdf Wong FL, Stajano F, Clulow J (2005) Repairing the bluetooth pairing protocol. In: Christianson B, Crispo B, Malcolm JA (eds) Security Protocols, 13th International Workshop Proceedings, Springer-Verlag, Lecture Notes in Computer Science, vol 4631, pp 31–45, URL http://www.cl.cam.ac.uk/ ~fms27/papers/2005-WongStaClu-bluetooth.pdf
Pervasive Systems in Health Care Achilles Kameas, Ioannis Calemis
1 Introduction An important characteristic of “Ambient Intelligence” (AmI) environments is the merging of physical and digital space (i.e. tangible objects and physical environments are acquiring a digital representation). As the computer disappears in the environments surrounding our activities, the objects therein are transformed to artifacts, that is, they are augmented with Information and Communication Technology (ICT) components (i.e. sensors, actuators, processor, memory, wire-less communication modules) and can receive, store, process and transmit information. Artifacts may be totally new objects or improved versions of existing objects, which by using the ambient technology, allow people to carry out novel or traditional tasks in unobtrusive and effective ways. In addition to objects, spaces also undergo a change, towards becoming augmented AmI spaces. An AmI space embeds sensing, actuating, processing and networking infrastructure in a physical (usually closed) space and offers a set of services in the digital AmI space. Applications that run on an AmI space use the services offered by the space and the artifacts therein in order to support user tasks. Everyday habits and activities present an important source of information and may often provide information of medical significance. For example, a person’s movement patterns convey important information for the person’s status, thus, screening the functional status of patients through an appropriately designed set of set of motion capturing devices is of great importance. Functional activities are fundamental to maintaining independence and mobility. ADLs (activities of daily living) and IADLs (instrumental activities of daily living) are measures of dependency. People who experience a loss in the ability to perform one or more ADL may Achilles Kameas CTI, N.Kazantzaki Str. Rio GREECE, e-mail:
[email protected] Ioannis Calemis CTI, N.Kazantzaki Str. Rio GREECE, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_12, © Springer Science+Business Media, LLC 2010
315
316
Achilles Kameas, Ioannis Calemis
be manifesting the only recognisable sign of an underlying illness, such as cardiac disease. In section 2 of this chapter, we survey the application of pervasive systems in supporting activities with health significance. A set of important existing healthcare systems is described, in order to clarify how pervasive system architectures have been adopted to meet the requirements of the health care application domain and its subdomains. In section 3, an abstract architecture of pervasive healthcare systems is presented. For each module of the architecture, the services it offers and the technologies it uses are described. Then, as model architecture, we shall describe and analyse in section 4 the HEARTS system, a pervasive system architecture that integrates a number of cutting edge technologies to enable and support patients with congestive heart failure (CHF). The technological framework is designed to be open to many other patient categories and can provide a general-purpose tool for remote monitoring with integrated medical inference and controlled decision making capabilities. The infrastructure is based on state-of-the-art measurement and data acquisition through communities of devices and artifacts, wireless data transfer and communications, data fusion, medical inference and decision making, and automatic planning and execution of advanced response patterns and interaction scenarios. The paper concludes with section 5 where a summary of all the presented architectures will be given as well as brief overview of some more projects and technologies that are related to pervasive healthcare.
2 Overview of Pervasive Healthcare Systems As a first step we are going to present the architectural approaches adopted by some well known Pervasive Healthcare systems. In our research we focused on nine different systems. In this section we will provide a generic overview of those systems.
2.1 @HOME Starting with the @Home system [1] which is able remotely to monitor patient’s vital parameters, like ECG, blood pressure and oxygen saturation level. The patient is equipped with ambulatory sensors, which acquire this health care data, which is subsequently transmitted to a local PC station at the patient’s home via Bluetooth or DECT communication link. The local PC runs special software to analyse the acquired data and produces alerts, according to the parameter threshold table (the doctor is able to remotely set the threshold values of the monitored parameters on the patient’s PC). In case of any abnormality the findings and the collected vital data are immediately transmitted to the clinic via GSM for further analysis (Fig. 1): The whole system works independently without requiring any explicit interaction with the patient. For instance, the patient doesn’t connect any ambulatory sensor to
Pervasive Systems in Health Care
317
Fig. 1 @ HOME System Architecture
the local PC at the end of the day for uploading the recorded data to the clinic’s server. Additionally, the patient’s PC is programmed to identify the sensors and to retrieve the readings immediately, when they are accessible. At the clinic, the Monitoring module is responsible for monitoring the patient’s progress. It encapsulates the rules for manipulating the @HOME database and the parametric model of the typical patient, for each disease. Then, using rules related to the monitoring of the patient’s progress, the system assigns specific values to the parameters of the model, based on the readings of the patient’s sensors.
2.2 HEARTFAID HEARTFAID [2, 3, 4] aims at defining efficient and effective health care delivery organization and management models for the “optimal” management of care in the field of cardiovascular diseases. The system is based on a knowledge based platform of services, able to improve early diagnosis and to make more effective the medical-clinical management of heart diseases within elderly population. The platform provides the following services: • electronic health record for easy and ubiquitous access to heterogeneous patient data; • integrated services for healthcare professionals, including patient telemonitoring, signal and image processing, alert and alarm system; • clinical decision support in the heart failure domain, based on pattern recognition in historical data, knowledge discovery analysis and inferences on patient clinical data.
318
Achilles Kameas, Ioannis Calemis
2.3 ALARM-NET ALARM-NET [5] is an Assisted-Living and Residential Monitoring Network for pervasive, adaptive healthcare developed at the University of Virginia. It integrates environmental and physiological sensors in a scalable, heterogeneous architecture, which supports real-time collection and processing of sensor data. Communication is secured end-to-end to protect sensitive medical and operational information. The system includes: • An extensible, heterogeneous network architecture that addresses the challenges of an ad hoc wide-scale deployment, and integrates embedded devices, backend systems, and user interfaces, • Context-aware protocols informed by Circadian activity rhythm analysis, which enable smart power management and dynamic alert-driven privacy tailored to the individual’s patterns of activity, • Query protocol for streaming online sensor data to user interfaces, integrated with privacy, security, and power management. • SecureComm, a hardware-accelerated secure messaging protocol and a TinyOS based module that supports user selectable security modes and multiple keys, • A system implementation and evaluation using custom and commodity sensors, embedded gateway, and backend database and analysis programs Fig. 2 shows an example deployment of the system in an assisted-living community with many residents or patients.
Fig. 2 Assisted-living deployment example, showing connections among sensors, body networks, and backbone nodes
Pervasive Systems in Health Care
319
2.4 CAALYX CAALYX (Complete Ambient Assisted Living Experiment) [6, 7] is a FP6 funded project that aims at increasing older people’s autonomy and self-confidence by developing a wearable light device capable of measuring specific vital signs of the elderly, detecting falls, and communicating automatically in real time with his/her care provider in case of an emergency, wherever the elderly person happens to be, at home or outside. Specifically, objectives of CAALYX are: • To identify which vital signs and patterns are most important in determining probable critical states of an elder’s health; • To develop an electronic device able to measure vital signs and to detect falls of the older person in the domestic environment and outside. This gadget has a geo-location system so that the monitoring system is able to know the elder’s position in case of emergency (especially when outdoors); • To allow for the secure monitoring of individuals organised into groups managed by a caretaker who will decide whether to communicate events identified by the system to the emergency service (112); and • To create social tele-assistance services that can be easily operated by the users.
2.5 TeleCARE Project IST TeleCARE [8, 9, 10] is developing a configurable common infrastructure so that the elderly communities and those responsible for providing care and assistance can get the best out of the technologies developed. The TeleCARE approach integrates the Internet with agent and mobile-agent technologies (Fig. 3).
2.6 CHRONIC CHRONIC (An Information Capture and Processing Environment for Chronic Patients in the Information Society) project [11] was funded by the European Community in the early years of Pervasive Healthcare research. The main objective of the CHRONIC project was to develop a new European model for the care of targeted chronic patients based on an integrated IT environment. This new model of care was applied to three highly prevalent chronic disorders that are predicted to become the top three causes of mortality in the year 2020 (cardio-vascular, neurologic and respiratory disorders). The model and the role that the use of IT played were evaluated in a series of clinical trials during the life of the project. The CHRONIC System allows the development of integrated services of home care assistance. The complexity of the services depends on the patient’s clinical requirements or the supporting organisational structure. Thus, services can range from a simple follow-up
320
Achilles Kameas, Ioannis Calemis
Fig. 3 Plug-and-play approach
call to the acquisition of patient’s instrumental clinical data from a distance, which are evaluated at the control centre and dispatched to the networked health care professionals.
2.7 MyHeart MyHeart [12, 13, 14] is an Integrated Project (IP) funded by the EU in the context of FP6 aiming at developing intelligent systems for the prevention and monitoring of cardiovascular diseases. The project develops smart electronic and textile systems and appropriate services that empower the users to take control of their own health status. The system uses wearable technology to monitor patient’s Vital Body Signs (VBS) and then processes the measured data in order to provide recommendations to the user. As illustrated in Fig. 4, the communication loop can either consist of direct local feedback to the user or of professional help by a physician or nurse. The latter will typically be provided remotely, that is why the MyHeart system also comprises a telemedical element. Data are transmitted to a remote server, where a professional can process it and contact the patient subsequently.
Pervasive Systems in Health Care
321
Fig. 4 MyHeart disease management and prevention approach
2.8 OLDES The OLDES project [15] implements an innovative low-cost technological platform, which provides a wider range of services to elderly people. The goal of the system is to provide both entertainment and communication services. The main actors of the OLDES system are shown in Fig. 5.Different groups connect to the proposed system: thousands of elderly people (clients) and several animators and professionals. In the central system, information about all the users is stored. A direct connection between doctor/social services and elderly people can also take place but, usually, this will happen via the central system on the explicit request by the patient. Moreover, the on-line analysis of the client’s behaviour and health status allows health and social operators to give each client specific support if needed through the various communication channels.
Fig. 5 OLDES Main Actor Groups
322
Achilles Kameas, Ioannis Calemis
2.9 SAPHIRE In the SAPHIRE project [16, 17] the patient monitoring will be achieved by using agent technology complemented with intelligent decision support systems based on clinical practice guidelines. The observations received from wireless medical sensors together with the patient medical history will be used in the reasoning process. The patient’s history stored in medical information systems will be accessed through semantically enriched Web services to tackle the interoperability problem. In order to guarantee long term patient monitoring and successful execution of clinical decision support systems the following services are supported: • Accessing vital signs of the patient: In order to be able to monitor the patient’s current condition, the clinical decision support systems need to access the vital signs of the patient measured by wireless medical sensors and body-worn platforms. • Accessing Electronic Healthcare Records of the Patient: The gathered vital signs of the patient can only be assessed correctly when consolidated with the Electronic Healthcare Records (EHRs) of the patient. • Accessing the Clinical Workow systems executed at Healthcare Institutes: While the clinical decision support system is executing the Clinical Guideline Denition, it is needed to interact with several modules of the clinical workow executed at the healthcare institutions. The SAPHIRE Clinical Decision Support System that is responsible for deploying and executing clinical guidelines in a heterogeneous distributed environment is implemented as a multi-agent system. Its agents communicate in a reactive manner and are instantiated and eliminated dynamically based on the demand.
3 Reference Architecture In this section we shall present different architectures used in pervasive healthcare systems. One could claim that these systems follow the same high-level architecture and consist of: • Input-Output Subsystem: all these systems use sensory input, albeit of varying data types. This input is provided by sensors attached either to the space or to the user, by wearable devices or by specific healthcare devices that retrieve specific measurements. For each system we are going to analyse the specific input it uses and we are going to provide an overview of the technology used for doing so; • Local Subsystem: the local subsystem locally collects sensory and other information and provides it to the healthcare system. The local subsystem may appear in the form of a single computer or as a small computing device like a PDA. In other cases, it may not be a concrete system but rather a collection of
Pervasive Systems in Health Care
323
software agents each one specializing in a different operation. Lastly some systems provide a middleware as a common operating platform so that the devices have a common understanding with the rest of the sensory system. Our purpose is to analyse the specifics of the each subsystem and provide a thorough overview of it; • Remote Subsystem: all pervasive healthcare systems send the data collected to a specific service provider that either does some kind of processing on it, or stores it into health medical records. Usually these subsystems reside on Healthcare service providers, capable of providing scientific monitoring of the data and checking whether the patient is in a critical condition or not in order to provide immediate support.
3.1 Input-Output System The types of input that pervasive healthcare systems collect depend on the medical variables they monitor or analyze. Sources of input are medical devices, everyday devices and objects, sensors embedded in the user’s local space or audiovisual recording devices. In order to achieve closer and seamless monitoring of user’s health with the least possible intrusion in the patient’s lifestyle, these devices are sometimes adapted as wearable devices or become embedded to specific cloth ware. The most widely monitored medical variables are those related to Cardio Vascular Diseases (CVD); thus, the medical devices that are used are mainly: • • • •
Electrocardiograph (ECG) Heart Rate Monitor (HRM) Blood Pressure Monitor (BPM) Oxygen Saturation Monitor (SpO2)
These devices are in most cases interfaced to a local data collecting station, such a PC or a PDA. In the @HOME system, for example, the wearable sensors transmit continuous information to the patient’s computer (Fig. 6). Furthermore, all the sensors are portable, meaning that the patient is able to use them outdoors as well. The sensors are wired through serial ports to a serial communication hub which in its part is interfaced to a Pocket PC or iPAQ that is equipped with a Bluetooth card. ALARM-NET uses a combination of wearable pulse-oxymeter and several ECG sensors, all connected to MicaZ [18] devices. These collect patient vital measurements for Heart rate (HR), heartbeat events, oxygen saturation (SpO2), and electrocardiogram (ECG). The MyHeart system combines input from specific medical devices together with that coming from Body Signal Sensors (BSS) embedded in a specifically designed shirt (Fig. 7). The variables monitored include electrocardiogram (ECG), respiration and motion activity detection. In addition to the medical devices there is a need of taking measurements of nonmedical variables, which may provide information about the user’s current status. These are called Activities of Daily Living (ADL) and are described as “things
324
Achilles Kameas, Ioannis Calemis
Fig. 6 @HOME Sensors’ communication schema
Fig. 7 Architecture of the MyHEART’s Take Care System
we normally do in daily living including any daily activity we perform for selfcare (such as feeding ourselves, bathing, dressing, grooming), work, homemaking, and leisure” [19]. Typical ADL measurements contain body weight, movement patterns and sleeping habits. Most of the systems that monitor patients at home take measurements concerning a set of ADL variables. For example, in the ALARMNET system, the monitoring of ADL variables is a highly important task. That is why the system consists of several elements capable of measuring ADL variables, including a motion sensor used to track presence in the house, a body network that records human activities such as walking, eating and stillness and incorporates a GPS to track the outdoor location of the user, indoor temperature and luminosity sensor, and bed sensor(Fig. 8).All these are interfaced to a MicaZ wireless sensor node that processes the sensor data using interrupts and forwards the information through the wireless network.
Pervasive Systems in Health Care
325
Fig. 8 ALARM-NET architecture components and logical topology
While all systems provide some kind of sensory input some of them provide also output subsystems, for direct communication with the user. For example, at the CHRONIC Integrated Care Platform, the Patient Units (Fig. 9) support video conferencing, telemonitoring, messaging and educational services. The units are to be used in connection with a TV set. The telemonitoring functionality is provided in combination with a set of sensors, which measure oxygen saturation, ECG, spirometry and accelerometer.
Fig. 9 CHRONIC’s Patient Unit (with sensors)
326
Achilles Kameas, Ioannis Calemis
In the majority of the above cases the input-output subsystem is either connected to the local system, or directly to the remote monitoring system, using one or more of the common communication protocols. Commonly used wire technologies are serial connection, X10 or Ethernet. However with the evolution of wireless technologies, which make the monitoring units more portable, a very common practice for Body Area Networks is Bluetooth [20] followed by 802.11b/g [21]. Specific systems however use the logic of sensor networks such as ALARM-NET which uses the MicaZ motes as sensor nodes which communicate with 2.4 GHz IEEE 802.15.4, TinyOS Wireless Measurement System [22].
3.2 Local System The local subsystem usually is used as a collector for the sensory input that each system provides. It is implemented usually in a PC that resides in the patient’s local space and is connected to the Internet so that it can transfer the locally collected data to a remote system. In a much simpler case, the local PC is replaced by a router routing data from the local sensors to the remote system. Such is the case of the @HOME system, which uses a desktop PC or a laptop positioned at some point in the patient’s home to provide the user end of the communication link between the clinic and the patient’s home. Its functionalities are the following: • It receives all recordings transmitted from the medical sensors and transmits them to the clinic’s Central System Server. The transmission will be realized via wireless infrastructure. The Remote Controller establishes a TCP/IP connection over GSM/GRPS and conveys the data to the Clinic’s Central Server (CCS). • It controls all remote sensors connected to it. It is responsible for the proper functionality and checks the status of each sensor, identifies communicating sensors, downloading the readings, uploading schedules to the sensors. • Additionally, the Remote Controller enables the patient and his family members to monitor his progress using an appropriate user interface. That interface offers limited functionality compared to the Clinic User Interface. The same approach is adopted by the CHRONIC Integrated Care Platform where the user station has the responsibility of retrieving the local data and sending them to the Patient Management Module. Similarly, in OLDES system, a PC is used to provide user entertainment (tele-accompany) and care facilities through a series of easy-to-access thematic channels. The PC is integrated with multiple wearable and ambient sensors and devices that gather the information to be sent to a remote Central Node (CN) as is shown in Fig. 10. Some systems use a middleware layer, which manages the data exchange between the various system modules. In addition, this level ensures that all the incoming, outgoing and exchanged information, as well as all the communications taking place among the internal modules of the platform and between the platform and the
Pervasive Systems in Health Care
327
Fig. 10 OLDES’ System architecture
external applications are compliant with the standards for clinical data representation and communication. In the case of the HEARTFAID system, all biomedical devices interact with the middleware level. The TeleCARE platform consists mainly of a set of agents, whose purpose is to discover local services, offered either by existing medical devices or by ubiquitous computing devices and sensors. The Device abstraction layer of the TeleCARE platform provides the necessary interfaces to the sensors, monitoring devices and other hardware (home appliances, environment controllers, etc.). In the TeleCARE environment each service can be implemented as a set of distributed stationary and mobile agents. For instance, a monitoring service might involve a stationary agent in the care center (interacting with the care worker), a number of stationary agents in the elderly home (agents in charge of monitoring local sensors, e.g. temperature sensor, presence sensor), and a mobile agent sent from the care center to the elderly home (Fig. 11). This mobile agent might carry a mission to collect information from different sensors and report back to the care centre. A different approach is used in the SAPHIRE architecture. The bottom layer represents the actual sensor hardware. Above that, the networking layer is comprised of a Bluetooth stack and a TCP/IP stacks. The sensor driver layer implements the communication protocol that determines the sensor’s data structure and how it is transmitted through the network layer. If a new sensor is introduced to the system, the Data Point Abstraction layer ensures that only the proprietary sensor driver needs to be adapted. A Virtual Device, as it can be seen in the layer above the Datapoint Abstraction can - but does not necessarily have to correspond to a physical device. A Virtual Device can also receive its data from an algorithm that derives data from other (possibly also virtual) devices (Fig. 12). The sensor data coming from wireless sensors to the gateway computer are exposed as Web services (Fig. 13). These Web Services have been published to the
328
Achilles Kameas, Ioannis Calemis
Fig. 11 Example of a service implementation
Fig. 12 Sensors and Data Point Abstraction
UDDI Registry together with their functionality semantics, which are based on the IEEE 11073-10101 standard. The bottom layer represents the actual sensor hardware. Above that, the networking layer is comprised of a Bluetooth stack and a TCP/IP stack. In other cases there is no concrete local system, in the form of a single unit. For example in the case of the ALARM-NET the infrastructure of the local system consists of a sensor network. Data from the body network and the sensors is transmitted to user interfaces or back-end programs. These devices form a multihop wireless network to the nearest AlarmGate application. AlarmGate applications run on embedded platforms, managing system operations and serving as application level gateways between the wireless sensor and IP networks. These nodes allow user interfaces to connect, authenticate, and interact with the system. There are also systems that don’t support any kind of local subsystem, such as CAALYX, where
Pervasive Systems in Health Care
329
Fig. 13 Sensor and Medical Web Services Architecture
both roaming and home monitoring devices are connected directly to the Internet in order to provide their data to Healthcare systems ( Fig. 14)
Fig. 14 CAALYX system diagram
330
Achilles Kameas, Ioannis Calemis
3.3 Remote Subsystem Data collected from the local subsystems is usually sent to a remote system. The remote systems are mostly sophisticated systems that run special algorithms for further data extraction. They combine both medical and ADL data in order to provide a holistic view of the patient’s current medical status. Furthermore they may keep thorough patient health records in order to have a complete overview of the patients fact file. A typical remote system, as the one supported by the @HOME system, consists of: • Clinic’s Central Server (CCS), is responsible for monitoring the patient’s progress based on a parametric model of the typical patient; • Health record system / Database, which enables the system to keep in a uniform way the records of each patient; • Clinic user interface (CUI), which provides access to each member of staff in the clinic. Some other remote systems, as the one of the ALARM-NET, provide online analysis of sensor data and long-term storage of system configuration, user information, privacy policies, and audit records. Note that, as wireless sensor networks grow stronger in their capability to collect, process, and store data, privacy and the protection of personal information become rising concerns. Systems must include a framework to protect privacy and still support the need to provide timely assistance to patients in critical health conditions. In the CAALYX system all data and alerts produced by both the Roaming and Home Monitoring systems are sent to the Central Care Service and Monitoring System. The caretaker will evaluate whether received alerts need to be communicated to the emergency service, in which case the geographic position and data about the likely type of emergency (fall, stroke, etc.) will be provided to the emergency service, so that a suitably equipped emergency team may be dispatched in a timely manner to the patient’s location. Besides this service, video-communication with the home environment will be provided to attend to the older person’s demands. In the CHRONIC system, the Chronic Care Management Centre (CCMC) is the core of the whole system and is composed of two main modules: the Call-Centre and the Patient Management Module. This centre can be placed inside a hospital or another healthcare structure, or it can be part of a service centre from which requests and/or clinical data can be transmitted to the clinical reference centre of the patient. The HEARTFAID system uses an ontology-based decision support mechanism. The knowledge level integrates symbolic knowledge and computational reasoning models. The decision support subsystem implements data processing algorithms and provides guidelines to using medical protocols and accessing the knowledge base, as well as alarms and diagnostic suggestions in case of critical situations. The architecture of the CDSS subsystem is typical of an ontology-based system, comprising the following components (Fig. 15):
Pervasive Systems in Health Care
331
Fig. 15 The general view of the HEARTFAID CDSS architecture - dashed arrows correspond to reference to the ontologies, while the others denote a direct communication
• Domain Knowledge Base, consisting of the domain knowledge, formalized from the European guidelines for the diagnosis and treatment of chronic HF, and the clinicians’ know-how; • Model Base, containing the computational decision models, as well as signals and images; • Processing methods and pattern searching procedures; • Meta Knowledge Base, composed by the strategy knowledge about the organization of CDSS tasks; • Brain, the system component endowed with the reasoning capability; • Explanation Facility, providing means to explain conclusions taken. As one can see, the knowledge in the system is partitioned into different modules, something that makes knowledge maintenance easier. Specifically, the Domain Knowledge Base uses a hierarchy of core and upper level ontologies combined with rules as knowledge representation formalism; this facilitates easy re-use and sharing of knowledge. In the agent based architecture adopted in the TeleCARE platform, the main purpose of the Care Center is to produce specific task agents to be sent to the local system in order to collect and report back information from different sensors. The Core MAS Platform (Fig. 16) is the main component of the basic platform. It supports the creation, launching, reception (authentication and rights verification), and execution of stationary and mobile agents as well as their interactions. The SAPHIRE system architecture combines multi-agent systems with ontologies. The Clinical Decision Support System is implemented as a multi-agent system, so as to cope with the heterogeneity and dynamic nature of the distributed clinical environment. Agents are autonomous components that communicate with each other in a reactive manner; some of these components can be instantiated and eliminated dynamically. Various roles are assigned to the system agents, as can be seen in
332
Achilles Kameas, Ioannis Calemis
Fig. 16 The TeleCARE platform architecture
Fig. 17. Among these, the Ontology agent resolves the semantic heterogeneity that appears when the medical Web services, the sensor data, and the EHR documents use different reference information models and clinical terminologies.
Fig. 17 The SAPHIRE Multi Agent System
Pervasive Systems in Health Care
333
4 The HEARTS System The HEARTS system [23] aims at assisting in the rehabilitation and remote monitoring of people with cardio-vascular diseases by using an ambient monitoring environment installed in the user’s own space. The system adds services in user’s space, which help the patient in taking the necessary measurements himself, in his own local space. At the same time, it supports remote monitoring of user activities, as they could provide more information about the user’s daily status. The HEARTS system gathers information using devices and sensors in the user’s local space, filters this information in the local subsystem and forwards the formatted information to the specialized remote system that can take decisions about the patient’s status. The above tasks are carried out without intruding in the patient’s daily activities. The functionality of the system transcends simple monitoring and plain comparison of measurements to predefined values that could lead to alarms. When potentially negative trends are identified in the monitored health variables, the system issues effective pre-alarms or alarms, involving the automatic creation and execution of in-situ responses (such as audible and/or visible feedback through appropriately adapted familiar common devices) and external responses (engaging health service providers and other formal and informal health structures). Interaction scenarios with the user are integrated into the system’s operation either as part of a medical data acquisition schedule or as an automatically created system response.
Fig. 18 HEARTS System Architecture
334
Achilles Kameas, Ioannis Calemis
The HEARTS architecture is composed of the following subsystems ( Fig. 18): • Local Devices/Services: contains all the devices and services that are deployed in the user’s local space. These are not used just for sensory input but they also provide actuation (such as warning messages and guidelines for the device usage); • Local Subsystem Manager (LSM): this is the gateway for the local services, where initially processed data are gathered, packaged and sent to the remote system; • HEARTS Middleware: the purpose of the HEARTS middleware platform is – to provide a common interface between the devices and LSM, in order to maintain a rich communication, and – to provide a common interface among the devices themselves, in order to form artifact communities • Remote Subsystem: the purpose of the remote subsystem (that resides in a health care service provider) is to collect the data for each patient from each local subsystem and to run sophisticated algorithms, which may derive prealarm or alarm states. In such cases the system either notifies its administrator to contact the patient or sends back to the LSM the appropriate notification messages in order to be presented to the user using his local devices A simple usage scenario of the system goes as follows: The system notifies the patient to take a measurement (i.e. using ECG). The measurement data is sent to LSM through the middleware communication protocol. LSM packages the data into a secure XML envelope and sends it to the remote subsystem. The remote subsystem combines the data with past measurements (e.g. taken during the past month) and tries to analyse the patient’s current status by running the appropriate algorithms. The system decides that there is a possibility of heart failure in the next days. It sends a message back to the local system for a pre-alarm warning and communicates with the local administrator. When LSM receives the pre-alarm message, it sends it to a vocal notification device which warns the user to communicate with his doctor because his readings aren’t too good.
4.1 Local Devices/Services The HEARTS system aims to provide the user-patient with a set of services that will enable him to take any measurement that may be needed in order to have a secure overview of his heart condition, while minimizing intrusion in his living environment. They are classified in the following categories: • Medical Devices: All the devices used for taking the necessary measurements of medical variables. These include:
Pervasive Systems in Health Care
335
– Electrocardiograph (ECG): An ECG device that is able to analyse the patients heart state – Oxymeter: The oxymeter measures the amount of oxygen (SPO2) in the user’s blood – Blood Pressure Monitor: The BPM device is used to get regularly reading of the user’s blood pressure – Thermometer: A simple temperature reading device that record patient’s temperature • ADL Devices:A set of devices and artifacts that monitor ADL variables. Currently, two devices are used:: – Weight Scale: The measure of every day weight is considered important since most people with CVD problems tend to have abnormal weight changes – Bed: A specifically designed bed with embedded sensors able to monitor the time that the user stays at bed. Abnormal sleeping patterns of a patient may indicate problems that may be connected with imminent cardiac arrest • Sensors:A set of sensors attached to the home enables the system to record the current state of the patient’s local space. These sensors include temperature, light, location indication. • Actuators: Artifacts (such specially designed lamps) that are able to provide the end user with health related information from the system are deployed in the local system. These can be used to provide warning messages to the user (e.g. a lamp that flashes may indicate to the user that he has a pre-alarm state) or to provide an easy living ambient environment in conjunction with the sensors of the system (i.e. a lamp to light if the light sensors of the system sense low ambient luminosity) All the above devices/services are adapted to the system with the help of the HEARTS-OS middleware that is going to be analysed in the next paragraph.
4.2 HEARTS-OS The HEARTS-OS kernel is the fundamental component of the HEARTS system architecture (Fig. 19). It is the smallest set of components and functionalities needed so that any device can become an artifact and interact with the HEARTS system. The kernel itself(Fig. 20) is divided in four distinct parts: the Communication Module, the Process Manager,the Memory Manager and the State Variable Manager. The Communication Module is responsible for handling communication between different artifact nodes and the local server. This module implements algorithms and protocols for wireless, connectionless communication (using the 802.11b/g protocol) as well as mechanisms for internal diffusion of information exchanged. The
336
Achilles Kameas, Ioannis Calemis
Fig. 19 HEARTS-OS Architecture
Fig. 20 HEARTS-OS Kernel
Process Manager is the coordinator module of HEARTS-OS. Some of its most important tasks are to manage the processing policies of the HEARTS-OS, to accept and serve various tasks set by the other modules of the kernel and to implement the artifact collaboration protocol. The State Variable Manager is a repository of the hardware environment (sensors/actuators) inside HEARTS-OS reflecting at each particular moment the state of the hardware. Finally the Memory Manager is a local storage module for various storage operations, like communication paths, current connections available etc The communication between artifacts or artifacts and local server is perceived as peer-to-peer. Communication between two peers is accomplished with asynchronous XML messages. In order to provide stability as well as abstraction to the communication interface we have divided the communication system into three layers. • Peer: A peer is any networked device that implements one or more communication protocols. Each peer operates independently and asynchronously from all other peers and is uniquely identified by a Peer ID. Peers allow for other components higher in the hierarchy, like the Process Manager, to register themselves as event listeners, in order to receive notifications for incoming data. Peers are
Pervasive Systems in Health Care
337
also responsible for the implementation of communication protocols, such as discovery protocol, message exchanging protocols, routing protocol etc. • Pipe: Each peer owns two pipes: an Input pipe waiting for incoming messages and an Output pipe that sends either unicast or multicast messages to the network. Pipes are an asynchronous and unidirectional message transfer mechanism used for service communication. Pipes are indiscriminate; they support the transfer of any object, including binary code, data strings, and Java technologybased objects. However, as HEARTS-OS is a platform and language independent system, only XML messages are transferred through pipes • Endpoint: Each pipe is bound to one or two Endpoints, which are considered as the fundamental networking units of the Communication Module. The Endpoint may be associated to specific network resources such as a TCP port and associated IP address. An input pipe owns two Endpoints, one for listening multicast messages and one for listening TCP p2p messages. An output pipe owns one Endpoint which is configured dynamically to send either multicast messages to a default multicast group, or p2p TCP messages. HEARTS-OS Communication Module defines a series of XML message formats, and protocols, for communication between artifacts, which use these protocols to discover each other, advertise and discover network resources, and communicate and route messages. There are four basic and two complementary communication protocols: • Gadget Discovery Protocol (GDP). This protocol is used to discover artifact in the local space using resources as criteria. Resources are represented as advertisements, and usually contain information about a remote artifact (i.e. ID, Services, Relay, etc.). This information can then be used to establish a p2p communication. An artifact creates a discovery message that is sent to all listening artifacts through a Multicast Endpoint. There are two kinds of discovery messages, the simple ones, where a specific ID is requested and the complex ones, where either a specific service from an unknown artifact is requested. • Gadget Advertisement Protocol (GAP). This protocol is used to advertise one or more attributes of an artifact. An advertisement message is usually a response to a discovery message, thus formed according to the requested information. • Gadget Information Protocol (GIP). This protocol is used for exchange of data between artifacts. It uses a single unicast pipe to send the information to a specific peer. Each data message is folded inside an envelope that contains the recipient’s identification. • Message Acknowledging Protocol (MAP). This underlying protocol is responsible for the robustness of the communication system. Its main task is to provide acknowledgements (ACK) for each arrived message to the sender and NACK receipts for each lost message. Furthermore if a message is lost the MAP protocol is responsible for resending the message to the target artifact. Apart from this basic protocol scheme there exist two complementary protocols that may optimize the network performance and increase the capabilities of HEARTS system.
338
Achilles Kameas, Ioannis Calemis
• Service - Resource Discovery (SRDP). This protocol is used to discover specific services and resources that specific artifacts may have throughout the network. The services or resources may vary from low level resources like CPU, memory, etc. to high level like light, display, sound capabilities etc. • Routing Protocol (RP). This protocol deals with the problem of what should an artifact do when it cannot reach a synapse-peer directly. A solution is to apply the relay technique. While communication module handles all the message transaction between two artifacts (or artifacts and server) there is an important virtual communication level that takes part in the process manager, the Plug - Synapse communication model (Fig. 21)
Fig. 21 The Plug-Synapse Communication Model
• Plugs. A plug belongs to an artifact and is an abstraction of its properties and abilities. It is the only way other artifacts can use an artifacts’s services and have access to the artifact’s properties, and is accessible to other artifacts modules via HEARTS-OS only. Each artifact has several S-Plugs and one T-Plug. – S-Plugs. An S-Plug is the plug that describes services and there is one SPlug per service. – T-Plug. Each artifact has a T-Plug that contains its physical properties as well as a list of its S-Plugs. A T-Plug comprises the HEARTS identity of a component as it gathers all the HEARTS specific information in one object. • Synapses. A synapse is a connection between two S-Plugs. By collaborating with the Communication Module and the State Variable Manager, the Process Manager sets up an event based internal messaging system that combines input from sensors and actuators with input received via synapses from other artifacts. The Plug-Synapse model envelops a set of services that can initiate a virtual path, send data through this path and break it if necessary. The communication is done through well define XML messages. These services are: • Synapse Request: One of the artifacts sends a Connection request message to another artifact. The message contains information concerning the local and remote plug.
Pervasive Systems in Health Care
339
• Synapse Response: When the target artifact receives the message, it checks the plug compatibility of the remote and local plugs. If plug compatibility test is passed, an instance of the remote plug is created and a positive response is sent to the artifact that initiated the synapse. The instance of the remote plug is notified for changes by its remote counterpart plug and in fact serves as an intermediary communication level. In case of a negative plug compatibility test, a negative response message is sent to the remote plug, while no instance of the remote plug is created. When the initial artifact receives a positive response, it also creates an instance of the remote plug, for the same reason as before, and the connection is established. • Synapse Activation: After connection established, the two plugs are capable of exchanging data. Output plugs use specific objects, called shared objects (SO), to encapsulate the plug data to send, while input plugs use specific event-based mechanisms, called shared object listeners (SOL), to become aware of incoming plug data. When the value of the shared object of the output plug is changed the instance of the remote plug is notified and a synapse activation message is sent to the other artifacts. The remote artifact receives the message and changes the shared object of the instance of the remote plug. This, in turn, notifies the target local plug, which reacts as specified. • Synapse Disconnection:Finally, if one of the two connected plugs breaks the synapse, a synapse disconnection message is sent to the remote plug in order to also terminate his end of the synapse. The communication scheme described above supports artifact-to-artifact communication. The logical communication channel used between the artifacts and Local Server is provided by the HEARTS-OS middleware. Local Server and artifacts exchange information through a specific Synapse between a specially designed Plug, the Monitoring Plug, or MPlug. The MPlug inherits its basic properties from the Plug class, therefore, has all the abilities of a single Plug. However, its functionality is enhanced in the following ways: • Monitoring of the low-level context of the host artifact. When created, the MPlug is connected to the State Variable Manager in order to retrieve the low level data of the host artifact. From this point on, any change on those data is reflected to the MPlug through an event based mechanism • Monitoring of high level state. The MPlug may receive high level monitoring data, provided by the inference engine, representing higher level states of the artifact. These variables are processed in the same manner as the state variables. • Monitoring the variables of a remote artifact. The MPlug can receive the variables of a remote artifact (via a synapse), and monitor the data they represent The MPlug has a dual role. On one hand, it monitors the parameters of a local artifact and on the other hand, it records changes in the synapse for visualization purposes. Two connected MPlugs are completely synchronized and any change made on one is also sent to the other (Fig. 22).
340
Achilles Kameas, Ioannis Calemis
Fig. 22 MPlug Communication Mechanism
4.3 HEARTS Local Subsystem Manager (LSM) The HEARTS Local Server contains a series of end user tools that support the creation of pervasive computing applications that collect measurements. These tools interact with the artifacts in the HEARTS-OS level in order to create/break connections and monitor data through the conceptual model of HEARTS-OS. Furthermore the Local Server is responsible to forward the information to a remote system where specific algorithms run in order to make a decision about the patient’s current health state. Lastly an important task for the Local Server is to provide informative messages to the user from the remote system, such as an imminent warning about his condition. The Local Server’s tools aim to (i) create the appropriate tests so that the Local Server can gather the appropriate measurements (ii) create home based applications between artifacts and (iii) trace back specific data in order to provide to the end user history data about a medical variable. The Scenario Creation Tool (Fig. 23) aims to discover the local devices, and provide to the user with an appropriate mechanism to create scenarios of Device usage. It is actually an interface where the user can add specific tests needed to be done after a certain amount of time (e.g take ECG measurements every 6 hours). The tool provides a series of steps where the user can add the interactions needed to be done between the system and the patient in order to take the appropriate measurement. After the execution of a scenario, the system can decide if the scenario was executed correctly and collect the appropriate measurements, or may notify the user to repeat the scenario. The Artifact Editor is a complementary tool that can be used to create associations among artifacts used to monitor ADL (Fig. 24). The tool provides an overview of the artifacts that exist in the area and their services. It also supports the estab-
Pervasive Systems in Health Care
341
Fig. 23 Local Subsystem Manager - Scenario Creation tool
lishment of Synapses between the artifact Plugs, so as to facilitate data exchange. For example one can connect an artifact chair with a desk lamp so that the lamp is switched on whenever the user is sitting on his chair. This kind of applications, created by the end-user himself, aim to ensure the user’s easy living.
4.4 Remote Monitoring In the HEARTS system, the data that are collected locally is sent to a remote server for further processing. The remote server is designed is such way that it can monitor multiple local environments, and includes decision making mechanisms in order to extract information about the current state of the patient and to deduce if the patient is on an imminent alarm of a heart stroke (Fig. 25). The communication protocol is based on a simple XML mechanism. However, because personal data concerning a patient’s health state are sensitive, security and privacy algorithms are applied in the whole process. Any message that is sent to the Remote Server and contains sensitive data is packed within two different envelopes: the Privacy envelope, which ensures that the data is coded with a user’s specific key so that it can be used only by the specific user, and the Security envelope, which ensures that the information will be point-to-point encrypted.
342
Fig. 24 Artifact Editor
Fig. 25 Remote subsystem communication module and risk detection
Achilles Kameas, Ioannis Calemis
Pervasive Systems in Health Care
343
The Remote Server also provides a set of tools for monitoring a patient’s medical data over a period of time so that a doctor can be aware of the current health state of the patient. Data is encrypted in the remote system so that only the attendant doctor can have access to a patient’s medical history. However an alert that results from the decision making algorithms becomes known to anyone who is in the interest set of the user (family, doctor, health service provider).
5 Conclusions In this chapter, we surveyed a few pervasive systems, which provide holistic pervasive healthcare support and can be regarded as representing different architectural approaches. Three subsystems are commonly found in these systems: the input/output subsystem, the local processing subsystem and the remote subsystem. The first is used as source of health-related data (these are a combination of medical variables and ADLs), as well as communication channels between the system and the user. The local processing subsystem may simply transmit the data to the remote system, or it may apply complex processing on it, before sending it. The remote subsystem stores and processes the data and creates alerts when the patient’s condition becomes critical. While all the systems presented in this chapter fall into the same architectural style each of them presents a different set of novelties, such as: • Wearable body electronics that are seamlessly added on the patients clothing (or as a part of the clothing like shirts), which enable medical and ADL measurements to be recorded (@Home, CAALYX, ALARM-NET, MyHEART) • Middleware components specifically tailored for medical systems in such way that the whole system won’t lose any critical data (HEARTFAID, HEARTS) • Agents based systems that are able to pickup the appropriate devices for specific critical medical tasks (TeleCARE, SAPHIRE) • Development of medical ontologies for providing better decision support systems (TeleCARE, HEARTFAID, SAPHIRE) • Sophisticated algorithms for decision support that are able to provide a-priori knowledge of heart failure conditions (HEARTS) The list of pervasive healthcare systems grown constantly. MobiHEALTH [25, 26]. The system provides new mobile health services, based on the on-line, continuous monitoring of vital signs, via GPRS and UMTS technologies. These services are supported by the MobiHealth Body Area Network (BAN), a wireless system that allows the simple connection of different vital signal sensors; SAPHE [27] which develops a new generation of telecare networks with miniaturised wireless sensors worn on the body and integrated homes (and ultimately offices, hospitals, etc.) to allow for intelligent, unobtrusive continuous healthcare monitoring; WASP Project [28] (Wirelessly Accessible Sensor Populations), which has made a case study on the deployment of sensors network specif-
344
Achilles Kameas, Ioannis Calemis
ically aimed for elderly care monitoring. SOPRANO [29] (Service-oriented Programmable Smart Environments for Older Europeans) is an Integrated Project in the European Commission’s 6th Framework Programme and aims is to enable older Europeans to lead a more independent life in their chosen environment. Research within SOPRANO focuses on three pillars. • To design the next generation of systems for ambient assisted living in Europe, including innovative contextaware, smart home environment services with natural and comfortable interfaces for older people at affordable cost. • To set up large-scale, visible demonstrations of innovative Ambient Assisted Living systems showing their viability in European markets. • To adapt and extend state-of-the art Experience and Application Research methods, integrating design-for-all components and providing innovative tools to create a new, consistently user-centred design methodology. Finally M-POWER [30] is an IST project whose aim is to develop a middleware platform that enables rapid development of novel smart house systems in a SOA environment. The platform in particular supports: (a) Integration of SMART HOUSE and sensor technology; (b) Interoperability between profession and institution specific systems (e.g. Hospital Information System); (c) Secure and safe information management, including both social and medical information; and (d) Mobile users which often change context and tools. As a general conclusion there is a lot of work done and going on that is focused on pervasive health care environments. Most of the work focuses on patients either on Cardio Vascular Diseases, Alzheimer or Dementia. The purpose of all systems is to provide people with an easy living and safe monitoring environment in such a way that they will continue to carry on with their daily activities while under the system’s “care”.
6 Acknowledgement This paper describes research done in various Pervasive Healthcare projects. Pictures describing architectures of those systems are owned by the respective project consortia. Furthermore the work presented for the HEARTS plarform describes research carried within the HEARTS project funded by GSRT. The authors wish to thank their fellow researchers in the HEARTS consortium.
References [1] Sachpazidis I.: @HOME: A modular telemedicine system. In: Mobile Computing in Medicine, Second Conference on Mobile Computing in Medicine, Workshop of the Project Group MoCoMed, GMDS-Fachbereich Medizinis-
Pervasive Systems in Health Care
[2] [3]
[4]
[5]
[6] [7]
[8] [9]
[10]
[11] [12] [13]
[14]
[15]
345
che Informatik & GI-Fachausschuss 4, Pages: 87–95, (2002), ISBN:3-88579344-X HEARTFAID project’s web site: http://www.heartfaid.org/ Domenico Conforti et al: HEARTFAID: A Knowledge Based Platform Of Services For Supporting Medical-Clinical Management Of Heart Failure Within Elderly Population. ICMCC, (2006) Colantonio, S., Martinelli, M., Moroni, D., Salvetti, O., Perticone, F., Sciacqua, A., Gualtieri, A.: An approach to decision support in heart failure. In: Semantic Web Applications and Perspectives (SWAP 2007). Proceedings, Bari, Italy, Dip. di Informatica, Universita di Bari (2007) A. Wood, G. Virone, T. Doan, Q. Cao, L. Selavo, Y. Wu, L. Fang, Z. He, S. Lin, and J. Stankovic: ALARM-NET: Wireless Sensor Networks for AssistedLiving And Residential Monitoring. Technical Report CS- 2006-11, Department of Computer Science, University of Virginia, (2006) CAALYX project’s web site: http://www.caalyx.eu/ Boulos Maged N Kamel, Rocha Artur, Martins Angelo, Vicente Manuel Escriche, Bolz Armin, Feld Robert, Tchoudovski Igor, Braecklein Martin, Nelson John, Laighin Gear, Sdogati Claudio, Cesaroni Francesca, Antomarini Marco, Jobes Angela, Kinirons Mark: CAALYX: a new generation of location-based services in healthcare. In: International journal of health Geographics, (2007) TeleCARE project’s web site: http://www.uninova.pt/ ~telecare/ Camarinha-Matos L. M., Afsarmanesh H.: TeleCARE: Collaborative virtual elderly support communities. In: Proceedings of the 1st Workshop on TeleCare and Collaborative Virtual Communities in Elderly Care, TELECARE, Portugal, 13 April (2004). Camarinha-Matos L. M., Castolo O., Rosas J.: A multi-agent based platform for virtual communities in elderly care. In: ETFA-2003, 9th IEEE International Conference on Emerging Technologies and Factory Automation Lisbon, Portugal, 16-19 September (2003) Chronic: An Information Capture and Processing Environment for Chronic Patients in the Information Society. Final Project Report MyHEART project’s web site: http://www.hitech-projects.com/ euprojects/myheart/home.html Habetha J.: The MyHeart Project: Fighting cardio-vascular diseases by prevention and early diagnosis. In: Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, (2006) Amft O. and Habetha J.: The MyHeart project. In: L. van Langenhove, editors, Book chapter in: Smart textiles for medicine and healthcare, pp. 275-297, Woodhead Publishing Ltd, Cambridge, England, February (2007). ISBN 1 84569 027 3 OLDES project’s web site: http://www.oldes.eu/press/ busuoli.php
346
Achilles Kameas, Ioannis Calemis
[16] SAPHIRE project’s web site: http://www.srdc.metu.edu.tr/ webpage/projects/saphire/index.php [17] Gokce Banu, Laleci Erturkmen ,Asuman Dogac, Mehmet Olduz, Ibrahim Tasyurt, Mustafa Yuksel, Alper Okcan: SAPHIRE: A Multi-Agent System for Remote Healthcare Monitoring through Computerized Clinical Guidelines. In: Book Chapter in Agent Technology and e-Health, Whitestein Series in Software Agent Technologies and Autonomic Computing, (2008) [18] MICAz Wireless Measurement System Specification sheet, XBOW Technology Inc. [19] Kristine Krapp (Ed.): Activities of Daily Living Evaluation., Encyclopedia of Nursing & Allied Health.Gale Group, Inc., (2002) [20] Bluetooth Core Specification 2.1, Bluetooth consortium, July (2007) [21] IEEE Standard for Information technology: Telecommunications and information exchange between systems Local and metropolitan area networks, Specific requirements. Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, 802.11, June (2007), Revision of IEEE Std 802.11-1999 [22] TinyOS Wireless Measurement System, www.tinyos.net [23] HEARTS Project’s web site: http://www.ilsp.gr/hearts\_eng. html [24] eGadgets Project’s web site: http://www.extrovert-gadgets.net [25] Mobihealth project’s web site: http://www.mobihealth.org/ [26] Dimitri Konstantas et al: MOBIHEALTH: Ambulant Patient Monitoring Over Public Wireless Networks. In: MEDICON 2004, July 31th - August 5th 2004, Island of Ischia, Naples, Italy, (2004) [27] SAPHE project’s web site: http://ubimon.doc.ic.ac.uk/saphe/ m338.html [28] Atallah, L., Lo, B., Guang-Zhong Yang, Siegemund, F.: Wirelessly accessible sensor populations (WASP) for elderly care monitoring. In: Pervasive Computing Technologies for Healthcare, 2008. PervasiveHealth 2008, Jan. 30 2008-Feb. 1 (2008), pp 2-7, ISBN: 978-963-9799-15-8 [29] SOPRANO project’s web site: http://www.soprano-ip.org/ [30] M-POWER project’s web site:http://www.sintef.no/ Projectweb/MPOWER/Home/
Part IV
Human-centered Interfaces
Human-centered Computing Nicu Sebe
1 Introduction Computing is at one of its most exciting moments in history, playing an essential role in supporting many important human activities. The explosion in the availability of information in various media forms and through multiple sensors and devices means, on one hand, that the amount of data we can collect will continue to increase dramatically, and, on the other hand, that we need to develop new paradigms to search, organize, and integrate such information to support all human activities. Whether we talk about the pervasive, ubiquitous, mobile, grid, or even the social computing revolution, we can be sure that computing is impacting the way we interact with each other, the way we design and build our homes and cities, the way we learn, the way we communicate, play, or work [52][61]. Simply put, computing technologies are increasingly affecting and transforming almost every aspect of our daily lives. Unfortunately, the changes are not always positive and much of the technology we use is clunky, unfriendly, unnatural, culturally biased, and difficult to use. As a result, several aspects of daily life are becoming increasingly complex and demanding. We have access to huge amounts of information, much of which is irrelevant to our own local socio-cultural context and needs or is inaccessible because it is not available in our native language, we cannot fully utilize the existing tools to find it, or such tools are inadequate or nonexistent. Thanks to computing technologies, our options for communicating with others have increased, but that does not necessarily mean that our communications have become more efficient. Furthermore, our interactions with computers remain far from ideal and too often only literate,
Nicu Sebe University of Amsterdam, The Netherlands, e-mail:
[email protected] & University of Trento, Italy, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_13, © Springer Science+Business Media, LLC 2010
349
350
Nicu Sebe
educated individuals who invest significant amounts of time in using computers can take direct advantage of what computing technologies have to offer. The goal of this chapter is to discuss the most important aspects of HCC covering the existing challenges (Section 2.1), the different possible definitions of HCC reflecting the wide expertise of the researchers working in this field, the scope of HCC, and the connection with related research areas (Sections 2.2, 2.3, and 2.4, respectively). We also present the areas of research covered by HCC and we focus on three main directions: production, analysis, and interaction (Section 3). We then present the possible applications (Section 4) and we discuss the issues of integrating the existing systems and applications into real world conditions (Section 5). Finally, we propose a research agenda for HCC (Section 6).
2 What is Human-centered Computing? Human Centered Computing (HCC) is an emerging field that aims at bridging the existing gaps between the various disciplines involved with the design and implementation of computing systems that support human’s activities. HCC aims at tightly integrating human sciences (e.g. social and cognitive) and computer science (e.g. human-computer interaction (HCI), signal processing, machine learning, and ubiquitous computing) for the design of computing systems with a human focus from the beginning to the end. This focus should consider the personal, social, and cultural contexts in which such systems are deployed [57]. Beyond being a meeting place for existing disciplines, HCC also aims at radically changing computing with new methodologies to design and build systems that support and enrich people’s lives.
2.1 Gateways and Barriers We could argue that knowledge and communications are two of the main pillars of any society. As information of all kinds increasingly forms part of the digital ecosystem, the computing technologies we develop become, paradoxically, both the gateways to all kinds of resources and the barriers to access them. Therefore, in addition to having become critical in improving lives, computing is now essential to the livelihood not only of the relatively few of us who have the necessary skills and access to resources, but potentially for everyone [8][38]. For the most part, however, the computing community designs and implements computing algorithms and technologies without fully taking into account our cognitive abilities, the ways we perceive and handle information, go about our work and life, create and maintain our social relations, or use our cultural context. That is, researchers and engineers often develop computing technologies in relative isolation.
Human-centered Computing
351
Most current methodologies start with an idea that builds on or improves existing technologies or that solves problems in a particular technological domain, largely ignoring human issues. The obvious outcomes are more powerful, less expensive computers that are more difficult to use, in some cases even effectively slower and less accessible to most of the world’s population. Given the current rate of penetration of computing technologies (directly or indirectly) in almost every imaginable human activity, it is clear that the existing computing research and development models are no longer adequate. People who adapt more quickly to technology (whether the technology is good or bad) have greater opportunities to benefit at many levels. Developing technologies in relative human isolation does not contribute to alleviating the problems of wealth distribution, sustainability, and access to healthcare and education. Technologies that are difficult to use not only waste our time and make our lives worse, they make access to important resources more difficult, particularly for people who do not have the education or language skills of the minority who develop those technologies. From this perspective, the current path of development of computer technologies is not only detrimental but also dangerous because it contributes to increasing the gap between the educated and uneducated, and between the rich and the poor. The problem is even greater if we consider that the difficulties in using much of the technology available today are leading to a rapidly increasing content gap - most digital information is produced in developed countries, from particular cultural perspectives, and in only a few languages. To make things worse, access to the information itself is through technology developed without considering the local socio-cultural context of the majority of the world’s population. Given the developments in recent years in decreased computing costs, increased wireless communications, and the spread of the Internet, the time is ripe for a major shift in the computing revolution. In our opinion, the goal of human-centered computing is to focus the computing revolution on addressing human abilities and needs.
2.2 Definitions In the last few years, many definitions of HCC have emerged. In general, the term HCC is used as an umbrella to embrace a number of definitions which were intended to express a particular focus or perspective [29]. In 1997, the U.S. National Science Foundation supported a workshop on Human-Centered Systems [19], which included position papers from 51 researchers from various disciplines and application areas including electronics, psychology, medicine, and the military. Participants proposed various definitions for HCC, including the following (see [19]): • HCC is “a philosophical-humanistic position regarding the ethics and aesthetics of the workplace”; • a HCC system is “any system that enhances human performance”;
352
Nicu Sebe
• a HCC system is “any system that plays any kind of role in mediating human interactions”; • HCC is “a software design process that results in interfaces that are really userfriendly”; • HCC is “a description of what makes for a good tool - the computer does all the adapting”; • HCC is “an emerging inter-discipline requiring institutionalization and special training programs”. Other definitions of HCC have also appeared in the literature: • According to [20], HCC is “the science of designing computations and computational artifacts in support of human endeavors”; • For Canny et al. [10], HCC is “a vision for computing research that integrates technical studies with the broad implications of computing in a task-directed way. HCC spans computer science and several engineering disciplines, cognitive science, economics and social sciences.” So, what is really HCC? HCC research focuses on all aspects of humanmachine integration: humans with software, humans with hardware, humans with workspaces, humans with humans, as well as aspects of machine-machine interaction (e.g., software agents) if they impact the total performance of a system intended for human use. The HCC vision inherits all of the complexity of software engineering and systems integration with the human in the loop, plus the additional complexity of understanding and modeling human-human and human-computer interaction in the context of the working environment [13]. HCC recognizes the fact that the design and the use of new information processing tools must be couched in system terms [19]. That is, the humans and the information processing devices are regarded as a coupled, co-adapting system nested within a context in which they are functional. Furthermore, the design of systems must regard the human as just one aspect of a larger and dynamic context, including the team, organization, work environment, etc. This means that the fundamental unit of analysis is not the machine, nor the human, nor the context and domain of work, but the triple including all three [19]. In this context, according to Hoffman [28], HCC can be defined as “the development, evaluation, and dissemination of technology that is intended to amplify and extend the human capabilities to: • perceive, understand, reason, decide, and collaborate; • conduct cognitive work; • achieve, maintain, and exercise expertise.” Inherently, HCC research is regarded as being interdisciplinary, as illustrated by the participation of experts from a wide range of fields, including computer, cognitive, and social sciences in defining HCC and its scope. Based on these observations we adopt the following definition: Human-Centered Computing, more than being a field of study, is a set of methodologies that apply to any field that uses computers, in any form, in applications in which humans directly interact with devices or systems that use computer technologies.
Human-centered Computing
353
2.3 Scope of HCC One way to identify the scope of HCC is to examine new and existing initiatives in research and education. Given the trends that indicate that computing is permeating practically all areas of human activity, HCC has received increasing attention in academia as a response to a clear need to train professionals and scholars in this domain. A number of initiatives have appeared in the past years. Examples of the growing interest in HCC are the HCC doctoral program at Georgia Tech [27], the HCC consortium at the University of California, Berkeley [26], and the Institute of Human and Machine Cognition, Florida [30], to mention just a few. The goals are ambitious. For instance, Georgia Tech’s interdisciplinary Ph.D. program, created in 2004, “aims to bridge the gap between technology and humans by integrating concepts from anthropology, cognitive sciences, HCI, learning sciences and technology, psychology and sociology with computing and computer sciences [. . .] with the explicit goal of developing theory and experimentation linking human concerns and computing in all areas of computing, ranging from the technical-use focus of programming languages and API designs and software engineering tools and methodologies to the impacts of computing technology on individuals, groups, organizations, and entire societies” [20]. The program emphasizes, on one hand, a deep focus on “theoretical, methodological, and conceptual issues associated with humans (cognitive, biological, socio-cultural); design; ethics; and analysis and evaluation”, and, on the other hand, “design, prototyping, and implementation of systems for HCC”, with a focus on building interactive system prototypes. Conceiving computing as the “infrastructure around a human activity”, the UC, Berkeley HCC consortium is “a multidisciplinary program designed to guide the future development of computing so as to maximize its value to society” [10]. The initiative added a socio-economic dimension to the domain: “the great economic [computing] advances we have seen are undermined because only a fraction of the population can fully use them. Better understanding of human cognition is certainly needed, but also of the social and economic forces that ubiquitous computing entails.” While “extremely broad from a disciplinary perspective”, the program is “tightly focused on specific applications.” Issues covered in this initiative include “understanding humans as individuals, understanding humans as societies, computational models of behavior, social and cultural issues (diversity, culture, group dynamics, and technological change), economic impacts of IT, human-centered interfaces, human-centered applications, and human- centered systems.” The following is a list of HCC topics including the ones listed by the U.S. National Science Foundation in their HCC computing cluster (see [48]): • Systems for problem-solving by people interacting in distributed environments. For example, in Internet-based information systems, in sensor-based information networks, and mobile and wearable information appliances;
354
Nicu Sebe
• Multimedia and multimodal interfaces in which combinations of images and video, speech, text, graphics, gesture, touch, sound, etc. are used by people to communicate with one another; • Intelligent interfaces and user modeling, information visualization, and adaptation of content to accommodate different display capabilities, modalities, bandwidth and latency; • multi-agent systems that control and coordinate actions and solve complex problems in distributed environments in a wide variety of domains, such as ecommerce, medicine, or education; • Models for effective computer-mediated human-human interaction under a variety of constraints, (e.g., video conferencing, collaboration across high vs. low bandwidth networks, etc.); • Collaborative systems that enable knowledge-intensive and dynamic interactions for innovation and knowledge generation across organizational boundaries, national borders, and professional fields; • Methods to support and enhance social interaction, including innovative ideas like social orthotics, affective computing, and experience capture; • Social dynamics modeling and socially aware systems. In terms of applications, the possibilities are endless if we wish to design and deploy computers using HCC methodologies. If we view computing as a large space with the human in the center, we can certainly identify some applications/fields that are closer to the human and some that are further away. Using an extreme example, packet switching is very important in communications, but its distance from a human is much larger, than, for instance, human-computer interaction. As computers become more pervasive, the areas that are closer to humans increase in number. If we view this historically, it is clear that computers are increasingly getting - physically, conceptually, and functionally - closer to humans (think of the first computers, those used in the 1950s and the current mobile devices), motivating the need for well-defined methodologies and philosophies for HCC. Both HCC’s scope and its possibilities are wide, but in our view, three factors form the core of HCC system and algorithm design processes: • augmenting or taking into account individual human abilities and limitations, • social and cultural awareness, and • adaptability across individuals and specific situations. If we consider these factors in the design of systems and algorithms, HCC applications should exhibit the following qualities: • integrate input from different types of sensors and communicate through a combination of media as output, • act according to the social and cultural context in which they are deployed, and • be useful to diverse individuals in their daily life.
Human-centered Computing
355
2.4 HCC, HCI, CSCW, User-centered Design, and Human Computation Many important contributions to HCC have been made in fields such as HCI, computer-supported cooperative work (CSCW), user-centered design, cognitive psychology, sociology, anthropology, and others. User-centered design can be characterized as a multistage problem-solving process that requires designers not only to analyze and foresee how users are likely to use an interface but also to test the validity of their assumptions with regard to user behavior in the real world [41]. Traditional design approaches often include heuristic evaluations and measurements of productivity and efficiency. CSCW combines the understanding of the way people work in groups with the enabling technologies of computer networking and associated hardware, software, services, and techniques [25]. In contrast, HCC covers more than the traditional areas of usability engineering, HCI, and human factors, which are primarily concerned with user interfaces or user interaction. HCC “incorporates the learning, social, and cognitive sciences and intelligent systems areas more closely than traditional HCI” [20]. According to [26], compared to HCI, the shift in perspective with HCC is twofold: • HCC is “conceived as a theme that is important for all computer-related research, not as a field which overlaps or is a sub-discipline of computer science”; • The HCC view acknowledges that “computing connotes both concrete technologies (that facilitates various tasks) and a major social and economic force.” Interactive evolutionary computation, an interesting technique first developed in the early 1990s and more recently known as human-based computation [22][3][2], also puts the human at the center, but in a different way. In traditional computation, a human provides a formalized problem description to a computer and receives a solution to interpret. In human-based computation, the roles are reversed: The computer asks a person or a large number of people to solve a problem, then collects, interprets, and integrates their solutions. In other words, the computer “asks” the human to do the work it cannot do. This is done as a way to overcome technical difficulties: Instead of trying to get computers to solve problems that are too complex, use human beings. In human computation, the human is helping the computer solve difficult problems, but in HCC, the computer helps humans maximize their abilities regardless of the situation - the human need not be in front of a computer performing a computing task when dealing with the computer. Although HCC and human-based computation approach computing from two different perspectives, they both try to maximize the synergy between human abilities and computing resources. Work in human-based computation can therefore be of significant importance to HCC. On one hand, data collected through human-based computation systems can be valuable for developing machine-learning models. On the other hand, it can help us better understand human behavior and abilities, again of direct use in HCC algorithm development and system design.
356
Nicu Sebe
HCC researchers also bring a diverse array of conceptual and research tools to traditional computing areas. HCC is not just about the interaction, the interface, or the design process. It is about knowledge, people, technology, and everything that ties them together. This includes how the technology is actually used and in what context. Furthermore, HCC involves both the creation of theoretical frameworks and the design and implementation of technical approaches and systems in many areas. Additionally, Dertouzos [16] points out that HCC goes well beyond user-friendly interfaces. This is because HCC uses five technologies in a synergistic way: natural interaction, automation, individualized information access, collaboration, and customization. In considering the scope, application areas, and qualities of HCC systems, much can be learned from fields and disciplines related to HCC. As we will see in the next section, this leads us directly to focus on three core areas for Human-Centered Computing: production, analysis, and interaction.
3 Areas of Human-centered Computing The distinctions between computing and multimedia computing are blurring very quickly. In spite of this, we can identify the main human-centered activities as follows [32]: media production, annotation,organization, archival, retrieval, sharing, analysis, and communication. The above activities can in turn be clustered into three large activity areas: production, analysis, and interaction. These three areas are proposed here as a means to facilitate the discussion of the scope of HCC in the remainder of the paper. However, it must be evident that other ways of describing HCC are possible, and that the three proposed areas are interdependent in more than one way. Consider for example two typical multimedia scenarios which are representative to illustrate the HCC aspects. In the first one, post-production techniques of non-edited home video normally use interaction (via manual composition of scenes), and increasingly rely on automatic analysis (e.g. shot boundary detection). In the second scenario, analysis techniques (e.g. automatic annotation of image collections) increasingly use metadata generated at production time (both automatic like time, date, camera type, etc. and human-produced via ‘live’ descriptions, commentaries, etc.), while performance can be clearly improved through various forms of interaction (e.g., via partial manual annotation, through active learning, etc.). In addition to this, further knowledge about humans could be introduced both at the individual level (in the first scenario, by adapting “film rules” to personalizing the algorithms, e.g. preferred modes of shooting a scene), and at the social level (in the second scenario, by using social context, provided for instance from a photo sharing website to improve annotation prediction for better indexing). Each of the three main areas is discussed in the following sections, emphasizing social and cultural issues, and the integration of sensors and multiple media for
Human-centered Computing
357
system design, deployment, and access. Note that we use here interchangeably HCC and Human-center Multimedia (HCM) as the latter stresses the multimodal aspect of the former.
3.1 Multimedia Production The first activity area in HCM is multimedia production, i.e. the human task of creating media (e.g., photographing, recording audio, combining, remixing, etc.). Although media can be produced automatically without human intervention once a system is set up (e.g., video from surveillance cameras), in HCM we are concerned with all aspects of media production which directly involve humans. Without a doubt, social and cultural differences result in differences in content at every level of the content production chain and at every level of the content itself (e.g., see discussions on the content pyramid in [32][33][17]). This occurs from low-level features (e.g., colors have strong cultural interpretations) to high-level semantics (e.g., consider the differences in communication styles between Japanese and American business people). A good example of how cultural differences determine the characteristics of multimedia content is the TREC Video Retrieval Evaluation (TRECVID) set [47]. In news programs in some Middle Eastern countries there are mini-soap segments between news stories. Furthermore, the direction of text banners differs depending on language, and the structure of the news itself varies from country to country. The cultural differences are even greater in movies: colors, music, and all kinds of social and cultural signals convey the elements of a story (consider the differences between Bollywood and Hollywood movies in terms of colors, music, story structure, and so on). Content is knowledge and vice versa: in HCM systems, cultural and social factors should ideally be (implicitly or explicitly) embedded in media at production time. In addition, HCM production systems should consider cultural differences and be designed according to the cultural context in which they will be deployed. As a simple example, a system for news editing in the Middle East might, by default, animate banners from left to right, or have special functions to distribute mini-soap segments across a newscast. HCM production systems should also consider human abilities. For example, many of the systems for continuous recording of personal experiences [11] are meant to function as memory prosthesis, so one of their main goals is to record events with levels of detail (e.g., video, audio) that humans are incapable of recording (or rather, recalling). Interestingly, for these types of applications, integration of different types of sensors is also important, as is their adaptability, to each individual and particular context (a “user” of such system would probably not want everything to be recorded). Unfortunately, in terms of culture, the majority of tools for content production follow a standard Western model, catering to a small percentage of the world’s popu-
358
Nicu Sebe
lation and ignoring the content gap (see the World Summit Award - http://www.wsisaward.org - which is an initiative to create an awareness of this gap) [8][38]. In terms of integration of multiple sensors, there has been some progress (e.g., digital cameras with GPS information or audio annotation, among others), but the issue of adaptability in content production systems has been seldom addressed. The result is that in spite of the tremendous growth in the availability of systems for media production, the field is in its infancy if we think of the population as a whole (relatively speaking, few people have access to computers, and out of those that do, even fewer can easily produce structured multimedia content with the current available tools).
3.2 Multimedia Analysis A second activity area of great importance in HCM is automatic analysis of multimedia content. As described above, automatic analysis can be integrated in production systems (e.g., scene cut detection in video editing software). Interestingly, it can also alleviate some of the limitations in multimedia production because automatic analysis can be used to give content structure (e.g., by annotating it), increasing its accessibility. This has application in many Human-Centered areas (e.g., broadcast video, home video, data mining of social media, web search, etc.). An interesting HCM application that has emerged in recent years is the automatic analysis of human activities and social behavior. Automatic analysis of social interaction finds a number of potentially relevant uses, from facilitating and enhancing human communication (on-line), to allowing for improved information access and retrieval (off-line), in the professional, entertainment, and personal domains. Social interaction is inherently multimodal, and often recorded in multimedia form (e.g., video and information from other sensors). Unlike the traditional HCI view, which emphasizes communication between a person and a computer, the emphasis of an emerging body of research has shifted towards the study of computational models of human-to-human communication in natural situations. Such research has appeared in various communities under different names (social computing, socially-aware computing, computers-in-the-human-interaction-loop, etc.) [61]. Such interest has been boosted by the increasing capacity to acquire media with both fixed and mobile sensors and devices, and also to the ability to record and analyze largescale social activities through the internet (media sharing sites, blogs, etc). Social context can be provided through the understanding of patterns that emerge from human interaction at various temporal, spatial, and social scales, ranging from short-duration, face-to-face social signals and behaviors exchanged by peers or groups involving a few people, including interest, attraction [23], to mid-duration relations and roles that people play within groups, like influence and dominance [64], to group dynamics and trends that often emerge over extended periods of times, including degree of group membership, social network roles, group alliances, etc. [52].
Human-centered Computing
359
There is no doubt of the importance of considering social and cultural differences in the design and application of algorithms and systems for multimedia analysis of human activities and social behavior. In turn, culture-specific knowledge should also be used in designing automatic analysis algorithms of multimedia in other domains in order to improve performance. In the news example from Section 3.1, an automatic technique designed for the US news style is likely to yield low performance when applied to a Middle Eastern newscast. Augmenting or considering human abilities is also clearly beneficial because as argued earlier, there is a tight integration between the three activity areas we are considering, thus, what analysis algorithms are designed to do has a direct impact on how humans use multimedia data. The benefit of integrating multiple sensors is clear in the analysis of human activities (e.g., using input from RFID tags gives us information not easily attainable from video), as is the adaptability of HCM analysis systems to specific collections, needs, or particular tasks.
3.3 Multimedia Interaction In addition to understanding the subjacent human tasks, the understanding of the multiplicity of forms that interaction can take is of particular importance for multimedia research within the HCM paradigm. In other words: it is paramount to understand both how humans interact with each other and why, so that we can build systems to facilitate such communication and so that people can interact with computers (or whatever devices embed them) in natural ways. We illustrate this point with three cases. In face-to-face communication, interaction is physically located and real-time. Concrete examples include professional settings like interviews, group meetings, and lectures, but also informal settings, including peer conversations, social gatherings, traveling, etc. The media produced in many of these situations might be in multiple modalities (voice, images, text, data from location, proximity, and other sensors), be potentially very rich, and often unedited (raw content). In a second case, live computer-mediated communication - ranging from the traditional teleconferencing and remote collaboration paradigms to emerging ubiquitous approaches based on wearable devices - is physically remote but remains real-time. In this case, the type of associated media will often be more limited or pre-filtered compared to the face-to-face case, due to bandwidth constraints. A final case corresponds to non-real time computer-mediated communication - including for instance SMS, mobile picture sharing, e-mail, blogging, media sharing sites, etc. - where, due to their own nature, media will often be edited, and interaction will potentially target larger, physically disjoint groups. Unlike in traditional HCI applications (a single user facing a computer and interacting with it via a mouse or a keyboard), in the new applications (e.g., intelligent homes [1], remote collaboration, arts, etc.), interactions are not always explicit commands, and often involve multiple users. This is due in part to the remarkable progress in the last few years in computer processor speed, memory, and storage
360
Nicu Sebe
capabilities, matched by the availability of many new input and output devices that are making ubiquitous computing [63] a reality. Devices include phones, embedded systems, PDAs, laptops, wall size displays, and many others. The wide range of computing devices available, with differing computational power and input/output capabilities, means that the future of computing is likely to include novel ways of interaction and for the most part, that interaction is likely to be multimodal. Some of the modes of communication include gestures [50], speech [53], haptics [5], eye blinks [24], and many others. Glove mounted devices [7] and graspable user interfaces [18], for example, seem now ripe for exploration. Pointing devices with haptic feedback, eye tracking, and gaze detection [31] are also currently emerging. As in human-human communication, however, effective communication is likely to take place when different input devices are used in combination. Given these trends, we view the interaction activity area of HCM as Multimodal interaction (see [34] and [35] for recent reviews). Clearly, one of the main goals of a HCC approach to interaction is to achieve natural interaction, not only with computers as we think of them today (i.e., machines on a desk), but rather with our environment, and with other people. Inevitably, this implies that we must consider culture because the way we generate signals and interpret symbols depends entirely on our cultural background. Multimedia systems should therefore use cultural cues during interaction [32] (such as a cartoon character bowing when a user initiates a transaction at an ATM). Although intuitively this makes sense, the majority of work in multimedia interaction assumes a one-size-fits-all model, in which the only difference between systems deployed in different parts of the world (or using different input data) is language. The spread of computing under the language-only difference model means people are expected to adapt to the technologies imposed arbitrarily using Western thought models. Clearly, these unfortunate trends are also due to social and economic factors, but as computing spreads beyond the desktop, researchers and developers are recognizing the importance of rethinking what we could call the “neutral culture syndrome” where it is erroneously believed that current computing systems are not culture specific. In order to succeed, HCM interaction systems must be designed considering cultural differences and social context so that natural interaction can take place. This will inevitably mean that most HCM systems should embrace multimodal interaction, because multimodal systems open the doors to natural communication and to the possibility of adapting to particular users. Of course, integration of multiple sensors and adaptability are essential in HCM interaction. We describe some examples in the next section.
4 Applications The range of application areas for HCC touches on many aspects of computing, and as computing becomes more ubiquitous, practically every aspect of interaction with objects, and the environment, as well as human-human interaction (e.g., remote
Human-centered Computing
361
collaboration, etc.) will make use of HCC techniques. In the following sections, we describe briefly specific application areas in which interesting progress has been made (for more details see [35]).
4.1 Human Spaces Computing is expanding beyond the desktop, integrating with everyday objects in a variety of scenarios. As our discussions show, this implies that the model of user interface in which a person sits in front of a computer is no longer the only model. One of the implications of this is that the actions or events to be recognized by the “interface” are not necessarily explicit commands. In smart conference room applications, for instance, multimodal analysis has been applied mostly for video indexing [45] (see [52] for a social analysis application). Although such approaches are not meant to be used in real-time, they are useful in investigating how multiple modalities can be fused in interpreting communication. It is easy to foresee applications in which “smart meeting rooms” actually react to multimodal actions in the same way intelligent homes should [1]. Projects in the video domain include MVIEWS [12], a system for annotating, indexing, extracting, and disseminating information from video streams for surveillance and intelligence applications. An analyst watching one or more live video feeds is able to use pen and voice to annotate the events taking place. The annotation streams are indexed by speech and gesture recognition technologies for later retrieval, and can be quickly scanned using a timeline interface, then played back during review of the film. Pen and speech can also be used to command various aspects of the system, with multimodal utterances such as “Track this” or “If any object enters this area, notify me immediately.”
4.2 Ubiquitous Devices The recent drop in costs of hardware has led to an explosion in the availability of mobile computing devices. One of the major challenges is that while devices such as PDAs and mobile phones have become smaller and more powerful, there has been little progress in developing effective interfaces to access the increased computational and media resources available in such devices. mobile devices, as well as wearable devices, constitute a very important area of opportunity for research in HCC because natural interaction with such devices can be crucial in overcoming the limitations of current interfaces. Several researchers have recognized this, and many projects exist on mobile and wearable HCC [9][21][51].
362
Nicu Sebe
4.3 Users with Disabilities People with disabilities can benefit greatly from HCC technologies [42]. Various authors have proposed approaches for smart wheel-chair systems which integrate different types of sensors. The authors of [55] introduce a system for presenting digital pictures non-visually (multimodal output), and the techniques in [24] can be used for interaction using only eye blinks and eye brow movements. Some of the approaches in other application areas (e.g., [9][60]) could also be beneficial for people with disabilities.
4.4 Public and Private Spaces In this category we place applications implemented to access devices used in public or private spaces. One example of implementation in public spaces is the use of HCC in information kiosks [40][54]. These are challenging applications for natural multimodal interaction: the kiosks are often intended to be used by a wide audience, thus there may be few assumptions about the types of users of the system. On the other hand, there are applications in private spaces. One interesting area is that of implementation in vehicles [39][58][59]. This is an interesting application area due to the constraints: since the driver must focus on the driving task, traditional interfaces (e.g., GUIs) are not so suitable. Thus, it is an important area of opportunity for HCC research, particularly because depending on the particular deployment, vehicle interfaces can be considered safety-critical.
4.5 Virtual Environments Virtual and augmented reality has been a very active research area at the crossroads of computer graphics, computer vision, and human-computer interaction. One of the major difficulties of VR systems is the interaction component, and many researchers are currently exploring the use of interaction analysis techniques to enhance the user experience. One reason this is very attractive in VR environments is that it helps disambiguate communication between users and machines (in some cases virtual characters, the virtual environment, or even other users represented by virtual characters [46]).
4.6 Art Perhaps one of the most exciting application areas of HCC is art. Vision techniques can be used to allow audience participation [44] and influence a performance. In
Human-centered Computing
363
[62], the authors use multiple modalities (video, audio, pressure sensors) to output different “emotional states” for Ada, an intelligent space that responds to multimodal input from its visitors. In [43], a wearable camera pointing at the wearer’s mouth interprets mouth gestures to generate MIDI sounds (so a musician can play other instruments while generating sounds by moving his mouth). In [49], limb movements are tracked to generate music. HCC can also be used in museums to augment exhibitions [49].
4.7 Other Other applications include education, remote collaboration, entertainment, robotics, surveillance, or biometrics. HCC can also play an important role in safety-critical applications (e.g., medicine, military, etc.) and in situations in which a lot of information from multiple sources has to be viewed in short periods of time. A good example of this is crisis management.
5 Integrating HCC into a Real (Human) World Human-Centered Computing systems and applications should ultimately be integrated in a world that is complex and rapidly evolving. For instance, computing is migrating from the desktop, at the same time as the span of users is expanding dramatically to include people who would not normally access computers. This is important because although in industrialized nations almost everyone has a computer, a small percentage of the world’s population owns a multimedia device (millions still do not have phones). The future of multimedia, therefore, lies outside the desktop, and multimedia will become the main access mechanism to information and services across the globe. Integration of modalities and media, of access mechanisms, and of resources constitute three key-issues for the creation of future HCC systems. We discuss each of these issues in the following subsections (for more details see [32]).
5.1 Integrating Modalities and Media Despite great efforts in the multimedia research community, integrating multiple media (in production, analysis, and interaction) is still in its infancy. Our ability to communicate and interpret meanings depends entirely on how multiple media are combined (such as body pose, gestures, tone of voice, and choice of words), but most research on multimedia focuses on a single medium model. In the past, interaction concerns have been left to researchers in HCI and the scope of work on interaction
364
Nicu Sebe
within the multimedia community has focused mainly on image and video browsing. Multimedia, however, includes many types of media and, as evidenced by many projects developed in the arts, multimedia content is no longer limited to audiovisual materials. Thus, we see interaction with multimedia data not just as an HCI problem, but as a multimedia problem. Our ability to interact with a multimedia collection depends on how the collection is indexed, so there is a tight integration between analysis and interaction. In fact, in many multimedia systems we actually interact with multimedia information and want to do it in a multimodal way. Two major research challenges are modeling the integration of multiple media in analysis, production, and multimodal interaction. Statistical techniques for modeling are a promising approach for certain types of problems. For instance, Dynamic Bayesian Networks have been successfully applied in a wide range of problems that have a time component, while sensor fusion and classifier integration in the artificial intelligence community have also been active areas of research. In terms of content production, we do not have a good understanding of the human interpretation of the messages that a system sends when multiple media are fused and there is much we can learn from the arts and communication psychologists. Because of this lack of integration, existing approaches suit only a small subset of the problems and more research is needed, not only on the technical side, but also on understanding how humans actually fuse information for communication. This means making stronger links between fields like neuroscience, cognitive science, and multimedia development. For instance, exploring the application of Bayesian frameworks to integration [4], investigating different modality fusion hypothesis [15] (discontinuity, appropriateness, information reliability, directed attention, and so on), or investigating stages of sensory integration [56] can potentially give us new insights that lead to new technical approaches. Without theoretical frameworks on integrating multiple sensors and media, we are likely to continue working on each modality separately and ignoring the integration problem, which should be at the core of multimedia research.
5.2 Integrating Access Everyone seems to own a mobile device. As a consequence, there is a new wave of portable computing, where a cell phone is a fully functional computer that we can use to communicate, record, and access a wealth of information (such as locationbased, images, video, personal finances, and contacts). Although important progress has been made, particularly in ambient intelligence applications [1] and in the use of metadata from mobile devices [6][14], much work needs to be done and one of the technical challenges is dealing with large amounts of information effectively in real time. Developing effective interaction techniques for small devices is one of our biggest challenges because strong physical limitations are in place. In the past, we assumed the desktop screen was the only output channel, so advances in mobile devices are completely redefining multimedia applications. But mobile devices are
Human-centered Computing
365
used for the entire range of human activities: production, annotation, organization, retrieval, sharing, communication, and content analysis.
5.3 Integrating Resources While mobile phone sales are breaking all records, it is increasingly common for people to share computational resources across time and space. Public multimedia devices are becoming increasingly common. In addition, it is important to recog˚ particularly in developing countries U ˚ sharing of resources is often the nize that U only option. Many projects for sharing community resources exist, particularly for rural areas, in education and other important activities. One of the main technical research challenges here is constructing scalable methods of multimodal interaction that can quickly adapt to different types of users, irrespective of their particular communication abilities. The technical challenges in these two cases seem significantly different: mobile devices should be personalized, while public systems should be general enough to be effective for many different kinds of users. Interestingly, however, they both fall under the umbrella of ubiquitous multimedia access: the ability to access information anywhere, anytime, on any device. Clearly, for these systems to succeed we need to consider cultural factors (for example, text messaging is widespread in Japan, but less popular in the US), integration of multiple sensors, and multimodal interaction techniques. In either case, it is clear that new access paradigms will dominate the future of computing and ubiquitous multimedia will play a major role. Ubiquitous multimedia systems are the key in letting everyone access a wide range of resources critical to economic and social development.
6 Research Agenda for HCC To summarize the major points that we have presented so far, human-centered computing systems should be multimodal (inputs and outputs in more than one modality or communication channel), they must be proactive (understand cultural and social contexts and respond accordingly), and be easily accessible outside the desktop to a wide range of users (i.e., adaptable) (see Section 2.3 and [32]). A human-centered approach to multimedia will consider how humans understand and interpret multimedia signals (feature, cognitive, and affective levels), and how humans interact naturally (the cultural and social contexts as well as personal factors such as emotion, mood, attitude, and attention). Inevitably, this means considering some of the work in fields such as neuroscience, psychology, cognitive science, and others, and incorporating what is known in those fields within computational frameworks that integrate different media.
366
Nicu Sebe
Research on machine learning integrated with domain knowledge, automatic analysis of social networks, data mining, sensor fusion research, and multimodal interaction will play a special role [52]. Further research into quantifying humanrelated knowledge is necessary, which means developing new theories (and mathematical models) of multimedia integration at multiple levels. We believe that a research agenda on HCC will involve the following nonexhaustive list of goals [36][37]: • New human-centered methodologies for the design of models and algorithms and the development of systems in each of the areas discussed previously. • Focused research on the integration of multiple sensors, media, and human sciences that have people as the central point. • New interdisciplinary academic and industrial programs, initiatives, and meeting opportunities. • Discussions on the impact of multimedia technology that include the social, economic, and cultural contexts in which such technology is or might be deployed. • Research data that reflect the human-centered approach, e.g., data collected from real social situations, data that is rich - multisensorial and culturally diverse. • Common computing resources on HCM (e.g. software tools and platforms). • Evaluation metrics for theories, design processes, implementations, and systems from a human-centered perspective; and • Methodologies for privacy protection and the consideration of ethical and cultural issues. Human-centered approaches have been the concern of several disciplines [19] but, as pointed out in Section 1, they have been often undertaken in separate fields. The challenges and opportunities in the field of multimedia are great not only because so many of the activities in multimedia are human-centered, but also because multimedia data itself is used to record and convey human activities and experiences. It is only natural, therefore, for the field to converge in this direction and play a key role in the transformation of technology to truly support people’s activities.
7 Conclusions In this chapter, we gave an overview of HCC from a several perspectives. We described the three main areas of Human-Centered Computing emphasizing social and cultural issues, and the integration of sensors and multiple media for system design, deployment, and access. A research agenda for HCC (see also [32][37]) was presented. Many technical challenges lie ahead and in some areas progress has been slow. With the cost of hardware continuing to drop and the increase in computational power, however, there have been many recent efforts to use HCC technology in en-
Human-centered Computing
367
tirely new ways. One particular area of interest is new media art. Many universities around the world are creating new joint art and computer science programs in which technical researchers and artists create art that combines new technical approaches or novel uses of existing technology with artistic concepts. In many new media art projects, technical novelty is introduced while many HCC issues are considered: cultural and social context, integration of sensors, migration outside the desktop, and access. Technical researchers need not venture into the arts to develop human-centered computing systems. In fact, in recent years many human-centered multimedia applications have been developed within the multimedia domain (such as smart homes and offices, medical informatics, computerguided surgery, education, multimedia for visualization in biomedical applications, education, and so on). However, more efforts are needed and the realization that multimedia research, except in specific applications, is meaningless if the user is not the starting point. The question is whether multimedia research will drive computing (with all its social impacts) in synergy with human needs, or it will be driven by technical developments alone.
References [1] Aarts E (2004) Ambient intelligence: A multimedia perspective. IEEE Multimedia 11(1):12–19 [2] von Ahn L (2006) Games with a purpose. IEEE Computer pp 96–98 [3] von Ahn L, Dabbish L (2008) General techniques for designing games with a purpose. Communications of the ACM pp 58–67 [4] Andersen T, Tiippana K, Sams M (2004) Factors influencing audiovisual fission and fusion illusions. Cognitive Brain Research 21:301–308 [5] Benali-Khoudja M, Hafez M, Alexandre JM, Kheddar A (2004) Tactile interfaces: A state-of-the-art survey. In: Int. Symposium on Robotics [6] Boll S (2005) Image and video retrieval from a user-centered mobile multimedia perspective. In: Int’l Conf. Image and Video Retrieval [7] Borst C, Volz R (2005) Evaluation of a haptic mixed reality system for interactions with a virtual control panel. Presence: Teleoperators and Virtual Environments 14(6) [8] Brewer E (2005) The case for technology for developing regions. IEEE Computer 38(6):25–38 [9] Brewster S, Lumsden J, Bell M, Hall M, Tasker S (2003) Multimodal ‘eyesfree’ interaction techniques for wearable devices. In: ACM CHI [10] Canny J (2001) Human-center computing. Tech. rep., UC Berkeley HCC Retreat [11] CARPE (2005) ACM workshop on capture, archival, and retrieval of personal experiences (carpe) [12] Cheyer A, Julia L (1998) MVIEWS: Multimodal tools for the video analyst. In: Intelligent User Interfaces
368
Nicu Sebe
[13] Clancey W (1997) Situated Cognition: On Human Knowledge and Computer Representations. Cambridge University Press [14] Davis M, Sarvas R (2004) Mobile media metadata for mobile imaging. In: Int’l Conf. Multimedia and Expo [15] Deneve S, Pouget A (2004) Bayesian multisensory integration and crossmodal spatial links. J Physiology 98(1-3):249–258 [16] Dertouzos M (2001) The Unfinished Revolution: Human-centered Computers and What They Can Do for Us. HarperCollins [17] Dimitrova N (2004) Context and memory in multimedia content analysis. IEEE Multimedia 11(3):7–11 [18] Fitzmaurice G, Ishii H, Buxton W (1995) Bricks: Laying the foundations for graspable user interfaces. In: ACM CHI [19] Flanagan J, Huang T, Jones P, Kasif S (1997) Human-centered systems: Information, interactivity, and intelligence. Tech. rep., NSF [20] Foley J (2008) HCC educational digital library. Tech. rep., http://hcc.cc.gatech.edu [21] Fritz G, Seifert C, Luley P, Paletta L, Almer A (2004) Mobile vision for ambient learning in urban environments in urban environments. In: Int. Conf. on Mobile Learning [22] Gentry C, Ramzan Z, Stubblebine S (2005) Secure distributed human computation. In: ACM Conf. on Electronic Commerce, pp 328–332 [23] Gladwell M (2005) Blink. Little Brown and Co. [24] Grauman K, Betke M, Lombardi J, Gips J, Bradski G (2003) Communication via eye blinks and eyebrow raises: Videobased human-computer interfaces. Universal Access in Inf Society 2(4):359–373 [25] Grudin J (1994) Computer-supported cooperative work: Its history and participation. IEEE Computer pp 19–26 [26] HCC (2008) Human-Centered Computing: An interdisciplinary consortium at UC Berkeley . URL http://www.cs.berkeley.edu/jfc/hcc/ [27] HCC-PhD (2008) PhD Human Centered Computing. URL http://www. cc.gatech.edu/education/grad/phd-hcc [28] Hoffman R (2004) Human-centered computing principles for advanced decision architectures. Tech. rep., Army Research Laboratory [29] Hoffman R, Feltovich P, Ford K, Woods D, Klein G, Feltovich A (2002) A rose by any other name ... would probably be given an acronym. IEEE Intelligent Systems 17(4):72–80 [30] IHMC (2008) Institute of Human and Machine Cognition. URL http:// www.ihmc.us [31] Jacob R (2005) The use of eye movements in human-computer interactions techniques: What you look at is what you get. ACM Transactions Information Systems 9(3):152–169 [32] Jaimes A (2006) Human-centered multimedia: Culture, deployment, and access. IEEE Multimedia 13(1):12–19 [33] Jaimes A, Chang SF (2000) A conceptual framework for indexing visual information at multiple levels. In: SPIE Internet Imaging
Human-centered Computing
369
[34] Jaimes A, Sebe N (2005) Multimodal hci: A survey. In: IEEE Int’l Workshop on Human-Computer Interaction [35] Jaimes A, Sebe N (2007) Multimodal human computer interaction: A survey. Computer Vision and Image Understanding 108(1-2):116–134 [36] Jaimes A, Sebe N, Gatica-Perez D (2002) Human-centered computing: A multimedia perspective. In: ACM Multimedia, pp 855–864 [37] Jaimes A, Gatica-Perez D, Sebe N, Huang T (2007) Human-centered computing: Toward a human revolution. IEEE Computer 40(5):30–34 [38] Jain R (2003) Folk computing. Comm of the ACM 46(4):27–29 [39] Ji Q, Yang X (2002) Real-time eye, gaze, and face pose tracking for monitoring driver vigilance. Real-Time Imaging 8:357–377 [40] Johnston M, Bangalore S (2002) Multimodal applications from mobile to kiosk. In: W3C Workshop on Multimodal Interaction [41] Karat J, Karat C (2003) The evolution of user-centered focus in the humancomputer interaction field. IBM Systems 42(2):532–541 [42] Kuno Y, Shimada N, Shirai Y (2003) Look where you’re going: A robotic wheelchair based on the integration of human and environmental observations. IEEE Robotics and Automation 10(1):26–34 [43] Lyons M, Haehnel M, Tetsutani N (2003) Designing, playing, and performing, with a vision-based mouth interface. In: Conf. on New Interfaces for Musical Expression [44] Maynes-Aminzade D, Pausch R, Seitz S (2002) Techniques for interactive audience participation. In: ICMI [45] McCowan I, Gatica-Perez D, Bengio S, Lathoud G, Barnard M, Zhang D (2005) Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3):305–317 [46] Nijholt A, Heylen D (2002) Multimodal communication in inhabited virtual environments. Int J of Speech Technology 5:343–354 [47] NIST-TRECVID (2008) TREC Video Retrieval Evaluation. URL http:// www-nlpir.nist.gov/projects/trecvid [48] NSF-HCC (2008) NSF - Human Centered Computing Program. URL http: //www.nsf.gov/cise/iis/hcc\_pgm.jsp [49] Paradiso J, Sparacino F (1997) Optical tracking for music and dance performance. In: Optical 3-D Measurement Techniques [50] Pavlovic V, Sharma R, Huang T (1997) Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):677–695 [51] Pelz J (2004) Portable eye-tracking in natural behavior. J of Vision 4 [52] Pentland A (2005) Socially aware computation and communication. IEEE Computer 38(3):33–40 [53] Potamianos G, Neti C, Luettin J, Matthews I (2004) Audiovisual automatic speech recognition: An overview. In: Issues in Visual and Audio-Visual Speech Processing, MIT Press [54] Rehg J, Loughlin M, Waters K (1997) Vision for a smart kiosk. In: CVPR
370
Nicu Sebe
[55] Roth P, Pun T (2003) Design and evaluation of a multimodal system for the non-visual exploration of digital pictures. In: INTERACT [56] Schroeder C, Foxe J (2005) Multisensory contributions to low-level, ‘unisensory’ processing. Current Opinion in Neurobiology 15:454–458 [57] Shneiderman B (2002) Leonardo’s Laptop: Human Needs and the New Computing Technologies. MIT Press [58] Smith P, Shah M, Lobo N (2003) Determining driver visual attention with one camera. IEEE Transactions on Intelligent Transportation Systems 4(4) [59] Trivedi M, Gandhi T, McCall J (2007) Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety. IEEE Transactions on Intelligent Transportation Systems pp 108–120 [60] Valenti R, Gevers T (2008) Accurate eye center location and tracking using isophote curvature. In: IEEE Conf. Computer Vision and Pattern Recogn. [61] Vinciarelli A, Pantic M, Bourlard H, Pentland A (2008) Social signal processing: State-of-the-art and future perspectives of an emerging domain. In: ACM Multimedia, pp 1061–1070 [62] Wassermann K, Eng K, Verschure P, Manzolli J (2003) Live soundscape composition based on synthetic emotions. IEEE Multimedia 10(4) [63] Weiser M (1993) Some computer science issues in ubiquitous computing. Communications of the ACM 36(7):74–83 [64] Zhang D, Gatica-Perez D, Bengio S, Roy D (2005) Learning influence among interacting Markov chains. In: NIPS
End-user Customisation of Intelligent Environments Jeannette Chin, Victor Callaghan and Graham Clarke
1 Introduction One of the striking aspects of world-wide-web is how it has empowered ordinary non-technical people to participate in a digital revolution by transforming the way services such as shopping, education and entertainment are offered and consumed. The proliferation of networked appliances, sensors and actuators, such as those found in digital homes heralds a similar ‘sea change’ in the capabilities of ordinary people to customise and utilise the electronic spaces they inhabit. By coordinating the actions of networked devices or services, it is possible for the environment to behave in a holistic and reactive manner to satisfy the occupants needs; creating an intelligent environment. Further, by deconstructing traditional home appliances into sets of more elemental network accessible services, it is possible to reconstruct either the original appliance or to create new user defined appliances by combining basic network services in novel ways; creating a so called virtual appliance. This principle can be extended to decompose and re-compose software applications allowing users to create their own bespoke applications. Collectively, such user created entities are referred to as Meta – appliances or – applications, more generally abbreviated to MAps. Deconstruction and user customized MAps raise exciting possibilities for occupants of future intelligent environments, and sets significant research challenges [14]. For example, how can MAps be constructed and managed by ordinary nonexpert home occupants? At one extreme it is possible to use artificial intelligence (AI) techniques and equipment, such as autonomous intelligent agents. These monJeannette Chin University of Essex, e-mail:
[email protected] Victor Callaghan University of Essex, e-mail:
[email protected] Graham Clarke University of Essex, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_14, © Springer Science+Business Media, LLC 2010
371
372
Jeannette Chin, Victor Callaghan and Graham Clarke
itor an occupants habitual behaviour, modelling their behaviours, and creating rulebased profiles (self-programming) so they can pre-emptively set the environment to what they anticipate the user would like [1]. However, some people have privacy concerns about what is being recorded, when it is being recorded and to whom (or what) any information is communicated. These concerns are particularly acute with autonomous agents, in which people have little direct control. Such matters are especially sensitive when the technology is used in the private space of someone’s home. Frequently, end-users are given very little choice in setting-up digital home technology and are obliged to accept whatever is offered [10]. Apart from the issues of privacy and trust, we argue that creativity is an essential and distinctive human quality, and that many people would enjoy the process of creating their own novel networked appliances and personalising their ’electronic spaces’, providing they can be shielded from unnecessary technical complexity. This has parallels to the common practice of people decorating their own homes with paintings, walls hangings, pictures, colour schemes and furniture. This rationale has led many researchers to investigate what is termed ‘end-user programming’, a methodology aimed at allowing non-technical people to personalise their own digital spaces with network enabled embedded-computer based devices. Historically, programming has only been accessible to well-qualified professionals, such as computer scientists, or the outcome of self-programming (learning) using autonomous intelligent agents. The challenge for achieving an end-user programming vision is to devise programming methodologies that are usable by non-technical people. In this chapter we begin by reviewing current research into end-user programming systems, especially those for the home. We describe approaches that range from transposing conventional programming constructs into graphical or physical iconic objects, to those that adopt radically new programming metaphors. By way of an example of these new approaches, we describe a novel end-user programming approach developed at the University of Essex called Pervasive-interactiveProgramming (PiP) (UK patent No. GB0523246.7) and a service coordination model known as Meta-Appliances/Applications (MAps). We report on an evaluation of user experiences using PiP in a digital home, the University of Essex iSpace. We conclude this chapter by reflecting on the main findings of our work.
2 The Home of the Future – A Vision One vision for the home of the future is that it will be will be a technology-rich environment containing tens or even hundreds of pervasive network based services, some provided by physical appliances within the home, others by external service providers (see Figure 1). It will be centred on the concept of aggregating network services to satisfy particular needs. A supporting middleware will make these services discoverable, and accessible to the user in the environment in which they reside. Services might range from simple video entertainment streams through to complex home energy management packages.
End-user Customisation of Intelligent Environments
373
Fig. 1 A Pool of Service Providing Appliances
In the current market appliances are designed, packaged and marketed by commercial companies who bundle together pre-formed packages of functions that they anticipate customers will want (e.g. a TV). However, networked technology enables the development of alternative approaches. For example, from a customer’s perspective, it might be possible to create network based appliance or environment behaviour by aggregating coordinating sets of networked services. Formal descriptions of these composite services, and their behaviours would form ‘soft-objects’ which could migrate with people as they move across differing environments (e.g. via the network, or contained in mobile phones), instantiating these functions from local resources wherever possible. The current appliance market suggests that there is a basic set of common needs for devices that people want (eg telephones, TVs, heating etc) and these could form default MAps in all homes. However other, more novel, MAps – composite service descriptions – created by lay-people could be developed for personal use or even traded. Clearly, the deconstructed model, represents a radically different way of providing appliances and it is not yet clear, that the market would accept such a proposition. However, even if this vision for ‘virtual appliances’ does not find favour in the market, it remains an alternative way for end-users to personalise the functionality of their digital homes.
3 Contemporary Work Various approaches to customising digital homes have been reported by researchers, ranging from intelligent agents through to user driven methods. A common ap-
374
Jeannette Chin, Victor Callaghan and Graham Clarke
proach to endowing network appliances with coordinated behaviour is via rules [8]. Rules are fundamental to determining how a given appliance or service interacts with other appliances or services [50]. Chin [13] has proposed a taxonomy based on ‘rule formation’ as a way of describing the different rule-based approaches to customizing digital homes. Chin’s taxonomy is based on the following categories: 1. Pre-programmed rules - usually created by the developers or manufacturer, 2. Agent-programmed rules - created from intelligent agents, artificial intelligence or machine learning mechanisms, and 3. User-programmed rules - created from end-user programming.
3.1 Pre-programmed Rules This grouping describes systems where a developer or manufacturer fixes the rules for coordination before they are supplied to the end-users. Here professional developers try to anticipate the future use of the systems e.g. what devices will work together, what functionality the users will require etc. If there is a reliable link between the needs of the user, and that anticipated by the developer, this scheme can work well. Approaches range from systems with static functionality (that always present the same functionality to users), to systems that have some capacity to adapt to user needs. Thus, hard-coding doesn’t mean that the systems are not adaptable, rather that the flexibly is constrained to pre-designed options. The defining characteristic of work contained in this category is that there is no local autonomy (the system is not able to invent its own rules but rather utilise existing rules). Contemporary examples include task based computing systems and some forms of context aware systems. These can adapt to the users context but are built from preprogrammed rules, switching between a pool of options according to circumstances or context, and thus can be regarded as belonging to this category. In more detail, task based computing was pioneered by Wang and Garlan of CMU [49] and Fujitsu [33]. In this users interact with networked enabled services in terms of commanding high-level tasks (e.g. watch a movie on the largest available display). task computing can be described as a method to enable users to discover, combine and execute coordinated contextual actions or tasks [41]. Such tasks can be regarded as composites of lower-level actions, for example the task “watch a movie on the largest available display” could be decomposed into a series of smaller steps, which would need to be combined to carry out this task. A Task Computing interface generally presents the user with a fixed set of pre-specified tasks which, when activated, results in the system identifying and instantiating the best match of aggregated services to satisfy the requested task. Special tools such as the GUI based ‘Semantic Task Execution Editor (STEER)” developed by Fujitsu have been used by developers to pre-programme tasks. A disadvantage of pre-programming tasks is that, to some extent, it involves developers guessing at the users needs. As will be described in section 4.2 the Pervasive interactive Programming (PiP) tool
End-user Customisation of Intelligent Environments
375
developed by Chin provides an alternative approach allowing end-users, rather than developers, to compose tasks. Context-aware computing which was introduced by Schilit et al, in 1994 [44]. There are numerous definitions of what context aware systems are, many based on the type of interactivity the system uses. Here we adopt the definition proposed by Chen and Kotz of active and passive context-awareness [12]. Active contextawareness describes applications that, on the basis of sensor data, change their interface or content autonomously, whereas passive context-aware systems prompt the user to make the changes. A simple illustration of active context awareness might be a mobile phone automatically switching to silent on entering a library whereas, if it prompted the user to switch to silent, it would be an example of passive context awareness. Other examples of active context aware computing are Georgia Tech’s The Aware Home[30] and Microsoft’s EasyLiving project in which systems adapt by switching between pre-programmed routines or states [7]. In principle, it would be possible to employ deliberative agents to learn and evolve new rules to manage the context aware system but, currently, most systems are relatively simple and are built using pre-specified operational rules.
3.2 Agent-programmed Rules This group describes systems where the rules for coordination are formed from the use of intelligent agents, artificial intelligence or machine learning mechanisms. In general these are pre-emptive decision making systems that utlise models derived from monitoring past behaviour to predict the users next action. A distinguishing aspect of work in this group is that the agents are self-governing, by which we mean they learn new rules based on past behaviour. The performance of these systems is dependent upon there being a good match between past actions and future wants and needs. The general focus of research in this area is to create agents that can model and predict a person’s behaviour, with less attention being paid to how the agents are interconnected or grouped into functional sets. Noteable test-beds and concepts for of such system have been built, such as the MavHome [16], Neural Home [39], Hive [38] and the iSpace (described later in this chapter). The issue of how to aggregate services or devices, in the form of agents, remains a particularly difficult problem and a number of approaches have been proposed to solve this. One approach concerns the semantic web [4, 34] in which devices and their services are tagged with attributes and semantic descriptions that allow similarity searches and service aggregation to be accomplished. Another project, ANS [36], takes the form of smart middleware which uses the OWL ontology to find appropriate devices in addition to facilitating an adaptive agent-like capability which can learn the user’s preferences. This ability is also used to make decisions on device replacement. This use of OWL is somewhat similar to Chin’s dComp work, described later in this paper, The multi-agent community is also deeply emerged in
376
Jeannette Chin, Victor Callaghan and Graham Clarke
the control of smart environments [17] including intelligently modelling and managing associations [36, 21]. This work utilises task-specific predefined policies to enable the agents to associate with relevant devices in their search space providing some adaptation capability to deal with devices joining or leaving the network, and people’s mobility. The networking community are also undertaking research to endow network components with a degree of autonomy, generally aimed at improving throughput and reliability [2, 21]. Most of these systems rely on semantic descriptions in some form. A notable exception is the work of Duman [22] who has developed a function/semantic-free exploration algorithm based on fuzzy logic to discover and model the most relevant associations between devices and services operating in digital home. Finally, there are a number of wider issues that are involved in the successful deployment of autonomous agent based systems. For example all automatically constructed systems of coordinating distributed agents are vulnerable to cyclic instabilities [50]. In addition, the risk to personal privacy due to continual monitoring is also an issue for some people [10]. Collectively, all the above issues form a fertile and challenging research ground for researchers.
3.3 User-programmed Rules This group describes systems where the rules of coordination are formed via explicit guidance from the end-user. This approaches bypasses the manufacturer or professional developer by allowing the end user to create the system functionality directly. This methodology is commonly referred to as end-user programming and is characterized by the use of techniques that allow non-technical people to create programs [19]. Examples of end-user programming approaches include Humble [27], which uses a jigsaw, metaphor, enabling users to snap together puzzle-like graphical programming representations as a way of building applications. The HYP system [5] facilitates users to create applications for context-aware homes using a mobile phone based graphical interface. Media Cubes [26] provides a tangible interface for programming an environment in which each face of a cube represents a programming operation. Programming is achieved by turning the appropriate face of the cube towards the target device. Truong’s CAMP project [47] uses a fridge magnet metaphor together with a pseudo-natural language interface to enable non-technical people to realize context-aware ubiquitous applications in their homes. Programming-by-example (PBE) aims to simplify the process of programming by replacing the use of programming abstractions by the use of examples that are conveyed by the user demonstrating the actions required to the system [45]. Later Henry Lieberman described PBE as “a software agent that records the interactions between the user and a conventional direct-manipulation interface and writes a program that corresponds to the user’s actions”, where “the agent can then generalise the program so that it can work in other simulations similar to, but not necessarily exactly the same as, the example on which it is taught” [32]. Hence, PBE can
End-user Customisation of Intelligent Environments
377
be viewed as an approach that reduces the gap between the user requirements and the delivered program functionality by merging the two tasks. To date, the main area of PBE work has focused on graphical user interfaces running on single PCs. For instance PBE has been applied to computer application development [40], [24]; Computer-Aided Design (CAD) system [55]; children’s programs [57], [52] [58] and World Wide Web related technologies [46], [3], [37], [31], [6]. The MIT Alfred project and the Essex University Pervasive interactive Programming (PiP) project have extended the principles of PBE into the area of pervasive computing and Digital Homes. The Alfred project, developed by MIT in 2002, employed the concepts of ‘goals’, and ‘plans’ to allow users to compose a program via teaching-by-example. The system was intended to utilise a macro programming approach that could be created by the user via verbal or physical interaction. However, according to the developer of the system, Gajos, no formal studies were completed and the work appears to have been cut short when he moved from MIT to the University of Washington [23]. Whilst macros are widely used, their dependence on strict order can make them fragile and susceptible to failure, especially in applications where order of events is irregular or unpredictable. PiP, described later in this chapter, avoided this problem by using sequence-independent behaviour descriptions known as Meta-Appliances (MAps), which are akin to non-terminating processes. PiP utilises an event-based architecture and functions by mimicking examples of the required behaviour.
3.4 Supporting Studies Some insight into users wishes has been provided by a number of significant studies into digital home requirements. In one study the Samsung Corporation, in cooperation with the American Institutes for Research, conducted a study aimed at identifying smart home requirements by interviewing and monitoring people in South Korea and the USA [15]. Their findings included the need for harmonious cooperation between appliances and ease of use. However, more importantly for the work being reported in this chapter, a particularly important requirement that they discovered was the need for people to be able to customize the functionality of smart-homes. The same need was reported in a 3-year study on Finnish people undertaken by Tampere University Hypermedia laboratory [35].
3.5 Discussion Rules underpin the operation numerous smart homes and intelligent environments. As such, rule formation provides a useful taxonomy for describing the differing approaches to programming the coordination behaviour of services available in digital homes. We have argued that pre-programmed approaches can perform well, where
378
Jeannette Chin, Victor Callaghan and Graham Clarke
the user’s needs and the system components can be anticipated correctly, and remain relatively static. However, in more dynamic environments, where system components and people’s needs can change quickly, they are less suitable and adaptable agent based programming approaches become attractive. In these approaches, the use of automated rule generation offers the advantage of reducing the cognitive load on people by undertaking all the configuration and programming on behalf of the user. However these approaches have some drawbacks as some researchers contend that there can never be a perfect match between past actions and future needs, as users will never behave in exactly the same way from day to day, month to month and year to year. Because of this it is argued that the gap between user expectations and agent actions will remain a source of frustration, requiring agent based programming systems to be overridden by users frequently and thus annoying them. In addition, the continual monitoring of people in their homes, that agent programming approaches employ to maintain effective predictive behaviour models, is a cause of concern to anyone who worries about privacy. Finally, two significant studies have discovered there is a need for people to be able to customize their smart home functionality; an important finding that motivates further the work described in this chapter.
4 Pervasive interactive Programming (PiP) – An Example of End-User Programming The users of an appliance, or the occupants of an environment, usually know the behaviour they would like from a system but do not always have the means to communicate their wishes to the system. End-user programming approaches have the advantage of not requiring the system to guess a person’s wishes as people are able to explicitly describe their needs, although this is achieved at the cost of increased effort on the part of the user. In addition end-user programming goes some way to countering the fears that some people have about privacy and trust by providing more transparent operation and more user control [10]. Thus the aim of the work described in this chapter is to develop systems that maximise user’s control and operational transparency, engendering a sense of trust, and enabling people to customise the functionality of their virtual appliances and digital-homes, without requiring any detailed technical knowledge, thereby empowering user creativity. In the following sections we will explain how we have achieved this by describing how an end-user programming paradigm, programming-by-example, originally developed for “desktop computer environment”, was applied to the distributed computing systems that underpin intelligent environments. We call this approach Pervasive interactive Programming (PiP). Pervasive interactive Programming (PiP), is an end-user programming paradigm proposed and developed by Chin in 2003 as a means of addressing the issues of privacy and creativity in Digital Homes. PiP can be viewed as a methodology for enabling non-technical people to compose and orchestrate the behaviour of collections
End-user Customisation of Intelligent Environments
379
of network services, or devices, so as to produce the desired collective functionality that characterises the smart home or a virtual appliance. It provides a platform that utilises the physical user environment, with appropriate GUI support, to create a novel programming environment which enables people to customise the functionality of their virtual appliances or digital homes to suit their individual needs. Thus, a resident of a digital home, who has no technical expertise, is able to customise the functionality of coordinating network devices using natural physical interaction to demonstrate the required functionality, a task that could previously only be achieved by conventional computer programming.
4.1 PiP Terminology A fundamental concept underpinning PiP is the notion of deconstruction and reconstruction of appliances and applications. This gives rise to the following terms: • Virtual Appliances – Typical home appliances are made up from numerous services e.g. a TV is composed of a video display, audio transducer, media gateway, control interface etc. Thus, when appliances are connected to a network it might become possible to access these sub-functions, effectively deconstructing the appliance by making these basic services available to other network users. These users may in turn combine them with sub-functions from other networked appliances to form novel composite services or, as we chose to describe them virtual appliances. This model changes the understanding of what an appliance is, bringing with it the potential to disrupt the current appliance market by creating new types of producers and consumers. This concept is not limited to physical appliances but can, potentially, include any network service. Thus, we refer to such communities or virtual appliances by the more generic name of Meta Appliances/Applications(MAps) and the conceptual approach as the deconstructed appliance model. • Atomic & Nuclear Functions – The virtual appliance principle depends critically on the ability to deconstruct monolithic appliance functionality, referred to as nuclear functions, into sub-functions which are offered to the network, referred to as atomic functions. • Meta-Appliances/Applications (MAps) – A MAp is a description of a virtual appliance or application. MAps can be viewed as soft-objects that define the membership and behaviour of a collection of services that constitute a MAp (or virtual appliance). MAps can be designed, owned, copied, carried or traded. From a logical perspective, a MAp contains some basic properties and a collection of rules that determine the behaviour of a community of coordinating services, Rules are essentially a marriage of two different types of actions, namely antecedent (condition) and consequent (action). The antecedent of a rule can be described as “if” while the consequent of a rule can be described as “then”. A rule can contain 0-n antecedents and 1-n consequents, and a MAp legally can
380
Jeannette Chin, Victor Callaghan and Graham Clarke
contain 0-n rules, as rules can be added later by the end user. A UML representation of MAp relationships with rules is shown in Figure 2. From the end-users’ viewpoint a MAp is just a behaviour description that would create the sort of virtual appliance or environmental functionality they want. As individual end users, owners of virtual appliances or applications, have their unique preferences and their particular needs, it makes sense to let the users define theirs own MAp i.e. their own virtual appliance behaviour. MAps are created under the directions of end users to provide the functionalities they desire. MAps are akin to non-terminating processes and require no specific user expertise for their formation. They can be created, stored, retrieved, shared, executed, or removed on demand. Until a MAp is terminated, it will retain the behaviour that the user originally created i.e. it is a continually running process.
Fig. 2 The UML representation of the MAp object structure and its Rules
The following simple scenario is offered to illustrate the operation of MAps: Janet is watching the news on broadcast TV in the lounge when an interesting news item starts. She calls up to John to tune in. John is currently in the office upstairs working on a document and listening to a podcast. He starts up the TV application, which uses the same audio channel as his podcast, for output and tunes in to the news item. Once the news
End-user Customisation of Intelligent Environments
381
item has finished John closes the TV application and continues with his work. The podcast resumes (from the point it was interrupted) .
In the above scenario Johns’s MAp consists of three generic virtual devices, they are: a TV device, a radio device and a podcast device, along with three virtual services: TV-control-service, radio-control-service, and podcast-Control-service. The event: “when John’s TV application is on” is the conditionof a given set ofactions in MAps. The sequences: “his podcast system halts its current operation and switches to play the TV audio channel” are the actions that need to be performed if the conditions are met; in this case there is only one condition. A partial definition for John’s MAp is shown in Figure 11 (in the dComp ontology section). Having described the nature of a MAp, it may be useful to understand the differences between a task and a MAp. A task refers to a set of functions (actions) that are required to be performed via a specific command (normally a one-shot sequence) and requires expertise for its definitions whereas a MAp, although it may provide the same functionalities that a task provides, it is an on-going (non-terminating) process that requires no specific expertise for its formation. Until a MAp is terminated, it will always provide the same functionalities that the user originally created i.e. it is a continually running process. Thus MAps are designed by users to create virtual entities that, for example, provide functionalities to customise a digital home.
4.2 PiP System Architecture PiP was designed to work in real time within a digital home environment. It uses an event-based architecture, currently based around UPnP, to facilitate communication between components. It is based on a modular framework comprising the following six core components (see Figure 3): 1. Interaction Execution Engine (IEE) – this module manages communication between PiP components. It is responsible for device discovery, service events subscription, and performing network actions requests. It is built around a UPnP control point and interfaces to the low-level network layer via UPnP protocols. It features a 2-way function that, for in-bound functions, passes networked events to the Knowledge Engine and, for the out-bound functions, passes network actions, together with requests, to the network. This module also maintains and manages an event-subscription list, which it uses to store details of the devices and services that are utilised in MAps. The information required for the event-subscription list is provided by the Event Handler, which, in turn, is driven by requests from the MAp Associator component, triggered by the user’s interactions (eg from “PiPView” or networked devices). 2. Event Handler (EH) – this module manages the passing of events between interested parties. Such parties need to register with the Event Handler in order to receive appropriate events and their data. Examples of events include low-level network events (e.g. device discovery), device service events (e.g. service state
382
Jeannette Chin, Victor Callaghan and Graham Clarke
Fig. 3 PiP Architecture
changes) and high-level events caused by user interaction. Table 1 illustrates the types of events and the data handled by the EH. 3. Knowledge Engine (KE) – this module manages, assembles and instantiates a MAp. It also updates the record of device status as well maintaining the content of the knowledge bank and keeping it up-to-date. It notifies the Event handler when it should send a “DataModel Event” (see Table 1) to interested parties as a result of changes to the Knowledge Bank.
End-user Customisation of Intelligent Environments
383
4. Real-time MAp Maintenance Engine (RTMM) — this module maintains records of all MAps. The MAp Associator registers with the Event Handler in order to receive GUI sourced user interaction action events. For example, when the user “drags & drops” devices into a “programming formation palette”, the GUI component notifies the Event Handler which, in turn, generates a related event that is passed to the MAp component. Likewise, the MAp Associator Component notifies the Event Handler of any changes in the user’s MAp, which, in turn, generates a MAp Event which is propagated to interested parties. 5. Real-Time Rule Formation Engine (RTRF) – this module assembles rules formed by monitoring user interactions during the “demonstration” mode (started and stopped by the user clicking the “ShowMe” and “Done!” button, respectively, on the PiP editor). It registers with the Event Handler to get events from the Knowledge Engine, MAp, GUI and user’s activities. This module also assembles rule fragments based on user actions. At the end of a ‘demonstration mode” the aggregation of rule fragments become a MAp. 6. GUI (PiPView) – This module provides a means for the user to interact with the system; It enables the PiP user to inspect the network environment, view online devices and services, control the physical devices and compose/delete Maps/Rules (an alternative to the use of physical devices for programming), To enhance the intuitive nature of PiP, the editor was designed to convey the familiar “look and feel“ of current ‘Windows’ applications. Thus the GUI utilises “drag, drop & clicking” for interaction. The software is built around a multithreaded scheme with the ‘editor class’ being the main class and the user’s entry point to the system. Other classes include ‘pop-up dialogs’, ‘device control panel’ and ‘help’. PiPview differs to other modules in that it is not contained in every instance of PiP, rather only in the editor device (a tablet, in the current implementation).
4.3 How the System Works A PiP user can configure and program the functionality of a MAp using either graphical or direct physical interaction with the appliances. In PiP all networked items that exist in the ubiquitous environment can be regarded as user interfaces, as users can chose to interact with them as an alternative a GUI during the demonstration process. The network appliances used in our evaluation are shown in Figure 4. The benefit of interacting with real appliances, rather than a GUI, is that it is more natural and intuitive Thus the PiP metaphor for configuring digital home functionality is very simple. In more detail, to create a MAp, a person needs to log into the system through PiPView. An instance of the Interaction Execution Engine (IEE) module then completes a network discovery cycle and reports available devices to the Knowledge Engine (KE) which, via the EventHandler (EH), informs all registered devices (registered at start-up time). For example, PiPView receives a notification and then an
384
Jeannette Chin, Victor Callaghan and Graham Clarke
Fig. 4 The PiP Evaluation Environment
Fig. 5 PiP View – user’s MAp
Fig. 6 PiP View – Rules
instance of the interpreter transforms and renders descriptions of the devices on the GUI, allowing the user to decide which devices to use for creating or modifying a MAp. MAps are created by the user, who first assembles a community by dragging and dropping service and device representations into a formation palette. This defines the set of devices that, via the Event Handler and the Interaction Execution Engine, share events and become a virtual appliance or MAp. When devices are formed into assemblies the user is free to save or remove the MAp at any time during the demonstration cycle. Using PiPview the user informs PiP when to the process of demonstrating the MAp or virtual appliance behaviour will begin which, in turn, activates the Real Time Rule Formation (RTRF) Engine. Next the user demonstrates actions as to how the MAp should behave by providing examples using any one of three methods:
End-user Customisation of Intelligent Environments
Fig. 7 PiP on tablet
385
Fig. 8 End-user programming via physical interaction
1. Physically interacting with the devices themselves; 2. Using the GUI; 3. A combination of both. Using the users preferred interaction method, the user demonstrates the desired behaviour which results in the generation of related events on the network. In the background the Real-Time Rule Formation Engine “listens” for user’s activities, which are communicated to it via the Event Handler that in turn, is informed by the Knowledge Engine when it receives a notification that the status of a remote device has changed. This behaviour is then encoded as a set of rules as described earlier. In terms of sequencing, PiP aims to give the user as much freedom as possible, allowing antecedents and consequents to be formed in any quantity and order (i.e. the user is not required to follow a rigid logical sequence). Thus, unlike macro languages, where the sequence of instructions or actions is important, PiP places no significance on logical sequence. PiP employs a rule policy to maintain a MAp execution process in which “a set of conditions is satisfied if, the conditions defined within the context of this set, are all satisfied”. For example the rule statement: “if the telephone is ringing and, if the audio is playing, then stop the audio and raise the light level” will have the same logical meaning to any of the following rule statements: • “if the audio is playing and if the telephone is ringing then raise the light level and stop the audio” • “if the telephone is ringing and if the audio is playing then raise the light level and stop the audio” • “if the audio is playing and if the telephone is ringing then stop the audio and raise the light level” This is achieved by using a Rule Disassembler which separates antecedents and consequents. The rules are then shaped by a Real-Time Ambience Relation Resolver, which is responsible for resolving the relationships of the information received from the network‘s internal system and the user. To avoid rule conflicts (eg contradictory condition-actions) a process, the Contextual Consequence-Focus Conflict (CCFC) Detection Mechanism, tests whether the user’s current action conflict
386
Event DataModel Event
PiPUIEvent
RuleModelEvent
CCFC Event
MAp Event
PiPUPUPnPService Event
Jeannette Chin, Victor Callaghan and Graham Clarke
Type/Method DEVICE_UPDATED REMOVE_DEVICE STATE_CHANGED SERVICE_UPDATED SET_CURRENT_MAp MAp_DELETED SET_CURRENT_RULE RULE_DELETED ANTECEDENT_ADDED ANTECEDENT _REMOVED CONSEQUENT_ADDED CONSEQUENT _REMOVED ANTECEDENT _STATE_CHANGED CONSEQUENT _STATE_CHANGED RULE_CHANGED DUPLICATES_CONFLIC CCFCACTIONS_CONFLICT DUPLICATE_ CONSEQUENT MAp_REGISTERED MAp_REMOVED REGISTER_DEVICE REMOVE_DEVICE REMOVE_ALL_DEVICES RULE_ADDED RULE_REMOVED stateChanged(StateVariable variable)
Data Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object Object StateVariable
Table 1 PiP Event attributes table
with other MAps rules which may result in unwanted system behaviours. If the CCFC detects no conflicts then the Rule Assembler constructs the rule. If there are conflicts, the CCFC Entanglement Handler will deal with this situation by performing the following operations (1) gathering the conflict information (2) isolating the conflict actions from those in the current MAp and (3) presenting conflicts to the user so he may alter his action to avoid the conflict. Having created a MAP, to activate it, the user simply needs to drag the MAp’s graphical representation from the user’s home program area (represented in a sub-view), dropping it on the “play” button located at the top of the PiPView. To terminate a MAp, the user simply clicks the “stop” button.
End-user Customisation of Intelligent Environments
387
5 The dComp Ontology To maximise the usability of the system, data needs to be formed in a suitably semantic and standardised manner so that it can be understood and reasoned on by all parties in the network. To support this we have devised an ontology for PiP called dComp, that supports information interoperability between applications and provide a common knowledge framework. PiP leverages the dComp ontology semantics as a core vocabulary for its information space. dComp allows information to be conceptually specified in an explicit way by providing definitions associated with names of entities in a universe of discourse e.g. classes, relations, functions, or other objects, that are both in a machine and human useable format. In more practical terms, the dComp ontology describes, for instance, what a device names mean, and provides formal axioms that constrain the form and interpretation of terms.
Table 2 dComp ontology
5.1 dComp Rationale The dComp ontology was built on the OWL language, as it is more expressive than RDF or RDF-S that is it provides additional formal semantic vocabularies allowing PiP to embed more information into the ontology. In addition OWL is widely used, especially for the semantic Web, with numerous supporting tools such as Jena [53] and inference engines such as RACER [25], F-OWL [51], and Construct [54]. In order to realise our vision, a set of explicitly well-defined vocabularies (i.e. an ontology) was needed to model, not just the basic concept of decomposed devices
388
Jeannette Chin, Victor Callaghan and Graham Clarke
but also, the communities they form, the services they provide, the rules and policies they follow, the resultant actions that they take, and of course the people who inhabit the environment along with their individual preferences; dComp provides these properties. In creating dComp we have sought, wherever possible, to build on existing work. One such notable ontology is SOUPA [56] from Ubicomp which, while aimed at pervasive computing, lacks support for key PiP mechanisms such as community, decomposed functions and coordinating actions needed to produce higher level meta functionality [11]. In addition, at the time dComp was developed the SOUPA standard had only limited support for the concept of UPnP-based devices which PiP depends upon. However, SOUPA has a well-defined method of supporting notions of Action, Person, Policy and Time which dComp adopted. Thus, most of the innovation in dComp relates to the ontology of decomposition and community leading to the name “Decomposed Community Programming” (dComp).
5.2 The dComp Ontology – An Overview The following description takes the form of a summarised walk-through dComp; the full specification is available online[59].
5.2.1 The Device Class The main class is called “DCOMPDevice” and provides a generic description of all PiP devices. Currently DCOMPDevice has 10 sub-classes (see table 2) including both nuclear (traditional appliances) and atomic (decomposed) devices and remains the subject of ongoing development. The roles of most sub-classes are obvious from their names. Those which might not be obvious include “DeviceInfo” which is for individuals that share some UPnP descriptions, “DeviceInfo” is used for devices sharing some UPnP descriptions, “Characteristic” for different mobility characteristics, Relationships are defined by using the OWL object property and are: (1) hasDeviceInfo (2) hasHardwareProperty (3) hasDCOMPService (4) hasCharacteristic. The main elements of a typical DCOMPDevice expression is shown in figure 9.
5.2.2 Hardware Class An abstract class, DCOMPHardware, generalises all PiP hardware that exists in a DCOMPDevice and, in the current version, has 8 sub-classes along with associated properties: CPU, Memory, DisplayOutput, DisplayInput, AudioOutput, AudioInput, Amplifier and Tuner. In order for the PiP DCOMPDevices to work together, every DCOMPDevice on the dComp network, offers services. These services are modelled by a class called “DCOMPService” which currently contains three sub-classes, namely PiPService, LightsAndFittingsService and EntertainmentService. Each con-
End-user Customisation of Intelligent Environments
389
tains sub-services, for example, the EntertainmentService class includes AudioService, VideoService, FileRepositoryService, SetTopBoxService and FollowMeService. The LightsAndFittingsService and EntertainmentService are mutually distinct (ie in mathematical terms, they do not belong to a same set). These characteristics are modelled by declaring the classes to be disjointWith each other. Every service in the dComp environment is identified by a property called “serviceID” and a class called “StateVariable” (to represent UPnP values). The StateVariable class has three properties, namely: “name”, “value” and “evented”. The relationship between a DCOMPService and the StateVariable is linked by an object property called “hasStateVariable”. The relationship between a DCOMPDevice and DCOMPService is coupled by an object property called: hasDCOMPService. <device:AudioDevice rdf:ID="TestDevice12"> <device:hasDeviceInfo> <device:DeviceInfo> < device:friendlyName>TestDevice12 <device:DeviceUUID>0 <device:DeviceType>urn:schemas-upnp-org:TestDevice12:1 <device:DeviceModelURL>http://TestDevice12URL/ <device:DeviceModelNumber rdf:datatype="&xsd;double">0.0
<serv:hasDCOMPService> <serv:AudioService rdf:about="#JCAudioService01"/> <serv:hasDCOMPService> <serv:AudioService rdf:about="#JCAudioService02"/> <serv:hasDCOMPService> <serv:AudioService rdf:about="#JCAudioService03"/>
Fig. 9 Typical Display Device Expression
390
Jeannette Chin, Victor Callaghan and Graham Clarke
Tran-JCTV JC TV The first JC testing TV 2004-09-06T19:43:08+01:00 Fig. 10 Typical TV community definition
5.2.3 Community Class In order to support the notion of community (a MAp), dComp uses a class called DCOMPCommunity. In the current implementation three types of communities are used namely: (1) SoloCommunity (for those devices not yet part of a community) (2) PersistentCommunity (for communities with a degree of permanency) (3) TransitoryCommunity (for communities with a short lifetime). A dComp device (DCOMPDevice) can join one or more communities (a community must have at least one device). Relationship between a dComp device (DCOMPDevice) and a dComp community (DCOMPCommunity), is described using an object TransitiveProperty called “inTheCommunityOf”. A class called “CommunityDevice” is introduced to represent all the devices in a community. Devices are identified by an object, deviceUUID.. The relationship between a Community and a Community device (CommunityDevice) is linked by another object TransitiveProperty called “hasCommunityDevice”. A user forms communities in dComp; thus, each community has an owner. The properties of Communities are: community ID, communityName, communityDescription and timestamp. The relationship between a community and its owner is linked by an object type property, called “hasOwner”. An example of the main elements in a dComp TV community is given in figure 10.
End-user Customisation of Intelligent Environments
391
Tran-JohnMAp JohnMAp John testing virtual MAp 2004-09-06T19:43:08+01:00 John Johnny <device:deviceUUID>UUID:iPHLDigitalTV17 <device:deviceUUID>UUID:PHLhifiMMS223 <device:deviceUUID>UUID:WonderInternetRadio42 3e4edfa8-055e-4ef0-8581-70156c156288 ce4edfa8-c55c-4ef9-8581-40156c156258 Fig. 11 John’s Meta-TV Applicance
392
Jeannette Chin, Victor Callaghan and Graham Clarke
5.2.4 Rules Class PiP uses rules for coordinating community actions. These are supported by a class called “Rules” which models three types of rules: (1) UnchangeableRules (rules that can not be changed), (2) PersistentRules (rules that infrequently change) and (3) NonPersistentRules (rules that frequently change). These rules are mutually distinct and are declared to be complementOf each other. Rules have properties: ruleID and ruleDescription and an object property called “hasRuleOwner” to link to the owner (the rule and community owners may be different people). A class called “Preceding” is used to represent a set of triggers that cause the coordinating actions to be executed. The devices in the Preceding class are identified by their deviceUUID, and the service they offer. Lastly an object property called “hasAction” binds the relationship between Rules and Actions. The main elements of a Rule Definition is given in figure 12.
Fig. 12 Main elements of Rule Definition
Fig. 13 Main elements of a Situated Condition
5.2.5 Action, Person, Policy and Time Class As mentioned earlier, wherever possible dComp builds on existing ontology work. As SOUPA provides a suitable DCOMPperson, Policy and Time ontology these have been adopted in dComp. The dComp Action ontology document has, to some extent, been influenced by the SOUPA Action ontology. The class “Action” represents the set of actions defined by the rules. As with SOUPA, dComp provides two types of actions, namely: PermittedAction and ForbiddenAction class. The Action class in dComp is the union of these two action classes; every coordinating action has its target devices. A class called “Recipient” models target devices, which represents a set of target devices where actions take place. The members of Recipient
End-user Customisation of Intelligent Environments
393
are identified by their deviceUUID and the serviceID. Actions for the recipient are called “TargetAction” which has two properties namely actionName (the name of the action) and targetValue (the value for the action to be taken). A typical statement “when the phone rings, mute the TV” could be expressed as in figure 14.
Test action device:DeviceUUID>UUID:PHLAudioMMS223 <serv:serviceID>AudioMMS223 Mute Mute Fig. 14 Main elements of an Action (muting the TV)
5.2.6 Preference Class A person’s preferences are described in dComp by DCOMPPreference. In dComp, preferences are referred as “situated preferences”, which is similar to Vastenburg’s “situated profile” concept where he uses situation as a framework for user profile so that the values of the profile are relative to situations [48]. The “Preference” class represents a set of situated preferences of a person for his community. This Preference class has a subclass called “CommunityPreference” and an associated property called “communityID”. To model “person A prefers X, depending on the situation conditions of Y”, another class called “SituatedConditions” is defined which represents the set of situated conditions that the person’s preferences depended on. Although users are allowed to define their own “SituatedConditions”, dComp explicitly defines a list of pre-set situated conditions so that it forms a default template that a person can use. The Preference class has a close relationship to the Person class. To bind this relationship, an object property called “hasPreference” is used, which links the domain of Person to the range of Preference. The relationship between the Preference class and SituationConditions class is linked by another object property called: “hasCondition”. The main elements of a Situated Condition are given in figure 13.
394
Jeannette Chin, Victor Callaghan and Graham Clarke
6 Evaluation PiP was designed to be used by people and thus, to evaluate it we devised a small “proof of concept” trial involving real users who were asked to compose bespoke MAps within an experimental digital home, the iSpace. The primary purpose of the evaluation was to determine if users were able use PiP in a creative manner to construct MAps of their own design and to gain an insight into the participants post-trial views of PiP’s usability. In addition we tested the performance of dComp, as ontology can be computationally demanding. We summarise this work in the following sections.
6.1 PiP Testbed
Fig. 15 The Essex iSpace
PiP was evaluated in Essex the iSpace, a test-bed called which takes the form of a two bed roomed domestic apartment, see Figure 15. The iSpace was built from the ground-up to support digital home research and has many special structural features such as cavity walls/ceilings containing power & network outlets together with provision for internal wall based sensors and processors etc. In addition the iSpace has been populated with in excess of a quarter of a millions pounds worth of networked equipment varying from appliance, sensors, actuators through to special purpose equipment to support user trials. There are numerous networks in place ranging from wired , power-line, wireless, broadband to high-bandwidth multi-mode fibre connections to the outside world. The network and middleware infrastructure is illustrated in figure 16. All the basic services are electrically controlled wherever possible (eg heating, water, doors, telephones, MP3 players, lights, etc).
6.2 dComp Performance Evaluation Ontology brings many benefits to PiP, such as the ability to employ reasoning about service selection and aggregation but its downside is that it can be computation-
End-user Customisation of Intelligent Environments
395
Fig. 16 The iSpace Network Structure
ally demanding. In our case there was concern that one of the special features of PiP, the MAps based decomposed descriptions, might not perform well in large domains because of increased link following. To evaluate the performance of dComp we compared two sets of device descriptions; the first description was structured in typical xml-based “all-in-one” format, while the second was decomposed into smaller segments (i.e. broken up into hardware and service information), each segment being “linked” back to the device. Both descriptions were written in OWL. For each set, we used 2 different quantities of devices in the test (3 and 32). A common query with five conditions was used for the test, with each test being run fifty times. The test was conducted on a WindowsXP, 2.08 GHz, 512 RAM machine. Four sets of tests were completed: (1) 3 device descriptions in “all-in-one” format (2) 3 device descriptions in “decomposed” format (3) 32 device descriptions in “all-in-one” format and (4) 32 device descriptions in “decomposed” format.
Fig. 17 A typical dComp performance test
396
Jeannette Chin, Victor Callaghan and Graham Clarke
A representative example of our tests is shown in figure 17. As can be seen we found that the decomposed device description out-performed the compact devices description for smaller domains with fewer devices. On average, queries took half the time that “all-in-one” format descriptions took. Although we had been concerned that decomposed descriptions might not fare as well for larger domains we found that this was not the case, as the system performed as well as the “all-in-one” descriptions, whilst offering the advantages of decomposition described earlier. This we attribute to additional link-processing being counterbalanced by the processing benefits of smaller, better focused descriptions. For larger domains we found that the performance of decomposed versus the compact descriptions remained roughly the same.
6.3 The PiP Evaluation To assess the participants’ subjective views on the usability of PiP, an evaluation methodology was developed with the assistance a socio-technical research unit, Chimera, based at the BT Research Labs in Suffolk, England [20]. The evaluation comprised both observations and a questionnaire measuring attitudes over six usability dimensions shown in Table 3 (a higher rating score on the dimensions shows greater usability). The questionnaire was developed to assess the participants’ subjective judgments about the usability of PiP. It consisted of a set of seventeen statements, measuring attitudes over the following six usability dimensions: the overall concept, user controls, cognitive load, information retrieval/visualisation, affective experience and future potential. The questionnaire was based on a five-point Likert scale with responses from “Strong Agree” through to “Strongly Disagree”. The dimensions each consisted of a series of statements (from 2 to 4) with each statement offering a range of ratings (from 1 to 5). A higher rating score on the dimensions contributes towards the greater usability of PIP. In the research community there is some discussion as to how to best construct this type of test with, for example, some researchers worrying that there is no metric, interval measure or that the data would be best treated as ordinal [18]. However, there is a widely accepted consensus that the Likert scale can use with interval procedures, provided the scale item has at least 5 and preferable 7 categories [42]. Thus, as we are using 6 categories, the questionnaire rating data was treated as interval data in this study. The questionnaire was piloted on 3 users and the feedback used to refine the procedures and questionnaire for the main trial. In terms of the structure of the trials, our strategy was to set-up as open an arrangement as possible, with minimal constraints being placed time, methods, and tasks so that we could get a better idea of how participants would like to use the system, and how the system coped with different users.
End-user Customisation of Intelligent Environments
Fig. 18 Participant’s Computing Experience
397
Fig. 19 Participant’s Programming Experience
6.3.1 Participants The PiP evaluation comprised eighteen participants drawn from a diverse set of backgrounds (e.g. housewives, students, secretaries, teachers etc), see figure 20. The gender mix was 10 females and 8 males with ages ranging from 22 to 65. The participants also formed a multicultural group including Asians, Europeans, Americans, Latin-Americans and Australians. All trial participants had some minimal computing experience (i.e. they knew how to use a mouse and keyboard) see Figure 18. 60% of the participants had no programming experience whilst 20% of them had a very good knowledge of programming, see Figure 19. For the PiP evaluation sessions, participants were given five sets of devices drawn from a set of lights, a telephone, smart sofa and an MP3 player and asked to create a collective behaviour of their own design. In addition PiPview (the Pip GUI) was set-up to run on a winXP tablet PC (HP) that connected to the iSpace network via a Linksys 802.11g WIFI access point. As will be explain in greater detail below, the evaluation was preceded by a 20-minute training session to show the participants how to use the PiP technology. With only 5 devices, the possibilities for the users to create interesting designs were a little limited. However, despite that limitation, the users created a number of interesting virtual appliances such as, for example, Telight (telephone linked to lighting), LightSof (reading lights linked to sitting on the sofa) and Telight3 (lights and MP3 player responding to the telephone).
6.3.2 Procedures The University enforces stringent ethical regulations pertaining to research involving people and animals. Thus, prior to the evaluation, a consent form was prepared and completed by all participants before commencing their sessions to ensure that the participants were fully aware of the type of data that would be collected during the session and its use after the session was completed.
398
Jeannette Chin, Victor Callaghan and Graham Clarke
Each PiP evaluation participant was given a 20-minutes training session to familiarise them with PiP. This training session included: a briefing on the PiP concept, a walk through using the PiP UI, a quick demo on how to compose a MAp followed by an introduction to the trial environment. The task for the trials was deliberately open and participant were use PiP to customise the functionality of the 5 devices they were given in any way they wanted; thus the participants were free to create one or more MAps of their own design. After creating MAps participants were encouraged to switch between the MAps they had created.
Fig. 20 Trial Participants
No time limit was set for the participants to customise the space. Assistance was provided where needed. Following completion of the evaluations, a questionnaire was administrated to measure the participants’ subjective judgements of PiP. Participants rated a total of seventeen statements covering six dimensions mentioned above. To support the evaluation a “user-action” module was created and installed in PiP to collect system data. A digital video recorder was used to record participants’ interactions and verbal comments. Data was analysed using SPSS.
6.4 Results 6.4.1 Performance An analysis of the evaluation data showed that, in general, all the dimensions rated well (scoring above 4) indicating the users were generally well satisfied with the system. At the outset of the work, two suppositions that we wished to confirm were that people would enjoy the experience of using PiP to create MAps and find the process relatively easy. Both of these suppositions were supported by the evaluation results as, ‘enjoying the experience’ (the mean of the affective dimension) scored 4.6, the highest rating, whilst the cognitive load dimensions achieved an overall average of 4.3 indicating people found the process relatively simple. In fact, it was found that 88.9% reported that they used the controls with ease and 83% of partici-
End-user Customisation of Intelligent Environments
399
pants were able to use the system to create their desired environments with little or no assistance. The evaluation showed that after only a brief training session (20-minutes), 83% of participants were able to use PiP to customise their personal space with little or no assistance. The time taken to accomplish these tasks varied from participant to participants but our evaluation objectives didn’t include measuring the time taken (although it was typically of the order 2-5 minutes, depending on the complexity of the behaviour being designed). Concerning the two methods available for demonstrating examples, 11% of the participants chose to customise their personal space wholly via GUI controls while 72% of them conducted by physical interactions with the environment while the rest used a mixture of both. Trial participants showed no sign of distress during or after the evaluations. Although PiP is not exacting on logical sequence when composing MAp, 33% of the participants expressed the view that they found it mentally easier using a logical sequence and decided to conduct their trials that way. The remainder of the participants (77%) focused on the task (ie creating the behaviour of the environment rather than logical sequence). The study also revealed that none of the participants found it difficult to understand the basic principles of the system.
6.4.2 Questionnaire Rating A variety of tests were completed to analyse the questionnaire ratings using the SPSS software package. Table 2 summarises the overall rating scale for the six dimensions evaluated. The results revealed that “Affective Experience” dimension received the highest rating. 148 out of the total number of 240 cases received a top rating, which is 61.7%. The “Information Retrieval” dimension – information presentation – had the lowest recorded rating (2) whereas in all other dimensions 3 was the lowest recorded. Tests also revealed that the overall difference between the lowest (4.1) and highest (4.6) mean ratings was not great (see Table 2). The highest mean rating was scored by “Affective Experience” dimension suggesting participants were enjoying the experience of programming using PiP. The cognitive load dimension had an overall average score of 4.3 indicating participants found the process relatively simple. From individuals’ tests we observed that an overall 83.4% of all participants found PiP intuitive to use and 94.4% of all participants stated they felt the experience rewarding. Thus these results supported the original supposition that people would both enjoy and find the process of using PiP easy. In addition, we completed a cross analysis based on the participants computing and programming experience ranging from average, good to very good (Figure 19). Due to the length limit of the paper, only the results of the group with average computing and programming experience are reported here. For this group of participants, 4 out 199 cases evaluated had negative responses (2%). However, the overall results showed that they rated highly for all six dimensions (Figure 21). Examples of remarks recorded from this group include: “I just feel like right now I want to sit
400
Jeannette Chin, Victor Callaghan and Graham Clarke N
Mean
Std. Deviation
Std. Error
95% Confidence Interval for Mean Lower Bound
Conceptual UserControl CognitiveLoad InformationRetrieval AffectiveExperience FutureThoughts Total
113 191 155 112 240 83 894
4.3186 4.1990 4.2710 4.4107 4.6083 4.1687 4.3602
.53894 .59134 .57332 .54613 .50596 .76221 .59489
.05070 .04279 .04605 .05160 .03266 .08366 .01990
4.2181 4.1146 4.1800 4.3085 4.5440 4.0022 4.3211
Min.
Max.
Upper Bound
4.4190 4.2834 4.3619 4.5130 4.6727 4.3351 4.3992
3.00 3.00 3.00 2.00 3.00 3.00 2.00
5.00 5.00 5.00 5.00 5.00 5.00 5.00
Table 3 One-Way ANOVA test on dimension vs qRating
down for a lot longer and try out all sorts of MAps that I could possibly create!” and another one : “I can really get quite keen on it”.
Fig. 21 Mean ratings for each individual group of participants.
In general the “Information Retrieval” dimension (how well information was conveyed to the user) scored the least indicating this was the worst aspect of the PiP design. Given that the interface was a rather crude prototype, not the final product, it was not surprising to find that the interface could be improved. Other useful findings were that, we found no significant variation across culture but found some minor variation on cognitive loading for age groups, with younger participants finding the system slightly easier to use. In terms of general observations, none of the participants appeared to find the principles difficult to understand. By way of an example, that was typical of many users, one participant stated “I thought the basic principles themselves are very simple and straight forward”, “I felt I could easily grasp the basic principles”. This particular comment was from the group with
End-user Customisation of Intelligent Environments
401
no programming skills at all, a key target group for PiP. Overall 83.4% of all participants found PiP intuitive to use and 94.4% of all participants stated they felt it rewarding to use PiP. Thus these initial results support the original hypothesis of the work that it is possible to produce a system that empowers non-specialists to be able, and to enjoy, programming customised MAps. Like all disruptive technologies, user power is significant but, ultimately, just one of many factors in determining whether a given technology will be adopted in the market. This evaluation stops short of understanding the issues that drive companies into adopting particular products although, for those interested in understanding those processes an illuminating paper has been written by a member of Intel’s User Experience Group which takes PiP and the concept of MAps as its focus [28].
7 Concluding Discussion In this chapter we have described a vision for customising intelligent environments based on a form of programming-by-example. In this approach non-technical occupants of intelligent environment are given end-user tools to enable them to create their own bespoke functionalities for networked environments. In order to achieve this vision we introduced a number of new concepts and methodologies, in particular the ‘deconstructed appliance model’, meta-appliances/applications (MAps), the dComp ontology and Pervasive interactive Programming (PiP). In order to situate our work we presented a ‘rule formation’ taxonomy based on the way rules in networked devices are constructed; Pre-programmed rules – Agentprogrammed rules – User-programmed rules. The work in this chapter lies firmly within the latter class, user-programmed rule based systems. This allowed us to contrast our work to other significant approaches, such as autonomous self-learning agents. The ensuing review revealed that all approaches have strengths and weaknesses meaning that the choice of approach depends on the specific needs of a given applications. For instance self-programming agents do extremely well where there is a need to reduce the cognitive load on an individual, as they can manage the whole process without intervention from the user. However, for cases where there is a need for the home occupant to exercise greater control, or to participate intimately in the creative design process, then end-user programming has advantages. Of particular relevance to this chapter is “programming-by-example”, a well established and successful end-user programming paradigm. We described programming-by-example from its roots in the mid-seventies when Smith introduced it through to the inspirational work of Lieberman in the 90’s and to its use by Chin in pervasive computing environments in the new millennium. Over this time, programming-by-example has evolved from a means of programming applications on single platforms, to programming behaviours of embedded systems in distributed computing platforms. As part of this review we reported on a number of significant studies into smart home requirements and reported on their finding that a particularly important requirement
402
Jeannette Chin, Victor Callaghan and Graham Clarke
they found was the need for people to be able to customize the functionality of smart-homes. This finding underpins the motivation behind this work. By way of an example of end-user programming we presented Chin’s work on Pervasive interactive Programming (PiP). Chin’s work is novel in that it translated the principles of programming-by-example from single stand-alone computing platforms to distributed computing platforms. Before PiP, programming-by-example had not been applied to programming tangible physical objects, especially distributed embedded computing nor any other aspect of pervasive computing. PiP also introduced a number of innovative concepts such as the deconstructed appliance model, virtual appliances and Meta-Appliances/Applications (MAps). These concepts represent a radically new way to view the nature of home appliances by breaking apart monolithic appliance functionality into their elemental or atomic components, allowing them to be recombined by the user so as to customise the functionality of their appliances or environment. Currently, for historical reasons (ease of use, economies of production, technology limitations etc), appliances have been largely monolithic units (e.g. televisions, telephones etc), a paradigm that ‘deconstruction’ challenges. Whilst such changes to the market with replacement of monolithic appliances by more elemental services in the home would require a somewhat abrupt change to markets and therefore may be unlikely a more gradual evolution is possible by the gentle augmentation of appliances with network capabilities. Thus, even if it came to be that future homes had elemental services installed as standard for example displays, audio transducers, sensors, actuators, tuners, streamers etc, much as heating and lighting services are now standard, the road to that future is more likely to be a gradual evolution. This chapter also revealed another important feature of MAps; they are soft-objects and portable, able to move with people and, where possible, configure the environment to reflect a person’s preference wherever they are. In addition, beyond the home, MAps have the potential to alter business models as, being soft-object created by people with no technical skills, but having value, they could be traded. The soft nature of MAp makes them ideal for online web based trading which could be undertaken could be by individuals, new types of business coalitions, or traditional companies. We described our prototype ‘proof of concept’ architectural implementation of PiP. The core principles included the use of eventing, to capture user interaction; virtual engines, to execute user generated PiP rules and ontology, to allow better resource sharing and allocation. We provided an overview of the dComp ontology, which built on both Soupa and Owl principles to provide support for the PiP deconstructed model, especially MAps. In particular dComp differs from other ontologies by providing representations for community, decomposed functions and coordinating actions which are fundamental to the PiP model for end-user customization. The PiP evaluation was conducted in a purpose-built intelligent environment testbed known as the iSpace. This is a purpose built domestic apartment, which contains numerous, networked appliances, sensors and actuators. Initial testing of the dComp ontology showed that even the most computationally intensive queries relating to PiP decomposition were returned in less than a second, which yielded acceptable system performance. The main evaluation concerned assessing the ease or
End-user Customisation of Intelligent Environments
403
difficulty people had in using our prototype system PiP system to programme MAps. Whilst we were only been able to undertake a comparatively small scale evaluation with 18 users, the initial findings were most encouraging as they showed that it was possible to produce an end-user programming system that empowers non-specialists to be able, and to enjoy, programming coordinated actions of distributed embedded computer systems in a digital home. Remarks such as “I felt I could easily grasp the basic principles”, from participants with no programming skills at all were particularly encouraging as such people were a key target of our end-user programming of digital homes work. Finally, this chapter has set out to explain what end-user programming is, how it can be implemented and the benefits it can offer occupants of future high-tech homes. Whilst this chapter has argued strongly in favour of end-user programming and, especially, programming-by-example the future is rarely so ‘black and white’ and it is likely that solutions will be hybrids of many ideas. However, we feel passionately that whatever final solutions emerge, they should to recognise the human condition by addressing the fundamental needs of people to be creative and to protect their privacy.
Acknowledgements We wish to thank Martin Colley (Essex University), Hani Hagras (Essex University), Malcolm Lear (Essex University), Phil Bull (BT), Rowan Limb (BT), Brian Johnson (Intel) for their role in encouraging and motivating this work in various ways, not least by their stimulating discussions and papers. In addition we are pleased to acknowledge the DTI (Next Wave Technologies and Markets programme) who funded the original phase of the work, Chimera (Institute for Socio-Technical Research) who advised on the evaluation and, finally, Essex University who funded the PhD scholarship.
References [1] Augusto, Juan Carlos; Nugent, Chris D. (Eds.) Designing Smart Homes: The Role of Artificial Intelligence, http://www.springer.com/series/558Lecture Notes in Computer Science , Vol. 4008, 2006, ISBN: 978-3-540-35994-4 [2] O. Babaoglu et al., Anthill: A Framework for the Development of Agent-based Peer-to-Peer Systems, 22nd International Conference on Distributed Computing Systems, 2003. [3] Mathias Bauer, Dietmar Dengler, Gabriele Paul, Markus Meyer, Programming by example: programming by demonstration for information agents, Communications of the ACM, Volume 43, Issue 3 (March 2000), pp.98 – 103
404
Jeannette Chin, Victor Callaghan and Graham Clarke
[4] Berners-Lee T, Hendler J, Lassila O. The Semantic Web, Scientific American, May 2001 [5] Barkhuus, L., Vallgårda, A: Smart Home in Your Pocket, Adjunct Proceedings of UbiComp 2003 (2003) 165–166 [6] Alan F. Blackwell, Rob Hague, Designing a Program Language for Home Automation, in G.Kadoda (Ed). Proc PP1G 2001 pp. 85–103. [7] Barry Brumitt, Brian Meyers, John Krumm, Amanda Kern and Steven A.Shafer, EasyLiving: Technologies for Intelligent Environments, Proc of the 2nd international symposium on Handheld and Ubiquitous Computing, 2000, 12–29 [8] Callaghan V, Clark G, Colley M, Hagras H Chin JSY, Doctor F Intelligent Inhabited Environments, BT Technology Journal , Vol.22, No.3 . Klywer Academic Publishers, Dordrecht, Netherlands, July 2004 [9] Callaghan,V., Chin, J.S.Y., Shahi, A., Zamudio, V., Clarke,G.S., Gardner,M., 'Domestic Pervasive Information Systems: End-user programming of digital homes', Journal of Management Information Systems, Special Edition onPervasive Information Systems (volume editors Panos Kourouthanassis; George Giaglis), 24:01 (December 2007) pp.129–149 (inc), ME Sharp (New York), ISBN: 978-0-7656-1689-0 [10] Callaghan V, Clarke G, Chin J Some Socio-Technical Aspects Of Intelligent Buildings and Pervasive Computing Research, Intelligent Buildings International Journal, Vol 1 No 1, Published Autumn 2008, ISSN: 1750-8975, EISSN: 1756-6932 [11] Chen H,; http://ebiquity.umbc.edu/v2.1/person/html/Tim/Finin/Finin T,; http://ebiquity.umbc.edu/v2.1/person/html/Anupam/Joshi/Joshil A. SOUPA: Standard Ontology for Ubiquitous and Pervasive Applications, Int’l Conference on Mobile & Ubiquitous Systems: Networking & Services (MobiQuitous 2004 ), Boston, Massachusetts, USA, August 22–26, 2004 [12] G. Chen and D. Kotz. A survey of context-aware mobile computing research, Paper TR2000-381, Department of Computer Science, Darthmouth College, November 2000. [13] Chin J, Callaghan V, Clarke G, A Programming-By-Example Approach To Customising Digital Homes IET International Conference on Intelligent Environments 2008, Seattle, 21-22 July 2008 [14] Chin J, Callaghan V, Clarke G, Soft-appliances: A vision for user created networked appliances in digital homes, Journal of Ambient Intelligence and Smart Environments 1 (2009) 65–71, DOI 10.3233/AIS-2009-0010, IOS Press, 2009 [15] Kook Hyun Chung, Kyoung Soon Oh, Cheong Hyun Lee, Jae Hyun Park, Sunae Kim, Soon Hee Kim, Beth Loring, Chris Hass, A User-Centric Approach to Designing Home Network Devices, CHI '03 extended abstracts on Human factors in computing systems, 2003, 648 – 649 [16] D.J. Cook, M. Huber, K. Gopalratnam and M. Youngblood, Learning to Control a Smart Home Environment, Innovative Applications of Artificial Intelligence, 2003
End-user Customisation of Intelligent Environments
405
[17] http://www.springerlink.com/content/?Author=Diane J. CookDiane J. Cook, http://www.springerlink.com/content/?Author=Michael YoungbloodMichael Youngblood and http://www.springerlink.com/content/?Author=Sajal K. DasSajal K. Das A Multi-agent Approach to Controlling a Smart Environment, pp 165–182, in Designing Smart Homes: The Role of Artificial Intelligence, ISBN 978-3-540-35994-4, July 2006 [18] Coolican H. Research Methods and Statistics in Psychology (2nd Edition) Hodder and Stoughton, 1994 [19] Cypher A, Halbert DC, Kurlander D, Lieberman H, Maulsby D, Myers BA, and Turransky A, Watch What I Do: Programming by Demonstration, The MIT Press, Cambridge, Massachusetts, London, England 1993 [20] DiDuca D and Van Helvert J. User Experience of Intelligent Buildings; A UserCentered Research Framework, Intelligent Environments 2005, Essex, 28-29th June 2005 [21] N. Dulay, S. Heeps, E. Lupu, R. Mathur, O. Sharma, M. Sloman and and J. Sventek, AMUSE: Autonomic Management of Ubiquitous e-Health Systems, Proc. UK e-Science All Hands Meeting, Nottingham, Sept. 2005 [22] Duman H, Callaghan V, Hagras H, Intelligent Association Selection of Embedded Agents in Intelligent Inhabited Environments, Journal of Pervasive and Mobile Computing, vol. 3, issue 2, 2007, 117–157 [23] Gajos K., Fox H., Shrobe H., End User Empowerment in Human Centred Pervasive Computing, http://www.pervasive2002.org/Pervasive 2002, Zurich, Switzerland, 2002 [24] Halbert D.C. SmallStar: Programming by Demonstration in the Desktop Metaphor, Watch What I DO, MIT Press. 1993 [25] V.Haarslev and R. Moller, Description of the RACER system and its application, In proceedings International Workshop on Description Logics (DL2001), 2001 [26] Hague, R., et al: Towards Pervasive End-user Programming. In: Adjunct Proceedings of UbiComp 2003 (2003) 169–170 [27] Humble, J. et al Playing with the Bits, User-Configuration of Ubiquitous Domestic Environments, Proceedings of UbiComp 2003, Springer-Verlag, Berlin Heidelberg New York (2003), pp 256–263 [28] Johnson B, Callaghan V, Gardner G Bespoke Appliances for the Digital Home IET International Conference on Intelligent Environments 2008, Seattle, 21– 22 July 2008 [29] Kainulainen L Reasoning in The Smart Home, AIOME, 2006 [30] C. Kidd, R. Gregory, A. Christopher, T. Starner. The Aware Home: A Living Laboratory for Ubiquitous Computing Research, In Proceedings of the Second International Workshop on Cooperative Buildings (CoBuild’99), October 1999 [31] Lieberman H., Bonnie A. Nardi and David J. Wright, Training Agents to Recognize Text by Example, Proceedings of the third annual conference on Autonomous Agents, Seattle, Washington, United States, 1999, pp 116 – 122
406
Jeannette Chin, Victor Callaghan and Graham Clarke
[32] Lieberman H, Your wish is my command, Program by Example, Morgan Kaufmann press, 2001. [33] Masuoka R,; Parsia B,; Labrou Y. Task Computing – the Semantic Web meets Pervasive Computing, 2nd Int’l Semantic Web Conf (ISWC2003), 20-23 Oct 2003, Florida, USA [34] M. Luck et al., Agent Technology: Enabling Next Generation Computing, AgentLink II, 2003 [35] Mäyrä F, Soronen A, Vanhala J, Mikkonen J, Zakrzewski M, Koskinen I, Kuusela K, Probing a Proactive Home: Challenges in Researching and Designing Everyday Smart Environments, Human Technology Journal, Volume 2 (2), October 2006, 158–186 [36] J. McCann, P. Kristoffersson and E. Alonso, Building Ambient Intelligence into a Ubiquitous Computing Management System, Proc. International Symposium of Santa Caterina on Challenges in the Internet and Interdisciplinary Research (SSCCII-2004), Amalfi, Italy, Jan. 2004 [37] McDaniel R., Demonstrating the hidden features that make an application work, in Your wish is my command: programming by example, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA , 2001, pp 163 – 174 [38] N. Minar, M. Gray, P. Maes. Hive: Distributed Agents for Networking Things, In Proceedings of ASA/MA’99, the First International Symposium on Agents Systems and Applications and Third International Symposium on Mobile Agents, 1999 [39] Mozer, M. C. The neural network house: An environment that adapts to its inhabitants In M. Coen (Ed.), Proceedings of the American Association for Artificial Intelligence Spring Symposium on Intelligent Environments (pp. 110– 114). Menlo, Park, CA: AAAI Press, 1998 [40] Myers B.A., Creating user interfaces using programming by example, visual programming, and constraints, ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 12 , Issue 2 (April 1990), pp 143 – 177 [41] O'Neill, E. and Johnson, P. (2004) Participatory Task Modelling: users and developers modelling users' tasks and domains, 3rd International Workshop on task models and diagrams for user interface design (TAMODIA 04). Prague, Czech Republic, 15–16th November 2004 [42] Oppenheim, A N. Questionnaire Design, Interviewing and Attitude Measurement Pinter Publishers Ltd, 1992 [43] Carsten Röcker, Maddy D. Janse, Nathalie Portolan and Norbert Streitz, User Requirements for Intelligent Home Environments: A Scenario-Driven Approach and Empirical Cross-Cultural Study, Joint sOc-EUSAI conference, 2004, 111–116 [44] B. Schilit, N. Adams, and R. Want. Context-aware computing applications, In Proceedings of the 1st International Workshop on Mobile Computing Systems and Applications, 1994 [45] Smith, D. C., Pygmalion: A Computer Program to Model and Stimulate Creative Thought, (1975 Stanford PhD thesis), Birkhauser Verlag. 1977.
End-user Customisation of Intelligent Environments
407
[46] Sugiura A,Koseki Y. Internet scrapbook: automating Web browsing tasks by demonstration Proceedings of the 11th annual ACM symposium on User interface software and technology, San Francisco, California, United States, 1998, pp.9–18 [47] Truong, KN.et al CAMP: A Magnetic Poetry Interface for End-User Programming of Capture Applications for the Home, Proceedings of Ubicomp 2004, pp 143–160. [48] Vastenburg M, SitMod: a tool for modelling and communicating situations, Second International Conference, PERVASIVE 2004, Vienna Austria, April 21–23, 2004, ISBN: 3-540-21835-1 [49] Wang Z, Garlan D. Task-Driven Computing, Technical Report, CMU-CS-00154, Computer Science, Carnegie Mellon University, May 2000 [50] Zamudio V and Callaghan V, Facilitating the Ambient Intelligent Vision: A Theorem, Representation and Solution for Instability in Rule-Based MultiAgent Systems, International Transactions on Systems Science and Applications, Volume 4, Number 2, July 2008 [51] Young Zou, Harry Chen, and Tim Finin, F-OWL: an Inference Engine for Semantic Web, Proceedings of the Third NASA-Goddard/IEEE Workshop on Formal Approaches to Agent-Based Systems, 26 April 2004 [52] http://agentsheets.com/products/index.html [53] http://jena.sourceforge.net/ [54] http://www.networkinference.com/Products/Construct.html [55] http://www.lisi.ensma.fr/ihm/index-en.html [56] http://pervasive.semanticweb.org/soupa-2004-06.html [57] http://www.stagecast.com/creator.html [58] http://www.toontalk.com/ [59] http://iieg.essex.ac.uk/dcomp/ont/dev/2004/05/ [60] http://www.w3.org/Submission/2004/07/
Intelligent Interfaces to Empower People with Disabilities Margrit Betke
1 Introduction Severe motion impairments can result from non-progressive disorders, such as cerebral palsy, or degenerative neurological diseases, such as Amyotrophic Lateral Sclerosis (ALS), Multiple Sclerosis (MS), or muscular dystrophy (MD). They can be due to traumatic brain injuries, for example, due to a traffic accident, or to brainstem strokes [9, 84]. Worldwide, these disorders affect millions of individuals of all races and ethnic backgrounds [4, 75, 52]. Because disease onset of MS and ALS typically occurs in adulthood, afflicted people are usually computer literate. Intelligent interfaces can immensely improve their daily lives by allowing them to communicate and participate in the information society, for example, by browsing the web, posting messages, or emailing friends. However, people with advanced ALS, MS, or MD may reach a point when they cannot control the keyboard and mouse anymore and also cannot rely on automated voice recognition because their speech has become slurred. People with severe cerebral palsy or traumatic brain injury are often quadriplegic and nonverbal and cannot use the standard keyboard and mouse, or a voice recognition system. In particular children born with severe cerebral palsy can greatly benefit from assistive technology. For them, access to computer technology is empowering. It means access to communication and education – regardless whether their disabilities are purely physical or also cognitive. To help children and adults with severe disabilities, we initiated the Camera Mouse project several years ago [22]. The project is now a joint effort of researchers at Boston University and Boston College in the United States and has resulted in several camera-based mouse-replacement systems, as well as a system that enables users to create their own gesture-based communication environments. We here Margrit Betke, Department of Computer Science, Boston University, Boston, MA 02215, USA, e-mail: betke@ cs.bu.edu, www.cs.bu.edu/faculty/betke H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_15, © Springer Science+Business Media, LLC 2010
409
410
Margrit Betke
Fig. 1 A C AMERA M OUSE user who is quadriplegic due to cerebral palsy. The user moves the mouse pointer to the different regions on the screen by slightly moving his head. In the left upper region of the screen, a window with the Camera Mouse interface can be seen that contains the currently processed video frame showing the user’s face. The camera that captured this frame is attached to the top of the monitor (black).
present an overview of these systems (Table 1 and Sections 3, 4 and 5) and the assistive software applications that users can access with these systems (Section 6). Among our interface systems, the generic mouse-replacement system C AMERA M OUSE made the transition from a basic research platform to a popular assistive technology several years ago and is now used worldwide (Figure 1 and Section 2). We also give an overview of smart environments and interface systems for people with disabilities that others have developed (Section 3) and conclude with a discussion of the strengths and limitations of current approaches and the need for future smart environments (Section 7). Table 1 Camera-based Interface Systems for Users with Severe Motion Impairments
System C AMERA M OUSE
Action nism
Mecha- Main References
Move body part, Betke, Gips, and Fleming [13], e.g., head, foot, or Betke [10] finger Control blink Grauman, Betke, Lombardi, Gips, B LINK L INK length and Bradski [44] E YEBROW R AISER Raise eyebrows Lombardi and Betke [66] Control horizontal Magee, Betke, Gips, Scott, and E YE K EYS gaze direction Waber [70] F INGER C OUNTER Move fingers Crampton and Betke [27] H EAD T ILT D ETEC - Tilt head left or Waber, Magee, and Betke [107] right TOR S YMBOL D ESIGN Move body part Betke, Gusyatin, and Urinson [14]
Intelligent Interfaces to Empower People with Disabilities
411
2 The Camera Mouse The C AMERA M OUSE is used by children and adults of all ages in schools for children with physical or learning disabilities, long-term care facilities for people with advanced neurological diseases, hospitals, and private homes. The C AM ERA M OUSE is a camera-based mouse-replacement system that tracks the computer user’s movements, typically his or her head or facial movements, and translates them into the movements of the mouse pointer on the screen [13]. To mimic the functionality of a left mouse click, the C AMERA M OUSE evaluates the length of time that the pointer dwells over an icon, button, or menu item (or its surrounding region). If the dwell time is measured to exceed the threshold set by the user, e.g., 0.5 seconds, a selection command is issued. The Camera Mouse project has worked with various approaches to tracking, e.g., optical flow, template matching with the normalized correlation coefficient as the matching function, and Kalman filtering [13, 33, 26] and with different video cameras [24]. The current version uses optical flow and works with many standard USB video cameras (webcams) from, for example, Logitech, Creative Labs, and Microsoft [22]. Early experiences of users, ages 2–58, working with the C AMERA M OUSE, were described by Gips and Gips [38], Gips, Betke, and DiMattia [39], Cloud, Betke, and Gips [25]. At that time, a couple of dozen systems were in use in the northeastern United States and a school in England. Akram, Tiberii, and Betke [3] explained more recent experiences with a version of the C AMERA M OUSE that calibrates the mapping from feature movement to pointer movement and thus enables users to navigate to all areas of the screen with very limited physical movement. Our research team has built partnerships with care facilities, such as schools for children with physical or learning disabilities [17] long-term care facilities for people with Multiple Sclerosis and ALS [18], hospitals [23], and private homes [99]. In 2003, 26 schools in Northern Ireland obtained C AMERA M OUSE systems. A breakthrough increase in the number of people who could be helped with the C AMERA M OUSE occurred when our team decided to offer the software for free on the internet. Since it was launched in April 2007, the website www.cameramouse.org has had more than 2,500 downloads a month. The technology’s broad impact on the lives of people with disabilities is maybe best explained by testimonials, for example, the words of C AMERA M OUSE users from Turkey and Australia (Figure 2).
3 Intelligent Devices for People with Motion Impairments The most basic interface for people with severe motion impairments is a binary switch. This can be a mechanical touch switch, such as a hit plate, a wobble stick, a grip handle, a pinch plate, a sip-and-puff switch, or a pull string. It can also be an electrical switch, such as a voice-activated switch if the person can control making a sound, a tongue-movement detector, or a motion-sensitive photocell or infrared
412
Margrit Betke Sent: Thursday, July 27, 2006 1:46 AM Subject: I’m grateful to you I’m so grateful to you. Because I’m a MS (Multiple Sclerosis) patient since 20 years. I can’t move my finger. However I’m in internet fr 10 hour at this time. Thank you very much for everything. Sincerely Dr. M. U.
Sent: Thursday, 4 September 2008 16:53:58 +1000 G’day, [..] a great big thank you from all the way over here in Australia. I have found your fantastic program while looking for an alternative to the expensive trakir and other type head tracking software. [..] I already have a degree of lack of movement in my right arm, shoulder, through to my elbow which also incorporates my ability to grip tightly with my right hand. I am right handed so this is a real pain [..] as well as my condition giving me great pain in movement. I do not take analgesics, I try and keep surfing and swimming and try to keep moving the parts as much as possible. What I am finding is that the program works very well, I am using my daughter’s ASUS lap top with built in camera, I can control the cursur OK and feel with more practice will become very proficient. I thank you and your team for your outstanding work in this area. Best Regards to you all Over there. L. P.
Fig. 2 Emails from C AMERA M OUSE users with motion impairments.
switch. These switches allow users with some motor control to operate wheelchairs or activate selection commands in assistive software [104, 30, 1, 64, 116, 7]. People with cerebral palsy who have some control of their hand movements may be able to use the standard computer mouse as an input device. However, they often have difficulties in positioning the mouse pointer accurately [111, 114]. In an empirical study involving six children with cerebral palsy, Wu and Chen [114] showed that a joystick was a more effective input devices than a trackball or standard mouse. In another study involving four people with muscular dystrophy, Myers et al [73] showed the effectiveness of a stylus-based hand-held personal digital assistant (PDA or Palm) combined with the assistive software Pebbles. Severe paralysis may leave the eyes as the only muscles that a person can control. Such an individual may be able to control the computer via interfaces that detect eye movements, for example via head-mounted cameras, which typically process infrared signals, [76, 32, 65] or electro-oculography. Systems based on electro-oculography use electrodes placed on the face to detect the movements of the eyes [7, 29, 119]. Eye-movement detection systems are typically more pow-
Intelligent Interfaces to Empower People with Disabilities
413
erful communication devices than binary switches because they can replace some of the functionality of a standard computer mouse. Another mouse alternative for users with some ability to move their head are infrared head-pointing devices [31, 49, 101, 93]. Many eye trackers require computer users to use head-mounted equipment, for example, glasses with attached infrared transmitters, or to place a reflective “dot” on their foreheads that the infrared systems can track. This is often perceived as intrusive. Electro-oculographic systems can be even more uncomfortable, because electrodes must attached to the user’s skin. Users with severe disabilities prefer nonintrusive communication interfaces that passively observe them and do not require any equipment attached to the body. Camera-based systems are therefore particularly suited as intelligent interfaces for people with severe disabilities. Many current camera-based interfaces make use of active infrared illumination [50, 55, 57, 72, 74, 80, 87, 118]. Typically, the infrared light reflects off the retina and causes the pupil appear extremely bright in the image, similarly to the “red-eye effect” that occurs in flash photography. The light is synchronized with the camera to illuminate the eyes in alternate frames. The eyes are then located by comparing a frame with this bright pupil effect with a subsequent frame without the illuminated pupils. Typically, the relative gaze direction of an eye is found by analyzing the difference between the center of the bright eye pixel area and the reflection off the surface of the eye from the light source. Some people are concerned about the high costs of infrared eye trackers, the cognitive challenge posed by such systems, especially for small children or people with physical and cognitive disabilities, and the long-term effect of infrared illumination on the eye. Illuminating the environment instead of the user with infrared light is the strategy of the wearable “gesture pendant,” a small wearable camera ringed with infrared light-emitting diodes [37]. The system was designed to control household devices with a hand gesture and may eventually help the elderly or people with disabilities achieve more independence in their homes. Many other human-computer interaction systems have been proposed that gather data on human motion via video cameras, body suits or gloves, infrared or magnetic markers, head and eye trackers [61]. The focus of the computer vision community has been on the development of methods for detecting and tracking the motion of people without disabilities and analyzing their gestures [86, 54, 103]. These methods, however, may eventually be used to help people with disabilities also. Particularly useful are shape, appearance, and motion models of the human body, in particular, for the head and face [e.g. 117, 63], the hands and fingers [e.g. 108, 6, 45], and the eyes [e.g. 15, 46, 115, 53, 91, 100, 109]. We have focused on smart environments that interpret the user’s movements as communication messages, in particular, movements of the user’s eyes [70, 89, 12, 11], eyebrows [44, 66], mouth [89], head [107] or fingers [27, 41]. Others have developed camera-based human-computer interaction systems to track the user’s nose [e.g. 42], eyes [e.g. 58, 28, 59], or hands [e.g. 19, 113, 36].
414
Margrit Betke
If users have the ability to make voluntary blinks or winks, they may be assisted by smart environments that can detect intentional eye closures and interpret them as selection commands [e.g. 98, 85, 43, 44].
4 Camera-Based Music Therapy for People with Motion Impairments Moving with music can have an important therapeutic effect on people; playing or creating music while performing movement exercises seems to have an even stronger therapeutic benefit [e.g. 81, 78]. For people with cognitive or severe physical disabilities, however, traditional music instruments may be difficult to play. We therefore developed a smart environment, called M USIC M AKER, in which people can create music while performing therapeutic exercises [41]. In the M USIC M AKER environment, a video camera observes the individual’s gestures, for example, a reach and grip movement of his or her hand. The system interprets the spatial-temporal characteristics of the movement and provides musical and visual feedback. The user shown in Figure 3, for example, is performing a handopening-and-closing exercise and receives visual feedback on a projection screen. Music is playing during the exercise and is interrupted whenever the user is not opening and closing her fists at a certain minimal speed. By analyzing the motion pattern and trajectory that a body part follows during an exercise, the M USIC M AKER environment provides a quantitative tool for monitoring a person’s potential recovery process and assessing therapeutic outcomes. In a small empirical study involving subjects without disabilities, we tested the potential of MusicCamera as a rehabilitation tool. The subjects responded to or created music in various movement exercises, such as “keep the music playing,” “change the volume of the music” and “play a melody.” In these proof-of-concept experiments, the M USIC M AKER has performed reliably and shown its promise as a therapeutic environment. Fig. 3 M USIC M AKER : A smart environment for music therapy. The user obtains visual and musical feedback about the movement of her hand (Fig. courtesy of Gorman, Lahav, Saltzman, and Betke [41]).
Intelligent Interfaces to Empower People with Disabilities
415
5 Enabling Users to Create Their Own Smart Environments We developed a system, called S YMBOL D ESIGN, that allows people to design their own gestures, for example, single-stroke movement patterns they can make by moving their nose tip. With S YMBOL D ESIGN, the user can match these gestures to a set of commands that control the computer [14]. For example, the user may decide to choose a zigzag pattern to indicate deletion of a word or a circling gesture to represent the selection command. S YMBOL D ESIGN can be used to design interfaces for pen- or stylus-based input devices [e.g. 71] or for mouse-replacement systems, such as the C AMERA M OUSE. We focused on a user-centered design that is simple and intuitive, tries to minimize user input errors and recognition errors of the system, allows for flexibility, and includes redundancy. Instead of requiring users to memorize and practice predefined symbols, they can create interfaces with S YMBOL D ESIGN. This is beneficial because with other symbol or shorthand recognition systems [e.g. 40, 62, 68], the user has to use predefined symbols, whose meaning is often not intuitive and that take time to learn. The ability of S YMBOL D ESIGN to process user-defined gestures is particularly important for people with severe motion impairments who can choose gestures based on their movement abilities. With S YMBOL D ESIGN, users can create smart environments that “understand” the users’ spatio-temporal movement patterns and “learn” how the users typically perform or draw these patterns. The fact that the system does not have a priori information about the alphabet that a person will create makes the pattern classification problem challenging. S YMBOL D ESIGN uses a dynamically created classifier, implemented as an artificial neural network, for the machine learning component of the environment. The architecture of the neural network automatically adjusts according to the complexity of the classification task. S YMBOL D ESIGN thus creates environments in which the a-priori-unknown number of movement patterns of unknown appearance and structure can be distinguished efficiently. The issue of gesture similarity arises for anybody who designs a gesture alphabet [67]. An advantage of S YMBOL D ESIGN is that users find out during training of the interface if they inadvertently proposed movement patterns that are too similar for the classifier to distinguish reliably. If at least one motion pattern can be reliably recognized, the interface can control assistive software by serving as a binary switch.
6 Assistive Software for Users with Severe Motion Impairments assistive software may be categorized by the level of control of the pointer or cursor on the screen that they require (see Table 2). Some assistive software only works with traditional input devices (mouse, joystick, trackball, stylus) that move the pointer and select items, or with pointer-replacement devices that provide the same functionality (e.g, the C AMERA M OUSE). Other assistive software is specifically designed for alternative input devices that provide less functionality. For example,
416
Margrit Betke
Table 2 Categorization of Assistive Software
Functionality
Input Devices
Pointing and click- Mouse, joystick, ing stylus, C AMERA head tracker Pointing only Joystick, stylus, mouse, C AMERA head tracker Left click only
Binary switches
Application Software trackball, General software M OUSE, trackball, Stylus-based text entry, e.g., M OUSE, using commands defined with S YMBOL D ESIGN, games, etc. Scan-based software, e.g. text entry, games, etc.
binary switches can be used to select items in scan-based software, i.e., software that automatically scans through the choices, e.g., letters of the alphabet. Scan-based applications can be created by programming tools called “wifsids,” short for “widgets for single-switch input devices” [96] and can adapt to the user’s needs by modifying the scan step size and time interval between choices [16]. We engaged in discussions with computer users at various care facilities, for example, The Boston College Campus School [17], which serves students aged 3 to 21 with multiple disabilities, to learn about their most urgent computing needs. We also worked with computer users at The Boston Home [18], a not-for-profit specialized care residence for adults with progressive neurological diseases. Nonverbal people with severe motion impairments desire assistive software that enables them to have access to communication and information, i.e., text entry, email, and web browsing [10, 3]. They also enjoy playing various computer games [10]. A list of software that we and others have developed for users with severe motion impairments is given in Table 3. Our most popular text entry program is shown in Figure 4. With this software, computer users select letters in a two-step process – first the group of letters and then the desired letter within the group. The C AMERA M OUSE is used regularly by students at The Boston College Campus School [17], e.g., 13 children in Fall 2008. They play with educational games and use the ZAC Browser [120] to access the internet. Some children have cognitive, in addition to severe physical disabilities. For them, the C AMERA M OUSE is a tool to learn the basic principle of cause and effect. They learn that their own physical action (when recognized by the Camera Mouse) can initiate a change in the state of the world (here, the computer software). For these children, a C AMERA M OUSE setup is beneficial that uses a large projection screen to display the child’s interaction with application software. The camera mouse control settings are operated by a caregiver on a separate computer monitor and can be changed during the computer session. The caregiver typically selects the tip of the user’s nose as the feature to be tracked because the nose is a convenient and robust feature for tracking with the C AMERA M OUSE [25].
Intelligent Interfaces to Empower People with Disabilities
417
Table 3 Assistive Software for Users of Our Intelligent Interfaces
Assistive Software NAVIGATE TASKS
Interface System Task C AMERA M OUSE ,
binary switch C AMERA M OUSE M IDAS T OUCH S TAGGERED S PEECH, C AMERA M OUSE R ICK H OYDT S PELLER Dasher C AMERA M OUSE Web mediator, IW EB - C AMERA M OUSE E XPLORER C AMERA M OUSE ZAC B ROWSER B LINK L INK Scan-based on-screen
References
navigation frame- Akram et al [3] work (application mediator) text entry Betke et al [13] two-level text entry Betke [10], Gips and Gips [38] predictive text entry Ward et al [110] with language model web browsing Waber et al [106], Paquette [79] web browsing text entry
ZAC Browser [120] Grauman et al [44]
movement game movement game
Betke et al [13] Magee et al [70]
movement game
Grauman et al [44]
memory game
Grauman et al [44]
painting
Crampton and Betke [27]
image editing creating animations
Kim et al [60] Betke [10]
keyboard A LIENS B LOCK E SCAPE F ROG ’N F LY Matching images PAINT C AMERA C ANVAS A NIMATE !
C AMERA M OUSE E YE K EYS, C AMERA M OUSE E YEBROW R AISER, B LINK L INK E YEBROW R AISER, B LINK L INK F INGER C OUNTER, C AMERA M OUSE C AMERA M OUSE C AMERA M OUSE
We found that adults who use the C AMERA M OUSE with various application programs like to be able to control the selection of the applications themselves. They prefer to work with the computer as independently from a caregiver as possible. We therefore developed an application mediator system NAVIGATE TASKS [2]. It allows C AMERA M OUSE users to navigate between different activities such as browsing the internet, posting emails, or listening to music. For adult computer users at The Boston Home [18], we developed the web browser IW EB E XPLORER [79]. It enables a C AMERA M OUSE user to type a web address via an on-screen keyboard (Figure 5). It also asks the user for confirmation whether he or she indeed wanted to follow a selected link (Figure 5). We added this confirmation mechanism to the browser after noticing that users of standard web browsers often followed links unintentionally. Nearby links were falsely selected because they were so closely located on a web page that they fell within the same area in which the mouse pointer had to dwell to initiate a “left-click” command. Based on our experience with the IW EB E XPLORER, we developed a web mediator that alters the display of a web page so that the fonts of links become larger
418
Margrit Betke
Fig. 4 A user of the two-level text entry program S TAGGERED S PEECH . The first level of the on-screen keyboard is shown in both images. The user, a computer scientist who suffered from advanced ALS, took advantage of the C AMERA M OUSE to be on the internet for many hours a day. He composed emails with S TAGGERED S PEECH and thereby had a tool to kept in touch with his friends.
Fig. 5 Assistive windows of the web browser IW EB E XPLORER include an on-screen keyboard with special keys for browsing and a window to confirm the selection of a web link.
[106]. Spatially close web links are grouped to allow users to quickly tab through the groups of links and then the links within these groups. In experiments, we saw a significant reduction in the number of commands needed on average to select a link, which greatly improved the rate of communication. Guidelines for accessible web pages, such as developed by the European project on world-wide augmentative and alternative communication [83], may help encourage web developers to design pages that are easily convertible by web mediators. For quadriplegic people, being able to dance may be an unrealizable dream. However, although they cannot control the movement of their own hands and feet, they may enjoy using A NIMATE ! [5] to control the hands and feet of an anthropomorphic figure. By guiding how the arm, leg, torso, and head segments of the avitar should move from frame to frame, they can, for example, choreograph a dance for the avitar (Figure 6). A NIMATE ! thus provides people with severe physical disabilities a tool to explore their creative imagination. Kim, Kwan, Fedyuk, and Betke [60] recently developed C AMERA C ANVAS, an image editing tool for users with severe motion impairments. They studied the solutions for image editing and content creation found in commercial software to design
Intelligent Interfaces to Empower People with Disabilities
419
Fig. 6 A NIMATE ! is a program that enables people with severe motion impairments to create video animations of an anthropomorphic figure. Its interface has large selection buttons suitable for C AMERA M OUSE use to select commands such as “open a frame,” “create a new frame,” “save the current frame,” “show the previous frame,” “show the next frame,” “rotate the current frame,” and “play all stored frame consecutively.” The two image frames shown on the right are part of a video of a dancing figure created with the animation program
. a system that provides many of the same functionalities, such as cropping subimages and pasting them into other images, zooming, rotating, adjusting the image brightness or contrast, and drawing with a colored pen, see Fig. 7. Experiments with 20 subjects without disabilities, each testing C AMERA C ANVAS with the Camera Mouse as the input mechanism, showed that users found the software easy to understand and operate. In recent experiments with a user with severe cerebral palsy, we found that the he was able to select and enlarge a picture. The user, a young man whose family maintains a website for him, was excited about the potential opportunity to manipulate and create imagery. However, for this user, the procedure to select an item, which involves up or down movement of the menu bar at the left of the graphical interface (see Fig. 7), was too fast and needed to be slowed down. A change of the interface layout, in particular, a horizontal instead of vertical menu bar may also be beneficial for this user, because he was able to control his side-to-side head movements much better than his up-and-down movements.
7 Discussion We have given an overview of the intelligent devices and software applications that we have developed in the past years to empower people with severe disabilities to participate in the information society. Based on our experience in developing these environments and working with people who use them, we suggest that truly multidisciplinary approaches must be taken to build smarter, more powerful environments for people with motion impairments. There are interesting research goals in each of the four disciplines that collectively contribute to the knowledge base of intelligent
420
Margrit Betke
Fig. 7 Image-editing tool C AMERA C ANVAS . Left: A Camera Mouse user selected the 400%zoom option while working with a sunflower picture. Right: To crop and copy an image region, the user must select the position of the upper left corner of the region and the size (i.e., position of lower right corner) of the region by moving the mouse pointer onto one of the arrows.
interface design (Figure 8); significant progress, however, may only be possible if a researcher addressing one task takes into account the impact and dependency of the solutions on the other tasks. We here discuss some of the relevant state-of-the-art approaches within the four disciplines and their strength and limitations.
Computer vision
Human-computer interaction
Assistive software
Social sciences
Develop methods to reliably interpret videos of people with disabilities and their voluntary and involuntary movements Invent intelligence devices, develop multimodal interfaces, create viable, potentially user-defined gesture alphabets Increase rate of communication of text entry, improve universal access of the web, create software that automatically adapts to the specific needs of a user Bring smart environments to the people who need them – people with disabilities and their caregivers and teachers. Develop methods to teach with smart environments.
Fig. 8 Disciplines involved in design of intelligent interfaces for people with severe disabilities and main research tasks.
Intelligent Interfaces to Empower People with Disabilities
421
7.1 Successful Design of Camera-based Interfaces for People with Disabilities Requires a User-oriented Approach to Computer Vision Research The advent of fast processors and inexpensive web cams have contributed to significant advances in computer vision. There has been an explosive growth of the number of camera-based systems that analyze human motion, but an insufficient effort in formal evaluation of these computer-vision systems has been made that is based on user studies [20, 82]. Empirical user studies, if conducted at all, sometimes simply recruit subjects from the local population of computer science or engineering students [e.g. 92]. Studies about computer-vision systems that involve users with motion impairments are rare and usually narrow in scope. These studies can be very valuable, because they can bring out the limitations of face trackers, eye detectors, human movement models, etc. when people with motion impairments must rely on them. Magee et al [70], for example, reported that a computer-vision system failed to track the face of a woman with advanced multiple sclerosis, who could not keep her head in an upright position. Another subject had difficulty keeping one eyelid open, which resulted in failure of a gaze-detection system. In scenarios like this, it must be stressed that the source of the problem was not the users’ inabilities to conform to the camera-based interface but, conversely, the inability of the interface to adapt to the needs of the users. Camera-based smart environments that are based on human motion models must be able to automatically interpret the stiff and jerky movements, the involuntary grimacing and tongue thrusting, and the tremors to which some people with severe disabilities are prone and that may increase during periods of emotional stress. We have observed children with spastic cerebral palsy, for example, who became so excited about a computer game that they were playing with the C AMERA M OUSE, that they involuntarily thrust their heads over their shoulders and then could not move for a while.
7.2 Designing Interfaces for People with Disabilities – Research Efforts in Human-Computer Interaction In the area of human-computer interaction, research communities, for example, created by the European Research Consortium for Informatics and Mathematics “User Interfaces for All” (ERCIM UI4All) [UI4All] or the ACM Special Interest Group on Accessible Computing [90], have been developing new assistive devices and methods for universal access to these devices. The new generation of assistive technologies need not involve customized, expensive electro-mechanical devices [21]. Even multi-modal systems need not be expensive, i.e., interfaces involving more than one of a user’s senses, typically sight, hearing, and touch, but also smell
422
Margrit Betke
[e.g. 102]. Paradigms have been formulated that provide theoretical frameworks for the principles, assumptions, and practices of accessible and universal design [95, 83]. human-centered computing has become an important research area in computer science [77]. It includes efforts to build smart environments that users can control with gestures or symbols [e.g. 113, 37, 36]. Environments that enable users to create their own gesture alphabets have been shown to be particularly appreciated by users [67, 14, 97].
7.3 Progress in Assistive Software Development Central goals of assistive software development have been to increase the rate of communication for people with disabilities and to improve universal access of the web. Being able to interact with computers to send emails and browse the web can improve the quality of life of these people significantly. A great advance in increasing the rate of communication of text entry has been made by predictive text-entry systems. While a user is typing, such systems offer word completion based on a language model. Dasher, for example, a wellknown predictive text-entry interface [110], is driven by continuous pointing gestures that can be made with mouse-replacement interfaces such as the C AMERA M OUSE. Hansen et al [47] developed G AZE TALK, a system that creates dynamic keyboards for several target languages, including Danish, English, German, Italian, and Japanese, and is used with head or eye trackers. Another predictive text entry software is UKO-II by Harbush and Kühn [48]. Letter and word prediction was also used by MacKenzie and Zhang [69] in a system that enabled users to enter text by fixating their eyes on highlighted on-screen keyboard regions. The system detected gaze direction via the head-fixed eye tracker V IEW P OINT [105]. A comparison of the mouse-pointer trajectories of users with and without movement impairments revealed that some users with impairments pause the pointer more often and require up to five times more submovements to complete the same task than users without impairments [51]. Wobbrock and Gajos [111] stressed the difficulty that people with motion impairments have in positioning the mouse pointer within a confined area to execute a click command. An important research focus in the area of assistive software has been the development of strategies to support or alter mouse-pointer movement on the screen. Wobbrock and Gajos [111] suggested that “goal crossing,” i.e., selecting something by passing over a target line, is a mechanism that provides higher throughput for people with disability. Bates and Istance [8] created an interface for eye-based pointing devices that increased the size of a target on the screen, such as a button or menu item. Betke, Gusyatin, and Urinson [14] proposed to discretize user-defined pointer-movement gestures in order to extract “pivot points,” i.e., screen regions that the pointer travels to and dwells in. A related mechanism are “gravity wells” that draw the mouse pointer into a target on the screen once it is in proximity of the target [e.g. 16].
Intelligent Interfaces to Empower People with Disabilities
423
Empirical studies to evaluate quantitatively assistive software that supports or alters mouse-pointer movement on the screen typically measure the time it takes a user to issue a command (“target acquisition”) and describe the communication bandwidth using Fitts’ law [94, 112, 3]. Qualitative evaluation methods, including interviews with users and caregivers, are also beneficial [3]. We suggest that users perform a cost/benefit analysis when they encounter a new software system. Frankish et al [35], for example, found that for the task of entering a name and phone number into a virtual fax form, users accepted the costs associated with errors of a handwriting recognition system, but for the task of entering a text into a virtual appointment book, they did not. Users may be most inclined to work in smart environments when their physical conditions prevent them from using traditional input devices or when the environments adapt to match their physical abilities.
7.4 The Importance of Computational Social Science to Intelligent Interface Design The study of social implications and applications of computer science has been advocated for decades [e.g. 88]. In recent years, computational social science has emerged as a research field. It may bring about a much-needed bridge between the fields of computer science and the social sciences, in particular, special-needs pedagogy and elderly care. A study by Forrester Research [34] for Microsoft Corporation predicted that the need for accessibility devices would grow in the future due to the increase in computer users above the age of 65 and the increase in the average age of computer users. The study also estimated that about 22.6 million of computer users are currently very likely to benefit from accessible technology. The internet has made it easier for interface developer to inform people with disabilities and their caregivers and teachers about smart environments that they may benefit from. Nonetheless, much effort is still needed to ease the acceptance and use of smart environments in schools, care facilities, and homes, and to develop methods to teach in smart environments. Research on rehabilitation and disability has lead to a new conceptual foundation for interpreting the phenomenon of disability–a “New Paradigm” of disability [56]. “The ‘old’ paradigm has presented disability as the result of a deficit in an individual that prevented the individual from performing certain functions or activities.” We have a “cultural habit of regarding the condition of the person, not the built environment [..] as the source of the problem.” The new paradigm recognizes the contextual aspect of disability – it is “the dynamic interaction between individual and environment over the lifespan that constitutes disability” [56]. A proliferation of smart human-computer environments that can adapt to the physical abilities of every person may help shift our societies’ models of disability to the new paradigm. We expect that such environments will cause us to be more inclusive and to accept the notion that physical differences among people are normal.
424
Margrit Betke
Acknowledgements The author would like to thank her collaborators and current and former students who have contributed to the development Camera Mouse and the assistive software described in this paper, in particular, Prof. James Gips at Boston College and Wajeeha Akram, Esra Cansizoglu, Michael Chau, Robyn Cloud, Caitlin Connor, Sam Epstein, Igor Fedyuk, Mikhail Gorman, Oleg Gusyatin, Won-Beom Kim, Chris Kwan, John Magee, Michelle Paquette, Matthew Scott, Maria Shugrina, Laura Tiberii, Mikhail Urinson, Ben Waber, and Emily Yu at Boston University. The author would also like to thank user Rick Hoydt and caregivers Maureen Gates and David Young Hong for their feedback on the Camera Mouse and computer scientist Donald Green for his invaluable implementations. The published material is based upon work supported by the National Science Foundation under Grant IIS-0713229. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References [1] AbleNet Switches (2008) Roseville, MN, USA. http://www.ablenetinc.com [2] Akram W, Tiberii L, Betke M (2006) A customizable camera-based human computer interaction system allowing people with disabilities autonomous hands free navigation of multiple computing tasks. In: Stephanidis C, Pieper M (eds) Universal Access in Ambient Intelligence Environments – 9th International ERCIM Workshop “User Interfaces For All” UI4ALL 2006, Königswinter, Germany, September 2006, Revised Papers. LNCS 4397, Springer-Verlag, pp 28–42 [3] Akram W, Tiberii L, Betke M (2008) Designing and evaluating video-based interfaces for users with motion impairments. Universal Access in the Information Society, in review [4] ALS Association (2008) http://www.alsa.org [5] A NIMATE ! (2008) Software for CAMERA MOUSE users to create video animations of an anthropomorphic figure. http://csr.bu.edu/ visiongraphics/CameraMouse/animate.html [6] Athitsos V, Wang J, Sclaroff S, Betke M (2006) Detecting instances of shape classes that exhibit variable structure. In: Computer Vision – ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part 1, LNCS, Vol. 3951, Springer Verlag, pp 121–134 [7] Barea R, Boquete L, Mazo M, López E (2002) System for assisted mobility using eye movements based on electrooculography. IEEE Transactions on Neural Systems and Rehabilitation Engineering 10(4):209–218 [8] Bates R, Istance H (2002) Zooming interfaces! Enhancing the performance of eye controlled pointing devices. In: Proceedings of the Fifth International ACM Conference on Assistive Technologies (Assets ’02), ACM, New York, NY, USA, pp 119–126 [9] Bauby JD (1997) The Diving Bell and the Butterfly. Vintage Books [10] Betke M (2008) Camera-based interfaces and assistive software for people with severe motion impairments. In: Augusto J, Shapiro D, Aghajan H (eds)
Intelligent Interfaces to Empower People with Disabilities
[11]
[12]
[13]
[14]
[15]
[16]
[17] [18]
[19]
[20] [21] [22] [23]
[24]
425
Proceedings of the 3rd Workshop on “Artificial Intelligence Techniques for Ambient Intelligence” (AITAmI’08), Patras, Greece. 21st-22nd of July 2008. Co-located event of ECAI 2008, Springer-Verlag Betke M, Kawai J (1999) Gaze detection via self-organizing gray-scale units. In: Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, IEEE, Kerkyra, Greece, pp 70–76 Betke M, Mullally WJ, Magee J (2000) Active detection of eye scleras in real time. In: Proceedings of the IEEE Workshop on Human Modeling, Analysis and Synthesis, Hilton Head Island, SC Betke M, Gips J, Fleming P (2002) The Camera Mouse: Visual tracking of body features to provide computer access for people with severe disabilities. IEEE Transactions on Neural Systems and Rehabilitation Engineering 10(1):1–10 Betke M, Gusyatin O, Urinson M (2006) SymbolDesign: A user-centered method to design pen-based interfaces and extend the functionality of pointer input devices. Universal Access in the Information Society 4(3):223–236 Beymer D, Flickner M (2003) Eye gaze tracking using an active stereo head. In: Proceedings of the 2003 Conference on Computer Vision and Pattern Recognition (CVPR’03) - Volume II, Madison, Wisconsin, pp 451–458 Biswas P, Samanta D (2007) Designing computer interface for physically challenged persons. In: Proceedings of the International Information Technology Conference (ICIT 2007), Rourkella, India, pp 161–166 Boston College Campus School (2008) http://www.bc.edu/ schools/lsoe/campsch Boston Home (2008) The Boston Home is a specialized care residence for adults with advanced multiple sclerosis and other progressive neurological diseases. http://thebostonhome.org Bouchard B, Roy P, Bouzouane A, Giroux S, Mihailidis A (2008) Towards an extension of the COACH task guidance system: Activity recognition of Alzheimer’s patients. In: Augusto J, Shapiro D, Aghajan H (eds) Proceedings of the 3rd Workshop on “Artificial Intelligence Techniques for Ambient Intelligence” (AITAmI’08), Patras, Greece. 21st-22nd of July 2008. Co-located event of ECAI 2008, Springer-Verlag Bowyer K, Phillips PJ (1998) Empirical Evaluation Techniques in Computer Vision. IEEE Computer Society Press, Los Alamitos, CA, USA Brown C (1992) Assistive technology computers and persons with disabilities. Communications of the ACM 35(5):36–45 Camera Mouse (2008) A video-based mouse-replacement interface for people with severe motion impairments. http://www.cameramouse.org Center of Communication Disorders (2008) Children’s Hospital, Boston, USA. http://www.childrenshospital.org/ clinicalservices/Site2016/mainpage/S2016P0.html Chau M, Betke M (2005) Real time eye tracking and blink detection with USB cameras. Tech. Rep. 2005-012, Computer Science Department,
426
[25]
[26]
[27]
[28]
[29]
[30] [31]
[32] [33]
[34]
[35]
[36]
[37]
Margrit Betke
Boston University, http://www.cs.bu.edu/techreports/pdf/ 2005-012-blink-detection.pdf Cloud RL, Betke M, Gips J (2002) Experiments with a camera-based humancomputer interface system. In: 7th ERCIM Workshop on User Interfaces for All, Paris, France, pp 103–110 Connor C, Yu E, Magee J, Cansizoglu E, Epstein S, Betke M (2009) Movement and recovery analysis of a mouse-replacement interface for users with severe disabilities. In: Proceedings of the 13th International Conference on Human-Computer Interaction (HCI International 2009), San Diego, CA, in press. Crampton S, Betke M (2003) Counting fingers in real time: A webcam-based human-computer interface with game applications. In: Proceedings of the Conference on Universal Access in Human-Computer Interaction (UA-HCI), Crete, Greece, pp 1357–1361 Darrell T, Essa IA, Pentland A (1996) Task-specific gesture analysis in real time using interpolated views. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(12):1236–1242 DiMattia P, Curran FX, Gips J (2001) An Eye Control Teaching Device for Students without Language Expressive Capacity – EagleEyes. The Edwin Mellen Press, see also http://www.bc.edu/eagleeyes Don Johnston Switches (2008) Volo, IL, USA. http://www. donjohnston.com Evans DG, Drew R, Blenkhorn P (2000) Controlling mouse pointer position using an infrared head-operated joystick. IEEE Transactions on Rehabilitation Engineering 8(1):107–117 Eye Tracking System, Applied Science Laboratories (2008) Bedford, MA, USA. http://www.a-s-l.com Fagiani C, Betke M, Gips J (2002) Evaluation of tracking methods for humancomputer interaction. In: IEEE Workshop on Applications in Computer Vision, Orlando, Florida, pp 121–126 Forrester Research (2003) The Aging of the US Population and Its Impact on Computer Use. A Research Report commissioned by Microsoft Corporation. http://www.microsoft.com/enable/ research/computerusers.aspx Frankish C, Hull R, Morgan P (1995) Recognition accuracy and user acceptance of pen interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’95), pp 503–510 Freeman WT, Beardsley PA, Kage H, Tanaka K, Kyuman C, Weissman C (2000) Computer vision for computer interaction. SIGGRAPH Computer Graphics 33(4):65–68 Gandy M, Starner T, Auxier J, Ashbrook D (2000) The Gesture Pendant: A self-illuminating, wearable, infrared computer vision system for home automation control and medical monitoring. In: Fourth International Symposium on Wearable Computers (ISWC’00), p 87
Intelligent Interfaces to Empower People with Disabilities
427
[38] Gips J, Gips J (2000) A computer program based on Rick Hoyt’s spelling method for people with profound special needs. In: Proceedings of the International Conference on Computers Helping People with Special Needs (ICCHP), Karlsruhe, Germany, pp 245–250 [39] Gips J, Betke M, DiMattia PA (2001) Early experiences using visual tracking for computer access by people with profound physical disabilities. In: Stephanidis C (ed) Universal Acccess In HCI: Towards an Information Society for All, Volume 3, Proceedings of the 1st International Conference on Universal Access in Human-Computer Interaction (UA-HCI), Lawrence Erlbaum Associates, Mahwah, NJ, pp 914–918 [40] Goldberg D, Richardson C (1993) Touch-typing with a stylus. In: Proceedings of the INTERCHI ’93 Conference on Human Factors in Computing Systems, IOS Press, Amsterdam, The Netherlands, pp 80–87 [41] Gorman M, Lahav A, Saltzman E, Betke M (2007) A camera-based music making tool for physical rehabilitation. Computer Music Journal 31(2):39– 53 [42] Gorodnichy DO, Roth G (2004) Nouse ‘use your nose as a mouse’ perceptual vision technology for hands-free games and interfaces. Image and Vision Computing 22(12):931–942 [43] Grauman K, Betke M, Gips J, Bradski GR (2001) Communication via eye blinks - detection and duration analysis in real time. In: Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR), Kauai, Hawaii, vol 2, pp 1010–1017 [44] Grauman K, Betke M, Lombardi J, Gips J, Bradski GR (2003) Communication via eye blinks and eyebrow raises: Video-based human-computer interfaces. International Journal Universal Access in the Information Society 2(4):359–373 [45] Guan H, Chang J, Chen L, Feris R, Turk M (2006) Multi-view appearancebased 3D hand pose estimation. In: IEEE Workshop on Vision for Human Computer Interaction, New York, NY, pp 1–6 [46] Hansen DW, Pece AEC (2005) Eye tracking in the wild. Computer Vision and Image Understanding 98(1):155–181 [47] Hansen JP, Tørning K, Johansen AS, Itoh K, Aoki H (2004) Gaze typing compared with input by head and hand. In: ETRA ’04: Proceedings of the 2004 Symposium on Eye Tracking Research & Applications, ACM, New York, NY, USA, pp 131–138 [48] Harbush K, Kühn M (2003) Towards an adaptive communication aid with text input from ambiguous keyboards. In: 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003) [49] HeadWay (2008) Infrared head-mounted mouse alternative, Penny & Giles, Don Johnston, Inc. http://www.synapseadaptive.com/ donjohnston/pengild.htm [50] Hutchinson T, JR KPW, Martin WN, Reichert KC, Frey LA (1989) Humancomputer interaction using eye-gaze input. IEEE Transactions on Systems, Man and Cybernetics 19(6):1527–1533
428
Margrit Betke
[51] Hwang F, Keates S, Langdon P, Clarkson J (2004) Mouse movements of motion-impaired users: a submovement analysis. In: Proceedings of the 6th International ACM SIGACCESS Conference on Computers and Accessibility (Assets ’04), pp 102–109 [52] International Myotonic Dystrophy Organization (2008) http://www. myotonic\-dystrophy.org [53] Ivins JP, Porril J (1998) A deformable model of the human iris for measuring small three-dimensional eye movements. Machine Vision and Applications 11(1):42–51 [54] Jaimes A, Sebe N (2005) Multimodal human computer interaction: A survey. In: Sebe N, Lew M, Huang T (eds) Computer Vision in Human-Computer Interaction, ICCV 2005 Workshop on HCI, Beijing, China, October 21, 2005, Proceedings. Lecture Notes in Computer Science, Volume 3766, SpringerVerlag, Berlin Heidelberg, pp 1–15 [55] Ji Q, Zhu Z (2004) Eye and gaze tracking for interactive graphic display. Machine Vision and Applications 15(3):139–148 [56] Kaplan (2008) World Institute on Disability. http://www. accessiblesociety.org [57] Kapoor A, Picard RW (2002) Real-time, fully automatic upper facial feature tracking. In: Proceedings of the Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washington, D.C., pp 10–15 [58] Kawato S, Tetsutani N (2004) Detection and tracking of eyes for gaze-camera control. Image and Vision Computing 22(12):1031–1038 [59] Kim KN, Ramakrishna RS (1999) Vision-based eye-gaze tracking for human computer interface. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Vol. 2, Tokyo, Japan, pp 324–329 [60] Kim WB, Kwan C, Fedyuk I, Betke M (2008) Camera canvas: Image editor for people with severe disabilities. Tech. Rep. 2008-010, Computer Science Department, Boston University, http://www.cs.bu. edu/techreports/pdf/2008-010-camera-canvas.pdf [61] Kollios G, Sclaroff S, Betke M (2001) Motion mining: Discovering spatiotemporal patterns in databases of human motion. In: Proceedings of the 2001 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2001), Santa Barbara, CA, pp 25–32 [62] Kristensson PO, Zhai S (2004) SHARK2 : a large vocabulary shorthand writing system for pen-based computers. In: Proceedings of the 17th Annual ACM Symposium on User Interface Software and Technology (UIST ’04), ACM, New York, NY, USA, pp 43–52 [63] LaCascia M, Sclaroff S, Athitsos V (2000) Fast, reliable head tracking under varying illumination: An approach based on robust registration of texturemapped 3D models. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(4):322–336 [64] Lankenau A, Röfer T (2000) Smart wheelchairs - state of the art in an emerging market. Künstliche Intelligenz, Schwerpunkt Autonome Mobile Systeme 4:37–39
Intelligent Interfaces to Empower People with Disabilities
429
[65] LC Technologies Eyegaze System (2008) http://www.lctinc.com [66] Lombardi J, Betke M (2002) A camera-based eyebrow tracker for handsfree computer control via a binary switch. In: 7th ERCIM Workshop on User Interfaces for All, Paris, France, pp 199–200 [67] Long AC, Landay JA, Rowe LA, Michiels J (2000) Visual similarity of pen gestures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’00), ACM, New York, NY, USA, pp 360–367 [68] MacKenzie IS, Zhang SX (1997) The immediate usability of graffiti. In: Proceedings of the Conference on Graphics Interface ’97, Canadian Information Processing Society, Toronto, Ont., Canada, Canada, pp 129–137 [69] MacKenzie IS, Zhang X (2008) Eye typing using word and letter prediction and a fixation algorithm. In: Proceedings of the 2008 Symposium on Eye Tracking Research & Applications (ETRA ’08), ACM, New York, NY, USA, pp 55–58 [70] Magee JJ, Betke M, Gips J, Scott MR, Waber BN (2008) A human-computer interface using symmetry between eyes to detect gaze direction. Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 38(6):1–15 [71] Meyer A (1995) Pen computing: a technology overview and a vision. ACM SIGCHI Bulletin 27(3):46–90 [72] Morimoto CH, Koons D, Amir A, Flickner M (2000) Pupil detection and tracking using multiple light sources. Image and Vision Computing 18(4):331–335 [73] Myers BA, Wobbrock JO, Yang S, Yeung B, Nichols J, Miller R (2002) Using handhelds to help people with motor impairments. In: Proceedings of the Fifth International ACM Conference on Assistive Technologies (Assets ’02), ACM, New York, NY, USA, pp 89–96 [74] Myers GA, Sherman KR, Stark L (1991) Eye monitor: microcomputer-based instrument uses an internal mode to track the eye. Computer 24(3):14–21 [75] National Multiple Sclerosis Society (2008) http://www. nationalmssociety.org [76] Nonaka H (2003) Communication interface with eye-gaze and head gesture using successive DP matching and fuzzy inference. Journal of Intelligent Information Systems 21(2):105–112 [77] NSF HCC (2008) Human-Centered Computing, Division of Information & Intelligent Systems, National Science Foundation. http://www.nsf. gov [78] Pacchetti C, Mancini F, Aglieri R, Fundaro C, Martignoni E, Nappi G (2000) Active music therapy in Parkinson’s disease: an integrative method for motor and emotional rehabilitation. Psychosomatic Medicine 62(3):386–393 [79] Paquette M (2005) IWebExplorer, project report, Computer Science Department, Boston University [80] Park KR (2007) A real-time gaze position estimation method based on a 3-d eye model. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics 37(1):199–212
430
Margrit Betke
[81] Paul S, Ramsey D (2000) Music therapy in physical medicine and rehabilitation. Australian Occupational Therapy Journal 47:111–118 [82] Ponweiser W, Vincze M (2007) Task and context aware performance evaluation of computer vision algorithms. In: International Conference on Computer Vision Systems: Vision Systems in the Real World: Adaptation, Learning, Evaluation, Bielefeld, Germany (ICVS 2007) [83] Poulson D, Nicolle C (2004) Making the internet accessible for people with cognitive and communication impairments. Universal Access in the Information Society 3(1):48–56 [84] Schnabel (2007) Director of the film “The Diving Bell and the Butterfly,” France: Pathé Renn Productions [85] Schwerdt K, Crowley JL (2000) Robust face tracking using color. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France [86] Sclaroff S, Betke M, Kollios G, Alon J, Athitsos V, Li R, Magee J, Tian T (2005) Tracking, analysis, recognition of human gestures in video. In: Proceedings of the 8th International Conference on Document Analysis and Recognition, Seoul, Korea, pp 806–810 [87] Shih SW, Liu J (2004) A novel approach to 3-D gaze tracking using stereo cameras. IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics 34(1):234–245 [88] Shneiderman B (1971) Computer science education and social relevance. ACM SIGCSE Bulletin 3(1):21–24 [89] Shugrina M, Betke M, Collomosse J (2006) Empathic painting: Interactive stylization through observed emotional state. In: Proceedings of the 4th International Symposium on Non-Photorealistic Animation and Rendering (NPAR 2006), Annecy, France, 8 pp. [90] SIGACCESS (2008) ACM special interest group on accessible computing. http://www.sigaccess.org [91] Sirohey S, Rosenfeld A, Duric Z (2002) A method of detecting and tracking irises and eyelids in video. Pattern Recognition 35(5):1389–1401 [92] Sirovich L, Kirby M (1987) Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A 4(3):519523 [93] Smart Nav Head Tracker (2008) Natural Point, Eye Control Technologies, Inc., Corvallis, OR, USA. http://www.naturalpoint.com/ smartnav [94] Soukoreff RW, MacKenzie IS (2004) Towards a standard for pointing device evaluation, perspectives on 27 years of Fitts’ law research in HCI. International Journal on Human-Computer Studies 61(6):751–789 [95] Stary C (2006) Special UAIS issue on user-centered interaction paradigms for universal access in the information society. Universal Access in the Information Society 4(3):175–176
Intelligent Interfaces to Empower People with Disabilities
431
[96] Steriadis CE, Constantinou P (2003) Designing human-computer interfaces for quadriplegic people. ACM Transactions on Computer-Human Interaction 10(2):87–118 [97] StrokeIt (2008) A mouse gesture recognition engine and command processor, software by Jeff Doozan. http://tcbmi.com/strokeit [98] Takami O, Morimoto K, Ochiai T, Ishimatsu T (1995) Computer interface to use head and eyeball movement for handicapped people. In: IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century, vol 2, pp 1119–1123 [99] Team Hoydt Website (2008) Racing towards inclusion. http://www. teamhoyt.com [100] Tian Y, Kanade T, Cohn J (2000) Dual-state parametric eye tracking. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp 110–115 [101] Tracker Pro (2008) Infrared-based head tracker, Madentec, Ltd., Edmonton, Alberta, Canada. http://www.madentec.com [102] Tsugawa S, Aoki M, Hosaka A, Seki K (1994) Recent Japanese projects of AVCS-related systems. In: Proceedings of the Symposium on Intelligent Vehicles, pp 125–130 [103] Turk M (2005) RTV4HCI: A historical overview. In: Kisacanin B, Pavlovic V, Huang T (eds) Real-Time Vision for Human-Computer Interaction, SpringerVerlag [UI4All] ERCIM working group “User Interfaces for All”. http://www. ui4all.gr/index.html [104] Vaidyanathan R, Chung B, Gupta L, Kook H, Kota S, West JD (2007) Tongue-movement communication and control concept for hands-free human-machine interfaces. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 37(4):533–546 [105] V IEW P OINT (2008) Eye tracking system. Arrington Research. http:// www.arring\-tonresearch.com [106] Waber B, Magee JJ, Betke M (2006) Web mediators for accessible browsing. In: Stephanidis C, Pieper M (eds) Universal Access in Ambient Intelligence Environments – 9th International ERCIM Workshop “User Interfaces For All” UI4ALL 2006, Königswinter, Germany, September 2006, Revised Papers. LNCS 4397, Springer-Verlag, pp 447–466 [107] Waber BN, Magee JJ, Betke M (2005) Fast head tilt detection for humancomputer interaction. In: Sebe N, Lew M, Huang T (eds) Computer Vision in Human-Computer Interaction, ICCV 2005 Workshop on HCI, Beijing, China, October 21, 2005, Proceedings. LNCS, Vol. 3766, Springer-Verlag, pp 90–99 [108] Wang J, Athitsos V, Sclaroff S, Betke M (2008) Detecting objects of variable shape structure with hidden state shape models. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(3):477–492 [109] Wang JG, Sung E (2002) Study on eye gaze estimation. IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics 32(3):332–350
432
Margrit Betke
[110] Ward DJ, Blackwell AF, MacKay DJC (2000) Dasher - a data entry interface using continuous gestures and language models. In: Proceedings UIST 2000: The 13th Annual ACM Symposium on User Interface Software and Technology, http://www.inference.phy.cam.ac.uk/dasher [111] Wobbrock JO, Gajos KZ (2008) Goal crossing with mice and trackballs for people with motor impairments: Performance, submovements, and design directions. ACM Transactions on Accessible Computing 1(1):1–37 [112] Wobbrock JO, Cutrell E, Harada S, MacKenzie IS (2008) An error model for pointing based on Fitts’ law. In: Proceeding of the Twenty-sixth Annual SIGCHI Conference on Human Factors in Computing Systems (CHI ’08), pp 1613–1622 [113] Wu C, Aghajan H (2008) Context-aware gesture analysis for speaker HCI. In: Augusto J, Shapiro D, Aghajan H (eds) Proceedings of the 3rd Workshop on “Artificial Intelligence Techniques for Ambient Intelligence” (AITAmI’08), Patras, Greece. 21st-22nd of July 2008. Co-located event of ECAI 2008, Springer-Verlag [114] Wu TF, Chen MC (2007) Performance of different pointing devices on children with cerebral palsy. In: Stephanidis C (ed) Universal Access in HumanComputer Interaction: Applications and Services, Vol. 4556, Springer-Verlag, Berlin Heidelberg, pp 462–469 [115] Xie X, Sudhakar R, Zhuang H (1995) Real-time eye feature tracking from a video image sequence using Kalman filter. IEEE Transactions on Systems, Man, and Cybernetics 25(12):1568–1577 [116] Yanco HA, Gips J (1998) Driver performance using single switch scanning with a powered wheelchair: Robotic assisted control versus traditional control. In: Proceedings of the Rehabilitation Engineering and Assistive Technology Society of North America Annual Conference (RESNA ’98), RESNA Press, pp 298–300 [117] Yang M, Kriegman D, Ahuja N (2002) Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1):34– 58 [118] Yoo DH, Chung MJ (2004) Non-intrusive eye gaze estimation without knowledge of eye pose. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, pp 785–790 [119] Young L, Sheena D (1975) Survey of eye movement recording methods. Behavior Research Methods and Instrumentation 7(5):397–429 [120] ZAC Browser (2008) Web browser designed for autistic children. http: //www.alsa.org
Visual Attention, Speaking Activity, and Group Conversational Analysis in Multi-Sensor Environments Daniel Gatica-Perez and Jean-Marc Odobez
1 Introduction Among the many possibilities of automation enabled by multi-sensor environments - several of which are discussed in this Handbook - one particularly relevant is the analysis of social interaction in the workplace, and more specifically, of conversational group interaction. Group conversations are ubiquitous, and represent a fundamental means through which ideas are discussed, progress is reported, and knowledge is created and disseminated. It is well known that spoken words represent a fundamental communication channel. However, it is also known from research in social and cognitive science and experienced by all of us everyday - that nonverbal communication is also a key aspect in social interaction, through which a wealth of personal information, ranging from instantaneous internal states to personality traits, gets expressed, interpreted, and weaved, along with speech, into the fabric of daily interaction with peers and colleagues [34]. In this view, although it is clear that further work on speech and language processing is needed as important modules of "smart" environments, the computational analysis of nonverbal communication is also a relevant domain and has received increasing attention [22]. The coordination of speaking activity (including turn-taking and prosody) and gaze (i.e., visual attention) are key aspects of nonverbal communication. Gaze is an important nonverbal cue with functions such as establishing relationships (through mutual gaze), expressing intimacy, and exercising social control. Furthermore, the role of gaze as a cue to regulate the course of interaction, via turn holding, taking, Daniel Gatica-Perez Idiap Research Institute and Ecole Polytechnique Fédérale de Lausanne, Switzerland e-mail:
[email protected] Jean-Marc Odobez Idiap Research Institute and Ecole Polytechnique Fédérale de Lausanne, Switzerland e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_16, © Springer Science+Business Media, LLC 2010
433
434
Daniel Gatica-Perez and Jean-Marc Odobez
or yielding, has been established in social psychology [32, 24]. Speakers use their gaze to indicate whom they address and to secure their attention [31], while listeners use their gaze towards speakers to show their attention and find appropriate times to request the speaking floor [16, 42]. It has also been documented that the coordination of speaking and looking acts correlate with the social perception of personality traits, like dominance, or of situational power like status [18]. As we will discuss later, a multi-sensor environment could make use of all this knowledge to infer a number of important facets of a group conversation, with the goal of supporting communication even if some of the participants of the conversation were not physically present. The chapter reviews the role of these two nonverbal cues for automatic analysis of group conversations. Our goal is to introduce the reader to some of the current technologies and envisioned applications within environments equipped with multiple cameras and microphones. We discuss the state of affairs regarding the automatic extraction of these cues in real conversations, their initial application towards characterizing social interaction patterns, and the some of challenges that lie ahead. The structure of the chapter is as follows. In Section 2, we motivate the work in this field by discussing two application scenarios in the workplace, briefly reviewing existing work in these directions. In Section 3, we discuss some of the current approaches (with emphasis on our own work) to automatically measure speaking activity and visual attention. In Section 4, we illustrate the application of these technologies to the problem of social inference using a specific case study (dominance). Section 5 offers some concluding thoughts, pointing to open questions in the field.
2 Applications 2.1 On-line Regulation of Group Interaction The automatic modeling of nonverbal behavior has a clear value for building systems that support face-to-face interaction at work. As described in the introduction, speaking activity and visual attention play key roles in the establishment and evolution of conversations, but also on their outcomes. As one example, the social psychology literature has documented that dominant and high-status people often talk more, more often, interrupt others more, and are looked at by others more too ([7, 15], see more in 4). The effect of this type of behavior at work can be significant. For instance, through dominant behavior people might over-control the floor and negatively affect a group conversation where the ideas of others might be important but overlooked, for instance in a brainstorming meeting or as part of a learning process. It has been documented that people who hold the floor too much are perceived as over-controlling [8]. Using findings in social psychology and communication about group interaction, recent work has beginning to investigate the automatic recognition of speaking ac-
Visual Attention, Speaking Activity, and Group Conversational Analysis
435
tivity and attentional cues as part of on-line systems that actively regulate and influence the dynamics of a group conversation by providing feedback to and increasing the awareness of the group members, in situations where a more balanced participation is key for effective teamwork. Some representative examples are summarized below [22]. DiMicco et al [12] proposed an approach that, in an on-line basis, estimates the speaking time of each participant from headset microphones and visualizes this information publicly, projecting the speaking time proportions on a wall or other large shared surface (Fig. 1(a)). The authors found that this simple feedback mechanism, which represents a minor cognitive load overhead, tends to promote a more even participation of the group members. DiMicco et al. have also explored other ways to display information about the relative participation of team members in an effective way. Bachour et al [5] have also explored a similar idea, but employing a translucent table where the conversation unfolds both as the sensing platform and the display device. speaking activity for each participant is inferred through a threemicrophone array implemented on a the table. A rough estimate of the proportion of speaking time of each participant is then displayed via LEDs located under the surface of the table, and that cover the participant’s portion of the table (Fig. 1(b)). In a preliminary study, the authors found that people actually pay attention to the table, and that the displaying strategy can influence the behavior of group members who are willing to accept value the table display, both in the case of participants who might not contribute much, and of people who want to regulate their talking time with respect to the others’, that is interested in a more egalitarian use of the speaking floor. Kim et al [33] opted for a highly portable solution for both sensing and displaying of interaction (Fig. 1(c)). Group members wear a device (dubbed ’sociometric badge’) that extracts a number of nonverbal cues, including speaking time and prosody, and body movement through accelerometers. These cues are used to generate a display of the group interaction on each person’s cell phone. Evaluating their system on both collocated and remote meetings, the authors reported that this setting improved interactivity by reducing the differences between dominant and non-dominant people, and between co-located and remote participants, without being distractive. The previous cases rely on speaking time to infer a need for regulation of interaction. Taking one step further, [57] combined speaking activity and visual attention, building on previous work [36], and developed a system that automatically estimates speaking time from headset microphones and visual attention from headbands with reflective pieces tracked by infrared cameras. The system builds on the assumption that both cues are strong indicators of the state of a group conversation, and visualizes these cues for each participant aggregated over time on the conversation table (Fig. 1(d)). The system aims at regulating the flow of the conversation by facilitating individual participation. The above research works are examples of the growing interest in developing systems that can promote better group communication through the automatic understanding of nonverbal behavior. Clearly, comprehensive studies about the efficiency and usability of these ideas and prototypes in the real world remain as open research
436
Daniel Gatica-Perez and Jean-Marc Odobez
issues. Nevertheless, they suggest that human-centered technology that influences group behavior starts to be feasible, and that such systems could be socially acceptable in multiple group interaction situations.
2.2 Assistance of Remote Group Interaction Meetings involving remote participants is another area where the better understanding of nonverbal communication and social dynamics can play a role to enhance the quality of interaction. Traditionally, research in this domain has focused on the enabling technological aspects of communication: increasing communication bandwith and/or signal compression, improving audio and video quality, or designing more efficient hardware solutions for tele-conferencing systems. However, in many settings, increasing the perception of presence of remote participants by enhancing the social aspects of communications could be useful (e.g. as is done with emoticons in emails). Let us consider for instance the following scenario: Bob is engaged in a distant meeting with a group of collocated people. He has an audio connection. During the discussion, Bob does not perceive well the nonverbal communicative cues indicating the reactions of the other meeting participants to propositions and comments. As he has no perception of the social behavior related to attention, addressing and floor control: • Bob cannot know well if somebody addresses him unless called by name;
(a)
(b)
(c)
(d)
Fig. 1 Systems for on-line regulation of group interaction from speaking activity and/or visual attention: (a) DiMicco et al [12]; (b) Bachour et al [5]; (c) Kim et al [33]; (d) Sturm et al [57]. All pictures are reproduced with permission of the corresponding authors.
Visual Attention, Speaking Activity, and Group Conversational Analysis
437
• When trying to request the floor, Bob has no idea whether other people are doing the same. After a few failed attempts such as interrupting others at a wrong place or not answering promptly, Bob tends to disengage himself from the meeting. This scenario is one example which examplifies that, like in face-to-face meetings, taking or giving the floor must follow social protocols which to a large extent rely on the appropriate perception and use of nonverbal behavior. Gaze, body postures, and facial expression are codes by which interactors show whether they are attentive or not or ready to speak, and are also used to communicate reactions to propositions and comments [35, 50]. In remote meetings involving negotiations, the lack of perception of these codes was shown to lead to a misunderstanding of the other people’s priorities and the delay of the achievement of an agreement [50]. Thus, in video conferencing situations, investigations have been conducted towards the enhancement of gaze perception, even at a weak level [44], to favor user engagement. There are several ways to improve remote communication through the analysis of gaze and turn taking patterns. In Bob’s example, such an analysis would be useful to build remote assistants that would improve his participation and satisfaction in the meeting, by providing a better perception of presence and of the social behavior of the remote participants in the main room: • Addressee assistant, which indicates whether the current speaker (or any other person) is looking at Bob; • Interruptability assistant, which would indicate when Bob can safely interrupt the meeting in a non-disruptive way. More generally, the goal would be to allow Bob to have a better perception of the conversation dynamics developing in the meeting room. Matena et al [40] investigated such issues, by proposing a graphical interface for cell phones which represents the current communication situation in an instrumented meeting room by using the output of multimodal processing tools (Fig.2). In their study, the authors found that users were actually interested by the concept and that would use such display mainly for work meetings. Knowing who speaks to whom was the most liked feature. When distant people have video access, another way to improve social perception is through automatic video editing. Many video conference and meeting recording systems are nowadays equipped with multiple cameras capturing several aspects of the meeting place, from close-up views of the participants to a panorama view of the whole space. The goal of an automatic video editing method is to combine all these streams into a single one which would convey the dynamics of the conversations and attract the attention of the viewers. This is performed through video cutting, i.e. by defining camera switching rules. Most of the current approaches rely on voice switching schemes and shot transition probabilistic models estimated from TV programs, i.e. they mainly show the main speaker and avoid too small or too long shots to ensure variability. Such models, however, usually fail to provide viewers with
438
Daniel Gatica-Perez and Jean-Marc Odobez
Fig. 2 Example of a cell phone display for a remote participant, showing the current slide being displayed in the main meeting room, and conveying information about the discussion dynamics in the meeting room through visualisation of automatically extracted cues. The head orientation of participants in the main room is represented by a sector in front of their avatar, and this sector is colored in red when a person is speaking (e.g. Maurice in the above example). The person being the current focus of the participant selected by the cellphone user for display (Marice in the example) is marked by a green circle. From [40], reproduced with permission of the authors.
contextual information about the other attendees. To address these issues, Takemae et al [58] proposed alternative gaze-based switching schemes. They showed that in three-person meetings, selecting the view of the person which is mostly looked at was the most effective way of conveying the conversation, and particularly of conveying addressee responses to a speaker’s utterance. This work followed the idea that people monitor their gaze towards others to capture their emotions, reactions, and intentions and as such, people receiving more visual attention (even when not speaking) might be more important than others with respect to the conversation. As another exemple, the study in Ranjan et al [49] showed that using the speaker gaze as switching cue when the speaker holds a long turn, or the majority gaze when multiple people speak, were also useful in conveying the conversational aspects of meetings.
3 Perceptual Components: Speaking Turns and Visual Attention 3.1 Estimating Speaking Turns The extraction of speaking turns answers the question “who spoke when”, i.e. it determines for each speaker the periods of time when he is silent and the periods when he is speaking, as illustrated in Fig. 3. This is one of the most fundamental tasks in many speech related applications. This segmentation problem is also called diarization, when one has to automatically infer the number of people and their speaking
Visual Attention, Speaking Activity, and Group Conversational Analysis
439
Fig. 3 Speaker turn segmentation of four people in a meeting. The x-axis denotes time. On the y axis, the speaking status (0 or 1) of each speaker (sp.1 to sp. 4) is displayed.
turns in an unsupervised manner. Different techniques have been proposed in the past, which depend on the type of audio signal which is processed, like broadcast news, telephone speech, or dictations systems to name a few examples. The segmentation of spontaneous speech in meetings is challenging for several reasons. Speech utterances can be very short (i.e. they are sporadic), which makes temporal smoothing techniques difficult. Furthermore, people tend to talk over each other, which generates overlapped speech signals. In [54], it was shown that such overlap could represent up to 15% of the data. In addition, the presence of concurrent audio signals, such as body noise from the participants (body motion, breathing, chair motion,...) as well as machine noise (fans, projectors, laptop) will be mixed with the speech signal. This will affect the acoustic features used to identify a given speaker. One approach for speaker segmentation is to use intrusive solutions, i.e. to request each participant to wear a close-talk microphone (lapel near the throat, or headset near the mouth). This allows to know precisely when each speaker is talking, because there is no need to identify who is speaking, but only detect when the person wearing the microphone speaks. The problem reduces to a Voice-ActivityDetection (VAD) issue [48], where the problem is simplified since the speech signal of the person is much cleaner than other audio signals due to the distance. The simple thresholding of the short-term energy can provide good results. Nevertheless, in lively conversations, there might still be a non-negligible presence of high energy cross-talk from close by participants (esp. when using lapels), as well as breathing and contact noises. To address this issue, Wrigley et al [61] used learned Gaussian Mixture model (GMM) classifiers to identify different audio configurations (e.g. speaker alone or speaker plus cross-talk) for each person in the meeting, and performed the multi-channel joint speech segmentation of all people. They obtained segmentation accuracy up to 96% on a large database of meetings. Dines et al [13] proposed an approach using a discriminant multi-layer perceptron for phoneme plus silence classification, and reported even better results (less than 2%) on similar head-set data. To reduce the constraints on meeting participants and allow more mobility freedom, microphone arrays (also known as multiple distant microphones (MDM)) can be employed. MDM signals contain spatio-temporal redundancy which can be used
440
Daniel Gatica-Perez and Jean-Marc Odobez
for speaker turn extraction. One common way to exploit this redundancy is to produce a single audio stream out of the multiple channels using beamforming techniques, with the benefit of enhancing the audio signal coming from the most active spatial location (i.e in principle the current speaker place) in the room. Diarization techniques which have been developed for other contexts, e.g. broadcast news [59], can be applied to this generated stream. However, adaptation of such techniques to meetings can be difficult due to the short utterances encountered in meetings, the high level of overlapped speech which makes it difficult to build precise acoustic models of the speakers, and the difficulty to detect and identify the presence of several speakers in a single audio stream. When the geometry of the array is known, the location of the speakers can be extracted. Under the assumption that people remains seated at the same place, the detection of speech activity coming from specific space regions can be used to solve the diarization problem, and provide more accurate results, esp. on overlapped speech, as demonstrated by Lathoud and McCowan [39]. More generally, when moving people are involved, MDM can be used to perform multi speaker tracking. Fig. 4 presents a possible approach to this problem. Alternatively, the use of the time difference of arrival (TDOA) between each channel pairs can directly be exploited as additional features to the acoustic ones. Vijayasenan et al [60] demonstrated within an information bottleneck diarization framework that this could lead to a reduction of speaker classification error from around 16% to less than 10%. In addition the direct use of the TDOA features presents the advantage to be exploitable even if the array geometry is unknown. In summary, it is feasible to extract quite reliably and precisely speaker turns in meeting rooms, where the precision level mainly depends on the intrusiveness of the solution. The required level of precision depends on the applications. For conversational analysis current solutions are usually good enough. Still, analysis of social interactions and behaviors, which can rely on subtle cues such as short backchannels or patterns of interruptions will benefit from improvements in speaker turn extractions.
3.2 Estimating Visual Focus of Attention Recognizing the gaze or visual focus of attention (VFOA) of people in meetings answers the question “who looks at whom or what”, i.e. which visual target the gaze of a person is oriented at. Traditionaly, gaze has been measured using HumanComputer Interaction (HCI) techniques. Usually, such systems require either invasive head mounted sensors or accurate estimates of the eye features obtained from high-resolution iris images [41]. These systems would be very difficult to set up for several people in a room, and would interfere with natural conversation, by restricting head mobility and orientation. As an alternative, head pose can be used as a proxy for gaze, as supported by psychovisual evidence showing that people do exploit head orientation to infer people’s gaze [37]. This was empirically demonstrated
Visual Attention, Speaking Activity, and Group Conversational Analysis
441
Fig. 4 Example of a processing chain for turn tracking using microphone array. First line: acoustic sources are first detected and localized in space. These are then grouped by spatio-temporal proximity. Second line: non speech sources are discarded. Finally, segments are automatically clustered into individual speakers. The figure is taken from [38].
by Stiefelhagen et al [56] in a simple setting with 4 people having short conversations. Social findings on gaze/speaking turn relationships were also exploited for VFOA estimation. Using people speaking status as another information source to predict a person’s VFOA, Stiefelhagen et al [56] showed that VFOA recognition could be slightly increased w.r.t. using the head pose alone. Moving a step further, Otsuka et al [46] proposed an interesting framework by introducing the notion of conversation regimes driving jointly the sequence of utterances and VFOA patterns of all people (for exemple a ’convergence’ regime towards a speaker). Regimes were estimated along with people’s VFOA, and their experiments with 4-participants informal short conversations showed that people VFOA could be reliably estimated and that the conversational regimes matched well with some addressing structures and dialog acts such as question/answers. At work, most meetings involve the use of documents or laptops put on tables as well as projection screens for presentations. Studying conversations and gazing behaviours in these conditions is more complex than in meetings where people are only conversing [56, 45, 46]. The addition of new targets, such as the Table or the slide screen leads to an increase of head pose ambiguities (i.e. the same head pose can be used to focus at different VFOA targets), making the recognition of VFOA from head pose a difficult task [3]. Introducing conversation context like the regimes in Otsuka et al [46] is thus necessary to reduce ambiguities. However, the presence of artifacts (laptops, slide screen) needs to be taken into account, as it significantly affects the gaze patterns. This is the situational attractor hypothesis of Argyle and
442
Daniel Gatica-Perez and Jean-Marc Odobez
Fig. 5 Meeting recording setup. The first image shows the central view that is used for slide change detection. The second image shows a top view of the meeting room to illustrate seats and focus target positions. Seat numbers will be used to report the VFOA recognition results of people seated at these locations. The third and fourth images show the side cameras that are used for tracking people’s head pose.
J.Graham [1]: objects which play a central role in a task that people are doing attracts the visual attention, thereby overruling the trends for eye gaze behaviour observed in ’direct’ human-human conversations. Such impact of artifacts on gaze behaviour was reported in task-based meetings [10]. For instance, when a new slide is displayed, the ’convergence’ regime towards the speaker is no longer observed, since people look at the slide, not the speaker. To further ground the discussion on VFOA recognition issues, in the following we will describe our work with more details. More precisely, we address the joint recognition of people VFOA in task-based meetings involving table and slide as potential targets, using meeting contextual models in a DBN framework. We first introduce the setting and the different variables involved in our modeling (VFOA, observations, conversational events) (Subsection 3.2.1), and then present our DBN model, detailing qualitatively some of its main components (Subsection 3.2.2). Finally, we report some recognition results, comparing with the existing literature and discussing issues and ideas for future work.
Visual Attention, Speaking Activity, and Group Conversational Analysis
443
Fig. 6 Head pose extraction. The head localization (position, scale, in-plane rotation, cf left image) and head pose (discrete index indentifying a specific out-of-plane rotation, cf right image) are jointly tracked (taken from [3]).
3.2.1 Experimental Context and Cue Extraction In this section, we set the experimental problem, and then introduce the different variables involved in the modeling of the conversation patterns. Set-up and task: Our source of data is the AMI corpus1. In this corpus, the recorded meetings followed a general task-oriented scenario: four persons having different roles were involved in the creation and design of a new television remote control. Fig. 5 shows the physical set-up of the meeting room. Amongst the different sensor recordings which are available in the corpus, we used the following: the video streams from the two cameras facing the participants (Fig. 5 bottom images) and of a third camera capturing a general view of the meeting room (Fig. 5, top left image). As audio sensors, we used the close-talk microphones. Our main objective is to estimate people’s VFOA. Thus, we have defined the set of interesting visual targets for a given participant seated at seat k, denoted Fk , as comprising 6 VFOA targets: the 3 other participants, Pk , (for example, for seat 1, P1 ={seat2,seat3,seat4}), as well other targets O ={Table, Slide Screen, Unfocused }. The later target (Unfocused) is used when the person is not visually focusing on any of the previously cited targets. Below, we describe the three types of features we used as input to our model: audio features describing people’s speaking status, the people head orientations, and a projection screen activity feature. Speaking status of participants (1 if the person speaks, 0 otherwise) were obtained by thresholding the energy signal of his close-talk microphone. To obtain more stable features, we smoothed the instantaneous speaking status by averaging them over a temporal window of 5 seconds, leading to the audio features s˜tk describing the proportion of time participant k speaks during this interval. Head pose is the most important cue to recognize people VFOA. The head pose of a person was estimated by applying an improved version of the tracking method de1
www.idiap.ch/mmm/corpora/ami
444
Daniel Gatica-Perez and Jean-Marc Odobez
Fig. 7 Dynamic Bayesian Network graphical model. Squares represent discrete variables, circles represent continuous variables. Observations are shadedand hidden variables are unshaded.
scribed in [4] to our mid-resolution head video sequences (center images of Fig. 5). The approach is quite robust especially because head tracking and pose estimation are considered as two coupled problems in a Bayesian probabilistic framework. More precisely, the joint tracking of the head location (position, scale, in-plane rotation) and pose (represented by a discrete index denoting an element of the out-ofplane head pose set, as shown in Fig. 6) was conducted. Texture (output of one Gaussian and two Gabor filters) and skin color head appearance models were built from a training database, and used to evaluate the likelihood of observed features. An additional pose-independent likelihood model based on background subtraction feature was used to provide better localization information and reduce tracking failures. More details on models and estimation procedure can be found in [4]. A slide screen activity feature at characterizing the degree of involvements of participants into a slide-based presentation on the gaze patterns. Intuitively, the effect will be more important when a slide has just been displayed, and will decrease as time goes by. Thus, we used as slide activity feature at the time that elapsed since the last slide change occurred 2 . To detect the slide changes, we applied a very fast compressed domain method proposed by Yeo and Ramchandran [62] to the video stream from the central view camera (see Fig 5). Conversational events: to model the interactions between people’s VFOA and their speaking statuses, we follow a similar approach to Otsuka et al [45]. We introduce a set E = {Ei , i ∈ 1...16} of ’conversational events’ uniquely characterized by the set of people currently holding the floor. Since we are dealing with meetings of 4 people, the set comprises 16 events which can be divided into 4 types: silence, monologue, dialogue, or discussion.
2
A slide change is defined as anything that modifies the slide-screen display. This can correspond to full slide transitions, but also to partial slide changes, or to switch between a presentation and some other content (e.g. the computer desktop).
Visual Attention, Speaking Activity, and Group Conversational Analysis
445
3.2.2 Joint VFOA and Conversational Event DBN Model To estimate the joint VFOA of people, we rely on the DBN model displayed in Fig. 7, and according to which the joint distribution of all variables is given by: p( f1:T , e1:T , λ , o1:T , s˜1:T , a1:T ) ∝ T
p(λ ) ∏ p(ot | ft )p(s˜t |et )p( ft | ft−1 , et , at )p(et |et−1 , at ),
(1)
t=1
where ft = ( ft1 , ft2 , ft3 , ft4 ), ot = (ot1 , ot2 , ot3 , ot4 ) and s˜t = (s˜t1 , s˜t2 , s˜t3 , s˜t4 ) denotes respectively the joint VFOA of all participants, their head poses and their speaking proportion, and λ represents a set of model parameters. This DBN expresses the probabilistic relationships between our random variables and reflects the stochastic dependencies we have assumed. In the current case, one important assumption is that the conversational event sequence is the primary hidden process that governs the dynamics of gaze and speaking patterns, i.e. gaze patterns and people utterance are independent given the conversational events. Indeed, given a conversational event, who speaks is clearly defined. At the same time, since people tend to look at the current speakers, the conversational event will also directly influence the estimation of people gaze through the term p( ft | ft−1 , et , at ), which increases the prior probability of looking at the VFOA targets corresponding to the people who are actively involved in the current conversational event. This prior, however, is modulated by the duration at since the last slide change to account for the impact of slide presentation on gaze patterns. In the following, we introduce all the terms appearing in Eq. 1, and describe qualitatively their impact on the model. The Conversational events dynamics is assumed to factorize as p(et |et−1 , at ) = p(et |et−1 )p(et |at ). The first term was modeled to introduce temporal smoothness on the event sequence. The second term, learned from training data, introduced prior knowledge about the type of conversational events given the slide activity. Essentially, it was observed that some seconds after a slide change, monologues occur slightly more frequently than in a normal conversation. The VFOA dynamics is assumed to have the following factorized form: p( ft | ft−1 , et , at ) ∝ Φ ( ft )p( ft | ft−1 )p( ft |at , et )
(2)
where the three terms are described below. The shared focus prior Φ ( ft ) is a group prior which favors joint VFOA states for which the number of people looking at the same VFOA is large, thereby reflecting statistics collected from our data which showed that people look more often at the same focus than if they would behave independently. The joint VFOA temporal transition term assumes conditional independance of peok ). Individual ple VFOA given their previous focus, i.e. p( ft | ft−1 ) = ∏4k=1 p( ftk | ft−1
446
Daniel Gatica-Perez and Jean-Marc Odobez
VFOA transitions were defined to enforce temporal smoothness, with high probability to keep the same focus and lower probability to transit to other focus. The VFOA meeting contextual priors p( ft |at , et ) is the most interesting part. It models the prior of focusing at VFOA targets given the meeting context (conversational event, slide activity). We first assumed the conditional independance of people VFOA given the context, i.e. p( ft |at , et ) ∝ ∏4k=1 p( ftk |at , et ), and then defined the prior model p( ftk = l|at = a, et = ei ) of any participant by following the intuition that this term should establish a compromise between the known properties that: (i) people tend to look at speaker(s) (ii) during presentations, people look at the projection screen ( iii) speaker’s gazing behaviour might be different than listener’s one. This was done by defining this term as: p( ftk = l|at = a, et = ei ) ≈ gl,ivs,ei (a) = ϑ1 e−ϑ2 a + ϑ3 . where ivs denotes the involvement status of person k (speaker or listeners) into the conversational event ei , and the set of parameters ϑ j , j = 1, ..., 3 (one set for each configuration (l, ivs, ei )) were obtained by fitting the training data. Fig 8 illustrates the estimated model when the conversational event is a dialog. For the ’table’ target, we can see that its probability is always high and not so dependent on the slide context a. However, whether the person is involved or not in the dialog, the probability of looking at the projection screen right after a slide change is very high, and steadily decreases as the time since last slide change (at = a) increases. A reverse effect is observed when looking at people: right after a slide change, the probability is low, but this probability keeps increasing as the time a increases as well. As could be expected, we can notice that the probability of looking at the people involved in the dialog is much higher than looking at the side participants. For the later target, we can notice a different gazing behaviour depending on whether the person is involved in the dialog or not: people involved in the dialog focus at the side participants, whereas a side participant almost never looks at the other side participant. Observation models relate the hidden variables to the speaking statuses and head orientation observed data. The speaking observation models assumes that, given the conversationnal event, the proportion of time that people speak are independent and can be set a priori. For instance, if E j represents a monologue by person 2, we can define that person 2 will speak around 95% of the time during the predefined window size, while other participant utterances are assumed to correspond to backchannel, and will thus speak only 5% of the time. The head pose observation model is the most important term for gaze estimation. Assuming again conditionnal independence (p(ot | ft ) = ∏4k=1 p(otk | ftk )), the head pose distribution when person k looks at target i was modelled as a Gaussian: p(otk | ftk = i) = N (otk , μk,i , Σk,i ).
(3)
Visual Attention, Speaking Activity, and Group Conversational Analysis
447
Fig. 8 Fitted contextual prior probabilities of focusing to a given target (occurence probability), in function of time a since last slide change, when the conversational event is a dialog, and for: Left: a person involved in the dialog. Right: a person not involved in the dialog. Taken from [2].
Fig. 9 Head pose-VFOA relationship. Assuming that the participant has a prefered reference gazing direction (usually in front of him), the gaze rotation towards the VFOA target is made partly by the head and partly by the eye.
with mean μk,i and Σk,i . Setting these parameters is indeed not a trivial task, especially for the means. While researchers have shown that there exists some correlation between the gazing direction and the head pose, this does not mean that the head is usually oriented in the gaze direction. Indeed, relying on findings in cognitive sciences about gaze shift dynamics [20, 26], we defined a gazing model [43] which exploits the fact that, to achieve a gaze rotation towards a visual target from a reference position, only a constant proportion κ of the rotation is made by the head, while the remaining part is made by the eyes. This is illustrated in Fig 9. In gaze the gaze other words, if μkre f denotes the reference direction of person k, and μk,i direction associated with the VFOA target i, the head pose mean μk,i can be defined by: re f gaze μk,i − μkre f = κ (μk,i − μk ). (4) where the proportion κ is usually around 0.5. Grossly speaking, the reference direction can be defined as looking straight in front of the person. This also corresponds to the direction which roughly lies at the middle between the VFOA target extremes, as people orient themselves to be able to visually attend all VFOA of interest. Model parameters adaptation and Bayesian inference. In the Bayesian framework, we can specify some prior knowledge about the model by defining a
448
Daniel Gatica-Perez and Jean-Marc Odobez
Model Contextual Independent Contextual: no adaptation Contextual: speech/conversational events only
seat 1 seat 2 seat 3 61.7 56.1 48.5 41.3 45.5 29.7 53 52.5 44.3 62.5 48.8 43.6
seat 4 55.9 36.4 53.9 48.1
average 55.6 38.2 50.9 50.7
Table 1 VFOA recognition results for different modeling factors. Taken from [2].
distribution p(λ ) on the model parameters. Then, when the observations are given, during inference, optimal parameters of the model are estimated jointly with the optimal VFOA and conversational event sequences. In our work, we showed that this was important for parameters involving the VFOA state variables and in particular for μk,i since we observed large variations between the head pose means computed empirically using the VFOA ground truth data for different people. These variations can be due to either people specific ways of looking at a given target (people turn more or less their head to look at targets), or the introduction of a bias in the estimated head pose by the head pose tracking system. The inference problem consists of maximizing p( f1:T , e1:T |a1:T , s1:T , o1:T ), the posterior distribution of the conversational events e1:T , the joint VFOA f1:T , and the model parameters λ , given the observation variables (see Eq. 1). Given the model complexity, direct and optimal estimation is untractable. Thus, we exploited the hierarchical structure of our model to define an approximate inference method which consists in the iteration of two main steps: First, estimation of the conversational events given the VFOA states, which can be conducted using Viterbi algorithm. Second, estimation of the VFOA states and models parameters given the conversational events using a Maximum A Posteriori (MAP) procedure [23]. These iterations guarantee at each time step to increase the joint posterior probability distribution. More details can be found in [2].
3.2.3 Results Dataset and protocol. To evaluate the above model, we used 4 meetings of the AMI corpus, with duration ranging from 17 to 30 minutes for a total of 1h30. The performances of our algorithms are measured in term of frame based VFOA recognition rate (FRR), i.e. the percentage of frames which are correctly classified, and a leaveone-out protocol was adopted to allow the learning of some of the model parameters (mainly p( ftk |at , et )). Main results. Table 1 provides the results obtained with our contextual model. Results with an ’independent’ approach, i.e. with VFOA recognition done independently for each person solely from his head pose (in this case, the VFOA dynamics k )) are also given. From the performance of this latonly contains the term p( ftk | ft−1 ter model, we see that our task and data are quite challenging, due mainly to our complex scenario, the meeting length, the behaviour of people, and the errors in the
Visual Attention, Speaking Activity, and Group Conversational Analysis
449
head pose estimation (some heads are rather difficult to track given our image resolutions). We can also notice that the recognition task is more challenging for people at seat 3 and 4, with on average a FRR of 8 to 10% less than for seat 1 and 2. This is mainly explained by the fact that the VFOA targets spans a pan angle of around only 90% degrees for seats 3-4 vs 180% for seats 1-2, thus increasing the level of VFOA ambiguities associated with each head pose. When using the contextual model, the recognition improves by a large amount, passing from 38.2% to 55.6%. The influence of the speaking activities through the conversational events and of the slide context on the VFOA estimation (and adaptation process), is beneficial, esp. for seat 3 and 4, by helping in reducing some of the inherent ambiguities in focus. Indeed, improvements can be attributed to two main interconnected aspects: first, estimated conversational events and slide activity variables introduce direct informative prior on the VFOA targets when infering the VFOA states; second, head pose model parameters for each VFOA are adapted during inference, allowing to account for specifities of each person’s gazing behaviour. To evaluate the impact of the adaptation process,we removed it during inference. This lead to a recognition rate drop of almost 5%, as shown in Table 1, demonstrating the benefit of adaptation. Notice that the removal of adaption in the independent case actually lead to a slight increase of 1.8%. The reason is that in this case, adaptation is ’blind’, i.e. the pose model is adapted to better represent the observed head pose distribution. In practice, as head poses for some VFOA targets can be close, we noticed that the head pose mean of one VFOA target could drift to the head poses of another VFOA, resulting in significant errors. In the contextual approach, observed head poses are somehow ’weighted’ by the context (e.g. when somebody speaks), which allows to better associate head pose with the right VFOA target, leading to better adapted head pose means μk,i and avoiding drift. Finally, Table 1 also shows the results when only the conversation events are used to model the utterance and gaze pattern sequence, i.e. we remove the slide variable in the DBN model. The performance drop of around 5% clearly indicates the necessity to model the group activity in order to more accurately recognize the VFOA.
3.2.4 Discussion The above results (55%) show that recognizing the VFOA of people is not an easy task in general. According to our experience and the literature, there are several factors which influence the accuracy we can obtain. In the following, we highlight the main ones and propose some future directions for improvements. Physical setup and scenario. Our recognition results are not as high as those reported in the literature. Stiefelhagen et al [56] reported a recognition rate of 75% using audio-visual features, while Otsuka et al [46] reports results of 66%. In those two cases, meetings with 4 participants were considered. They were sitting in a circular fashion, at almost equal space positions, and only other participants were considered
450
Daniel Gatica-Perez and Jean-Marc Odobez
as VFOA targets3 . In addition, participants were only involved in discussions, and meetings were much shorter (less than 8 minutes), which usually implies a more active use of the head to gaze at people. Overall these meetings were quite different from our longer task-based meetings involving slide presentation, laptop and artefact manipulations, and thus ’more complex’ gaze behaviour. Even in our setting, we can notice results ranging from 48% to 62% for the different seats, showing that good VFOA recognition can only be achieved if the visual targets of a person are well separated in the head pose angular space. Overall, our results indicate that obtaining in more complex meeting the same performance as published in earlier work on VFOA recognition [56] is definitively problematic, and that there are limitations in using the head pose as approximation to the gaze. Head pose estimation. Accurate head pose estimation is important for good results. Ba and Odobez [3] addressed the VFOA recognition from head pose alone on 8 meetings recorded in the same setting described in this chapter. Head pose was either measured using flock-of-birds (FOB) magnetic sensor, or estimated with a visual tracker. This allowed to evaluate the recognition errors due to using head pose as a proxy for gaze, and errors due to pose estimation. The results showed that VFOA recognition errors and head pose errors were correlated, and that there was an overall drop of around 13% when using the estimates instead of the FOB measurements. Note that Otsuka et al [46] only reports a drop of 2% when using estimates rather than magnetic sensor measurements. This smaller drop can be explained by three facts: first, their setting and scenario are simpler. Second, each participant face was recorded using one camera directly looking at him, with high resolution. This resulted in lower tracking errors. Finally, they were using an utterance/gaze interaction model similar to what we presented in this chapter (this was not the case in Ba and Odobez [3]) which may already correct some of the pose errors made by the tracker. Exploitation of contextual cues. Our results showed that information on the slide activity was improving VFOA recognition. A more closer look at the recognition rates by target class, showed that the joint use of conversational events and slide context further increases the slide screen recognition accuracy w.r.t. the sole use of the slide context (77% vs 70%). This indicates that the model is not simply adding priors on the slide or people targets according to the slide and conversational variables, but that this is the temporal interplay between these variables, the VFOA and adaptation process which makes the strength of the model. Of course, during the meeting, slide presentation is not the single activity that attracts gaze. People use their laptops, or manipulate objects on the table. Any information on these activities could be used as further contextual cues in the model, and would probably lead to a performance increase. Issues are how and with what accuracy can we extract such contextual cues. When considering laptops for instance, we could imagine directly obtain activity information from the device itself. Alternatively, we can rely on video analysis, but this will probably result in an accuracy 3 Note that a 52% recognition rate is reported by Stiefelhagen [55] when using the same framework as in [56] but applied to a 5-person meeting.
Visual Attention, Speaking Activity, and Group Conversational Analysis
451
loss (e.g. it might be difficult to distinguish some hand fidgeting or simple motion from hand laptop usage). Improving behavioral models. There are several ways to improve the behavioral modeling. One way consists of introducing timing or duration models in the DBN to account for phenomenon reported by psychologists, like the timing information existing between speaking turns and VFOA patterns: for instance, people avert gaze at the beginning of a turn, or look for mutual gaze at the end of a turn. In another direction, since we show in the next section that people’s VFOA is related to people traits like dominance, one may investigate whether the knowledge of these traits (e.g. introversion vs extroversion) could be used to define a more precise gaze model of the given person. The overall question when considering such modeling opportunities is whether the exploited properties would be recurrent enough (i.e. sufficiently predictable) to obtain a statistical improvement of the VFOA recognition.
4 From Speaking Activity and Visual Attention to Social Inference: Dominance Modeling As an example of the potential uses of the nonverbal cues presented in this chapter to infer social behavior, we now focus on dominance, one fundamental construct of social interaction [7]. Dominance is just one example of the myriad of complex phenomena that emerge in group interaction in which nonverbal communication play a major role [34], and whose study represents a relatively recent domain in computing [21, 22]. In this section, we first summarize basic findings in social psychology about nonverbal displays of dominance, briefly review existing work in computing in this field, and focus on a particular line of research that investigates the joint use of speaking activity and visual attention for dominance estimation, motivated by work on social psychology. Most of the material presented here has been discussed elsewhere [27, 28, 29, 30].
4.1 Dominance in Social Psychology In social psychology and nonverbal communication, dominance is often seen in two ways, both "as a personality characteristic (trait)" and "to indicate a person’s hierarchical position within a group (state)" [53] (pp. 421). Although dominance and related terms like power are often used as equivalent, a distinguishing approach taken in [15] defines power as "the capacity to produce intended effects, and in particular, the ability to influence the behavior of another person" (pp. 208), and dominance as a set of "expressive, relationally based communicative acts by which power is exerted and influence achieved," "one behavioral manifestation of the relational construct of power", and "necessarily manifest" (pp. 208-209).
452
Daniel Gatica-Perez and Jean-Marc Odobez
Nonverbal displays of dominance involve voice and body activity. Cues related to the voice include amount of speaking time, energy, pitch, rate, and interruptions [15]. Among these, speaking time has shown to be a particularly robust (and intuitively logical) cue to perceive dominance: dominant people tend to talk more [53]. Kinesic cues include body movement, posture, and elevation, and gestures, facial expressions, and visual attention [15, 7]. From kinesic behavior, visual attention represents particularly a strong cue for nonverbal display and interpretation of dominance. Efran documented that highstatus persons receive more visual attention than low-status people [17]. Cook et al. found that people who very rarely look at others in conversations are perceived as weak [11]. Furthermore, there is solid evidence that joint patterns of visual attention and speaking activity are correlated with dominance [18, 14]. More specifically, Exline et al. found that in dyadic conversations, high-status interaction partners displayed a higher looking-while-speaking to looking-while-listening ratio (i.e., the proportion between the time they would gaze at the other while talking and the time they would gaze at the other while listening) than low-status partners [18]. This ratio is known as the visual dominance ratio. Furthermore, [14] found that patterns of high and low dominance expressed by the visual dominance ratio could be reliably decoded by external observers. Overall, this cue has been found to be correlated to dominance for a variety of social factors including gender and personality [25].
4.2 Dominance in Computing Automatic estimation of dominance in small groups from sensor data (cameras and microphones) is a relatively recent research area. In the following, we present a summary of existing work. A more detailed review can be found in [22]. Some methods have investigated the relation between cues derived from speaking activity and dominance. Rienks and Heylen [51] proposed an approach based on manually extracted cues (speaking time, number of speaking turns, and number of successful floor grabbing events) and Support Vector Machines (SVMs) to classify the level of a person’s dominance as being high, normal, or low. Rienks et al [52] later compared supervised and unsupervised approaches for the same task. Other methods have investigated the use of both speaking and kinesic activity cues. Basu et al [6] used speaking turns, speaking energy, and body activity estimated from skin-color blobs to estimate the most influential participant in a debate, using a generative model called the influence model (IM). As part of our own research, we addressed the estimation of the most-dominant person using automatically extracted speaking activity (speaking time, energy), kinesic activity (coarse visual activity computed from compressed-domain video), and simple estimation models [27]. We later conducted a more comprehensive analysis where fusion of activity cues was studied [30]. The results suggested that, while speaking activity is a more discriminant modality, visual activity also carries information, and also that cue fusion in the supervised setting can be useful.
Visual Attention, Speaking Activity, and Group Conversational Analysis
453
Visual attention represents an alternative to the normally coarser measures that can be obtained to characterize kinesic activity with tractable methods. In the following section we described in more detail work that explicitly fuses visual attention and speaking activity to predict dominant people in group conversations.
4.3 Dominance Estimation from Joint Visual Attention and Speaking Activity The application of findings in nonverbal communication about the discriminative power of joint visual attention and speaking activity patterns for automatic modeling of group interaction is attractive. In addition to the technical challenges related to the extraction of these nonverbal cues, as discussed in previous sections, there are other relevant issues related to the application of the nonverbal communication findings to multi-sensor environments. In real life, group interaction takes place in the presence of multiple visual targets in addition to people, including common areas like the table, whiteboards, and projector screens, and personal artifacts like computers and notebooks. In contrast, most nonverbal communication work has investigated cases where people are the main focus of the interaction. Otsuka et al. proposed to automate the estimation of influence from visual attention and speaking activity [47], computing a number of intuitive measures of influence from conversational patterns and gaze patterns. While the proposed features are conceptually appealing, the authors presented neither an objective performance evaluation nor a comparison to previous methods. In contrast, [29] recently proposed the automation of two cues discussed in Section 4.1, namely received visual attention [17] and visual dominance ratio [18], and study their applicability in small group meetings on a systematic basis [29]. For a meeting with NP participants, lasting NT time frames, and where people can look at NF different focus targets (including each person, the slide screen, the whiteboard, and the table), the total received visual attention for each person i, denoted by rvai is given by rvai =
NT
NP
∑ ∑
t=1 j=1, j=i
δ ( ft j − i),
(5)
where ft denotes the target focus of person j at time t, and δ (.) is the standard delta function. The visual dominance ratio was originally defined for dyadic conversations. [29] proposed a simple extension of the concept to the multi-person case, revisiting the ‘looking-while-speaking’ definition to include all people whom a person looks at when she/he talks, and the ‘looking-while-listening’ case to include all cases when a person does not talk and looks at any speaker (approximated in this way given the difficulty of automatically estimating listening events). The resulting multi-party visual dominance ratio for person i, denoted by mvdr, is then given by j
454
Daniel Gatica-Perez and Jean-Marc Odobez
mvdri =
mvdrNi i , mvdrD
(6)
where the numerator (looking-while-speaking) and the denominator (looking while listening) are respectively defined as NT
mvdrNi = ∑ sti t=1
i mvdrD =
NT
NP
j=1, j=i
∑ (1 − sti )
t=1
∑
δ ( fti − j),
NP
∑
j=1, j=i
δ ( fti − j)stj ,
(7)
(8)
where sti denoted the binary speaking status (0/1: silence/speaking) of person i. For both rva and mvdr, the most dominant person is assumed to be the one with the highest value for the corresponding cue. We also studied the performance of the numerator and the denominator terms of the multi-party visual dominance ratio, estimating the most dominant person as the one having the highest (resp. the lowest) value. Speaking activity is estimated from close-talk microphones attached to each meeting participant, and visual attention is estimated through the DBN approach, following the techniques described earlier in this chapter. This approach was tested on two subsets of the AMI corpus [9]. The first one consists of 34 five-minute meeting segments where three human observers agreed on the perceived most dominant person (referred to as having full agreement in the following discussion). The second set is a superset of the first one, and consists of 57 five-minute meeting segments where at least two out of the three observers agreed on the perceived most dominant person (referred to as having majority agreement). The two subsets model different conditions of variability of human judgment of dominance. The results are shown in Table 2. Method Full agreement (%) Majority agreement (%) rva 67.6 61.4 mvdr 79.4 71.9 mvdrN 70.6 63.2 mvdrD 50.0 45.6 Random 25.0 25.0 Table 2 Meeting classification accuracy for the most-dominant person task for both full and majority agreement data sets (taken from [29]).
Considering the performance on these four cues on the full-agreement data set, it is clear that all the tested cues perform better than random, with the visual dominance ratio performing the best, suggesting that there is a benefit on combining visual attention and speaking activity (79.4%) compared to the case of only received visual attention (67.6%) as a predictor for the most dominant person. Nevertheless, it is remarkable that by using only visual cues (i.e., not using what is being said at
Visual Attention, Speaking Activity, and Group Conversational Analysis
455
all), relatively good performance is obtained, more even so if we consider that the automatically extracted visual attention is far from being perfect, where VFOA performance is around 50% as described earlier in the chapter. It is important to note that these two cues are applied to the group setting, which differs from the original use of these cues in psychology not only due the presence of multiple people but also due to the presence of meeting artifacts (table, screen) that significantly affect the group dynamics. Analyzing the results for the components of the MVDR, it appears that the numerator is more discriminant that the denominator taken in isolation, but also that their combination is beneficial. Finally, similar trends are observed for the performance on the majority-agreement data set, with a noticeable degradation of performance in all cases, which seems to reflect the fact that this data set is inherently more challenging given the larger variation in human perception of dominance. Current work is examining other measures that combine visual attention and speaking activity (see Fig. 10). Overall, the work discussed in this section can be seen as complementing the work oriented towards on-line applications presented in Section 2.1, providing further insights into the value of nonverbal cue fusion for automatic estimation of social constructs in conversations that take place in multi-sensor environments.
5 Concluding Remarks In this chapter, we have shown that there is a growing interest in exploiting conversation analysis for building applications in the work place. We have presented an overview and discussed some key issues on extracting speaking activity and visual attention from multi-sensor data, including through the modeling of their interaction in conversation. Finally, we have described their combined use for estimating social dominance. Despite the recent advancements in scene conversation analaysis, there are several inter-related issues that still needs to be investigated to move towards easy processing in real-life situations or address the inference of different behavioral and social cues. A first key issue is the robustness of the cue extraction process with respect to the sensor setting, and the constraints thet this setting may impose on people mobility or behaviour. Regarding audio, most of the existing works on social interaction analysis in meetings rely on good-quality audio signals extracted from close-talk microphones. A more simple and desirable setting would involve non-obtrusive, possibly single, distant microphones, but this is at the cost of larger speaker segmentation errors. VFOA extraction faces several challenges as well, as discussed in Section 3.2. While cameras are not intrusive, they need to continuously monitor all people’s head and face which clearly requires a multiple camera setting in scenarios involving moving people. In addition, relating head orientation to VFOA targets depends on both people and VFOA target locations. Estimating this relationship can be done using learning sequences, but this quickly becomes burdersome when
456
Daniel Gatica-Perez and Jean-Marc Odobez
Fig. 10 Estimating dominance over time from multiple cues. A distinct color is used to indicate all information related to each of the four meeting participants (identified by a circle and a letter ID drawn over the person’s body). The top panels (left and right) show the two camera views from which location (colored bounding boxes), head pose (colored arrows) and visual focus (colored circles above people’s heads) are estimated for each person. Notice that person B is recognized as being unfocused, while person C is recognized as looking at the table. The current speaker (person B) is highlighted by a light gray square. The bottom left panel shows speaking activity cues (speaking time, speaking turns, and speaking interruptions) and visual attention cues (received visual focus) estimated over time. The bottom right panel shows the dominance level estimated over time for each participant. In this case, two people (B and C) dominate the conversation.
multiple VFOA target models for multiple locations have to be estimated. The use of gaze models, as we presented in this chapter, is more versatile, but still requires gross camera calibration and estimated of people 3D head position and VFOA target locations estimates. Still, not all applications of interest may need accurate cue extraction for all people. For instance, in the Remote Participant application, one might only be interested at detecting when the majority of collocated participants -or even only the chairmanlook at the screen display of the remote participant to detect if the later is addressed. Also, some applications rely on measurement statistics rather than instantaneous cues, and therefore may tolerate important individual errors. As an example, the use of the recognized VFOA for dominance estimation, as reported in Subsection 4.3, leads to a performance decrease w.r.t. use of the ground truth [29], but not as much as what could be expected given the obtained 55% VFOA recognition rate. Similarly, Hung et al. [28] who studied the problem of predicting the most-dominant person for the single distant microphone case, using a fast speaker diarization al-
Visual Attention, Speaking Activity, and Group Conversational Analysis
457
gorithm and speaking time as only cue, reported a similar behaviour: performance was lower than using cleaner close-talk microphone signals, but larger decrease in diarization performance did not have the same degree of negative impact in the estimation of the most-dominant person. Another important issue which has to be taken into account is the group size, which is known to have impact the social dynamics of the conversation. Floor control, gaze, and other non-verbal cue pattern may significantly differ in large group than in small groups. Fay et al [19] reports that people speaking behaviours is quite different in 10-people group sizes, where communication evolves as a serie of monologues, than in 5-people meetings where communication is dialogue, and that dominance is perceived differently in the two situations. Most of the work so far on automated conversation analysis has been done on small groups from 2 to 5 people. Indeed, for larger groups sizes, the automatic cue extraction becomes very difficult, especially when using non obtrusive techniques. This is the case of the audio when using distant microphones, and even more for the VFOA extraction where the approximation of gaze with the head pose will no longer be sufficient, since gazing might be achieved only by eye motion. From this perspective, investigation of automated conversation and social analysis in large group meetings is very challenging and remains relatively unexplored. Acknowledgements Our work has been supported by the EC project AMIDA (Augmented MultiParty Interaction with Distant Access), the Swiss National Center for Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2), and the US research program VACE (Video Analysis and Content Extraction). We thank Sileye Ba, Hayley Hung, and Dinesh Jayagopi (members of our research groups at Idiap) for their contribution to the some of the research described here. We also thank Pierre Dillenbourg, Taemie Kim, Joan Morris DiMicco, Andrei Popescu-Belis, and Janienke Sturm for giving permission to reproduce the pictures presented in Figures 1 and 2.
References [1] Argyle M, JGraham (1977) The central europe experiment - looking at persons and looking at things. Journal of Environmental Psychology and Nonverbal Behaviour 1:6–16 [2] Ba S, Odobez JM (2008) Multi-person visual focus of attention from head pose and meeting contextual cues. Tech. Rep. 47, Idiap Research Institute [3] Ba S, Odobez JM (2008) Recognizing human visual focus of attention from head pose in meetings. IEEE Trans. on System, Man and Cybernetics: part B, Man, Vol. 39. No. 1. pp. 16-34, Feb 2009 [4] Ba SO, Odobez JM (2005) A Rao-Blackwellized mixed state particle filter for head pose tracking. In: Proc. ACM-ICMI-MMMP, pp 9–16 [5] Bachour K, Kaplan F, Dillenbourg P (Sept, 2008) An interactive table for regulating face-to-face collaborative learning. In: Proc. European Conf. on Technology-Enhanced Learning (ECTEL), Maastricht
458
Daniel Gatica-Perez and Jean-Marc Odobez
[6] Basu S, Choudhury T, Clarkson B, Pentland A (Dec. 2001) Towards measuring human interactions in conversational settings. In: Proc. IEEE CVPR Int. Workshop on Cues in Communication (CVPR-CUES), Kauai [7] Burgoon JK, Dunbar NE (2006) The Sage Handbook of Nonverbal Communication, Sage, chap Nonverbal expressions of dominance and power in human relationships [8] Cappella J (1985) Multichannel integrations of nonverbal behavior, Erlbaum, chap Controlling the floor in conversation [9] Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, T Hain JK, Karaiskos V, Kraaij W, Kronenthal M, Lathoud G, Lincoln M, A Lisowska IM, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus: A pre-announcement. In: Proc. Workshop on Machine Learning for Multimodal Interaction (MLMI), Edinburgh [10] Chen L, Harper M, Franklin A, Rose T, Kimbara I (2005) A Multimodal Analysis of Floor Control in Meetings. In: Proc. Workshop on Machine Learning for Multimodal Interaction (MLMI) [11] Cook M, Smith JMC (1975) The role of gaze in impression formation. British Journal of Social and Clinical Psychology [12] DiMicco JM, Pandolfo A, Bender W (2004) Influencing group participation with a shared display. In: Proc. ACM Conf. on Computer Supported Cooperative Work (CSCW), Chicago [13] Dines J, Vepa J, Hain T (2006) The segmentation of multi-channel meeting recordings for automatic speech recognition. In: Int. Conf. on Spoken Language Processing (Interspeech ICSLP) [14] Dovidio JF, Ellyson SL (1982) Decoding visual dominance: atributions of power based on relative percentages of looking while speaking and looking while listening. Social Psychology Quarterly 45(2):106–113 [15] Dunbar NE, Burgoon JK (2005) Perceptions of power and interactional dominance in interpersonal relationships. Journal of Social and Personal Relationships 22(2):207–233 [16] Duncan Jr S (1972) Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23(2):283–292 [17] Efran JS (1968) Looking for approval: effects of visual behavior of approbation from persons differing in importance. Journal of Personality and Social Psychology 10(1):21–25 [18] Exline RV, Ellyson SL, Long B (1975) Advances in the study of communication and affect, Plenum Press, chap Visual behavior as an aspect of power role relationships [19] Fay N, Garod S, Carletta J (2000) Group discussion as interactive dialogue or serial monologue: the influence of group size. Psychological Science 11(6):487–492 [20] Freedman EG, Sparks DL (1997) Eye-head coordination during headunrestrained gaze shifts in rhesus monkeys. Journal of Neurophysiology 77:2328–2348
Visual Attention, Speaking Activity, and Group Conversational Analysis
459
[21] Gatica-Perez D (2006) Analyzing human interaction in conversations: a review. In: Proc. IEEE Int. Conf. on Multisensor Fusion and Integration for Intelligent Systems (MFI), Heidelberg [22] Gatica-Perez D (2009) Automatic Nonverbal Analysis of Social Interaction in Small Groups: a Review, Image and Vision Computing, Special Issue on Human Naturalistic Behavior [23] Gauvain J, Lee CH (1992) Bayesian learning for hidden Markov model with Gaussian mixture state observation densities. Speech Communication 11:205– 213 [24] Goodwin C, Heritage J (1990) Conversation analysis. Annual Review of Anthropology pp 981–987 [25] Hall JA, Coats EJ, LeBeau LS (2005) Nonverbal behavior and the vertical dimension of social relations: A meta-analysis. Psychological Bulletin 131(6):898–924 [26] Hayhoe M, Ballard D (2005) Eye movements in natural behavior. TRENDS in Cognitive Sciences 9(4):188–194 [27] Hung H, Jayagopi D, Yeo C, Friedland G, Ba SO, Odobez JM, Ramchandran K, Mirghafori N, Gatica-Perez D (2007) Using audio and video features to classify the most dominant person in a group meeting. In: Proc. of ACM Multimedia [28] Hung H, Huang Y, Friedland G, Gatica-Perez D (2008) Estimating the dominant person in multi-party conversations using speaker diarization strategies. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas [29] Hung H, Jayagopi D, Ba S, Odobez JM, Gatica-Perez D (2008) Investigating automatic dominance estimation in groups from visual attention and speaking activity. in Proc. Int. Conf. on Multimodal Interfaces (ICMI), Chania, October. [30] Jayagopi D, Hung H, Yeo C, Gatica-Perez D (2009) Modeling dominance in group conversations using nonverbal activity cues. IEEE Trans. on Audio, Speech, and Language Processing, Special Issue on Multimodal Processing for Speech-based Interactions, Vol. 17, No. 3, pp. 501-513. March [31] Jovanovic N, Op den Akker H (2004) Towards automatic addressee identification in multi-party dialogues. In: 5th SIGdial Workshop on Discourse and Dialogue [32] Kendon A (1967) Some functions of gaze-direction in social interaction. Acta Psychologica 26:22–63 [33] Kim T, Chang A, Holland L, Pentland A (2008) Meeting mediator: Enhancing group collaboration with sociometric feedback. In: Proc. ACM Conf. on Computer Supported Cooperative Work (CSCW), San Diego [34] Knapp ML, Hall JA (2005) Nonverbal Communication in Human Interaction. Wadsworth Publishing [35] Kouadio M, Pooch U (2002) Technology on social issues of videoconferencing on the internet: a survey. Journal of Network and Computer Applications 25:37–56
460
Daniel Gatica-Perez and Jean-Marc Odobez
[36] Kulyk O, Wang J, Terken J (2006) Real-time feedback on nonverbal behaviour to enhance social dynamics in small group meetings. In: Proc. Workshop on Machine Learning for Multimodal Interaction (MLMI) [37] Langton S, Watt R, Bruce V (2000) Do the eyes have it ? cues to the direction of social attention. Trends in Cognitive Sciences 4(2):50–58 [38] Lathoud G (2006) Spatio-temporal analysis of spontaneous speech with microphone arrays. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland [39] Lathoud G, McCowan I (2003) Location Based Speaker Segmentation. In: Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-03), Hong Kong [40] Matena L, Jaimes A, Popescu-Belis A (2008) Graphical representation of meetings on mobile devices. In: MobileHCI conference, Amsterdam, The Netherlands [41] Morimoto C, Mimica M (2005) Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding 98:4–24 [42] Novick D, Hansen B, Ward K (1996) Coordinating turn taking with gaze. In: International Conference on Spoken Language Processing [43] Odobez JM, Ba S (2007) A Cognitive and Unsupervised MAP Adaptation Approach to the Recognition of Focus of Attention from Head pose. In: Proc. of ICME [44] Ohno T (2005) Weak gaze awareness in video-mediated communication. In: Proceedings of Conference on Human Factors in Computing Systems, pp 1709–1712 [45] Otsuka K, Takemae Y, Yamato J, Murase H (2005) A probabilistic inference of multiparty-conversation structure based on markov-switching models of gaze patterns, head directions, and utterances. In: Proc. of ICMI, pp 191–198 [46] Otsuka K, Yamato J, Takemae Y, Murase H (2006) Conversation scene analysis with dynamic bayesian network based on visual head tracking. In: Proc. of ICME [47] Otsuka K, Yamato J, Takemae Y, Murase H (2006) Quantifying interpersonal influence in face-to-face conversations based on visual attention patterns. In: Proc. ACM CHI Extended Abstract, Montreal [48] Ramírez J, Górriz J, Segura J (2007) Robust speech recognition and understanding, I-Tech, I-Tech Education and Publishing, Vienna, chap Voice activity detection: Fundamentals and speech recognition system robustness [49] Ranjan A, Birnholtz J, Balakrishnan R (2008) Improving meeting capture by applying television production principles with audio and motion detection. In: CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, pp 227–236, DOI http://doi.acm.org/10.1145/1357054.1357095 [50] Rhee HS, Pirkul H, Jacob V, Barhki R (1995) Effects of computer-mediated communication on group negotiation: Au empirical study. In: Proceedings of the 28th Annual Hawaii International Conference on System Sciences, pp 981– 987
Visual Attention, Speaking Activity, and Group Conversational Analysis
461
[51] Rienks R, Heylen D (2005) Automatic dominance detection in meetings using easily detectable features. In: Proc. Workshop on Machine Learning for Multimodal Interaction (MLMI), Edinburgh [52] Rienks R, Zhang D, Gatica-Perez D, Post W (2006) Detection and application of influence rankings in small-group meetings. In: Proc. Int. Conf. on Multimodal Interfaces (ICMI), Banff [53] Schmid Mast M (2002) Dominance as expressed and inferred through speaking time: A meta-analysis. Human Communication Research 28(3):420–450 [54] Shriberg E, Stolcke A, Baron D (2001) Can prosody aid the automatic processing of multi-party meetings? evidence from predicting punctuation, disfluencies, and overlapping speech. In: ISCA Tutorial and Research Workshop (ITRW) on Prosody in Speech Recognition and Understanding (Prosody 2001) [55] Stiefelhagen R (2002) Tracking and modeling focus of attention. PhD thesis, University of Karlsruhe [56] Stiefelhagen R, Yang J, Waibel A (2002) Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans on Neural Networks 13(4):928–938 [57] Sturm J, Herwijnen OHV, Eyck A, Terken J (2007) Influencing social dynamics in meetings through a peripheral display. In: Proc. Int. Conf. on Multimodal Interfaces (ICMI), Nagoya [58] Takemae Y, Otsuka K, Yamato J (2005) Automatic video editing system using stereo-based head tracking for multiparty conversation. In: ACM Conference on Human Factors in Computing Systems, pp 1817–1820 [59] Valente F (2006) Infinite models for speaker clustering. In: Int. Conf. on Spoken Language Processing (Interspeech ICSLP) [60] Vijayasenan D, Valente F, Bourlard H (2008) Integration of tdoa features in information bottleneck framework for fast speaker diarization. In: Interspeech 2008 [61] Wrigley SJ, Brown GJ, Wan V, Renals S (2005) Speech and crosstalk detection in multi-channel audio. IEEE Trans on Speech and Audio Processing 13:84–91 [62] Yeo C, Ramchandran K (2008) Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Tech. Rep. UCB/EECS-2008-79, EECS Department, University of California, Berkeley
Using Multi-modal Sensing for Human Activity Modeling in the Real World Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
1 Introduction Traditionally smart environments have been understood to represent those (often physical) spaces where computation is embedded into the users’ surrounding infrastructure, buildings, homes, and workplaces. Users of this “smartness” move in and out of these spaces. Ambient intelligence assumes that users are automatically and seamlessly provided with context-aware, adaptive information, applications and even sensing – though this remains a significant challenge even when limited to these specialized, instrumented locales. Since not all environments are “smart” the experience is not a pervasive one; rather, users move between these intelligent islands of computationally enhanced space while we still aspire to achieve a more ideal anytime, anywhere experience. Two key technological trends are helping to bridge the gap between these smart environments and make the associated experience more persistent and pervasive. Smaller and more computationally sophisticated mobile devices allow sensing, communication, and services to be more directly and continuously experienced by user. Improved infrastructure and the availability of uninterrupted data streams, for instance location-based data, enable new services and applications to persist across environments. Previous research from our labs have investigated location-awareness [8, 20, 21] and instrumented objects and environments [6, 19, 11]. In this chapter, we focus on wearable technologies that sense user behavior, applications that leverage such sensing, and the challenges in deploying these types of intelligent systems in real
Beverly Harrison and Sunny Consolvo Intel Research, 1100 NE 45th Street, Seattle, WA 98103, USA, e-mail: {beverly.harrison, sunny.consolvo}@intel.com Tanzeem Choudhury Dartmouth College, Department of Computer Science, 6211 Sudikoff Lab, Hanover, NH 03755, USA, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_17, © Springer Science+Business Media, LLC 2010
463
464
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
world environments. In particular, we discuss technology and applications that continuously sense and infer physical activities. In this chapter, we discuss our efforts over five years to implement, iterate, and deploy a wearable mobile sensing platform for human activity recognition. The goal of this system was to encourage individuals to be physically active. We use technology both to automatically infer physical activities as well as employ persuasive strategies [7] to motivate individuals to be more active. This general area is of growing interest to the human-computer interaction and ubiquitous computing research communities, as well as the commercial marketplace. We highlight some of the key issues that must be considered if such systems are to become successfully integrated into real world human activity detection based on our experiences.
2 Technologies for Tracking Human Actvities We are interested in building mobile platforms to reliably sense real world human actions and in developing machine learning algorithms to automatically infer highlevel human behaviors from low-level sensor data. In 2.2, we briefly discuss several common systems and approaches to collect such data. In 2.3, we outline our particular implementation of the Mobile Sensing Platform (MSP) and an application built using this MSP – UbiFit Garden.
2.1 Methods for Logging Physical Activity Several technologies used to sense human physical activity employ a usage model where the technology is used only while performing the target activity. These technologies include Dance Dance Revolution, the Nintendo Wii Fit, the Nike+ system, Garmin’s Forerunner, Bones in Motion’s Active Mobile & Active Online, bike computers, heart rate monitors, MPTrain, Jogging over a distance, shadowboxing over a distance, and mixed- and virtual-reality sports games [17, 16, 14, 15, 10, 12]. Perhaps the most common commercial device that detects physical activity throughout the day is the pedometer–an on-body sensing device that detects the number of “steps” the user takes. The usage model of the pedometer is that the user clips the device to his or her waistband above the hip, where the pedometer’s simple “inference model” counts alternating ascending and descending accelerations as steps. This means that any manipulation of the device that activates the sensor is interpreted as a step, which often leads to errors. Another more sophisticated commercial on-body sensing device that infers physical activity is BodyMedia’s SenseWear Weight Management Solution. The SenseWear system’s armband monitor senses skin temperature, galvanic skin response, heat flux, and a 2-d accelerometer to infer energy expenditure (i.e., calories burned), physical activity duration and intensity, step count, sleep duration, and sleep efficiency. SenseWear’s inference model calcu-
Using Multi-modal Sensing for Human Activity Modeling in the Real World
465
lates calories burned and exercise intensity; aside from step count, it does not infer specific physical activities (e.g., running or cycling). However, as we have learned from our own prior research [3], a common problem when designing systems based on commercial equipment is that the systems are often closed. That is, in our prior work that used pedometers, users had to manually enter the step count readings from the pedometer into our system, as our system could not automatically read the pedometer’s step count. Several researchers have recognized this problem, which has led to a new generation of experimental sensing and inference systems. One approach that is being used is to infer physical activity from devices the user already carries/wears, such as Sohn et al.’s [20] software for GSM smart phones that uses the rate of change in cell tower observations to approximate the user’s daily step count. Shakra [12] also uses the mobile phone’s travels to infer total “active” minutes per day and states of stationary, walking, and driving. A different approach that is being used to detect a wider range of physical activities such as walking, running, and resistance training is to wear multiple accelerometers simultaneously on different parts of the body (e.g., wrist, ankle, thigh, elbow, and hip) [1]. While this approach has yielded high accuracy rates, it is not a practical form factor when considering all-day, everyday use. Another approach uses multiple types of sensors (e.g., accelerometer, barometer, etc.) worn at a single location on the body (e.g., hip, shoulder, or wrist) [9, 2]. Such multi-sensor devices are more practical for daily use, while still capable of detecting a range of activities, and are thus the approach that we have chosen to use in our work.
2.2 The Mobile Sensing Platform (MSP) and UbiFit Garden Application Unlike many of the above technologies, our work focuses on detecting physical activities throughout the day, rather than during single, planned workout sessions. This requires that the sensing technology be something that the individual can wear throughout the day and that disambiguates the target activity(ies) from other activities that are performed throughout daily life. The UbiFit Garden system uses the Mobile Sensing Platform (MSP) [2] to automatically infer and communicate information about particular types of physical activities (e.g., walking, running, cycling, using the elliptical trainer, and using the stair machine) in real-time to a glanceable display and interactive application on the phone. The MSP is a pager-sized, battery powered computer with sensors chosen to facilitate a wide range of mobile sensing applications (Fig. 1). The sensors on the MSP can sense: motion (accelerometer), barometric pressure, humidity, visible and infrared light, temperature, sound(microphone), and direction (digital compass). It includes a 416MHz XScale microprocessor, 32MB RAM, and 2GB of flash memory (which is bound by the storage size of a removable miniSD card) for storing programs and logging data (Fig. 2). The MSP’s Bluetooth
466
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
networking allows it to communicate with Bluetooth-enabled devices such as mobile phones. The MSP runs a set of boosted decision stump classifiers [18, 22] that have been trained to infer walking, running, cycling, using an elliptical trainer, and using a stair machine. The individual does not have to do anything to alert the MSP or interactive application that she is starting or stopping an activity, provided that the MSP is powered on, communicating with the phone, and being worn on the individual’s waistband above her right hip (the common usage model for pedometers). The MSP automatically distinguishes those five, trained activities from the other activities that the individual performs throughout her day (Fig. 3). The inferences for those five types of physical activities are derived from only two of the MSP’s sensors: the 3-axis accelerometer (measuring motion) and the barometer (measuring air pressure). The barometric pressure sensor can be used to detect elevation changes and helps to differentiate between activities such as walking and walking up or down. The sensor data is processed and the activity classification is performed on the MSP itself, the inference results are then communicated via Bluetooth to a mobile phone that runs the interactive application and glanceable display (Fig. 4). The MSP communicates a list of activities and their predicted likelihood values to the phone four times per second. This means that the MSP and phone
Fig. 1 The Mobile Sensing Platform (MSP) prototypes.(a) The MSP in its original gray case; (b) the gray case MSP as worn by a woman while using the elliptical trainer; (c) the gray case MSP worn by the same woman in casual attire; (d) the MSP in its redesigned black case; and (e) the black case MSP as worn by the same woman while using the elliptical trainer.
Using Multi-modal Sensing for Human Activity Modeling in the Real World
467
must be within Bluetooth range at all times during the performance of the activity. The interactive application then aggregates and “smoothes” these fine-grain, noisy data resulting in “human scale” activities such as an 18-minute walk or 45-minute run. The activity’s duration and start time appear in the interactive application about six to eight minutes after the individual has completed the activity (Fig. 4). “Tolerances” have been built into the definitions of activities to allow the individual to take short breaks during the activity (e.g., to stop at a traffic light before crossing the street during a run, walk, or bike ride). UbiFit defines the minimum durations and tolerances for activities to appear in the interactive application’s list. Walks of five or more minutes automatically post to the interactive application, as do cardio activities of eight or more minutes. However, to receive a reward for the activities performed (i.e., a new flower appearing on the glanceable display), each instance of the activities must be at least 10 minutes in duration. This means that activities may appear in the individual’s list that do not map to flowers on the glanceable display (e.g., an inferred 8-minute walk would
Fig. 2 The Mobile Sensing Platform (MSP) consists of seven types of sensors, Xscale processor, Bluetooth radio and Flash memory in pager size casing.
Fig. 3 Inferring activities from sensor readings
468
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
appear in the interactive application, but would not have a corresponding flower in the glanceable display). If the MSP is reasonably confident that the individual performed an activity, but cannot determine which activity, the interactive application will pop up a question asking the individual if she performed an activity that should be added to her journal.
2.3 User Studies Two field trials of UbiFit Garden (with 12 and 28 participants respectively) helped illustrate how activity monitoring technologies [5, 4] fit into everyday experiences. Participants recruited were from the Seattle metropolitan area and were regular mobile phone users who wanted to increase their physical activity. They received the phone, fitness device, and instructions on how to use the equipment. The participants agreed to put their SIM cards in the study phone and use it as their personal phone throughout the study. They set a weekly physical activity goal of their own choosing which had to consist of at least one session per week of cardio, walking, resistance training, or flexibility training. The participants were interviewed about their experiences in the study and also given the opportunity to revise their weekly physical activity goal. The intent of the second experiment was to get beyond potential novelty effects that may have been present in the three-week field trial and to systematically explore the effectiveness of the glanceable display and fitness device components through
Fig. 4 (a) Screen shot of UbiFit Glanceable Display; (b) UbiFit Garden’s interactive application showing two automatically inferred walks.
Using Multi-modal Sensing for Human Activity Modeling in the Real World
469
the use of experimental conditions. Additionally, whereas our three week study focused on reactions to activity inference and the overall concept of UbiFit Garden, the second field experiment specifically investigated the effectiveness of the glanceable display and fitness device as a means of encouraging awareness and behavior of physical activity.
3 Usability, Adaptability, and Credibility Based on the results of our field trials, we present in this section three fundamental aspects of intelligent systems that we believe critically impacts the success of mobile activity monitoring systems. The overall usability of the system from the end user perspective includes not only the user interface itself through which the data is communicated, modified, or monitored; usability of such systems also encompasses the underlying infrastructure to support data communications and power management (in particular, when users need to diagnose or detect and correct errors or failures). The adaptability of the system includes learning particular user patterns and adapting across usage contexts (noisy environments, outdoor vs. indoor, etc.). Finally, the system performance, usability, adaptability, and accuracy combine to directly impact the overall credibility. Credibility means that end users build a mental model of the system that enables them to feel it is reliable, produces expected and accurate data, and when errors do occur, users can mentally “explain” the causes for these errors and take appropriate corrective actions. There is an assumption that such errors (and corrective actions) improve the system over time. A lack of credibility is typically reflected in abandoning technology. We discuss specific examples of each of these from our experiences with the MSP and UbiFit Garden deployments.
3.1 Usability of Mobile Inference Technology When creating smart technologies such as the one described above, a number of significant basic challenges must be addressed. The system must provide value to motivate its use and this value must outweigh the costs of using such a system. Fundamental usability issues include not only how wearable the technology is, but how to handle any variations in connectivity smoothly, how long the system can run between charges, how accurate the underlying inference is, the extent of personalization or user-specified training data required, and the design of the user interface that provides these intelligent services or applications to consumers. We briefly discuss our experiences with these issues below.
470
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
3.1.1 Form Factor and Design Despite many design iterations over two years and pilot testing within our research lab, participants in the two field studies of UbiFit Garden complained about the form factor of the MSP prototypes. This result was not surprising given the early nature of the prototypes and the fact that several members of the research team had used the full UbiFit Garden system for up to two years and were well aware of the MSPs limitations (too well informed about the technical constraints perhaps). Both the original gray case and redesigned black case versions of the MSP were large, still slightly too heavy at 115g (though lighter than many other comparable devices), bulky, somewhat uncomfortable (e.g., the prototypes poked several participants in the hip), drew attention from others (particularly the bright LED), occasionally pulled participants’ waistbands down (particularly females during high-intensity activities), and did not last long enough on a single charge (the battery life was approximately 11.5 hours for the gray case MSPs and 16 hours for the black case MSPs) (see Figure 1). (The main difference between these two cases was that the 2nd version contained two batteries instead of a single battery thereby increasing time between charges.) A participant explained: “It was a lot of machinery to carry around, to be perfectly honest. Like it’s just trying to remember to plug in the one [MSP] when I also had to remember to plug in my phone and I mean, I’m just never on top of that stuff and so then after I plugged it in and I’d grab my phone, I’d forget, you know, some days forget the device [the MSP]. And it [the MSP] would pull my pants down when I would be running or doing something more vigorous like that.” Some participants in the 3-month field experiment joked that when they transitioned from the gray case to the black case MSP, they upgraded from looking as if they were wearing an industrial tape measure on their waist to looking like a medical doctor from the 1990’s. In spite of its “dated pager” appearance, all participants preferred the black case, and despite the complaints, most participants liked the idea of the fitness device and understood that they were using an early-stage prototype, not a commercial-quality product.
3.1.2 Power and Connectivity Issues Many intelligent applications or services assume users have continuous network connectivity. While our deployments ran in urban environments with extensive GSM/cell phone, WiFi and WiMax network coverage and we continuously tested Bluetooth connectivity between sensor platform and communication platform (i.e., MSP to cell phone), nevertheless there were times when some portion of the userperceived communications failed. Note that most end users are not experienced at diagnosing and troubleshooting such problems –the system simply appears broken. If network connectivity was temporarily lost, our system continued running in a local mode and would re-synch when able without user intervention. Similarly, if the Bluetooth connection failed, we could again continue running continuous (in this
Using Multi-modal Sensing for Human Activity Modeling in the Real World
471
case sensing motion data and inferring activity) and we stored the data locally on the MSP storage card until Bluetooth connectivity was restored (often by prompting the user on the cell phone to power on/off the devices or to verify that they were both in range). If this connectivity break occurred while a participant was performing an activity, the activity often either did not appear on the phone at all or appeared with errors (e.g., appearing as two activities instead of one, having an error in start time, and/or having an error in duration). At the time this work started in 2003, we did anticipate that cell phones would eventually have integrated sensors and thus having a separate sensor platform was a means of prototyping and testing such systems until sensor-equipped cell phones are more prevalent. Clearly such integration would solve one communication issue. When designing systems that rely upon several battery powered components (in our case the MSP and the cell phone), the usability of the overall system depends upon its weakest link (or in this case most-power hungry link). Cell phone design has evolved to optimize battery life and power consumption but typically this assumes far less data-intensive applications and less data communication. Early versions of this system sent raw data from the sensor platform to the phone (a power greedy use of Bluetooth). After migrating the inference algorithms to the embedded MSP platform, we could send inferred activity data and timestamps and we did some data packet optimization to reduce the amount of Bluetooth communication required thereby improving the power performance to about 8 hours. However, for smart services and applications that are expected to run in real world scenarios, this 8 hour limit results in an almost unusable system. We initially gave participants in our studies two chargers (one for work and one for home) to facilitate mid-day recharging but this is not a realistic expectation. While this works well for pilot testing and research, real world deployments cannot rely on 8-hour recharging strategies. Realistically these technologies must run for between 12 and 16 hours (i.e., during waking hours) with overnight recharges. We went through a number of code optimizations, altered data packet sizes, and ultimately built a second version of the MSP that could hold two cell phone batteries to attain a 12 hour power life. In our experiences, even with integrated sensor platforms we still believe that the data intensive nature of the communications will have significant impact on usability since current devices have not been optimized for this. This is particularly an issue when the types of intelligent systems require real time inference and continuously sensed data (even at low data rates). These are the types of problems that are inherent in field studies of early stage, novel technologies. Despite the problems with the MSP prototypes, the two field studies provided invaluable insights into our UbiFit Garden system and on-body sensing and activity inference in general. See [5, 4] for more details.
472
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
3.1.3 Accuracy and Generalizability We want our system to be based on general enough models that it would work for most people “out-of-the-box” without users having to supply personalized training data. In previous work, the MSP has been shown to detect physical activities with about 85% accuracy [9]. For the field trials mentioned in the previous section, we retrained and tuned the activity inference models to increase the accuracy for detecting walking and sitting using labeled data from 12 individuals of varying ages, heights, and gender (none were study participants). A wide range of participant types were chosen to ensure that our model parameters did not overfit to a specific sub-type of user. A Naïve Bayes model took several samples of the output from the boosted decision stump classifier to produce the final classification output. This resulted in a smoother temporal classification and was simpler to implement on the embedded device with results comparable to an HMM-based model (e.g., [9]). In trying to iterate and improve system accuracy, we found it helpful to take each target activity (for instance walking) and create a list of as many sub-classes of the target activity as we could. This enabled us to more precisely label our training data by sub-class first (e.g., walking uphill, walking down hill, walking fast, walking leisurely, walking in high heels, etc.) and then these sub-classes were merged into the higher level activity of walking. Even with variations in personal interpretations of what “walking fast” actually means, we found this strategy produced better models with much shorter training data samples needed. The value of these cascading 2-level models is best illustrated by issues we had classifying cycling when we used a single-level model. Our initial training data samples were of “bike rides” (one label) where this comprised very different sub-classes (i.e., coasting without pedaling, stopping and standing at a street light for a moment, standing in pedals out of seat to pedal uphill, etc.). We did not separate out these subclasses and therefore some instances of biking appeared to be confused with sitting or standing. We believe using a 2-level model with sub-classes of training data more precisely labeled would have reduced this ambiguity – this is not yet implemented in our system.
3.2 Adaptability Progressively more applications are reflecting some degree of end user adaptation over time as devices and systems become more personalized and more intelligent about user context. Recent menu item selections may appear immediately while seldom used menu items remain temporarily hidden unless the user does a more deliberate and prolonged open menu request. Recent documents, links, phone numbers called, or locations visited can be short-listed for convenience. In this section, we briefly discuss some of the issues we encountered in trying to create a flexible and adaptive system based on user context and real world needs.
Using Multi-modal Sensing for Human Activity Modeling in the Real World
473
3.2.1 Facilitating a Flexible Activity Log Our system was designed to automatically track six different physical activities, however, people’s daily routines often include other forms of activities which they would like to also account for but for which we have no models (e.g., swimming). While we could continue building and adding new models for each new type of activity, it is more immediately practical to allow users to manually journal activities that are out of scope of what we could automatically detect. This enables us to create a single time stamped database for any physical activity regardless of whether or not we have current automated models. When users enter these new activities, they can incorporate them into the glanceable display and “get credit” for them as part of a single user experience. This combination of manual data entry combined with the automatic journaling seemed to give users the feeling that the overall system was more adaptable to their needs. In addition, because this was a wearable system and the feedback was incorporated into the users’ own cell phone background screen, the system was operational on both planned physical workouts (e.g., going for a 30 minute run) and unplanned physical activities (e.g., walking to a lunch restaurant). This flexibility meant that users were more motivated to carry the device all the time in order to get credit for all the walking and incidental physical activity they did. Finally, we included a deliberately ambiguous “active” item for cases when the system detected some form of physical activity but confidence margins were too low to speculate upon an exact activity type. In this way, the system could provide a time stamped indicator to the user that something had been detected (rather than leaving out the item entirely). For our 3-week field trial, half of the participants received a generic “active” flower for these events; they had the option to either edit these items and provide more precise labels or they could leave them as generic “active” flowers. The other half received a questionnaire on their cell phones asking them if they had just performed an activity that should be added to their journal. Based on experiences in the 3-week field trial, we employed the latter strategy (i.e., the questionnaire) only in the 3-month field trial. Again, this reinforced the notion that the system was adaptable to varying degrees of uncertainty and communicated this uncertainty to the end user in a reasonable and interpretable way.
3.2.2 Improving Accuracy During Deployment Semi-supervised and active-learning techniques can be particularly useful for personalizing activity models to a specific user’s behavior. The system can adapt the model parameters according to the user’s unlabeled data, even after the system is deployed. We have seen some significant performance improvement using semisupervised methods [13]. (For the deployment system described above, this research was still underway and thus these methods have not yet been incorporated into the embedded system). Based on the results thus far, we believe that active learning
474
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
methods can be used to query the users for new labels that could further improve the performance of the recognition module.
3.2.3 Flexible Application Specific Heuristics As we described earlier, we used cascading models to predict the moment-tomoment physical activity of the user wearing the MSP device. However, for most intelligent systems, the final application goal will determine the precision and data types and sampling rates necessary to infer meaningful contexts. We created a number of software configurable parameters for application developers that would allow us to improve accuracy while preserving application meaningful activity distinctions. Even though more sophisticated modeling techniques can be used to group the activity episodes we found from our studies, the notion of episodes and how much gaps in activity should be tolerated is very subjective. So, we developed a more easily configurable mechanism to facilitate this. In fact the parameters are read in from a simple text file so, in theory, a knowledgeable end user might be able to set these. While we sample and infer activity four times per second, clearly for detecting physical activities the second-to-second variations over the course of a day do not tend to meaningfully map to what users might consider “episodes”. In fact, if the system indicates a 2 second bike ride preceded and followed by 5 minutes of walking it is almost certain that this resulted from noise in the data rather than an actual 2 second bike ride. The duration or number of confidence margins to be considered as a group can be set. In an application such as the one described earlier, we typically group the activity inferences and their associated confidence margins. Within this group, we provide the option to set a threshold that indicates the number of items that must be above a specified confidence margin in order to successfully categorize that activity. For instance, 18 of 20 activities must have the same label and be above a confidence margin of 80%. In this way, we can use one general platform for a variety of applications and easily configure it in a way to minimize both noise in the data and the amount of precision required. Finally, we have a “human scale” activity smoothing parameter which lets us easily specify how long an activity needs to be sustained for a valid resulting “episode” and what permissible gaps can occur within this episode. For instance, if the application goal is to track sustained physical activity, we might want a minimum duration of 10 minutes per activity type but we may allow up to 60 seconds gap to accommodate things like jogging and pausing at a street light before resuming the same jogging activity. This layered approach to data smoothing and the ability for either application developers or even end users to tune and configure these parameters is a critical element if the system is to be flexible enough to support more than one potential target application.
Using Multi-modal Sensing for Human Activity Modeling in the Real World
475
3.3 Credibility A user’s level of confidence in the system is influenced by the reliability of technology and by the accuracy of the inferred activities. This is especially noticeable to end users if the activity inference results are visible in real time. At the heart of most inference-based systems, the inferred activities (or locations or contexts or identities of users) are probabilistic in nature. This uncertainty is often not reflected in the user interface. Also such systems can introduce false positives (where events of interest did not really occur) or false negatives (where events did occur but were not reported). The amount of such errors, the ability to correct them, and the ability of users to diagnose or explain these errors are critical in establishing a credible system. We propose that intelligent systems may inherently require a different style of user interface design that reflects these characteristic ambiguities.
3.3.1 Correcting or Cheating One of the current debates in automatically logging physical exercise data is whether or not users should be allowed to edit the resulting data files. Since these data sets are often shared as part of a larger multi-user system, a number of commercial systems do not allow users to add or edit items presumably in the interests of preserving data integrity and reducing the ability to cheat. However, our experiences indicate that users are more often frustrated when they cannot correct erroneous data or add missing data. We made a deliberate decision in our system to allow users to add new items, delete items or change the labels on automatically detected items (in the event of misclassification). However, we limited the time span during which users could alter data to the last 2 days of history only. This seemed a reasonable compromise to both allow users to manually improve the data accuracy and completeness (thereby improving their perceptions of system credibility) while limiting this to a time span where they would have more accurate recall of their actual activities. Additionally, the system was not used in the context of a multi-user application (i.e., they did not have workout “buddies” or a competition/collaboration mechanism built in) so the users had less of an inclination to cheat.
3.3.2 Use of Ambiguity of User Interfaces Using ambiguity in either the graphic representations and/or in “labels” as presented to users can help maintain system credibility when algorithm confidence margins for any one activity are low and hence uncertainty is high. For instance, we provided users with feedback using a generic “active” label (and a special flower) when some form of physical activity was detected but the algorithm could not determine specifically which type of activity it was. Rather than incorrectly selecting a lower probability activity or presenting no feedback at all, this generic but higher level
476
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
activity class (active versus inactive or sedentary) indicated that the system had detected some user movements. As a consequence, users reporting feeling better that the system detected something at least and they could later edit this vague activity to be more precise (or not). This was preferred over not indicating any activity. We have additionally considered an option to display the top n activities (based on confidence margins) where users can pick from the list or else indicate “other” and specify the activity if it is not in the list. This feature was not implemented in the system used in our reported field trials. Previous experience with mobile phone user interfaces [e.g., Froehlich et al, 2006] suggests that such lists would need to fit within one screen (i.e., less than 10 items). This remains an area to be explored to weigh off user effort against data accuracy.
3.3.3 Learning from User Corrections We anticipate that using an initial model and then learning from user correction will impact overall system credibility. Users experience frustration if they must repeatedly correct the same types of error. By applying active learning techniques we can take advantage of the additional data supplied by these user corrections to uniquely personalize the models of physical activity to better match those of a particular user. Over time, such corrections should become less frequent thereby improving perceived accuracy and credibility of the system. Such personalization assumes these systems are not multi-user or shared devices unless the models can be distinguished per user based on some other criterion, for instance through login information or biometrics. Our experiences suggest that when systems appear to be “good” at automatically categorizing some types of activities, users might assume the system has abilities to learn to be “smarter” in other situations or for other categories. Thus, there is an expectation that error corrections will improve system accuracy over time even when users are told there is no learning capability.
4 Conclusions The task of recognizing activities from wearable sensors has received a lot of attention in recent years. There is also a growing demand for activity recognition in a wide variety of health care applications. This chapter provides an overview of a few key design principles that we believe are important in the successful adoption of mobile activity recognition systems. Two real-world field trials allowed us to systematically evaluate the effectiveness of the different design decisions. Our findings suggest that the main design challenges are related to the usability, adaptability, and credibility of the system being deployed. Furthermore, we believe that some of the design considerations are applicable to scenarios beyond mobile activity tracking
Using Multi-modal Sensing for Human Activity Modeling in the Real World
477
and will be relevant for designing any context-aware system that provides real-time user feedback and relies on continuous sensing and classification. Acknowledgements Thanks to Gaetano Borriello, Mike Chen, Cherie Collins, Kieran Del Pasqua, Jon Froehlich, Pedja Klasnja, Anthony LaMarca, James Landay, Louis LeGrand, Jonathan Lester, Ryan Libby, David McDonald, Keith Mosher, Adam Rea, Ian Smith, Tammy Toscos, David Wetherall, Alex Wilkie, and many other family, friends, colleagues, and study participants who have contributed to this work.
References [1] Bao, L. & Intille, S.S., Activity Recognition from User-Annotated Acceleration Data, In Proceedings of Pervasive ’04, (2004), 1-17. [2] Choudhury, T., Borriello, G., Consolvo, S., Haehnel, D., Harrison, B., Hemingway, B., Hightower, J., Klasnja, P., Koscher, K., LaMarca, A., Landay, J.A., LeGrand, L., Lester, J., Rahimi, A., Rea, A., & Wyatt, D. (Apr-Jun 2008). “The Mobile Sensing Platform: An Embedded Activity Recognition System, IEEE Pervasive Computing Magazine Special Issue on Activity-Based Computing, 7(2), 32-41. [3] Consolvo, S., Everitt, K., Smith, I., & Landay, J.A. (2006). Design Requirements for Technologies that Encourage Physical Activity, Proceedings of the Conference on Human Factors and Computing Systems: CHI 2006, (Montreal, Canada), New York, NY, USA: ACM Press, 457-66. [4] Consolvo, S., Klasnja, P., McDonald, D., Avrahami, D., Froehlich, J., LeGrand, L., Libby, R., Mosher, K., & Landay, J.A. (Sept 2008). Flowers or a Robot Army? Encouraging Awareness & Activity with Personal, Mobile Displays, Proceedings of the 10th International Conference on Ubiquitous Computing: UbiComp 2008, (Seoul, Korea), New York, NY, USA: ACM Press [5] Consolvo, S., McDonald, D.W., Toscos, T., Chen, M.Y., Froehlich, J., Harrison, B., Klasnja, P., LaMarca, A., LeGrand, L., Libby, R., Smith, I., & Landay, J.A. (Apr 2008). Activity Sensing in the Wild: A Field Trial of UbiFit Garden, Proceedings of the Conference on Human Factors and Computing Systems: CHI 2008, (Florence, Italy), New York, NY, USA: ACM Press, 1797-806. [6] Fishkin, K. P., Jiang, B., Philipose, M. and Roy, S. I Sense a Disturbance in the Force: Long-range Detection of Interactions with RFID-tagged Objects. Proceedings of the 6th International Conference on Ubiquitous Computing: Ubicomp 2004, pp. 268-282. [7] Fogg, B.J. (2003). Persuasive Technology: Using Computers to Change What We Think and Do. San Francisco, CA, USA: Morgan Kaufmann Publishers. [8] Hightower, J., Consolvo, S., LaMarca, A., Smith, I. E., Hughes, J. Learning and Recognizing the Places We Go. Proceedings of the 7 th International Conference on Ubiquitous Computing: Ubicomp 2005: 159-176.
478
Beverly L. Harrison, Sunny Consolvo, and Tanzeem Choudhury
[9] Lester, J. Choudhury, T., & Borriello, G. (2006). A Practical Approach to Recognizing Physical Activities, In Proceedings of the 4th International Conference on Pervasive Computing: Pervasive ’06, Dublin, Ireland, 1-16. [10] Lin, J.J., Mamykina, L., Lindtner, S., Delajoux, G., & Strub, H.B. (2006). Fish‘n’Steps: Encouraging Physical Activity with an Interactive Computer Game, In Proceedings of the 8th International Conference on Ubiquitous Computing: UbiComp ’06, Orange County, CA, USA, 261-78. [11] Logan, B., Healey, J., Philipose, M., Munguia-Tapia, E., Intille. S. Long-Term Evaluation of Sensing Modalities for Activity Recognition. In Proceedings of Ubicomp 2007. Innsbruck, Austria, September 2007 [12] Maitland, J., Sherwood, S., Barkhuus, L., Anderson, I., Hall, M., Brown, B., Chalmers, M., & Muller, H. Increasing the Awareness of Daily Activity Levels with Pervasive Computing, Proceedings of the 1st International Conference on Pervasive Computing Technologies for Healthcare 2006: Pervasive Health ’06, Innsbruck, Austria. [13] Mahdaviani. M. and Choudhury, T. Fast and Scalable Training of SemiSupervised CRFs with Application to Activity Recognition. Appears in the Proceedings of NIPS 2007. December 2007. [14] Mueller, F., Agamanolis, S., Gibbs, M.R., & Vetere, F. (Apr 2008). Remote impact: shadowboxing over a distance, CHI ’08 Extended Abstracts on Human Factors in Computing Systems, Florence, Italy, 2291-6. [15] Mueller, F., Agamanolis, S., & Picard, R. (Apr 2003). Exertion Interfaces: Sports Over a Distance for Social Bonding and Fun, Proceedings of CHI ‘03, pp.561-8. [16] Mueller, F., O’Brien, S., & Thorogood, A. (Apr 2007). Jogging Over a Distance, In CHI ’07 Extended Abstracts, 1989-94. [17] Oliver, N. & Flores-Mangas, F. (Sep 2006). MPTrain: A Mobile, Music and Physiology-Based Personal Trainer, In Proceedings of MobileHCI ’06. [18] Schapire, R.E., A Brief Introduction to Boosting, Proc. 16th Int’l Joint Conf. Artificial Intelligence (ijcai 99), Morgan Kaufmann, 1999, pp. 1401-1406. [19] Smith, J. R., Fishkin, K., Jiang, B., Mamishev, A., Philipose, M., Rea, A., Roy, S., Sundara-Rajan, K., RFID-Based Techniques for Human Activity Recognition. Communications of the ACM, v48, no. 9, Sep 2005. [20] Sohn, T., et al., Mobility Detection Using Everyday GSM Traces, In Proceedings of the 8th International Conference on Ubiquitous Computing: UbiComp 2006, Orange County, CA, USA. [21] Sohn, T., Griswold, W., Scott, J., LaMarca, A., Chawathe, Y., Smith, I.E., Chen, M.,Y., Experiences with place lab: an open source toolkit for locationaware computing. ICSE 2006: 462-476. [22] Viola, P., Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. In: Proc. Computer Vision and Pattern Recognition (2001).
Recognizing Facial Expressions Automatically from Video Caifeng Shan and Ralph Braspenning
1 Introduction Facial expressions, resulting from movements of the facial muscles, are the face changes in response to a person’s internal emotional states, intentions, or social communications. There is a considerable history associated with the study on facial expressions. Darwin [22] was the first to describe in details the specific facial expressions associated with emotions in animals and humans, who argued that all mammals show emotions reliably in their faces. Since that, facial expression analysis has been a area of great research interest for behavioral scientists [27]. Psychological studies [48, 3] suggest that facial expressions, as the main mode for nonverbal communication, play a vital role in human face-to-face communication. For illustration, we show some examples of facial expressions in Fig. 1. Computer recognition of facial expressions has many important applications in intelligent human-computer interaction, computer animation, surveillance and security, medical diagnosis, law enforcement, and awareness systems [68]. Therefore, it has been an active research topic in multiple disciplines such as psychology, cognitive science, human-computer interaction, and pattern recognition. Meanwhile, as a promising unobtrusive solution, automatic facial expression analysis from video or images has received much attention in last two decades [57, 29, 84, 55]. This chapter introduces recent advances in computer recognition of facial expressions. Firstly, we describe the problem space, which includes multiple dimensions: level of description, static versus dynamic expression, facial feature extraction and representation, expression subspace learning, facial expression recognition, posed versus spontaneous expression, expression intensity, controlled versus uncontrolled Caifeng Shan Philips Research, Eindhoven, The Netherlands, e-mail:
[email protected] Ralph Braspenning Philips Research, Eindhoven, The Netherlands, e-mail: ralph.braspenning@philips. com H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_18, © Springer Science+Business Media, LLC 2010
479
480
Caifeng Shan and Ralph Braspenning
Fig. 1 Facial expressions of Tony Blair.
data acquisition, correlation with bodily expression, and databases. Meanwhile, the state of the art of facial expression recognition is also surveyed from different aspects. In the second part, we present our recent work on recognizing facial expressions using discriminative local statistical features. Specially Local Binary Patterns are investigated for facial representation. Finally, the current status, open problems, and research directions are discussed. The remainder of this chapter is organized as follows. We first review the problem space and the state of the art in Section 2. Our recent work on facial expression recognition using Local Binary Patterns are then described in Section 3. Finally Section 4 concludes the chapter with a discussion on challenges and future opportunities.
2 Problem Space and State of the Art 2.1 Level of Description Facial expressions can be described at different levels [84]. Two mainstream description methods are facial affect (emotion) and facial muscle action (action unit) [55]. Psychologists suggest that some basic emotions are universally displayed and recognized from facial expressions [25], and the most commonly used descriptors are the six basic emotions, which includes anger, disgust, fear, joy, surprise, and sadness (see Fig. 2 for examples). This is also reflected by the research on automatic facial expression analysis; most facial expression analysis systems developed so far target facial affect analysis, and attempt to recognize a set of prototypic emotional facial expressions [57, 59]. There have also been some tentative efforts to detect cognitive and psychological states like interest [38], fatigue [32], and pain [45]. To describe subtle facial changes, the Facial Action Coding System (FACS) [27] has
Recognizing Facial Expressions Automatically from Video
481
been widely used for manually labeling of facial actions in behavioral science. FACS associates facial changes with actions of the muscles that produce them. See Fig. 3 for some examples of action units (AUs). It is possible to map AUs onto the basic emotions using a finite number of rules [27]. Automatic AU detection has been studied recently [23, 82, 60, 95, 44]. The major problem of AU-related research is the need of highly trained experts to manually perform FACS coding frame by frame. Approximately 300 hours of training are required to achieve minimal competency of FACS, and each minute of video tapes takes around two hours to code comprehensively [12]. Another possible descriptor is the bipolar dimensions of Valence and Arousal [65]. Valence describes the pleasantness, with positive (pleasant) on one end (e.g. happiness), and negative (unpleasant) on the other (e.g. disgust). The other dimension is arousal or activation, for example, sadness has low arousal, whereas surprise has a high arousal level. Different emotional labels can be plotted at various positions on a two-dimensional plane spanned by these two axes (as shown in Fig. 4).
Fig. 2 Prototypic emotional facial expressions: Anger, Disgust, Fear, Joy, Sadness, and Surprise (from left to right). From the Cohn-Kanade database [39].
Fig. 3 Examples of facial action units and their combination defined in FACS [55].
482
Caifeng Shan and Ralph Braspenning
Fig. 4 Facial expressions labeled with Valence and Arousal [21].
2.2 Static Versus Dynamic Expression Facial expressions can be detected and recognized in a snapshot. So many existing works attempt to analyze facial expressions in each image [46, 23, 60, 44]. For example, Cohen et al [16] adopted Bayesian network classifiers to classify a frame in video sequences to one of the basic emotional expressions. However, psychological experiments [10] suggest that the dynamics of facial expressions are crucial for successful interpretation of facial expressions. The differences between facial expressions are often conveyed more powerfully by dynamic transitions between different stages of expressions rather than any single state represented by a still image. This is especially true for spontaneous facial expressions without any deliberate exaggerated posing [2]. Therefore, capturing and analyzing temporal dynamics of facial expressions in videos or image sequences is important for facial expression analysis. Recently many approaches have been introduced to model dynamic facial expressions in videos or image sequences [53, 95, 56]. For example, Cohen et al [16] presented a multi-level Hidden Markov Model (HMM) classifier to model temporal behaviors exhibited by facial expressions, which not only performs expression classification in a video segment, but also automatically segments an arbitrary long video sequence to different expression segments. Facial expression dynamics can also be captured in low dimensional manifolds embedded in the input image space [15]. Shan et al [70, 75] proposed to model facial expression dynamics by discovering the underlying low-dimensional manifold. An example of expression manifold learning is shown in Fig. 5. Lee and Elgammal [42] recently introduced a framework
Recognizing Facial Expressions Automatically from Video
483
to learn decomposable generative models for dynamic appearance of facial expressions where facial motion is constrained to one dimensional closed manifolds.
Fig. 5 (Best viewed in color) Six image sequences of basic expressions of a person are mapped into the 3D embedding space [70]. Representative faces are shown next to circled points in different parts of the space. Different expressions are color coded as: Anger (red), Disgust (yellow), Fear (blue), Joy (magenta), Sadness (cyan), and Surprise (green).
2.3 Facial Feature Extraction and Representation A vital step for successful facial expression analysis is deriving an effective facial representation from original face images. Two types of features [84], geometric features and appearance features, are usually considered for facial representation. Geometric features deal with the shape and locations of facial components (such as mouth, eyes, brows, and nose), which are extracted to represent the face geometry. Appearance features present the appearance changes (skin texture) of the face (such as wrinkles, bulges and furrows), which are extracted by applying image filters to either the whole face or specific facial regions. The geometric features based approaches commonly require accurate and reliable facial feature detection and tracking, which is difficult to accommodate in real-world unconstrained scenarios, e.g.,
484
Caifeng Shan and Ralph Braspenning
under head pose variation. In contrast, appearance features suffer less from issues of initialization and tracking errors, and can encode changes in skin texture that are critical for modeling. However, most of the existing appearance-based facial representations still require face registration based on facial feature detection, e.g., eye detection.
Fig. 6 Geometric features [96]: 34 fiducial points for representing the facial geometry.
Fiducial facial feature points have been widely used in facial representation. As shown in Fig. 6, Zhang et al [96] utilized the geometric positions of 34 fiducial points as facial features. A shape model defined by 58 facial landmarks was adopted in [14, 15]. Pantic and her colleagues [60, 56, 87] used a set of facial characteristic points to describe facial actions (as shown in Fig. 7). Tian et al [83] considered both location features and shape features for facial representation. Specifically, as shown in Fig. 8, six location features (eye centers, eyebrow inner endpoints, and corners of the mouth) are extracted and transformed into 5 parameters, and the mouth shape features are computed from zonal shape histograms of the edges in the mouth region. In image sequences, facial movements can be measured by the geometrical displacement of facial feature points between the current frame and the initial frame [58, 38]. Tian et al [82] developed multi-state facial component models to track and extract the geometric facial features. Cohen et al [16] referred the tracked motions of facial features (such as the eyebrows, eyelids, and mouth) at each frame as MotionUnits (MUs) (as shown in Fig. 9). The MUs represent not only the activation of a facial region, but also the direction and intensity of the motion. optical flow analysis has also been used to model muscles activities or estimate the displacements of feature points [90, 28, 91]. Although it is effective to obtain facial motion information by computing dense flow between successive image frames, flow estimation has its disadvantages such as easily disturbed by lighting variation and non-rigid motion, and sensitive to the inaccuracy of image registration and motion discontinuities [95]. Most of the existing appearance-based methods have adopted Gabor-wavelet features [46, 35, 44, 85]. Gabor filters are obtained by modulating a 2D sine wave with a Gaussian envelope. Representations based on the outputs of Gabor filters at mul-
Recognizing Facial Expressions Automatically from Video
485
Fig. 7 The facial points used to describe facial actions [55].
Fig. 8 Geometric features [83]: (Left) location features; (Right) normalized face and zones of the edge map of the normalized face.
Fig. 9 The Motion-Units introduced in [16].
486
Caifeng Shan and Ralph Braspenning
tiple spatial scales, orientations, and locations have proven successful for facial image analysis. Zhang et al [96] computed a set of Gabor-filters coefficients at fiducial points (as shown in Fig. 10). Lyons et al [46] represented each face using a set of Gabor filters at the facial feature points sampled from a sparse grid covering the face. Tian [80] compared geometric features and Gabor-wavelet features with different image resolutions, and her experiments show that Gabor-wavelet features work better for low-resolution face images. Although widely adopted, it is computationally expensive to convolve face images with a set of Gabor filters. It is inefficient in both time and memory due to the high redundancy of Gabor-wavelet features. For example, in [9], the Gabor-wavelet representation derived from each 48×48 face image has the high dimensionality of O(105 ). Another kind of widely used appearance features is statistical subspace analysis, which will be discussed in Section 2.4.
Fig. 10 Gabor-wavelet representation [96]: two examples with three Gabor kernels.
The appearance-based approaches are believed to contain more information than those based on the relative positions of a finite set of facial features [7], so providing superior performance. For example, the Gabor-wavelet representation outperforms the performance upper-bound computed based on manual feature tracking [44]. However, some recent studies indicate that this claim does not always hold. For example, the AU detection method [87] based on tracked facial points achieved similar or higher detection rates than other methods. It seems that using both geometric and appearance features might be the best choice [55]. Recently feature selection methods have been exploited to select the most effective facial features. Guo and Dyer [35] introduced a linear programming technique that jointly performs feature
Recognizing Facial Expressions Automatically from Video
487
selection and classifier training so that a subset of features is optimally selected together with the classifier. Bartlett et al [9] selected a subset of Gabor filters using Adaboost.
2.4 Expression Subspace Learning Appearance-based statistical subspace learning has been shown to be an effective approach for facial representation. This is because that despite a facial image space being commonly of a very high dimension, the underlying facial expression space is a sub-manifold of much lower dimensionality embedded in the ambient space. Subspace learning is a natural approach to resolve this problem. Traditionally, linear subspace methods including Principal Component Analysis (PCA) [86], Linear Discriminant Analysis (LDA) [11], and Independent Component Analysis (ICA) [8] have been used to discover both facial identity and expression manifold structures. For example, Lyons et al [46] adopted PCA based LDA with the Gabor-wavelet features to classify facial images. Donato et al [23] explored PCA, LDA, ICA, and Gabor-wavelet representation for facial action classification. Best performances were obtained using the Gabor-wavelet features and ICA. Recently, a number of graph-based linear subspace techniques have been proposed in literature, e.g., Locality Preserving Projections (LPP) [36]. Shan et al [74] comprehensively investigate and evaluate these linear subspace methods for facial expression analysis. An example of 2D embedding subspace of emotional facial expressions is shown in Fig. 11, which was derived using the supervised LPP. As facial muscles are contracted in unison to display facial expressions, different facial parts have strong correlations. Shan et al [77] further employed Canonical Correlation Analysis, a statistical technique that is well suited for relating two sets of signals, to model correlations among facial parts.
2.5 Facial Expression Classification Different techniques have been proposed for facial expression recognition, such as Neural Networks [82], Support Vector Machine (SVM) [9], Bayesian Networks [16], HMM [91], Dynamic Bayesian Network [95], and rule-based classifiers [58]. Facial expression recognition can be divided into image-based or sequence-based. The image-based approaches use features extracted from a single image to recognize the expression of that image, while the sequence-based methods aim to capture the temporal pattern in a sequence to recognize the expression for one or more images. Yacoob and Davis [90] used local motions of facial features to construct a midlevel description of facial motions, which was classified into one of six facial expressions using a set of heuristic rules. Using the dual-view (front and profile) geometric features, Pantic and Rothkrantz [58] performed facial expression recogni-
488
Caifeng Shan and Ralph Braspenning
40 30 20 10 0 −10 −20 −30 −40 −50 −30
−20
−10
0
10
20
30
40
50
60
Fig. 11 (Best viewed in color) The 2D embedding subspace of emotional facial expressions [74]. Different expressions are color coded as: Anger (red), Disgust (yellow), Fear (blue), Joy (magenta), Sadness (cyan), and Surprise (green).
tion by comparing the AU-coded description of an observed expression against rule descriptors of basic emotions. Recently they further adopted the rule-based reasoning to recognize action units and their combination [60]. Tian et al [82] used a three-layer Neural Network with one hidden layer to recognize AUs by a standard back-propagation method. Lyons et al [46] adopted a nearest neighbor classifier to recognize facial images using discriminant Gabor-wavelet features. As a powerful discriminative machine learning technique, SVM has been widely adopted for facial expression recognition. More recently Bartlett et al [9] performed comparison of Adaboost, SVM, and LDA, and best results were obtained by selecting a subset of Gabor filters using Adaboost and then training SVM on the outputs of the selected filters. This strategy is also adopted in [85, 87]. For example, Valstar and Pantic [87] recognized AU temporal segments using a subset of most informative spatio-temporal features selected by Adaboost. As one of the basic probabilistic tools used for time series modeling, HMM has been exploited to capture temporal behaviors exhibited by facial expressions. For example, Oliver et al [53] associated each of the mouth-based expressions, e.g. sad and smile, with an HMM trained using the mouth features, and facial expressions were identified by computing the maximum likelihood of the input sequence with respect to all trained HMMs. Yeasin et al [91] presented a two-stage approach to classify six basic emotions, and derive the level of interest using psychological evidences. First, a bank of linear classifiers were applied at frame level and the output was coalesced to produce the temporal signature for each observation. Second, temporal signatures computed from the training data set were used to train discrete HMMs to learn the
Recognizing Facial Expressions Automatically from Video
489
underlying models for each expression. Dynamic Bayesian Networks (DBNs) are graphical probabilistic models which encode dependencies among sets of random variables evolving in time. DBNs are capable of accounting for uncertainty in facial expression recognition, representing probabilistic relationships among different actions and modeling the dynamics in facial action development. Zhang and Ji [95] explored the multisensory information fusion technique with DBNs for modeling and understanding the temporal behaviors of facial expressions in image sequences. Kaliouby and Robinson [38] proposed a system for inferring complex mental states from videos of facial expressions and head gestures in real-time. Their system was built on a multi-level DBN classifier which models complex mental states as a number of interacting facial and head displays. Recently Tong et al [85] proposed to exploit the relationships among different AUs and their temporal dynamics using DBN.
2.6 Spontaneous Versus Posed Expression Most of the existing works have been carried out on expression data that were collected by asking subjects to deliberately pose facial expressions. However, the exaggerated facial expressions occur rarely in real-life situations. Spontaneous facial expressions induced in natural environments are more subtle and fleeting, such as tightening of the lips in anger or lowering the lip corners in sadness [84]. Recently research attention has started to shift to spontaneous facial expression analysis. Sebe et al [67] collected a database containing spontaneous facial expressions, and compared a wide range of classifiers, including Bayesian Networks, the Decision Trees, SVM, k-Nearest-Neighbor, Bagging, and Boosting, for expression recognition. Surprisingly, the best classification results were obtained with the k-NN classifier. It seems that all the models tried were not able to entirely capture the complex decision boundary that separates different spontaneous expressions. Cohn et al [18] developed an automated system to recognize brow actions in spontaneous facial behaviors captured in interview situations. Their recognition accuracy was relatively worse than that for the posed facial behaviors. Zeng et al [93] treated the problem of emotional expression detection in a realistic conversation setting as an one-class classification problem, and adopted Support Vector Data Description to distinguish emotional expressions from non-emotional ones. Bartlett et al [7] recently presented preliminary results on facial action detection in spontaneous facial expressions by adopting their AU recognition approach [44]. Spontaneous facial expressions differ from posed expressions both in terms of which muscles move and how they move dynamically. Cohn and Schmidt [17] observed that posed smiles were of larger amplitude and has a less consistent relationship between amplitude and duration than spontaneous smiles. A psychological study [26] has indicated that posed expressions may differ in appearance and timing from spontaneous ones. For example, spontaneous expressions have fast and smooth onsets, with distinct facial action peaking simultaneously, while posed
490
Caifeng Shan and Ralph Braspenning
expressions tend to have slow and jerky onsets, and the actions typically do not peak simultaneously [7]. So facial dynamics is a key parameter in differentiating posed expressions from spontaneous ones [88]. Recently Valstar et al [88] experimentally showed that temporal dynamics of spontaneous and posed brow actions are different from each other. They built a system to automatically discern spontaneous brow actions from deliberately posed ones, based on the parameters of temporal dynamics such as speed, duration, intensity, and the occurrence order. They achieved the classification rate of 90.7% on 189 samples taken from three different databases.
2.7 Expression Intensity Facial expressions vary in intensity. For example, manual FACS coding uses a 3- or more recently 5-point intensity scale to describe intensity variation of action units [84]. A distinct emotional state of an expresser can not be correctly perceived unless the expression intensity exceeds a certain level. Methods that work for intense expressions may generalize poorly to subtle expressions with low intensity. It has been experimentally shown that the expression decoding accuracy and the perceived intensity of the underlying affective state vary linearly with the physical intensity of a facial display [37]. Explicit analysis of expression intensity variation is also essential for distinguishing between spontaneous and posed facial behaviors [55]. So expression intensity estimation is necessary and helpful for accurate assessment of facial expressions. It is desirable to decide the expression intensity from the face data without human labeling [4]. Some methods have been presented to automatically quantify expression intensity variation in emotional expressions [41] and action units [43]. Essa and Pentland [28] represented intensity variation in smiling using optical flow. Tian et al [81] compared manual and automatic coding of intensity variation; using Gabor wavelets and a Neural Network, their approach discriminated intensity variation in eye closure as reliably as did human coders. By adopting a 3grade intensity scale (Low, Moderate, and High), Shan [68] applied Fuzzy K-Means to derive fuzzy clusters in the embedded space. An example of expression intensity estimation is shown in Fig. 12, where the cluster memberships were mapped to expression degrees.
2.8 Controlled Versus Uncontrolled Data Acquisition Most of the existing facial expression recognition systems attempt to recognize facial expressions from data collected in a highly controlled environment with very high resolution frontal faces (e.g. 200×200 pixels). However, in real-life uncontrolled environments such as smart meeting and visual surveillance, only lowresolution compressed video input is available. Moreover, head pose variation is always present due to head movements. Fig. 13 shows a real-world image recorded
Recognizing Facial Expressions Automatically from Video
491
Fig. 12 (Best viewed in color) Expression intensity estimation on an example image sequences [68].
in a smart meeting scenario. There have been few studies that address facial expression recognition from data collected in real-world unconstrained scenarios. Tian et al [83] made the first attempt to recognize facial expressions with lower resolution (e.g. 50×70 pixels). In their real-time system, to handle the full range of head motion, not face but head was first detected, and the head pose was then estimated. For faces of frontal and near frontal views, the geometric features were extracted for facial expression recognition using a Neural Network. Tian [80] further explored the effects of different image resolutions for each step of facial expression analysis. Similarly, Shan et al [72] studied recognizing facial expressions at low resolution. In the existing research, the recognition performance is usually accessed on faces normalized based on the manually labeled eyes position, i.e., without considering face registration errors. However, in uncontrolled environments, face registration errors are always present and difficult to avoid due to face/eye detection errors. Therefore, it is necessary to investigate facial expression analysis in presence of face registration errors. In a more recent work [31], local features based facial expression recognition approaches were studied with face registration errors, where Gaussian noise is added to the manually annotated eye positions to simulate registration errors.
2.9 Correlation with Bodily Expression The human face is usually perceived not as an isolated object but as an integrated part of the whole body. The human body configuration and movement also reveal and enhance emotions [13]. For example, an angry face is more menacing when accompanied by a fist. When we see a bodily expression of emotion, we immediately know what specific action is associated with a particular emotion, as is the case for
492
Caifeng Shan and Ralph Braspenning
Fig. 13 An example of low-resolution facial expressions recorded in real-world environments. (from PETS 2003 data set).
facial expressions [30]. See Fig. 14 for examples of affective body gestures. Psychological studies [3, 47] suggest that the visual channel combining facial and bodily expressions is most informative, and the recognition of facial expression is strongly influenced by the concurrently presented emotional bodily expression. Therefore, analyzing bodily expressions could provide helpful complementary cues for facial expression analysis. However, in affective computing, a great deal of attention has been focused on how emotions are communicated through facial expressions and voice. Little attention has been placed on the perception of emotion from bodily expressions.
Fig. 14 Examples of affective body gestures, from the FABO database [34]. From top to bottom: Fear, Joy, Uncertainty, and Surprise.
Recognizing Facial Expressions Automatically from Video
493
Affective bodily expression analysis is an unresolved area in psychology and nonverbal communication. Coulson [20] presented experiments on attributing six universal emotions to static body postures using computer-generated mannequin figures, and his experiments suggest that recognition from body posture is comparable to recognition from voice, and some postures are recognized as well as facial expressions. Observers tend to be accurate in decoding some negative emotions like anger and sadness from static body postures and the gestures like head inclination and face touching often accompany affective states like shame and embarrassment [19]. Neagle et al [50] reported a qualitative analysis on affective motion features of virtual ballet dancers, and their results show that human observers are highly accurate in assigning an emotion label to each dance exercise. Recently Ravindra De Silva et al [64] presented an affective gesture recognition system that recognize child’s emotion with intensity through body gesture in the context of a game. Nayak and Turk [49] developed a system that allows virtual agents to express their mental state through body language (see Fig. 15 for some examples). In computer vision, the existing studies on gesture recognition primarily deal with non-affective gestures such as sign language [63]. There has been few investigations into affective body posture and gesture analysis. This is probably due to the high variability of the possible posture and gesture that can be displayed. Recently some tentative attempts have been made [6, 33]. Shan et al [76] investigated affective body gesture analysis on a large dataset by exploiting spatial-temporal features.
Fig. 15 Examples of body language displayed by the virtual agent in [49]. From left to right: anger, defensiveness, and headache.
Human emotional and interpersonal states are not conveyed by a single indicator rather by a set of cues. It is the combination of movements of the face, arms, legs and other body parts, as well as voice and touching behaviors, that yields an overall display [5]. Meeren et al [47] showed that the recognition of facial expressions is strongly influenced by the concurrently presented emotional body language, and that the affective information from the face and the body start to interact rapidly, and the integration is a mandatory automatic process occurring early in the processing stream. To more accurately simulate the human ability to assess affect, an automatic affect recognition system should make use of multimodal data, which potentially
494
Caifeng Shan and Ralph Braspenning
can accomplish more effective affect analysis. However, most of the existing works have relied on a single modality. Existing work combining different modalities investigated mainly the combination of facial and vocal signals [59], and there is little effort on human affect analysis by combining face and body gestures [61]. Kapoor and Picard [40] presented a multi-sensor affect recognition system for classifying interest (or disinterest) in children trying to solve a puzzle on the computer. The multimodal information from face expressions and postures are sensed through a camera and a chair respectively, which are combined with the state of the puzzle. The multimodal method is shown to outperform classification using the individual modalities. Balomenos et al [6] and Gunes and Piccardi [33] made tentative attempts to analyze emotions from facial expressions and body gestures. Recently Shan et al [76] exploited statistical techniques to fuse the two modalities at the feature level by deriving a semantic joint feature space.
2.10 Databases To carry out research on facial expression analysis, dealing with different dimensions of the problem space as discussed above, large data sets are needed for evaluation and benchmark. We summarize the main databases of facial expressions in Table 1. The JAFFE database [46] is one of widely used databases, which consists of 213 images of 10 Japanese females displaying posed expressions of six basic emotions and neutral faces. The Cohn-Kanade database [39] is the most widely used database, which contains image sequences of 100 subjects posing a set of 23 facial displays. The MMI database [62] is another comprehensive database, which contains both deliberately displayed and spontaneous facial expressions of AUs and emotions. Recently a 3D facial expression database has been built [92], which contains 3D range data for emotional expressions at a variety of intensities. Although several databases containing spontaneous facial expressions have been reported recently [67, 7, 79], most of them currently are not available to the public, due to ethical and copyright issues. In addition, manual labeling of spontaneous expressions is very time consuming and error prone due to subjectivity. One of the available databases containing spontaneous facial expression was collected at UT Dallas [54], which contains videos of more than 200 subjects. Videos of spontaneous expressions (including happiness, sadness, disgust, puzzlement, laughter, surprise, and boredom) were captured when the subject watching videos that intend to elicit different emotions. Some other databases [24] were recorded from talk TV shows which contain speech-related spontaneous facial expressions.
Recognizing Facial Expressions Automatically from Video
JAFFE Database [46] Cohn-Kanade Database [39] MMI Database [62] FGNet Database [89] Authentic Expression Database [67] UTDallas-HIT [54] RU-FACS [7] MIT-CBCL [79] BU-3DFE Database [92] FABO Database [34]
Subjects 10 100 53 18 28 284 100 12 100 23
495
Expressions Type Labeled 7 classes Posed Yes wide range Posed Yes wide range Posed/Spontaneous Yes 7 classes Spontaneous Yes 4 classes Spontaneous Yes 11 classes Spontaneous No wide range Spontaneous Yes 9 classes Spontaneous Yes 7 classes Posed Yes wide range Posed Yes
Table 1 Summary of the existing databases of facial expressions.
3 Facial Expression Recognition with Discriminative Local Features In this section, we discuss our recent contribution to the field. We investigate discriminative local statistical features for facial expression recognition.
3.1 Local Binary Patterns Local Binary Patterns (LBP), an efficient non-parametric method summarizing the local structure of an image, has been introduced for facial representation recently [1, 73]. The most important properties of LBP features are their tolerance against monotonic illumination changes and their computational simplicity. The original LBP operator [51] labels the pixels of an image by thresholding a 3 × 3 neighborhood of each pixel with the center value and considering the results as a binary number. Formally, given a pixel at (xc , yc ), the resulting LBP can be expressed in the decimal form as LBP(xc , yc ) =
7
∑ s(in − ic )2n
(1)
n=0
where n runs over the 8 neighbors of the central pixel, ic and in are the gray-level values of the central pixel and the surrounding pixel, and s(x) is 1 if x ≥ 0 and 0 otherwise. Ojala et al [52] later made two extensions of the original operator. Firstly, the operator was extended to use neighborhood of different sizes, to capture dominant features at different scales. Using circular neighborhoods and bilinearly interpolating the pixel values allow any radius and number of pixels in the neighborhood. The notation (P, R) denotes a neighborhood of P equally spaced sampling points on a circle of radius of R. Secondly, they proposed to use a small subset of the 2P patterns, produced by the operator LBP(P, R), to describe the texture of images. These patterns, called uniform patterns, contain at most two bitwise transitions from 0 to 1
496
Caifeng Shan and Ralph Braspenning
or vice versa when considered as a circular binary string. For example, 001110000 and 11100001 are uniform patterns. The uniform patterns represent local primitives such as edges and corners. It was observed that most of the texture information was contained in the uniform patterns. Labeling the patterns which have more than 2 transitions with a single label yields an LBP operator, denoted LBP(P, R, u2), which produces much less patterns without losing too much information. After labeling an image with a LBP operator, a histogram of the labeled image fl (x, y) can be defined as Hi = ∑ I( fl (x, y) = i),
i = 0, . . . , L − 1
(2)
x,y
where L is the number of different labels produced by the LBP operator and I(A) is 1 if A is true and 0 otherwise. The LBP histogram contains information about the distribution of local micro-patterns over the whole image, so can be used to statistically describe image texture characteristics. LBP features have been exploited in many applications (see a comprehensive bibliography related to LBP methodology online1).
3.2 Learning Discriminative LBP-Histogram Bins Each face image can be seen as a composition of micro-patterns that can be effectively detected by the LBP operator. Ahonen et al [1] introduced LBP for face recognition. To consider the shape information of faces, face images are divided into M small non-overlapping regions R0 , R1 , . . . , RM (as shown in Fig. 16). The LBP histograms extracted from each sub-region are then concatenated into a single, spatially enhanced feature histogram defined as: Hi, j = ∑ I( fl (x, y) = i)I((x, y) ∈ R j )
(3)
x,y
where i = 0, . . . , L − 1, j = 0, . . . , M − 1. The extracted feature histogram describes the local texture and global shape of face images. This face representation has also been proved effective for facial expression recognition [73, 78]. Possible criticisms of this method are that dividing the face into a grid of sub-regions is somewhat arbitrary, as sub-regions are not necessary well aligned with facial features, and that the resulting facial representation suffers from fixed size and position of sub-regions. To address these limitations, the optimal sub-regions (in term of LBP histogram) were selected from a large pool of sub-regions generated by shifting and scaling a sub-window over face images [94, 71]. Some examples of selected sub-regions for facial expressions are shown in Fig. 17. In the above methods, LBP histograms are extracted from local facial regions as the region-level description, where the n-bin histogram is utilized as a whole. 1
http://www.ee.oulu.fi/mvg/page/lbp_bibliography
Recognizing Facial Expressions Automatically from Video
497
Fig. 16 A face image is divided into sub-regions from which LBP histograms are extracted and concatenated into a single, spatially enhanced feature histogram.
Fig. 17 The sub-regions selected by Adaboost for each facial expression. From left to right: Anger, Disgust, Fear, Joy, Sadness, and Surprise.
However, not all bins in the LBP histogram are necessary to contain useful information for facial expression recognition. It is helpful and interesting to have a closer look at the local LBP histogram at the bin level, to identify the discriminative LBPHistogram (LBPH) bins for better facial representation [69]. We adopt Adaboost to learn the discriminative LBPH bins. Adaboost [66] learns a small number of weak classifiers whose performance can be just better than random guessing, and boosts them iteratively into a strong classifier of higher accuracy. The process of Adaboost maintains a distribution on the training samples. At each iteration, a weak classifier which minimizes the weighted error rate is selected, and the distribution is updated to increase the weights of the misclassified samples and reduce the importance of the others. As the traditional Adaboost works on two-class problems, the multi-class problem here is accomplished by using the one-against-rest technique, which trains Adaboost between one expression with all others. For each Adaboost learner, the images of one expression were positive samples, while the images of all other expressions were negative samples. The weak classifier h j (x) is designed to select the single LBPH bin that best separates the positive and negative examples, which consists of a feature f j (corresponding to each LBPH bin), a threshold θ j and a parity p j indicating the direction of the inequality sign: 1 if p j f j (x) ≤ p j θ j (4) h j (x) = 0 otherwise
3.3 Experiments We carried out experiments on the Cohn-Kanade database. Fig. 18 shows some sample images from the database. For our experiments, we selected 320 image se-
498
Caifeng Shan and Ralph Braspenning
quences of basic emotions, which come from 96 subjects, with 1 to 6 emotions per subject. For each sequence, the neutral face and three frames of expressions at apex were used for prototypic emotional expression recognition, resulting in 1,280 images (108 Anger, 120 Disgust, 99 Fear, 282 Joy, 126 Sadness, 225 Surprise, and 320 Neutral). To test the algorithms’ generalization performance, we adopt 10-fold cross-validation scheme in our experiments. Following the existing works [80], we scaled the faces to a fixed distance between the two eyes. Facial images of 110×150 pixels were then cropped from original frames based on the two eyes location.
Fig. 18 The sample face expression images from the Cohn-Kanade database.
Experiment: Limited Sub-regions – As shown in Fig. 16, we divided face images into 42 sub-regions and applied the 59-label LBP(8, 2, u2) operator, resulting in a LBP histogram of 2,478 (42×59) bins for each face image. These parameter settings were suggested in [1]. We adopted Adaboost to learn discriminative LBPH bins and boost a strong classifier. We plot in the left side of Fig. 19 the recognition performance of the boosted strong classifier as a function of the number of features selected. With the 200 selected LBPH bins, the boosted strong classifier achieves recognition rate of 85.3%. • Uniform patterns are usually adopted to reduce the length of LBP histograms. Here we verify the validity of uniform patterns for facial representation from a point view of machine learning. By using the LBP(8, 2) operator, each face image was represented by a LBP histogram of 10,752 (42×256) bins. We plot in the left side of Fig. 19 the recognition performance of the boosted strong classifier. We can see that the boosted strong classifier of LBP(8, 2) performs similarly with that of LBP(8, 2, u2), which illustrates that the non-uniform patterns do not provide more discriminative information for facial expression recogni-
Recognizing Facial Expressions Automatically from Video
499
tion. We took a closer look at the learned LBPH bins of LBP(8, 2), and found that 91.1% of them are uniform patterns. Therefore, most of discriminative information for facial expression recognition was contained in the uniform patterns. • Fig. 20 shows the spatial distribution of the top 200 features selected in the 10-fold cross-validation experiments. As can be observed, different facial expressions have different distribution patterns. For example, for “Disgust”, most discriminative features locate in the eye inner corners, while most discriminative features for “Joy” are distributed in the mouth corners. Overall, discriminative features for facial expression classification mostly distribute in eyes and mouth regions. • By varying the sampling radius R, LBP of different resolutions can be obtained. Here we also investigate multiscale LBP for facial expression recognition. We applied the LBP(8, R, u2)(R = 1, · · · , 8) to extract multiscale LBP features, resulting a LBP histogram of 19,824 (42×59×8) bins for each face image. We then run Adaboost to learn discriminative LBPH bins from the multiscale feature pool. We plot in the right side of Fig. 19 the recognition performance of the boosted strong classifier. As can be observed, the boosted strong classifier of multiscale LBP(8, R, u2)(R = 1, · · · , 8) produces consistently better performance than that of single scale LBP(8, 2, u2), providing recognition rate of 88.6% with the 200 selected LBPH bins. Thus the multiscale LBP brings more discriminative information, and should be should be considered for facial expression recognition.
0.9
0.9
0.85
0.85
0.8
LBP(8,2,u2) LBP(8,2)
Average Recognition Rate
Average Recognition Rate
0.8
0.75
0.7
0.65
LBP(8, MultiScale, u2) LBP(8,2,u2) 0.75
0.7
0.65
0.6
0.6
0.55
0.55
0.5
0
20
40
60
80 100 120 Number of Features
140
160
180
200
0.5
0
20
40
60
80 100 120 NUmber of Features
140
160
180
200
Fig. 19 Average recognition rate of the boosted strong classifiers, as a function of the number of feature selected. Left: LBP(8, 2, u2) vs LBP(8, 2); Right: LBP(8, 2, u2) vs LBP(8, Multiscale, u2).
Experiment: More Sub-regions – By shifting and scaling a sub-window over face images, we can get many more sub-regions, which potentially contain more complete and discriminative information about face images. We shifted the subwindow with the shifting step of 14 pixels vertically and 12 pixels horizontally.
500
Caifeng Shan and Ralph Braspenning Feature Distribution: ANGER
Feature Distribution: DISGUST
150
Feature Distribution: FEAR
400
200
300
150
200
100
100
50
100
50
0
0 1
0 1
2
1 2
3
2 3
4 5 6 7
1
2
3
4
5
6
3 4 5 6 7
Feature Distribution: JOY
1
2
3
4
5
6
4 5 6 7
Feature Distribution: SADNESS
200
200
300
150
150
200
100
100
100
50
50
0
0
6 7
1
2
3
4
5
6
2 3
5
4
1 2
3 5
3
0 1
2 4
2
Feature Distribution: SURPRISE
400
1
1
6
3 4 5 6 7
Feature Distribution: NEUTRAL
1
2
3
4
5
6
4 5 6 7
1
2
3
4
5
6
Feature Distribution: ALL EXPRESSIONS
200
1500
150 1000 100 500 50 0
0 1
1 2
2 3
3 4 5 6 7
1
2
3
4
5
6
4 5 6 7
1
2
3
4
5
6
Fig. 20 Spatial distribution of the selected LBPH bins (an example face image divided in subregions is included in the bottom right corner for illustration).
The sub-window was scaled as 14, 21, or 28 pixels (height) and 12, 18, or 24 pixels (width) respectively. In total 725 sub-regions were obtained. By using multiscale LBP(8, R, u2)(R = 1, · · · , 8), a histogram of 342,200 (725×59×8) bins was extracted from each face image. To improve computation efficiency, we adopted a coarse to fine feature selection scheme: We first run Adaboost to select LBPH bins from each single scale LBP(8, R, u2), then applied Adaboost to the selected LBPH bins at different scales to obtain final feature selection results. We plot in the left side of Fig. 21 the recognition performance of the boosted strong classifiers as a function of the number of features selected. We can see that the final boosted strong classifier of multiscale LBP provides better performance than that of each single scale. Among strong classifiers of single scales, it seems that scales (R = 3, 4, 5, 6) perform better, while the performance of scales (R = 1, 8) is poor.
Recognizing Facial Expressions Automatically from Video
501
• The scale distribution of final selected multiscale LBPH bins is shown in the right side of Fig. 21. We can observe that most discriminative LBPH bins come from scales (R = 3, 4, 5, 8). • Finally we adopted SVM to recognize facial expressions using the selected LBPH bins. SVM using Gabor features selected by Adaboost (AdaSVM) achieves the best performance (93.3%) reported so far on the Cohn-Kanade database [44]. We compare our LBP-based methods with the Gabor-based methods in Table 2. We used the SVM implementation in the library SPIDER1 , and the multi-class classification was accomplished by using the one-against-rest technique. It can be observed that the boosted LBPH bins produce comparable results to the boosted Gabor features.
2500 0.9
0.85
2000 LBP(8, 1, u2) LBP(8, 2, u2) LBP(8, 3, u2) LBP(8, 4, u2) LBP(8, 5, u2) LBP(8, 6, u2) LBP(8, 7, u2) LBP(8, 8, u2) LBP(8, Multiscale, u2)
Average Recognition Rate
0.8
0.75
0.7
1500
1000
0.65
500 0.6
0.55
0
20
40
60
80 100 120 Number of Features
140
160
180
200
0
1
2
3
4
5
6
7
8
Fig. 21 Left: Average recognition rate of boosted strong classifiers, as a function of the number of feature selected. Right: Scale distribution of the selected LBPH bins
Features
Recognition Rates Adaboost AdaSVM(Linear) AdaSVM(RBF) LBP 91.3% 93.0% 93.1% Gabor[44] 90.1% 93.3% 93.3% Table 2 Comparison between the boosted LBPH bins and Gabor wavelet features.
1
Public available at http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html
502
Caifeng Shan and Ralph Braspenning
4 Discussion Automating the recognition of facial expressions is a key step to realize natural and effective human-computer interaction, and to boost applications in many fields such as medicine, security and education. The chapter describes the problem domain and state of the art of automatic facial expression analysis, and summarizes our recent work on learning discriminative local features for recognizing expressions. In summary, although human cognitive process appears to detect and interpret facial expressions with little or no effort, designing and developing an automated system that accomplishes this task is rather difficult. In last two decades, computer recognition of facial expressions from video or image has been widely studied, and much progress has been made. However, it is still far from the stage of realizing real-life applications in natural environments. Most of the existing systems attempt to recognize a small set of prototypic emotional facial expressions deliberately posed in controlled environments. But recently we have seen some progress made beyond that, for example, spontaneous facial action analysis. For future research, many problems remain open, for which answers must be found. Some major challenges are considered here: • How to make use of the temporal information? Psychological studies indicate the facial dynamics are crucial for successful interpretation of facial expressions. This is especially true for spontaneous facial expressions without any deliberate exaggerated posing. Spontaneous facial expressions also differ from posed expressions in terms of which muscles move and how they move dynamically. • How to recognize spontaneous facial expressions in real life? Real-life facial expression recognition is much more difficult than recognizing the posed expressions captured in controlled environments. Spontaneous facial expressions induced in natural environments are more subtle and fleeting. head pose variations and low-resolution input also make it more complex. • How to combine facial expressions with other modalities (e.g, body)? Facial expression is one of modes for non-verbal communication. Human emotional and interpersonal states are conveyed not by a single indicator but by a set of cues, which include the movements of the face, arms, legs and other body parts, as well as voice and touching behaviors. Therefore, integrating multiple modalities could potentially accomplish better performance.
Acknowledgments We sincerely thank Prof. Jeffery Cohn for granting access to the Cohn-Kanade database.
Recognizing Facial Expressions Automatically from Video
503
References [1] Ahonen T, Hadid A, Pietikäinen M (2004) Face recognition with local binary patterns. In: European Conference on Computer Vision (ECCV), pp 469–481 [2] Ambadar Z, Schooler JW, Cohn JF (2005) Deciphering the enigmatic face: The importance of facial dynamics in interpreting subtle facial expressions. Psychological Science 16(5):403–410 [3] Ambady N, Rosenthal R (1992) Thin slices of expressive behaviour as predictors of interpersonal consequences: A meta-analysis. Psychological Bulletin 111(2):256–274 [4] Amin MA, Afzulpurkar NV, Dailey MN, Esichaikul VE, Batanov DN (2005) Fuzzy-c-mean determines the principle component pairs to estimate the degree of emotion from facial expressions. In: International Conference on Natural Computation and International Conference on Fuzzy Systems and Knowledge Discovery, pp 484–493 [5] Argyle M (1988) Bodily Communication (2nd ed.). Methuen & Co. Ltd., New York [6] Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S (2005) Emotion analysis in man-machine interaction systems. In: Machine Learning for Multimodal Interaction, LNCS 3361, pp 318–328 [7] Bartlett M, Littlewort G, Frank M, Lainscseck C, Fasel I, Movellan J (2006) Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia 1(6):22–35 [8] Bartlett MS, Movellan JR, Sejnowski TJ (2002) Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13(6):1450– 1464 [9] Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J (2005) Recognizing facial expression: Machine learning and application to spotaneous behavior. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 568–573 [10] Bassili JN (1979) Emotion recognition: The role of facial movement and the relative importance of upper and lower area of the face. Journal of Personality and Social Psychology 37(11):2049–2058 [11] Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):711–720 [12] Braathen B, Bartlett M, Littlewort G, Smith E, Movellan JR (2002) An approach to automatic recognition of spontaneous facial actions. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp 231–235 [13] Burgoon JK, Buller DB, Woodall WG (1996) Nonverbal Communication: The Unspoken Dialogue. McGraw-Hill, New York [14] Chang Y, Hu C, Turk M (2003) Mainfold of facial expression. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG), pp 28–35
504
Caifeng Shan and Ralph Braspenning
[15] Chang Y, Hu C, Turk M (2004) Probabilistic expression analysis on manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 520–527 [16] Cohen I, Sebe N, Garg A, Chen L, Huang TS (2003) Facial expression recognition from video sequences: Temporal and static modeling. Computer Vision and Image Understanding 91:160–187 [17] Cohn JF, Schmidt KL (2004) The timing of facial motion in posed and spontaneous smiles. International Journal of Wavelets, Multiresolution and Information Processing 2:1–12 [18] Cohn JF, Reed LI, Ambadar Z, Xiao J, Moriyama T (2004) Automatic analysis and recognition of brow actions in spontaneous facial behavior. In: IEEE International Conference on Systems, Man, and Cybernetics, p 610 [19] Costa M, Dinsbach W, Manstead ASR, Bitti PER (2001) Social presence, embarrassment, and nonverbal behavior. Journal of Nonverbal Behavior 25(4):225–240 [20] Coulson M (2004) Attributing emotion to static body postures: Recognition accuracy, confusions, and viewpoint dependence. Journal of Nonverbal Behavior 28(2):117–139 [21] Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schroder M (2000) ’feeltrace’: An instrument for recording perceived emotion in real time. In: Proceeding of the ISCA Workshop on Speech and Emotion, pp 19–24 [22] Darwin C (1872) The Expression of the Emotions in Man and Animals. John Murray, London [23] Donato G, Bartlett M, Hager J, Ekman P, Sejnowski T (1999) Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(10):974–989 [24] Douglas-Cowie E, Cowie R, Schroder M (2000) A new emotion database: Considerations, sources and scope. In: Proceeding of the ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, pp 39–44 [25] Ekman P, Friesen W (1971) Constants across cultures in the face and emotion. J Personality Social Psychol 17(2):124–129 [26] Ekman P, Rosenberg E (1997) What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the Facial Action Coding System (FACS). New York: Oxford Univ. Press [27] Ekman P, Friesen WV, Hager JC (2002) The Facial Action Coding System: A Technique for the Measurement of Facial Movement. San Francisco: Consulting Psychologist [28] Essa I, Pentland A (1997) Coding, analysis, interpretation, and recognition of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):757–763 [29] Fasel B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern Recognition 36:259–275 [30] de Gelder B (2006) Towards the neurobiology of emotional body language. Nature Reviews Neuroscience 7:242–249
Recognizing Facial Expressions Automatically from Video
505
[31] Gritti T, Shan C, Jeanne V, Braspenning R (2008) Local features based facial expression recognition with face registration errors. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), Amsterdam, The Netherlands [32] Gu H, Ji Q (2005) Information extraction from image sequences of real-world facial expressions. Machine Vision and Applications 16:105–115 [33] Gunes H, Piccardi M (2006) Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications (In press) [34] Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: International Conference on Pattern Recognition (ICPR), vol 1, pp 1148–1153 [35] Guo G, Dyer CR (2003) Simultaneous feature selection and classifier training via linear programming: A case study for face expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 346–352 [36] He X, Niyogi P (2003) Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS) [37] Hess U, Blairy S, Kleck RE (1997) The intensity of emotional facial expression and decoding accuracy. Journal of Nonverbal Behavior 21(4):241–257 [38] Kaliouby RE, Robinson P (2004) Real-time inference of complex mental states from facial expressions and head gestures. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop, pp 154–154 [39] Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp 46–53 [40] Kapoor A, Picard RW (2005) Multimodal affect recognition in learning environments. In: ACM International Conference on Multimedia, pp 677–682 [41] Kimura S, Yachida M (1997) Facial expression recognition and its degree estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 295–300 [42] Lee CS, Elgammal A (2005) Facial expression analysis using nonlinear decomposable generative models. In: IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG), pp 17–31 [43] Lien JJ, Kanade T, Cohn JF, Li C (1998) Subtly different facial expression recognition and expression intensity estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 853–859 [44] Littlewort G, Bartlett M, Fasel I, Susskind J, Movellan J (2006) Dynamics of facial expression extracted automatically from video. Image and Vision Computing 24(6):615–625 [45] Littlewort G, Bartlett MS, Lee K (2006) Faces of pain: Automated measurement of spontaneous facial expressions of genuine and posed pain. In: Joint Symposium on Neural Computation, pp 1–1 [46] Lyons MJ, Budynek J, Akamatsu S (1999) Automatic classification of single facial images. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(12):1357–1362
506
Caifeng Shan and Ralph Braspenning
[47] Meeren H, Heijnsbergen C, Gelder B (2005) Rapid perceptual integration of facial expression and emotional body language. Procedings of the National Academy of Sciences of USA 102(45):16,518–16,523 [48] Mehrabian A (1968) Communication without words. Psychology Today 2(4):53–56 [49] Nayak V, Turk M (2005) Emotional expression in virtual agents through body language. In: International Symposium on Visual Computing, pp 313–320 [50] Neagle RJ, Ng K, Ruddle RA (2003) Studying the fidelity requirements for a virtual ballet dancer. In: Vision, Video and Graphics, pp 181–188 [51] Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distribution. Pattern Recognition 29(1):51–59 [52] Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7):971–987 [53] Oliver N, Pentland A, Berard F (2000) Lafter: a real-time face and lips tracker with facial expression recognition. Pattern Recognition 33:1369–1382 [54] O’Toole AJ, Harms J, Snow SL, Hurst DR, Pappas MR, Ayyad JH, Abdi H (2005) A video database of moving faces and people. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):812–816 [55] Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Kurihara K (ed) Face Recognition, Advanced Robotics Systems, Vienna, Austria, pp 377–416 [56] Pantic M, Patras I (2006) Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences. IEEE Transactions on Systems, Man, and Cybernetics 36(2):433–449 [57] Pantic M, Rothkrantz L (2000) Automatic analysis of facial expressions: the state of art. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12):1424–1445 [58] Pantic M, Rothkrantz L (2000) Expert system for automatic analysis of facial expression. Image and Vision Computing 18(11):881–905 [59] Pantic M, Rothkrantz L (2003) Toward an affect-sensitive multimodal humancomputer interaction. In: Proceeding of the IEEE, vol 91, pp 1370–1390 [60] Pantic M, Rothkrantz LJM (2004) Facial action recognition for facial expression analysis from static face images. IEEE Transactions on Systems, Man, and Cybernetics 34(3):1449–1461 [61] Pantic M, Sebe N, Cohn J, Huang T (2005) Affective multimodal humancomputer interaction. In: ACM International Conference on Multimedia, pp 669–676 [62] Pantic M, Valstar M, Rademaker R, Maat L (2005) Web-based database for facial expression analysis. In: IEEE International Conference on Multimedia and Expo (ICME), pp 317–321 [63] Pavlovic VI, Sharma R, Huang TS (1997) Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7):677–695
Recognizing Facial Expressions Automatically from Video
507
[64] Ravindra De Silva P, Osano M, Marasinghe A (2006) Towards recognizing emotion with affective dimensions through body gestures. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp 269– 274 [65] Russell JA (1994) Is there univeral recognition of emotion from facial expression. Psychological Bulletin 115(1):102–141 [66] Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Maching Learning 37(3):297–336 [67] Sebe N, Lew MS, Cohen I, Sun Y, Gevers T, Huang TS (2004) Authentic facial expression analysis. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp 517–522 [68] Shan C (2007) Inferring facial and body language. PhD thesis, Queen Mary, University of London [69] Shan C, Gritti T (2008) Learning discriminative lbp-histogram bins for facial expression recognition. In: British Machine Vision Conference (BMVC), Leeds, UK [70] Shan C, Gong S, McOwan PW (2005) Appearance manifold of facial expression. In: Sebe N, Lew MS, Huang TS (eds) Computer Vision in HumanComputer Interaction, Lecture Notes in Computer Science, vol 3723, Springer, pp 221–230 [71] Shan C, Gong S, McOwan PW (2005) Conditional mutual information based boosting for facial expression recognition. In: British Machine Vision Conference (BMVC), Oxford, UK, vol 1, pp 399–408 [72] Shan C, Gong S, McOwan PW (2005) Recognizing facial expressions at low resolution. In: IEEE International Conference on Advanced Video and SignalBased Surveillance (AVSS), Como, Italy, pp 330–335 [73] Shan C, Gong S, McOwan PW (2005) Robust facial expression recognition using local binary patterns. In: IEEE International Conference on Image Processing (ICIP), Genoa, Italy, vol 2, pp 370–373 [74] Shan C, Gong S, McOwan PW (2006) A comprehensive empirical study on linear subspace methods for facial expression analysis. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop, New York, USA, pp 153–158 [75] Shan C, Gong S, McOwan PW (2006) Dynamic facial expression recognition using a bayesian temporal manifold model. In: British Machine Vision Conference (BMVC), Edinburgh, UK, vol 1, pp 297–306 [76] Shan C, Gong S, McOwan PW (2007) Beyond facial expressions: Learning human emotion from body gestures. In: British Machine Vision Conference (BMVC), Warwick, UK [77] Shan C, Gong S, McOwan PW (2007) Capturing correlations among facial parts for facial expression analysis. In: British Machine Vision Conference (BMVC), Warwick, UK [78] Shan C, Gong S, McOwan PW (2008) Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing
508
Caifeng Shan and Ralph Braspenning
[79] Skelley J, Fischer R, Sarma A, Heisele B (2006) Recognizing expressions in a new database containing played and natural expressions. In: International Conference on Pattern Recognition (ICPR), pp 1220–1225 [80] Tian Y (2004) Evaluation of face resolution for expression analysis. In: International Workshop on Face Processing in Video, pp 82–82 [81] Tian Y, Kanade T, Cohn J (2000) Eye-state action unit detection by gabor wavelets. In: International Conference on Multimodal Interfaces (ICMI), pp 143–150 [82] Tian Y, Kanade T, Cohn J (2001) Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2):97–115 [83] Tian Y, Brown L, Hampapur A, Pankanti S, Senior A, Bolle R (2003) Real world real-time automatic recognition of facial expression. In: IEEE workshop on performance evaluation of tracking and surveillance (PETS), Australia [84] Tian Y, Kanade T, Cohn J (2005) Handbook of Face Recognition, Springer, chap 11. Facial Expression Analysis [85] Tong Y, Liao W, Ji Q (2006) Inferring facial action units with causal relations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1623–1630 [86] Turk M, Pentland AP (1991) Face recognition using eigenfaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [87] Valstar M, Pantic M (2006) Fully automatic facial action unit detection and temporal analysis. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop, p 149 [88] Valstar M, Pantic M, Ambadar Z, Cohn JF (2006) Spontaneous vs. posed facial behavior: Automatic analysis of brow actions. In: International Conference on Multimodal Interfaces (ICMI), pp 162–170 [89] Wallhoff F (2006) Facial expressions and emotion database. http://www.mmk.ei.tum.de/ waf/fgnet/feedtum.html [90] Yacoob Y, Davis LS (1996) Recognizing human facial expression from long image sequences using optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6):636–642 [91] Yeasin M, Bullot B, Sharma R (2004) From facial expression to level of interests: A spatio-temporal approach. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 922–927 [92] Yin L, Wei X, Sun Y, Wang J, Rosato M (2006) A 3d facial expression database for facial behavior research. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG) [93] Zeng Z, Fu Y, Roisman GI, Wen Z, Hu Y, Huang TS (2006) Spontaneous emotional facial expression detection. Journal of Multimedia (JMM) 1(5):1–8 [94] Zhang G, Huang X, Li SZ, Wang Y, Wu X (2004) Boosting local binary pattern (lbp)-based face recognition. In: Chinese Conference on Biometric Recognition, pp 179–186
Recognizing Facial Expressions Automatically from Video
509
[95] Zhang Y, Ji Q (2005) Active and dynamic information fusion for facial expression understanding from image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):1–16 [96] Zhang Z, Lyons MJ, Schuster M, Akamatsu S (1998) Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron. In: IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp 454–461
Sharing Content and Experiences in Smart Environments Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
1 Introduction Once upon a time... Stories used to be the only way to pass a message. The story teller would take his audience through the events by mere oration. Here and there he would hesitate, whisper, or gesticulate to emphasise his story or induce the right emotions in his audience. No doubt troubadours were loved, they both brought news of the world as well as entertainment. The urge to share the highlights of our lives with other people has not changed. The means we have available to do that have changed radically. From photo albums we have arrived at an age where anyone with a little computer skills is able to create his/her own multimedia show. Sharing is a matter of pressing the send button in your e-mail system or publishing it via web based publishing systems like Flickr or YouTube. In this chapter we will examine how sharing experiences can be supported by new developments in smart environments. As web-based sharing is reviewed in various other sources, we will limit ourselves to the particularities of sharing in “smart environments”, where we will interpret these environments as including home and public environments as well as nomadic use of services. In order to better understand the nature of our quest, we will first try to elaborate what experiences are and how they could be shared. Subsequently we will review the changing role of end-users of digital services - no longer are they just consumers of ready-made content, but active producers of experiences. In order to introduce smart environments we will identify three stages of ubiquitous computing research and briefly review how they have supported experience sharing. From our own research we have selected some projects to highlight different aspects of experience sharing. Finally we will conclude this chapter with a discussion and some indications for future research. authors VTT, PO Box 1300, 33101, Tampere, Finland, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_19, © Springer Science+Business Media, LLC 2010
511
512
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
2 What is Experience, Can You Share It? The greatest motivator of people to produce and carry digital content with them is their willingness to share such content with other people. Digital content and sharing such content can be easily understood to deal with videos, photos, music, etc. In contrast, the concept ‘experience sharing’ is more difficult to define, explain and understand because the term ‘experience’ can not be defined unambiguously without some semantic analysis.
2.1 Defining Experience The word experience has different nuances in different languages [29]. The meaning of the term can be understood at least in three different ways: ‘Primary Experience’, ‘Secondary Experience’ and ‘Emotional Experience’. The primary experience focuses on receiving and processing a certain personal knowledge of experience. The secondary experience consists of individual experience on the context and objects in the context with their subjective meaning and significance to a person. The emotional experience is a personal emotional feeling. The primary experience is clearly not serving our purpose in this context. This kind of experience deals with accumulated experience of skills, e.g. working experience and routines gathered during a career. Secondary experience can be used to refer to a more common and everyday experience, which refers to the past and to something that has happened to a person before, and which is familiar to him or her. A person might have a feeling, an expectation or an attitude towards the thing to be experienced, which is based on something already known to her/him from the past. Finally, the term experience may refer to a deep and multi sensorial emotional experience, enabling changes to one’s behaviour. This type of experiencing is stronger and might more deeply influence a person compared to the previous experience definitions. An experience according to the second definition above can lead to this kind of deep and meaningful experience, but not necessarily vice versa. A meaningful and deep emotional experience has a nuance of novelty and uniqueness [28, 29]. All experiences (regardless of their type) that people have gained and which they accumulate, have some impact on the person’s future, and former experiences either shut down or open up access to future experiences.
2.2 Experience Design Modeling Is it possible to design, produce and share experiences? If it is possible, what kind and what level of experiences?
Sharing Content and Experiences in Smart Environments
513
Emotional and deep experiences, that may lead to personal changes, are very subjective. Hence, an experience can not be produced with absolute certainty. But the settings which are ideal for producing this kind of experiences can be objectively created [24]. It has been stated [26] that “The elements that contribute to superior experiences are knowable and reproducible, which make them designable”. And further: “What these solutions require first and foremost is an understanding by their developers of what makes a good experience; then to translate these principles, as well as possible, into the desired media without the technology dictating the form of the experience.” If agreed that experiences can be designed into products and services, we need to know the elements of experience design based on experience modelling. What are the factors of user experience in smart environments? As an example of the experience modelling efforts, we will mention two of them.
Fig. 1 Experience pyramid (redrawn from [28])
A model (so called experience pyramid) suitable for experience design into products and services is reported in [28]. The model has two kinds of levels: vertical levels of subjective experience (individuality, authenticity, story, multi-sensory perception, contrast, interaction) and horizontal levels of experience product/service elements (e.g. interest on motivation level, sense perception on physical level, learning on intellectual level, experience on emotional level, change on mental level). A good experience product includes all elements of the experience at each level of subjective experiencing. Still, the possibility to meet ‘meaningful experience’ by
514
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
a person depends on his social and cultural background, expectations towards the service or product, etc. [29]. User experience is an experience that the user obtains when interacting with a product under some particular conditions. Based on this definition, an experience model has been suggested, which classifies different factors of experiences into five groups [2]. The groups are as follows: • User internal factors (values, emotions, expectations, prior experiences, physical characteristics, motor functions, personality, motivation, skills, age, etc. • Social factors (time pressure, pressure of success and fail, explicit and implicit requirements, etc.) • Cultural factors (sex, fashion, habits, language, symbols, religion, etc.) • Context of use (time, place, accompanying persons, temperature, etc.) • Product aspects (usability, functions, size, weight, language, symbols, aesthetic characteristics, usefulness, reputation, adaptivity, mobility, etc.) These factors will be integrated via user interaction, which produces the user experience related to the case at issue.
2.3 On Sharing Experiences To clarify the concepts of ‘Experience design’ and ‘User experience design’, it has been stated [23], that ‘experience design’ uses the interactions of customers with products and services to optimise the overall impressions left by these. ‘User experience design’ applies a similar approach to specific set of products - computer related ones. Hence, user experience design covers traditional Human-Computer Interaction (HCI) design and extends it by addressing all aspects of a product or service as perceived by users. The ‘user experience’ refers to the overall impression, feelings and interactions that a person has while interacting with these systems. In practice, this applies to almost all electronic and mechanical products we own/use since they often rely on computers for their functioning [23]. The user experience research is focusing on the interactions between people and computer based products/services, and the user’s experience resulting from the interaction. Without doubt, the user experience may be of subjective nature, and although it is possible, it does not necessarily lead to deep emotional feelings and behavioral changes for the user (emotional experiences). Still, user experience design should be extended to concern all aspects of experiencing the product or service, including physical, sensitive, cognitive, emotional, and aesthetic relations [14]. The success of sharing one’s experience with other people is embedded in the conscious production of experiences. Sharing an experience with other people may give additional content and meaning to an experience and thus deepen one’s personal emotional experience. ‘Co-experience’ is the user experience, which is created in social interaction driven by social needs of communication and where the
Sharing Content and Experiences in Smart Environments
515
participants together contribute to the shared experience [3]. If the capability of design and production of user experiences is agreed (although not self-explanatorily), the ability of sharing or transferring experiences between people is - in many cases - a technical challenge, too. Although it seems to be possible to advance creation of favourable circumstances for certain experiences and hence to design, produce and share experiences, one can not be sure or guarantee, that the other person’s experience is as intended. At least, an emotional experience is very subjective, being a complex mixture of a person’s earlier experiences, personality, values, behavioural norms, preferences and expectations - all issues to be treated in vertical levels of the experience pyramid. Probably, no one is able to experience the emotional experiences of another person exactly in the same way. Nevertheless, we may create circumstances for converging individual experiencing, the subjective user experiences will hardly be exactly the same.
3 The Changing Role of the User in Content Provision (Prosumers) Currently, many of the most popular web applications are characterised by increasing user participation. Users create and import their own content for others to share in the public virtual spaces, and add value by commenting, recommending or tagging existing content. This phenomenon is called social media or Web 2.0. Users themselves produce topical and personal content and recommendations from other users help in finding relevant content. User communities created around shared interests are efficient tools to share information. Social media has started as entertainment but it has a huge potential also in everyday and working life applications that involve people interacting with information, content and each other. Another change that is happening is the way to access content. Technologies such as RSS feed enable push-type solutions: it is possible to keep up-to-date on almost anything, which changes the way people interact with content and media. To allow ordinary people to get the most out of the opportunities of social media, we need to be aware of potential bottlenecks of adopting social media in different user groups. Those bottlenecks include security and trust, digital identities, everyman’s rights in the digital domain, usable tools for content creation, sharing and storing as well as efficient tools to manage and benefit of information flows from the outside world. Increasing user participation also challenges traditional application and service design paradigms. Our digital environment is evolving into an ecosystem of services, where new services have to compete for their share of user attention with already existing services. The designers cannot design new services in a vacuum in addition to designing the service itself, the designer also has to design how the new service fits into the service ecosystem. In fitting the service to the constantly evolving service ecosystem, users again get an important role. The design does not
516
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
end on the designer’s drawing board but the design continues by users as long as they are using the system or service. As the users adopt usage practises and create content to the service, they actually finalise the design. Lead users who are the first to adopt new services also have an important role in creating the first contents and helping other users to adopt the service by showing the way to them. Design and usage are integrating and are not as separate activities as they used to be. More information on the changing faces of innovation can be found in [12]. Another paradigm shift caused by social media is that services do not really exist without user generated content as the services have no value without the usergenerated content. Before the service is ready to be put onto the market, it has to have some forerunner user groups who create content and encourage other users to join. These forerunner users should have a strong role in the design. The design should also focus on user communities and their practices to share information. If design and usage can be combined as early as possible, new social media services have better foresights for success. Also in smart environments the design continues in actual use where users shape the technology by producing content as well as designing and adopting usage practices. Smart environments need to be flexible enough to allow this kind of “local design” [15]. Social media is getting an important role also in smart environments that include content created and shared by the users. Users of smart environments are not just passive objects receiving information and guidance but active producers and consumers of media (prosumers). Users should be given an active role in the design and setup of smart environments. In addition to individual users, we have to focus on user communities and usage practices in them.
4 Smart Environments and Experience Sharing Features 4.1 From Ubiquitous Computing to Digital Ecologies As we have seen in the previous section, the smartness of the environment in our view will be very dependent on the users in the environment. Whereas the ubiquitous computing vision promoted by Weiser [31] was very much emphasising the increasing interoperability and computational power of our environment and the increased intelligence of services building on that, we would like to view the smartness of the environment as a result of the cooperation of humans and technology in a certain environment. Weiser recognised the importance of focussing on people, and the Ambient Intelligence vision, (see e.g. [1]) made this aspect already more concrete by explicitly stating that interaction is an inherent part of ambient intelligence. Our approach is viewing the environment as an ecology, where people and technologies are in constant change. This ecosystem is shaped both by design and usage activities.
Sharing Content and Experiences in Smart Environments
517
When reviewing our previous research, as will be done in the next section, we can see the same evolution of the smart environment concept; initial research followed the ubiquitous computing vision (UBI), soon succeeded by the more interactionoriented Ambient Intelligence notion (AmI) and now culminating in the participatory Digital ecology (ECO). The examples provided mainly represent the AmI approach, but some (e.g. ExpeShare and m:Ciudad) already point forward to the digital ecology paradigm. We should also mention a wide national research project on Smart Environments where a great number of Finnish research institutes cooperated on the research of systematic methodologies for the creation of smart environments or ecologies. The results of this research can be found in Finnish [9] UBI: Our initial research focussed on interconnectivity of autonomous components in the environment. We perceived these components as performing specific tasks; sensing the environment, actuators controlling the environment, means for user input and feedback, and computational tasks adding the intelligence into the environment. We called these environmental components “smartlets”. The main issues of research included wired and wireless (ad hoc) communication, abstraction of sensor input, sensor fusion, and decision making support. Context awareness was aimed at detecting the device environment or environmental conditions of the user. AmI: Soon the user was put at the centre and research started to focus on adapting services to the user by using information about the user’s context or preferences. New interaction paradigms were researched aiming to facilitate mobile interaction, the main focus in the Northern European countries. Experiments with lesser used modalities like gestures were done. User feedback and usability became more important and user centred design methods were applied to the development of new services. Ontologies were not only applied to reflect hidden features of devices, but also for domain specific concepts, tagging content and setting up user profiles. ECO: Recently our research has started to focus on designing the environment as a system consisting of people and digital services provided by the environment. The research has drawn from developments in the web, where end users more and more participate in building services and providing content. Our research has focussed on how to provide affordances; enhancements in the environment that aim to form a natural bridge between digital and physical worlds. Design methodologies are now aiming to design a service not just as stand alone components, but to be incorporated in services orchestrated by the users themselves. Interoperability is now considered more on a semantic level, than on a physical communication level. Interaction research aims to empower users to tweak their environments with a minimum of effort, instead of automatically adapting to the best guess of the user’s wishes.
4.2 Features for Sharing Content and Experiences The survey of the term experience resulted in the somewhat critical note, that experiences can not be actually transferred from one person to another. It is possible to induce a similar experience, however, by triggering the right memories, providing
518
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
content and encouraging emotions. For the remainder of this chapter we will take a pragmatic approach and limit ourselves to supporting sharing (past) experiences and creating shared experiences. Here we assume that sharing (past) experiences is possible by presenting multimedia content and conveying relevant details about the original context. Creating shared experiences refers to the collaborative creation of a collection of such experiences. Note that also reviewing past experiences in a new context or with other people can create a new experience. In all cases we assume that proper interaction methods will be instrumental to the success of the solutions. Furthermore we identify that managing experiences must be addressed as well.
4.2.1 Multimedia Storage and Retrieval Before the digital age sharing multimedia content was limited to showing paper pictures or slides to one another and exchanging records. The content was intrinsically associated with the carrying media. Sharing content meant sharing media, and there were very limited possibilities to copy the content. Only photographs were relatively easy to reproduce from their negatives. Storage happened on shelves and in albums, much like books had been kept for generations. Libraries and other owners of large quantities of content usually used a card-system and a basic classification system to keep track of the content items. Tape recorders and copying machines brought a change. The possibility to record content made it possible to separate the content from the original carrying media and a number of people could enjoy the same content. Copying was still a slow task it took as long as the record lasted - but it was sufficiently easy to have the record companies start worrying. As there was still a physical medium, the tape, necessary to store the content storage, there was no fundamental difference in storage methods. The compact disk was the first commercial digital medium for musical content storage. While the first systems only allowed to play CD’s, in time the recordable disk was developed, making copying even easier and eventually also faster. Still, also CD’s and DVD’s are physical media, and no difference in storage methods was necessary. Only when the memory capacity of PC’s, portable players and digital photo cameras had grown sufficiently to comfortably hold a number of CD’s or DVD’s managing content has made great progress. Also the development of efficient compression algorithms like MP3 was important in this process. Now content can be copied at will and stored in large “containers”. The content is now fully separated from the media, and only exists in digital format as a file in a device with sufficient memory. Exchanging content now no longer requires exchanging media. And what’s more, it doesn’t necessarily occupy shelf space. The absence of a physical medium requires new ways to archive our content, and there is more to it than just reproducing our card-system in electronic form. The digital age has also seen a vast increase of end-user created content. Taking a picture is no longer associated with high development and printing costs, so one can shoot as much as one likes. Users of digital cameras typically take thousands of
Sharing Content and Experiences in Smart Environments
519
pictures, probably an increase of more than 10 times the amount taken by a paper film camera. While some users make the effort to collect the best shots, combine them into a (digital) album and provide them with some annotations, most will rely on the system to store the shooting date and try to retrieve their pictures that way. Sharing of content, regardless of it being pictures, videos, music, games, or something else, happens by copying files. If any physical medium is used, it is often temporary and the same for all types. All content can be published on the web, allowing anyone to copy it. Sharing is instantaneous and universal, without transfer of physical carriers. While it seems that the digital age has opened the way for infinite exchange of multimedia content, it has simultaneously created a new dilemma. With the vast increase of own content and content obtained from elsewhere, how do we manage all this? Categorising is not sufficient, as few people find the time to manually annotate the content. And while this might still be possible with pictures, videos hardly even provide the options for that.
4.2.2 The Role of Context and Metadata The semantic web insists on enhancing content with semantic information to help finding and associating web content. It emphasises the use of ontologies for this purpose - structured vocabularies defining how real-world terms are associated with each other. Metadata consists of terms from these ontologies categorising the content. Users of the content can set up profiles consisting of ontology terms and their personal relationship to them (usually how much they like content categorised with that term). These profiles can be used to personalise content provision and content search results, but also how the user interacts with the system can be personalised by means of profiles. Ontology construction is a surprisingly difficult and laborious task. While efforts have been done to provide an ontology of everything [16], practical ontologies are usually limited to a certain application domain. Finding the right level of abstraction, limiting the domain and agreeing on the meaning of the terms and their associations belong to the challenges for ontology creation. Several web-based services have completely abandoned the idea of basing metadata on strictly specified ontologies and have allowed users to create their own keywords. The vocabularies that are developed in that way are called folksonomies [6]. They serve their own purpose, but have no way of associating related terms with each other. The metadata associated with the content can provide more than just a description of the content. Information about the context - i.e. where, when and in what circumstances the content was created - can be also very informative and may be added to the metadata. Furthermore also the content description can be done at several levels serving different purposes. The first description coming to mind is “what’s in the picture”, or musicEˇ While this is very interesting metadata, it is also one of the most difficult to provide as it often relies on the willingness of the creator to annotate the content. Descriptions may also tell something about the characteristics of the data,
520
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
like colour histograms, beat per minute, etc. This kind of metadata can be added to the content by analysing the content with suitable algorithms and requires less effort from the user. Advanced algorithms can even categorise the content, recognise faces, etc., and in this way come close to user annotations. For experience sharing metadata is important both to help with managing and selecting content, and as a means of conveying more of the experience when the content was created. Cues in the metadata provided with the content may be used to reconstruct part of the experience. For example music played when a picture was taken, people present at a meeting, etc. In addition to metadata, also explicitly associating content with other content may be a way to convey the experience. Combining content created by several people at an event and including personal annotations may together provide a better reconstruction than content from a single user could do.
4.2.3 Interaction to Enhance Experiences The viewing and creation of experiences naturally relies on proper content rendering and manipulation means. We would not like to go into detail about the best means to show multimedia content. Suffice it to observe that current A/V equipment does a good job, and that mobile solutions are developing rapidly, but display limitations will impose problems for proper viewing of images and videos. Neither will we look at immersive environments or virtual reality as an option, but focus on home, public and mobile environments in everyday life. In these focus domains interaction technologies can enhance the management of the content, the annotation of the content, the provision of the content and the linking of devices for exchanging content or rendering content. In the next section we will give several examples of such interactive systems.
5 Examples of Research in Sharing Content and Experiences This section will review some of our own research and how it has contributed to solutions for sharing content and experiences in smart environments. While early research has also included ubiquitous computing research in the sense described in the previous section, this research has mainly provided the infrastructure for communication and exchange of content and context and hardware platforms for our interaction research.
Sharing Content and Experiences in Smart Environments
521
5.1 Video Content Management in Candela The Candela project (Eureka/ITEA, 2003-2005, see [4, 7]) identified the challenges related to the growing amount of content. Solutions for this problem were sought by enhancing the content with metadata describing the content. A related problem identified in the project is the problem of delivering the content over slow networks, for example to mobile clients. The project worked on various ways to allow contents and user interfaces (mostly web-based) to be scaled for the available network and rendering client. The main content considered in the project were videos, and effort was dedicated to the automatic analysis of this content as well.
Fig. 2 Candela project vision to facilitate video retrieval
The use of metadata with videos allowed the enhancement of videos with information about their contents, the situation in which the video was made, and other relevant information. This allowed for the efficient retrieval of videos, or their parts, in retrieval tasks. MPEG7 was used as a format for storing the metadata with the videos. Some metadata was added manually. This was done by providing video creators with tools to annotate their video using keywords from a predefined ontology. The domain was restricted to keep the ontology simple. Also some experiments were done allowing users to extend the ontology themselves. Other metadata was automatically generated by analysing the videos and their audio tracks. Provisions for enhancing the metadata with context information, for example shooting location and author, were made but not utilised during the project (information was entered manually) [25].
522
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
The system built in the project was capable of delivering videos to fixed and mobile web clients via direct internet connections and wireless connections. When a mobile phone client was detected, the user interface and the content would be scaled according to the characteristics of the client. The scaled video and user interface made it possible to use the system also over very limited connections, naturally with a loss of quality in the video. The user interface allowed the user to specify criteria for searching video clips. The criteria for the search can be given using terms of the same ontology as used for annotating the video clips. Clips found may be complete videos or takes of a video, as annotation can be done for each separate take of a video. The video clips and their metadata were stored in a relational database to allow for easy extension. As the project focussed on technical issues, no extensive user tests were done. From a user perspective the system was at that time state-of-the-art but very cumbersome to use. The creation, annotation and addition of videos to the system required various tools and needed nearly professional skills. The retrieval of the videos was rather simple and implemented as a regular web service. All scaling details were hidden from the user. Another lesson learnt from the project was, that producing ontologies, even for the limited domain at hand, was rather cumbersome and seemed to be insufficient for the home videos produced by the test users. From an experience sharing point of view the content delivery system developed in the project catered for mobile users. Annotating and automatically analysing videos for sharing was experimented, but the results were not yet feasible for the end user. The use of ontologies was helpful in the retrieval of content, but might be only of limited use for user annotation of content due to the necessarily limited (and often unavailable) ontologies and the laborious annotation task.
5.2 Context Aware Delivery and Interaction in Nomadic Media The Nomadic Media project (Eureka/ITEA, 2003-2005, [22]) was concerned with developing ways to ease how people accessed and managed mobile content and services independent of time, location or device. The project particularly focussed on the use of context information for adapting the services to the user’s need, but also interaction issues related to managing content were addressed. From the point of view of a service provider, the variety of mobile platforms that could possibly be used to access their services is a real problem. Due to the difference in interaction capabilities and screen size, each device requires their own service. Maintaining such services would be sheer impossible. In the project, a set of tools was developed to alleviate this problem. The tools allowed to build a flexible service, that adapted themselves to the features of the client device or even user preferences. The tools were based on an extensive modelling system starting from tasks and automatically generating web pages or pages adapted to different sized mobile devices. Besides the devices, also user preferences could be modelled and taken into account when rendering the services on the client device. The tools were
Sharing Content and Experiences in Smart Environments
523
based on a UIDL (User Interface Description Language) developed in the project [20]. Another branch of the project worked on interaction technologies. Particularly the use of gestures to enhance mobile interaction and to control the environment was researched [11]. The gestures were recognised by means of motion sensors. During the project a vocabulary was developed for semantic gestures. A particularly interesting application used a throwing gesture to transfer pictures from a mobile phone to a TV with set-top box.
Fig. 3 Throwing pictures at the TV set
Besides the technical work, the project also enhanced the user centred design methodology to be applied to the kind of adaptive services developed in the project. The methodology was applied to the pilot services developed in the project. The early prototypes and the components developed by the participating companies were evaluated with end users. The results showed, that home consumer and entertainment electronic device prototypes and mobile device based services for sharing digital content between users from device to device (e.g., photos and music) by using gesture UIs and NFC based touching or pointing paradigms were found useful, enjoyable and usable. The services included, e.g., personal guidance and navigation systems and entertainment facilities, e.g., mobile multi-user gaming either using personal displays or by using public screens at public spaces (e.g., airport) and when moving from location to another [27]. Like in the case of Candala, also here the creation of content was only marginally addressed by the project. Content storage and provision was done by servers, which were able to serve mobile users in the way best suited to the client device and user situation. Creating a collaborative experience was best exemplified by the set-top box allowing users to “throw” pictures to be combined in an existing presentation. In this way the resulting picture gallery represents the joint experience of those participating in the session.
524
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
5.3 Kontti - Context Aware Services for Mobile Users Mobile services and applications are used in varying contexts and surroundings. When the services are made context-aware, they can offer contextually relevant information to the user. This facilitates the finding of data and creates new purposes of use. The Kontti (Context-aware services for mobile users) project (January 2002 December 2003) designed and implemented a context-aware service platform and services. The platform enabled management and sharing of contexts, presence information and contextual content, and it provides context adaptation and context-aware messaging. Also, personalisation and context tools were implemented. The goal of the project was to develop concepts and tools for offering contextaware mobile services. Methods for adaptation and personalisation for managing these services were also created. Development goals included example services, which adapt to the terminal, network, context of use and user preferences. The match between the services, users’ and service providers’ requirements was ensured by using human-centred design methods. The design process was started with a series of interviews. A separate research track in Japan charted the upcoming technology. Along the project, user requirements were gathered to ensure that user-friendly implementations and concepts of the technology would be used. The developed solutions were evaluated in the laboratory and in the field trials. Improvements in the design could be done “on the fly” as well as in larger, distinct phases. The contact between the developers, evaluation team and users worked for the benefit of all parties. The project developed a context-aware service platform and applications. The platform provides personal management and sharing of contexts and presence information; content adaptation and context-aware messaging. Contextual information can be viewed and managed within the system. Several user experience studies were performed during the project. The evaluated applications included everyday services, event services, tourist services and friend services. Some parts of the concept were also evaluated with mock-ups. Generally, in the evaluations, the implemented system proved very promising and already usable. In addition, we got a lot of feedback from different user groups and views from service providers on the business aspects of context-aware services. A result of the context study was the specification of frame-based context ontologies. The implemented basic types included: time, weekly, location and user-defined type-of contexts. In addition, there were specified context operations applied to form context object hierarchies of any level. The platform manages several kinds of content objects (links, notes, media files, directories) as well as friends and groups. In particular, contextual content provides the means of context-aware views to object repositories and messages. The adaptation module performs content adaptation and media conversions to the user’s terminal according to current delivery context. The user can create and edit contexts, monitor the state of contexts, and publish contexts
Sharing Content and Experiences in Smart Environments
525
to be visible to other users and groups. The friend and group objects were also used to specify targets to context-aware messages. The platform enables a user to send messages context-aware to the specified target contexts of recipients. Further, it is possible to attach into a message any kinds of objects (contexts, content objects) handled by the platform. This forms a base to the implemented content sharing, which is one of the novel aspects of the developed system.
Fig. 4 With the Kontti service the users could themselves define contexts and share them with other users
Fig. 5 In Kontti field trials the users were very creative in defining their zones
In the Kontti system users could define their own contexts, called “zones” in the system. Users could choose which contexts were visible also to others, who they were visible to and what kind of description was shown. In Kontti field trials, communicating ones own situation to others turned out to be the primary reason for
526
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
creating an own context, “zone” in the system. The users found defining the contexts fun and easy. Verbal description gave unlimited degree of freedom to describe and to control the information they conveyed to others. Maintaining the currently active zone manually however required too much efforts in mobile use. Contexts themselves were found a nice informal way to communicate but context-aware messages related only to certain contexts were not found very important. In the Kontti project, user-defined contexts were often related to user’s mind set rather than physical elements of context such as location. This provided evidence that users should have a stronger role in defining and sharing contexts. Sharing context information can be a very efficient way to share experience. The results of the Kontti project indicated that the most promising applications for context-aware services are event guides and professional use. Also, conveying context information proved to be very interesting for the users. Further, contexts can be used for opening new communication channels for messaging. Context can be used as a mediator where any recipient can pick up a public message.
5.4 m:Ciudad - A Metropolis of Ubiquitous Services The starting point of m:Ciudad (EU-FP7/ICT, 2007-, [17]) is the vision that the mobile devices carried by people and used for service access should allow for service provision as well. Due to the limited resources of mobile phone facilities, the user generated services in m:Ciudad should be relatively simple to create and use and should be equipped with a limited (but relevant) amount of information for the user group to be defined case by case. These have been called microservices. Hence, providing users with easy tools to become also service providers is one of the challenges of the m:Ciudad project. Such tools would enable the combination of the two roles of mobile service users and service providers (the users would become ’prosumers’. Furthermore, this approach will lead to more user-centric, targeted, relevant and usable services to be shared within the user community defined. In addition to the services and the content offered by them, the user interfaces, usability and entire user experience pay a central role in creation, manipulation and management of services by individual mobile device users [18]. User evaluations for researching user experience on micro-services will be carried out later on during the project. They will cover the entire micro-service life cycle, e.g., tools for creating and sharing the services within the community and the use of the services themselves.
Sharing Content and Experiences in Smart Environments
527
5.5 Sharing with My Peer - ExpeShare The ExpeShare project (Eureka/ITEA 2007- [expeshare]) took on the challenge to develop solutions for “experience sharing in mobile peer communities”. It builds on previous projects like Candela, Nomadic Media and Minami and intended to go beyond content and focus particularly on (mobile) peer communication. For network solutions, it researches the feasibility of using peer networks between mobile devices for sharing the content. On top of these networks, services to manage communities are being provided. Also how to capture and share experiences, i.e. what extensions need to be made to the shared content, is an issue of research. Last but not least, interaction solutions that help mobile users to exchange content and experiences, e.g. via public screens or touching solutions, are also being researched.
Fig. 6 Vision of the use of ExpeShare services at a concert setting
The ExpeShare project is still ongoing. Here we will report some considerations related to peer to peer communication. The meaning of the term peer to peer communication is frequently used without being further defined. In the project we identified that there are several levels of peer-to-peer communication, which have entirely different challenges and only partly support each other. On a physical level, peer to peer is used to indicate that two devices are communicating with each other directly. The term has become necessary to differentiate from communication via servers or base stations, like is usually the case in web-based or mobile phone communication. Solutions include short range wireless technologies like the widespread Bluetooth, and the initiation of this communication can be accomplished through physical browsing technologies. On a network level, peer to peer communication is associated with the sharing of content via network technologies including Bittorrent, etc. These technologies use all participants in the network as hubs to pass the content. All participants are thus peers, each acting as receiver and sender of content from time to time. The technologies have particularly advantages in delivering content used by a large number
528
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
of users and helps to avoid congestion at servers. As the content is spread throughout the network, it is difficult to catch or trace once it has been released. For this reason, it has been used for the illegal spread of copyrighted content and has therefore obtained an undeserved connotation of illegality. The project intends to look for peer to peer networking solutions on wireless networks. This is not just the rather trivial issue of deploying P2P networking platforms as an overlay on - say - WLAN, but needs to take into account the features of mobile devices like battery life and ad-hoc networks. From a service provider point of view peer to peer communication can be interpreted as allowing end-users of the service to have one-on-one contact with each other. In the project we extend this to peer communities - i.e. the communication of a community member to other members in that community (the peers of that member). This differs from publishing on the web, for example by means of a blog, because the audience is limited. Issues for research to consider here is how one manages communities - for example how can you join up, how to set up and ad-hoc community etc., and how to manage content and separate its visibility in the different communities you belong to. As the project focuses on sharing experiences, it also addresses how context and presence information is distributed among the peers and how life information (e.g. video streams) can be delivered (even over mobile networks). Peer to peer services (or peer community services) can be provided over peer to peer networks, but this is no necessity. Currently it seems that the P2P services are rather independent of the underlying network technologies and the choice of P2P network technologies should be based on the nature of the content to be distributed and the performance compared to server based solutions rather than its benefits for P2P services.
5.6 Ubimedia Based on Memory Tags Mobile devices are increasingly evolving into tools to orientate in and interact with the environment, thus introducing a user-centric approach to ambient intelligence (AmI). With a mobile device as a medium of communication, the user can interact with everyday objects and surroundings and get information and services from and related to his/her local environment. MINAmI (MIcro-Nano integrated platform for transverse Ambient Intelligence applications, 2006-2009, EU-FP6, [19]) is developing a mobile phone platform that can access sensors and tags embedded in the environment and different objects in the environment. The project is also developing specific sensor and tag solutions. This mobile architecture facilitates different Ambient Intelligent applications. One part of the project is the development of memory tags. Conventional RFID tags, e.g. Near Field Communication (NFC) tags, normally have very limited mamory capacity. The same is true with 1 or 2-dimensional barcodes [10]. Memory tags that are being developed in the MINAmI project will be
Sharing Content and Experiences in Smart Environments
529
ultra low-power, small, low cost devices that can be attached to all kinds of everyday objects. The memory capacity will be several mega bytes and the communication capacity will be up to 50 Mb/s. The project is developing both readable and writeable tags. The memory tags will be passive, i.e. they do not have an own energy source but they get their energy from the mobile phone in read and write operations. In parallel with the technical development of memory tags and the related mobile platform, MINAmI project has studied usage possibilities for memory tags as well as user acceptance of those usages. The aim has been to recognise usage and application requirements and to analyse their implications for the technical development of both the mobile terminal and the memory tags [8]. One of the most promising usage possibilities of memory tags is ubimedia. Ubimedia is a concept where media files can be embedded in everyday objects and the environment. The user can access those files with his/her personal mobile phone simply by touching the physical objects. The concept facilitates easy access to media related to physical objects: for instance, music files can be provided as a bonus to a concert ticket, a movie trailer can be downloaded from a movie poster, or assembly instructions can be found on furniture as a video. The user can even write media files onto the tag with his/her mobile phone. The interaction paradigm, in which the user can access digital information and services by physically selecting common objects or surroundings, is called physical browsing [13]. Välkkynen et al. [30] have studied a physical browsing paradigm where RFID tags embedded in the environment are accessed with a mobile phone. In their studies touching and pointing turned out to be useful and complementary methods for selecting an object for interaction. In earlier physical browsing studies tags have been used to store links to web services. To get to the actual information, a network connection is required. The MINAmI approach is different as memory tags facilitate storing large amount of data, which can then be downloaded without a network connection. This will facilitate easier, faster and more cost-effective access to embedded media. Mäkelä et al. [21] conducted a field trial in which they studied initial user perceptions of interacting with a tagged poster. Their results showed that the users who had not had any instructions about the technology expected the tag itself to store the digital information instead of providing a link to information in a network. This indicates that users may consider memory tag concept quite natural. In the MINAmI project, the ubimedia concept has been studied by implementing proof of concept prototypes using existing technology, i.e. NFC tags and phones. The aim of the proof of concepts has been to illustrate as concretely as possible the user interaction. User evaluation of the proof of concepts has given concrete feedback on implementing physical browsing of ubimedia. The proofs of concepts are presented in 7. The readable tag proof of concept illustrates downloading a movie trailer from a memory tag attached to a poster. The mobile phone screen shows a progress bar displaying how much of the video clip has already been downloaded. Finally the trailer starts playing on the screen. The writeable tag proof of concept (7) illustrates sharing museum experiences with a photo frame equipped with a writeable memory tag. The user can view photos
530
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
and messages left by other museum visitors by touching the photo frame with his/her phone. If (s)he wants to open one of the messages, (s)he selects the message on the phone screen and again touches the tag for download. (S)he can also take a photo at the museum and store it to the frame with a message to share these with other visitors.
Fig. 7 Proof of concept of readable memory tag (left) and writeable memory tag (right)
The proofs of concepts were evaluated with users, the readable tag in a usability laboratory and the writeable tag in a real museum. The users clearly preferred ubimedia download over web downloads on cellular networks. The main reasons were the speed, cost, reliability and simplicity. The test users were very interested in the possibility to produce own content to memory tags to be shared locally. Also the writing interaction was found easy. The main user concerns were related to control over the interaction between the mobile phone and the memory tag and issues regarding security and reliability. The users were concerned about knowing exactly what was being downloaded and also about protection against viruses and unwanted content. Other worries included possible compatibility problems with older phone models and ease of use for elderly and disabled people. Ethical challenges included effortless interaction that may lead to involuntary or accidental actions such as buying products, protection against viruses and other harmful content as well as accessibility to disabled and elderly people. However, ubimedia also facilitates alternative content formats, which could be utilised to the benefit for disabled users [10]. Social media is currently a common phenomenon on the web where people interact with each other and share content in virtual communities such as Facebook and Youtube. Memory tags embedded in the physical environment and physical objects will extend social media to social ubimedia. People can then interact with each other and share content in actual physical environments. This will facilitate different local
Sharing Content and Experiences in Smart Environments
531
applications. However, writeable memory tags also introduce many challenges. How should the content be managed and what kind of metadata would it require? Should users have personalised access to the contents? Should the content be moderated and if so, by whom?
6 Discussion This chapter has attempted to provide an overview of the methodologies and technologies needed to share content and experiences in smart environments. We have elaborated the meaning of experiences and shown that sharing the experiences should be understood as and attempt to induce similar experiences in other people. In practice this will be done by sharing content related to the experience, together with contextual information and personal annotations. We have also examined the way that content provision is changing and the separation of consumers and producers gets blurred. The reasons for this change can be traced back to technology developments allowing anyone to publish content and social and semantical developments in the web that are often coined by the term Web 2.0. We have tried to show how these new paradigms from the internet world might also carry over into our real world, into what we call smart environments. We reviewed the change in the research approach as exemplified by the ubiquitous computing and ambient intelligence paradigms and suggest digital ecologies as the next step in smart environment research. Then we gave an overview of some research issues and related projects that are relevant from this perspective. Digital Ecologies may prove to be a major step forward in the provision of feasible technical solutions to real needs of people. The main emphasis of the research will not be on technical developments per se, but on empowering people to create an environment they want to live in, with the technical support they appreciate. Products developed in the future will need to take into account that they are part of an ecology, not just stand-alone single function devices. A significant part of the functionality provided by the digital ecology will be related to experience management, enjoyment and sharing. Digital ecologies thrive because they support people in their activities and in communicating with their communities, but mostly because they give the people the place they deserve - in the driver’s seat.
7 Acknowledgements This paper summarises the results accomplished in the various projects mentioned. The authors would like to thank the researchers, partners and funding organisations for their support to this work.
532
Johan Plomp, Juhani Heinilä, Veikko Ikonen, Eija Kaasinen and Pasi Välkkynen
References [1] Aarts, E., Marzano, S. (eds): The New Everyday - Views on Ambient Intelligence. 010 Publishers, Rotterdam, The Netherlands, 352 p. (2003) [2] Arhippainen, L., Tähti, M.: Empirical Evaluation of User Experience in Two Adaptive Mobile Application Prototypes. In Proc. of the 2nd International Conference on Mobile and Ubiquitous Multimedia (MUM2003), pp. 27-34, Norrköping (2003) [3] Battarbee, K.: Co-Experience - Understanding user experiences in social interaction. Academic Dissertation. Publication series of the University of Art and Design, A51, Helsinki (2004). [4] Candela project web pages: http://www.hitech-projects.com/ euprojects/candela/ [5] ExpeShare project web pages: http://www.vtt.fi/virtual/ expeshare [6] Hotho, A., Jäschke, R., Schmitz, C., and Stumme, G.: Information Retrieval in Folksonomies: Search and Ranking. In: The Semantic Web: Research and Applications, LNCS 4011, Springer, Heidelberg, pp. 411-426 (2006) [7] Jaspers, E., Wijnhoven, R. Albers, R., Nesvadba, J., Lukkien, J., Sinitsyn, A., Desurmont, X., Pietarila, P., Palo, J., Truyen, R.: CANDELA - Storage, analysis and retrieval of video content in distributed systems. In: Lecture Notes in Computer Science Vol. 3877. pp. 112-127. Springer-Verlag (2006) doi: 10.1007/1167083410 [8] Kaasinen, E., Ermolov, V., Niemelä, M., Tuomisto, T. and Välkkynen, P.: Identifying User Requirements for a Mobile Terminal Centric Ubiquitous Computing Architecture. In Proc. International Workshop on System Support for Future Mobile Computing Applications (FUMCA’06). IEEE. (2006) [9] Kaasinen, E. and Norros, L. (eds): Älykkäiden ympäristöjen suunnittelu - kohti ekologista systeemiajattelua (Design of smart environments - towards an ecological an systemical approach). Teknologiateollisuus ry, Helsinki, 320p (In Finnish) (2007) [10] Kaasinen, E., Jantunen, I, Sierra, J., Tuomisto, T. and Välkkynen, P.: Ubimedia based on memory tags. Mindtrek 2008 Conference Proceedings. ACM. (2008) [11] Kela, J., Korpipää, P., Mäntyjärvi, J., Kallio, S., Savino, G., Jozzo, L., and Di Marca, S.: Accelerometer-based gesture control for a design environment. In: Personal and Uniquitous Computing, Vol 10, No 5, Springer, pp 285-299 (2006) [12] Kelley, T.: The Ten Faces Of Innovation. Doubleday. New York. (2005) [13] Kindberg, T.: Implementing physical hyperlinks using ubiquitous identifier resolution. In Proc. Eleventh International Conference on World Wide Web, pp. 191-199, ACM Press (2002) [14] Kuniavsky, M.: Observing the User Experience: A Practioner’s Guide to User Research. Morgan Kaufmann, San Francisco (2003) [15] Kuutti, K., Keinonen, T., Norros,L. and Kaasinen: Älykäs ympäristö suunnittelun haasteena (Smart environement as a design challenge). In: Kaasinen, E.
Sharing Content and Experiences in Smart Environments
[16] [17] [18] [19] [20]
[21]
[22] [23] [24]
[25]
[26] [27]
[28]
[29] [30]
[31]
533
and Norros, L. (eds.) Älykkäiden ympäristöjen ekologinen suunnittelu (Ecological approach to the design of intelligent environments). (In Finnish). pp. 33-51. Teknologiateollisuus. (2007). Lenat, D.B.: Cyc: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM 38, no. 11, November (1995) m:Ciudad - Project web site: http://www.mciudad-fp7.org m:Ciudad - A metropolis of ubiquitous services. (A brochure): http:// www.mciudad-fp7.org/pdf/mCiudad-FP7\_Brochure.pdf MINAmI project web site: http://www.minami-fp6.org Mueller, W., Schaefer, R., Bleul, S.: Interactive Multimodal User Interfaces for Mobile Devices. Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS’04) - Track 9, (2004) Mäkelä, K., Belt, S., Greenblatt, D. and Häkkilä, J.: Mobile interaction with visual and RFID tags - a field study on user perceptions. In Proc. CHI 2007, pp. 991-994, ACM (2007) Nomadic Media project web pages: http://www.hitech-projects. com/euprojects/nomadic-media/ Paluch, K.: What is User Experience Design. http://www.paradymesolutions.com/articles/what-is-user-experience-design/ Janhila, L.: Elämyskolmiomalli tuotekehityksessä. (In Finnish). LCEEI Lapland Centre of Expertise for the Experience Industry, http://www. elamystuotanto.org/?deptid=21890 Sachinopoulou, A., Mäkelä, S.-M., Järvinen, S., Westermann, U., Peltola, J., Pietarila, P.: Personal video retrieval and browsing for mobile users. In: Proc. of the SPIE Conf. on Multimedia on Mobile Devices. Vol. 5684. pp. 219-230. SPIE - The International Society for Optical Engineering, USA (2005) doilink: 10.1117/12.596699 Shedroff, N.: Experience Design 1. New Riders Publishing, Indianapolis, http://www.nathan.com/ed/index.html2001 Strömberg, H., Leikas, J., Suomela, R., Ikonen, V. and Heinilä, J.: Multiplayer Gaming with Mobile Phones - Enhancing User Experience with a Public Screen. In Maybury, M. et al. (Eds.): INTETAIN 2005, Lecture Notes in Artificial Intelligence 3814, pp. 183-192. Springer-Verlag Berlin Heidelberg (2005) Tarssanen, S., Kylänen, M.: A Theoretical Model for Producing Experiences - a Touristic Perspective. In: Kylänen, M. (ed.) Articles on Experiences 2, pp. 134-154. Lapland Centre of Expertise for the Experience Industry, Rovaniemi (2006) Tarssanen, S., Kylänen, M.: Entä jos elämyksiä tuotetaan? (in Finnish). In: Karppinen, S., Latomaa, T. (eds.) Elämys ja seikkailu, (2007) Välkkynen, P., Niemelä, M. and Tuomisto, T.: Evaluating touching and pointing with a mobile terminal for physical browsing. In Proc. 4th Nordic conference on Human-computer interaction, pp. 28-37. ACM. (2006) Weiser, M.: The Computer for the 21st Century. Scientific American, 265, pp. 94-104 (1991).
User Interfaces and HCI for Ambient Intelligence and Smart Environments Andreas Butz
1 Input/Output Devices As this book clearly demonstrates, there are many ways to create smart environments and to realize the vision of ambient intelligence. But whatever constitutes this smartness or intelligence, has to manifest itself to the human user through the human senses. Interaction with the environment can only take place through phenomena which can be perceived through these senses and through physical actions executed by the human. Therefore, the devices which create these phenomena (e.g., light, sound, force, ...) or sense these actions are the user’s contact point with the underlying smartness or intelligence. In the PC paradigm, there is a fixed set of established input/output devices and their use is largely standardized. In the novel field of Ambient Intelligence and Smart Environments, there is a much wider variety of devices and the interaction vocabulary they enable is far less standardized or even explored. In order to structure the following discussion we will look at input and output devices separately, even if they sometimes coincide and mix.
1.1 Displays As the visual sense is the most widely used for traditional human-computer interaction, this situation persists in smart environments. Displays hence constitute the vast majority of output devices. They can be categorized along several dimensions, among which are their size and form factor, their position and orientation in space and their degree of mobility.
Andreas Butz University of Munich, Germany, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_20, © Springer Science+Business Media, LLC 2010
535
536
Andreas Butz
1.1.1 Form Factor of Displays in fact, the size and form factor of display devices has been recognized in the pioneering projects of the field, such as the ParcTab project [29]. In this project, the notion of tabs, pads and boards has been established in analogy to classical form factors and usage scenarios of information artifacts. The tabs correspond to small hand-sized notepads and provide a very personal access to information. Figure 1 shows such a device. The pads correspond to larger note pads or books and enable sharing of information between individual users, as well as collaboration. The boards correspond to classical chalk boards or white boards and enable presentation and sharing of information with a group of people.
Fig. 1 The Xerox ParcTab device as presented in [29]
This distinction has survived until today and is reflected in the device classes of PDA or smart phone, laptop or tablet PC, and commercially available interactive whiteboards ore tables. It hence seems to provide a useful classification along the size dimension.
1.1.2 Position and Orientation of Displays The next important property of any stationary display is its position and orientation in space. Board-sized displays, for example, exist in a vertical form as proper boards, or in a horizontal form as interactive tables. The usage scenarios and interaction techniques afforded by these two variants differ strongly.
User Interfaces and HCI
537
Because standing in front of a board for a longer time quickly becomes tiring, vertical large displays are often used as pure output channels or interaction only happens from a distance, where users can be conveniently seated. An interactive table, on the other hand, provides an invitation to users to gather around it and interact on its surface for almost arbitrary amounts of time. This immediately brings about another important distinction between vertical and horizontal displays: On a vertical display, all users share the same viewing direction, and since there is a well defined up-direction, the orientation of text and UI elements is clearly defined. If multiple users gather around an interactive table, there is no clearly defined up direction and hence it is highly questionable, which is the right orientation for text or UI elements. Some projects evade this problem by using interactive tables in a writing desk manner, i.e., for a single user and with a clear orientation, but if taken seriously, the orientation problem on horizontal interactive surfaces is a quite substantial design problem. To date, there are basically three ways of dealing with it: researchers have tried to a) remove any orientationspecific content, such as text, b) keep orientation-specific content moving, so that it can be recognized from any position for part of the time, or c) orient such content in the (sometimes assumed) reference frame of the user, resulting, for example, in text which is always readable from the closest edge of the table. For a good discussion of spatiality issues in collaboration see [8].
1.1.3 Mobility of Displays For non-stationary displays, their degree of mobility is an important factor for their usage. The smaller a display is, the more likely it is to be used in unknown contexts or even in motion. Smart phones, for example, enable a quick look while walking, whereas tablet PCs require some unpacking and laptops even require a flat surface for interaction. The smartness of the environment has to account for these different usage scenarios and for the changing positions and physical and social contexts of mobile displays. On the other hand, display mobility can even be actively used as an input modality, as described in [7]. The cameleon is a (within limits) mobile display, which shows a 3D model as if seen from the devices current position and orientation. The device thereby creates a kind of porthole through which the user can look into a virtual world which is otherwise hidden to the user. This is a very compelling and rather early example of using device mobility for the inspection of an information space which is overlaid to the physical space.
1.2 Interactive Display Surfaces Over the last two years, interactive surfaces have become a research focus of many HCI groups around the world. They combine visual output of sorts with immediate
538
Andreas Butz
Fig. 2 The Boom Chameleon, whose base concept is described in [7]
input on the same surface. Simple examples are the commercially available interactive whiteboards, or any touchscreen of sufficiently large size. The majority of interactive surfaces currently under investigation are either interactive tabletops or interactive walls. The rapidly growing number of submissions to the IEEE Tabletops and Interactive Surfaces Workshop documents this trend. One can claim that the field started with projects such as the MetaDesk [25] or the InteracTable [24], which is shown in Figure 3. For a long time, input was restricted to a single contact point, as in early touch screen technologies, or rather flaky and unreliable camera-based input. Recently, a number of input technologies for interactive surfaces have appeared, which enable so-called multi tpouch input, i.e., the recognition of many contact points. The FTIR technology [12] is one of the most widely used and is readily available to researchers as an open source implementation. Commercially, interactive multi touch tables are
User Interfaces and HCI
539
Fig. 3 The InteracTable, an early commercial interactive tabletop [24]
starting to appear on the market, such as the Microsoft Surface table. Interactive tabletops enable a number of collaborative scenarios, since they lend themselves very naturally to multi user interaction. Since these multiple users will often approach the table from different directions (and thereby benefit frm true face to face collaboration), the interface design need sto account for different usage directions, and hence directionality becomes an important design problem and research topic here. Interactive walls don’t raise this directionality issue, but because of their inherently big scale, they raise other challenges related to large display, such as the positioning of interactive elements, focus and context of human perception [3], and fatigue of their users. It can be expected that interactive surfaces will become much more widespread as the technologies above mature and novel, even more robust technologies will appear.
1.3 Tangible/Physical Interfaces Another concept which is sometimes used together with interactive display surfaces, is the concept of Tangible User Interfaces or TUIs as described in [9]. A TUI provides physical objects for the user to grasp. Early projects in this field used physical objecs either as placeholders for information [26] (see also Fig. 4), as handles to digital objects [27], or as actual tools, which act upon digital data [25]. Feedback is mostly provided on separate screens or on the interactive surface, on which the TUI is used. The main advantage of a TUI is its physicality and the fact, that humans have learned to manipulate physical objects all their life. Manipulating
540
Andreas Butz
Fig. 4 The MediaBlocks Tangible User Interface described in [26]
a physical placeholder, handle or tool hence borrows from human life experience and provides haptic feedback in addition to visual or acoustic feedback. it hence uses an additional sensory channel for the interaction between human and machine.
1.4 Adapting Traditional Input Devices One popular route to interaction in Smart Environments leads through existing interaction devices from other computing paradigms, such as mice, keyboards, data gloves, or tracking systems normally used for Virtual or Augmented reality.
1.4.1 Hardware Hacking Since many of these devices are rarely suitable for interaction in their original form, but many of them are easily accessible for programming, there is a high incentive for modifying or "hacking" devices, such as mice or even consumer hardware, such as remote controls. The rope interface described in [4] or the Photohelix [13] both use the hardware of a commercially avbailable, cheap, and easily programmable mouse to achieve entirely different forms of input. Remote controls have been used as simple infrared beacons or have been hacked to control the consumer applications they were built for through a very different physical interface [6] (see also Fig. 5).
User Interfaces and HCI
541
1.4.2 Mobile Phones The mobile phone has assumed a special role in the context of HCI for Ambient intelligence and Smart Environments. The fact that it is so widespread and wellknown, as well as its increasing technical capabilities, enable a wide range of usage scenarios for mobile phones beyond making phone calls. The fact that mobile phones nowadays provide a rather powerful computing platform with built-in networking capabilities makes them an ideal platform for prototyping ultra-mobile interaction devices. If the phone is just used as a mobile display with a keyboard, its small screen and limited input capabilities have to be be accounted for by a suitable interface design. If the phone includes additional sensors, such as acceleration sensors, NFC readers or simply a camera, this enables other and more physical types of interaction. For a detailed discussion, see [22].
Fig. 5 A TUI for media control, using a hacked remote control to actually control a hi-fi unit, described in [6]
542
Andreas Butz
1.5 Multi-Device User Interfaces Many interactions in Smart Environments involve multiple devices. In fact, in many cases, the actual user interface is distributed over different devices, taking input from one device and feeding output through another. Examples were already given in the ParcTab project [29] where the ParcTab device could be used for remote control of the pointer on a board device. More generally, one can speak of device ensembles [23], which aggregate single devices into a coherent combination. It is a major challenge to structure, develop and run such multi device user interfaces in a consistent and stable way, and the coordination of multiple devices in such a way is often managed in a middleware layer.
2 Humans Interacting with the Smart Environment In Ambient Intelligence and Smart Environments, the computing power of the environment pervades the physical space, in which the user actually lives. It hence becomes extremely important, to investigate the user’s perspective of these interactive environments. Today, it is state of the art to use human-centered engineering methods for classical PC-style graphical interfaces, and in this domain, the strategies and methods are reasonably well understood. A survey of current practices is given in [28]. The methods of user-centered design have been extended to other interaction paradigms as well, as described in [10]. While in PC-style interfaces, the interaction bandwidth between the human and the computer is relatively low (limited by the use of mouse, keyboard and screen), this bandwidth is much higher for interaction with smart environments. The human user can interact using her or his entire body and multiple senses. Units of information can occupy the same space the user actually lives in. This situation makes the transfer of classical user-centered design and development methods, but also evaluation much more difficult, since interaction becomes very rich and highly context-dependent.
2.1 Metaphors and Conceptual Models When Computers first appeared, their operators had to write their own programs to solve a problem. When computers became more widespread, also non-programmers wanted to use them and the user was not necessarily the same person as the programmer anymore. Users had to understand how to operate a computer, and with increasing complexity of computing systems came the need to provide an understandable user interface. One promising strategy, to keep operation of a novel device understandable, is to employ a metaphor to a known device or part of the world. In the mainframe world, human-computer interaction was structured along the metaphor of a type-
User Interfaces and HCI
543
writer. Text was typed on a typewriter keyboard, and results came out of a printer or later on were scrolled on a CRT monitor. In the PC world with its graphical user interfaces, the dominant metaphor for operating a computer, has become the desktop metaphor, which employs the analogies of files, folders, documents and a trash can, for example. It forms the basis for our conceptual understanding of the computer, for our conceptual model. In Ambient Intelligence and Smart Environments, there is no widely accepted common metaphor or other conceptual model (yet).
2.2 Physicality as a Particularly Strong Conceptual Model One conceptual model, that seems very natural for ambient intelligence, is physicality. Since computing moves into the real world, and everything else in this world follows the laws of physics, it seems to make sense, that also digital objects or at least their (visual, haptic, acoustic, ...) manifestations should follow these laws. Just as the desktop metaphor breaks down in places, (the most prominent example being the trash can on the desk instead of under it,), also physicality cannot be taken too literally. Otherwise, digital objects would slide down vertical displays, and other absurd things would happen. It therefore makes sense, to deviate from perfect physicality where needed, but keep the general direction nevertheless.
2.2.1 Types and Degrees of Physicality In [14], three variations on the physicality theme are discussed, as they appear in the user interface of a brainstorming application for a Smart Environment. Figure 6 shows the user interface of this application, which doesn’t contain any conventional graphical widgets and incorporates physical behavior of the displayed artifacts in various forms. Pseudo-physicality describes an apparently physical behavior, which in fact is far from physical reality, but still close enough to be recognized as physical and thereby enables assumed benefits of physicality, such as understandability. Sticky notes, for example, could be grouped together on the wall display in order to form clusters of related ideas. The actual lines people drew around the notes would then be beautified into smooth shapes, as if a rubber band was put around them. This was a relatively abstract use of a physical analog. Hyper-physicality describes a situation, in which a physical behavior is simulated quite accurately, but then applied to an object which doesn’t actually exhibit this behavior in the real world. Sticky notes on the table could be flicked across the table to other users in order to simulate the act of handing over a physical note to somebody else. In order to do this, users had to touch the note with the pen or finger, accelerate it in the intended direction, and let go of it while in motion. The note would then continue to slide across the table with the physical behavior of a billiard ball and eventually stop due to simulated friction. This was a realistic simulation of
544
Andreas Butz
Fig. 6 A multi device multi user UI using various forms of physicality as a guiding principle [14]
physics, but this type of physics can never be seen with physical sticky notes, since they are much too light to skid across a table. Meta-physicality describes the situation, in which a physical analogy is only the starting point for a design, but then functionality is added, which goes substantially beyond this analogy. One corner of the digital sticky notes seemed to be bent upwards, which - when found on a physical sticky note - makes it very easy to lift the note at this corner and flip it over. This was a very direct analogy to the physical world and was used both as a visual clue to turn the note and as the actual sensitive area for turning it. On the back side users would find a an icon, which, when touched, would copy the note. This can’t be done with physical sticky notes at all. There are, of course, many other examples where physicality is employed as the guiding principle of a user interface. One of the most prominent systems in this field is the bumptop interface as described in [1]. It relies on a physical simulation and makes all graphical objects in the interaction behave in a very physical way, but then it adds stylus gestures and menus on top of this, in order to handle complex interactions. An even more consequent application of physicality in the interfaces can be found in [31], where the physical simulation is the only application logic involved. This creates an interface, that can be appropriated by the user: users can try out the interface and develop their own strategies for solving tasks in it.
User Interfaces and HCI
545
Fig. 7 The Bumptop interface, using a physical simulation [1]
2.3 Another Example: the Peephole Metaphor Another example for a conceptual model borrowing from known things in the real world, is the Peephole Metaphor described in [5]. The peephole metaphor’s core idea is a virtual layer superimposed on a physical environment. While normally imperceptible, this layer can become visible, audible, or otherwise sensible when we open a peephole from our physical world into the virtual layer. Such a peephole might open as a display (visual peephole), a loudspeaker (acoustic peephole), or some other device. In a living room, for example, a display on a coffee tableÃÔ surface might show users a collection of personal photographs. The virtual layer has three basic properties: spatial continuity, temporal persistence, and consistency across peepholes. If two different portable displays are moved to the same position in the environment, they display the same information (though not necessarily using the same graphical representation). Peepholes can be implemented through a tracked, head-mounted display (HMD), making the virtual layer a virtual world that is superimposed on the real world wherever we look, producing the general paradigm of spatial augmented reality (AR). Another option is to implement the peephole using a position-tracked handheld computer. This adequately describes the Chamaeleon (see Fig. 2), which implements a spatially continuous virtual layer that is visible only through the peephole opened by the tracked display. Peepholes in a Smart Environment work much like magic lenses on a 2D screen. When interactivity is added, they behave like toolglasses. Although the peepholes are not as general and immediate as physicality, they allow a number of other analogies to be used (filters, wormholes, ...) on top of them, and thus provide a basis for understanding the behavior of the smart environment by using analogies to known objects.
546
Andreas Butz
2.4 Human-centric UIs The fact, that the user inhabits the same physical space, which is now entered by computing power in various forms, makes this novel form of computing much more accessible, but also potentially much more threatening or tedious for the user. It is absolutely indispensable that these technologies are designed after the human and her or his physical and mental capabilities.
2.4.1 Bimanuality While interaction in the PC world is reduced to the keyboard and mouse, interaction with a smart environment can be much richer. If we only look at the hands, they can already do much more on an interactive surface than on a keyboard. UI design for interactive surfaces needs to account for known mechanisms, such as bimanuality, for example. In [11], the kinematic chain provides a model for asymmetric bimanual interaction, which is the standard case when humans interact in the real world. This suggests that interfaces, which distribute functionality according to this model to both hands, will be more convenient or natural than symmetrically bimanual interfaces.
2.4.2 Body-centric Interaction Another effect of computing being spread throughout the environment, is the fact, that a UI can assume much larger scales than in the PC world. If the user interacts with a wall-sized display, for example, an interactive element in a fixed position, such as the start menu on a windows desktop, becomes unfeasible. The user would have to walk back and forth to this interaction element, because it is fixed in the reference frame of the environment. If we put the user in the center of the UI design, interaction elements should rather live in the reference frame of the user, i.e., move along with the user. Finally, the entire body can become part of the interaction, leading to entirely novel interface concepts, such as the exertion interfaces described in [17].
3 Designing User Interfaces for Ambient Intelligence and Smart Environments The design process for user interfaces in Ambient Intelligence and Smart Environments is subject to ongoing research and speculation. While there is a rather established set of tools for user-centered design of interactive systems in the PC world, there is no established development policy for ambient intelligence yet. Research
User Interfaces and HCI
547
currently tries to transfer human-centered engineering practices to this new computing paradigm, to clarify the differences, and account for them.
3.1 The User-centered Process Many elements of current user-centered development practice can be used in the design of Ambient intelligence interfaces. [19] give a good overview of the various stages of user involvement in the design process, in the concept and requirements phase, as well as in the evaluation phase. They discuss prototyping at different levels and recommend a looped process which continues to involve users to evaluate intermediate design or prototype stages. Additional insights about the nature of physical design of interactive everyday objects are provided in [18].
3.2 Novel Challenges The fact, that interactive systems in a smart environment invade the user’s actual living environment much more, makes their use much less predictable. There is no well-defined interaction situation or context anymore, and interaction can happen serendipitously or casually. As the physical user interface devices may move out of the direct focus of the user, so will the interaction process itself. Users can interact with their environment implicitly or on the side, during other activities. Interaction thus needs to be interruptible and in many cases, the entire interface needs to remain rather unobtrusive in order not to overload the space which also serves as a living environment for the human. Another challenge in the development of user interfaces for Ambient Intelligence and Smart Environments is the fact, that they might employ novel input devices, or even consist of a mixture of physical and digital elements. The computer scientist, which is experienced in developing software and hence purely digital user interfaces, must also learn the skill set of, or be supported by, an industrial designer. This latter knows about the design of physical objects and interface elements. If a user interface therefore involves physical and digital parts, the interface design requires a much wider skill set. To make things worse, there is even a clash of cultures between software design and physical design. Most classical approaches to software design assume a relatively clear idea of the intended result after an initial concept phase and then work iteratively towards this goal. This process is mainly convergent, once the initial concept is done. Product designers, on the other hand, work in a much more divergent way. In the beginning, they create a very wide range of possible designs, which are only sketched. After selecting a few of these designs and discarding others, the selected designs are taken to the next level of detail, for example by carving them from styrofoam to get an impression of their spatiality. Eventually, also this process
548
Andreas Butz
converges on one final design, which is then built in the form of a functional prototype, but the entire process is much broader than that of a software designer.
4 Evaluating User Interfaces for Ambient Intelligence and Smart Environments Evaluating user interfaces for Ambient Intelligence and Smart Environments faces the same challenges as designing and developing them. Since the usage context is mostly unknown beforehand, it is hard to make predictions and an evaluation assuming one certain context might become entirely worthless for other situations.
4.1 Adaptation of Existing Evaluation Techniques Just as for the design methods, a first step is the transfer of known techniques from the PC world. These include early evaluation techniques, such as expert evaluation or cognitive walkthrough, but also detailed qualtiative and quantitative user studies. Again, [19] gives a good overview of the respective processes. When trying to transfer predictive models of human-computer interaction, these have to be adapted to the increased expressivity and interaction bandwidth in Smart Environments. [15], for example, extend the keystroke level model from the desktop PC world to the use of advanced mobile phone interactions. Other studies have shown, that the well-known Fitt’s law, which predicts selection speed with a pen or mouse on PC screens, doesn’t hold in its original form on large displays. While PC interfaces can be evaluated in the very controlled and convenient setting of a usability lab, mobile or ubiquitous user interfaces can hardly be evaluated there regarding the full spectrum of their potential usage contexts. In exchange for this, the concept of a living lab has emerged, in which users actually inhabit a Smart Environment, and their interactions with it can be studied and to some extent also evaluated [16].
4.2 Novel Evaluation Criteria Finally, as this novel type of interactive applications pervades the life of their users much more, effectiveness and efficiency, which are by far the dominant criteria for evaluating PC interfaces, become less important. Issues of unobtrusiveness, integration in the environment and processes, playfulness and enjoyability are increasingly important for these interfaces. What exactly the criteria are, is the topic of ongoing research.
User Interfaces and HCI
549
Many aspects of interaction in Smart Environments are also more social in nature, rather than technical. They might have to do with social acceptance, social protocols, etiquette, but also with stability and fault-tolerance as a basis for user acceptance. Systems to support everyday life are difficult to evaluate in the lab. Their evaluation will therefore only become possible, once they are integrated into the actual everyday environment. This somewhat invalidates the classical approach of evaluating prototypes of an interactive system in the lab, before fully implementing and commercializing it.
5 Case Studies The following is a collection of examples for different interaction concepts in smart environments. This collection neither claims to be complete nor to represent every group which has worked in this area. In order to gain a historical perspective, the chosen examples are presented roughly in chronological order and while the beginning is relatively general, the list of examples will become more and more specific towards the end, focusing on physical interaction as a basis for HCI in smart environments, which is, of course, only one of several possible directions.
5.1 Applications in the ParcTab project Since the Parctab Project ([29]) is one of the fundamental projects in ubiquitous computing, it is well worth having a closer look at the applications that were written for the ParcTab device (see Figure 1) and the other displays in that environment, which were discussed in section 1. On the input side, the ParcTab project used (from today’s perspective) rather conventional technologies and techniques, such as pushbuttons, lists, pen input, and a predecessor of the Graffiti handwriting recognition called Unistroke. At the time, though, these were novel technologies and the novelty consisted in the mobility and context sensitivity of the devices, as well as in the coordination across multiple devices. The Applications on the Tab included (nowadays familiar) PIM applications, such as a calendar, notepad and address book, but also a number of applications that made use of the networking and location infrastructure, such as a file and a web browser, an email application, and a location-dependent pager. Finally, a number of applications worked across devices: With a remote pointing and annotation tool, multiple people in the audience could control their own cursor on a large presentation display. A voting tool allowed taking quick polls in a meeting, and a remote control application was built to control light and heat in the room. In terms of the underlying concepts, proximal selection and ordering was a novel feature. It preselected, for example, in a printer list, always the nearest printer, depending on the mobile device’s location.
550
Andreas Butz
Many of these applications seem commonplace today, but we have to realize that they were a novelty at the time of their publication and hence the ParcTab deserves credit for many elements later found in electronic organizers and for some basic multi device interaction techniques.
5.2 The MIT Intelligent Room
Fig. 8 The Intelligent Room at MIT (Picture taken from www.csail.mit.edu)
The Computer Science and Artificial Intelligence lab (CSAIL)1 presented the concept of "intelligent spaces" and set up an intelligent room in the early 1990ies. The room (see Figure 8) contained mostly conventional PCs of that time in different form factors (projection display on the wall and table, PDAs as mobile devices and laptops on the table) and a sensing infrastructure consisting of cameras and microphones. This allowed to use speech input and computer vision in addition to the conventional PC interface devices. Applications in this room include an intelligent meeting recorder which would follow meetings and record their content, as well as a multimodal sketching application. For a number of informative videos, please refer to the lab’s web page at www.csail.mit.edu.
1
http://www.csail.mit.edu/
User Interfaces and HCI
551
5.3 Implicit interaction: the MediaCup
Fig. 9 The MediaCup, a device for implicit interaction [2]
In 1999, a group at TECO in Karlsruhe, Germany presented the MediaCup [2]2 . This was an instrumented everyday artifact which was able top sense a number of physical properties and communicate them to the environment. While this can be described as context sensing, it is also one of the early examples of a concept called "embedded interaction". This term has actually two different interpretations, both of which are correct: On one hand, it describes human interaction with devices, into which digital technology is embedded, and on the other hand, it describes the fact, that interaction with the computer is embedded into everyday actions. The user does not explicitly interact with acomputer by pushing buttons or controlling a pointer, but instead, interaction is sensed from activities with mundane objects. The MediaCup was able to sense, for example, the temperature of its contents and communicate this value to computers in the environment. Its presence in a room was 2
http://www.mediacup.de/
552
Andreas Butz
simply detected by its connection to an RF access point with limited reach, which was in every room. This simple technical setup allows interesting inferences and applications, for example the following: If several cups are detected in the same room and their contents are all hot, this hints at the situation that a (potentially informal) meeting is taking place. In order to protect this assumed meeting, an electronic door sign could then be change status and keep other people from entering the room.
5.4 Tangible Interaction: The MediaBlocks Arguably one of the first tangible user interface prototypes was the MediaBlocks system [26], in which simple wooden blocks act as placeholders for information bins. The information is not physically stored in these blocks, of course, but it is conceptually associated with them by simple relatively physical operations: A MediaBlock can be inserted into a slot on a screen and then the screen’s content is (conceptually) transferred to the block. In the other direction, when a block is inserted to a printer slot, the contents of that block are printed. On a block browser device, the block can be put onto trays of different forms and all of its contents can be browsed by scrolling a dial and seeing the contents on the screen (see also Fig. 4 on page 538). These operations make good use of our sense of physical interaction, and they appear very intuitive, once the concept of the block as a placeholder for information is understood by the user.
5.5 Multi Device interaction: Sony CSL Interactive Workspaces One of the earlier projects to integrate multiple input and output devices into one coherent workspace are Sony CSL’s Augmented Surfaces [21]. The setup basically consists of a large top projection table with two cameras watching its surface. On the table, physical objects can be used, as well as more computing devices, for example, Laptops. This immediately raises the question how information is transferred between the laptop and the surrounding table area. The authors propose several interaction techniques for this, which bridge the gap between the digital and the physical world in relatively direct ways: The Hyperdragging technique allows the mobile device’s mouse curser to leave the display and move on in the surrounding table area. In this way, digital objects can be picked up on the screen and dropped onto the table and vice versa without switching interaction devices. The desk surface appears as a shared extension of the workspaces of all mobile device screens, and it therefore also provides a place for exchanging data between them. The Cameras overhead can detect optical markers on physical objects and thereby associate digital data with them (e.g., a trailer clip with a physical video tape). The cameras can also be used to digitize actual physical objects, e.g., scanning a business card, and then share its digital copy with the mobile devices.
User Interfaces and HCI
553
Fig. 10 The augmented surfaces project at Sony CSL [21]
This latter, on the other hand, conceptually dates back to Wellner’s digital desk as described in [30].
Fig. 11 The pick and drop interaction for moving information between displays [20]
Another interesting interaction technique from the same group, which also borrows from our understanding of interaction with the physical world, and also can
554
Andreas Butz
be seen as a straightforward extension from single screen interaction, is "pick and drop" [20]. A digital piece of information is picked up from one screen with a pen and the dropped onto another screen with the same pen. This clearly corresponds to the way we would move a physical object from one container to another, or how a painter would dip a brush into the paint in order to paint on the canvas.
5.6 Hybrid Interfaces: The PhotoHelix
Fig. 12 The Photohelix: A hybrid physical-digital interface [13]
Finally, the PhotoHelix [13] is an example of a category of interfaces which lies between Tangible UIs and purely graphical UIs on an interactive surface. Imagine a graphical widget on an interactive tabletop with interactive elements, such as buttons, sliders, lists etc. Now imagine one of its interactive elements not being digital, but physical, e.g., a wooden knob on the surface instead of a slider to adjust a specific value. This physical object is part of the entire widget (which thereby becomes a hybrid widget) and the graphical and the physical part are inseparable. The presence of a physical part in this hybrid widget now has a number of advantages: For one thing, real physical effects, such as inertia, can be used in the interface. Then it also becomes very simple to identify multiple users around a digital tabletop: If each user owns their own physical input object, it can unfold into the full hybrid widget as soon as it is put down on the table, and then is clearly assigned to the owner of the physical part. This solves the identification problem in collaborative input on a desk surface, and also allows to orient the interface towards the respective user.
User Interfaces and HCI
555
The PhotoHelix is exactly such a hybrid interface. It unfolds into a spiral-shaped calendar, on which the photos of a digital photo collection are arranged by capture time. The physical part is a knob which is meant to be held in the non-dominant hand and when it is rotated, the spiral rotates with it and different time intervals of the photo collection are selected. Scrolling far can be achieved by setting the knob in fast rotation and then letting it spin freely, thereby harnessing a true physical effect in this hybrid interface.
5.7 Summary and Perspective The field of Human-Computer Interaction in Smart Environments is much wider than the contents of this chapter. Nevertheless, the preceding sections have tried to structure this field along the capabilities of the participating sides (Computer and Human), to briefly touch on the design and evaluation of such interfaces, and to showcase a few important areas and trends in this field. With the steady increase in computing power and the stepwise appearance of new sensing technologies, entirely novel input technologies and concepts will inevitably appear. From the human point of view, it will always be important to be able to form a coherent mental model of the entire interaction possibilities. The understanding of Physicality is deeply rooted in human life from early childhood on. It therefore provides a very promising candidate for skill transfer and as a basis for such a coherent mental model. The question, what will become the "WIMP interface" of Ambient Intelligence and Smart Environments, is, however, still open.
References [1] Agarawala A, Balakrishnan R (2006) Keepin’ it real: pushing the desktop metaphor with physics, piles and the pen. In: CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, ACM Press, New York, NY, USA, pp 1283–1292, DOI http://doi.acm.org/10.1145/ 1124772.1124965 [2] Beigl M, Gellersen HW, Schmidt A (2001) MediaCups: Experience with design and use of computer-augmented everyday objects. Computer Networks, Special Issue on Pervasive Computing 35:401 – 409 [3] Boring S, Hilliges O, Butz A (2007) A wall-sized focus plus context display. In: Proceedings of the Fifth Annual IEEE Conference on Pervasive Computing and Communications (PerCom) [4] Burleson W, Selker T (2003) Canopy climb: a rope interface. In: SIGGRAPH ’03: ACM SIGGRAPH 2003 Sketches & Applications, ACM, New York, NY, USA, pp 1–1, DOI http://doi.acm.org/10.1145/965400.965549
556
Andreas Butz
[5] Butz A, Krüger A (2006) Applying the peephole metaphor in a mixed-reality room. IEEE Computer Graphics and applications [6] Butz A, Schmitz M, Krüger A, Hullmann H (2005) Tangible uis for media control - probes into the design space. In: Proceedings of ACM CHI 2005 (Design Expo) [7] Buxton W, Fitzmaurice G (1998) Hmd’s, caves and chameleon: A humancentric analysis of interaction in virtual space. Computer Graphics: The SIGGRAPH Quarterly 32(4):64–68 [8] Dourish P, Bellotti V (1992) Awareness and coordination in shared workspaces. In: CSCW ’92: Proceedings of the 1992 ACM conference on Computer-supported cooperative work, ACM Press, New York, NY, USA, pp 107–114, DOI http://doi.acm.org/10.1145/143457.143468 [9] Fitzmaurice GW, Ishii H, Buxton WAS (1995) Bricks: laying the foundations for graspable user interfaces. In: CHI ’95: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, pp 442–449, DOI http://doi.acm.org/10. 1145/223904.223964 [10] Gabbard JL, Hix D, Swan JE (1999) User-centered design and evaluation of virtual environments. IEEE Computer Graphics and Applications 19(6):51– 59, DOI http://doi.ieeecomputersociety.org/10.1109/38.799740 [11] Guiard Y (1987) Asymmetric division of labor in human skilled bimanual action: The kinematic chain as a model. The Journal of Motor Behavior 19(4):485–517 [12] Han JY (2005) Low-cost multi-touch sensing through frustrated total internal reflection. In: UIST ’05: Proceedings of the 18th annual ACM symposium on User interface software and technology, ACM Press, New York, NY, USA, pp 115–118, DOI http://doi.acm.org/10.1145/1095034.1095054 [13] Hilliges O, Baur D, Butz A (2007) Photohelix: Browsing, Sorting and Sharing Digital Photo Collections. In: To appear in Proceedings of the 2nd IEEE Tabletop Workshop, Newport, RI, USA [14] Hilliges O, Terrenghi L, Boring S, Kim D, Richter H, Butz A (2007) Designing for collaborative creative problem solving. In: C&C ’07: Proceedings of the 6th ACM SIGCHI conference on Creativity & cognition, ACM Press, New York, NY, USA, pp 137–146, DOI http://doi.acm.org/10.1145/1254960. 1254980 [15] Holleis P, Otto F, Hussmann H, Schmidt A (2007) Keystroke-level model for advanced mobile phone interaction. In: CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, pp 1505–1514, DOI http://doi.acm.org/10.1145/1240624.1240851 [16] Intille SS, Larson K, Tapia EM, Beaudin J, Kaushik P, Nawyn J, Rockinson R (2006) Using a live-in laboratory for ubiquitous computing research. In: Proceedings of PERVASIVE 2006, vol. LNCS 3968, Springer-Verlag, pp 349– 365 [17] Mueller F, Agamanolis S, Picard R (2003) Exertion interfaces: sports over a distance for social bonding and fun. ACM Press New York, NY, USA
User Interfaces and HCI
557
[18] Norman D, Collyer B (2002) The design of everyday things. Basic Books New York [19] Preece J, Rogers Y, Sharp H (2001) Interaction Design: Beyond HumanComputer Interaction. John Wiley & Sons, Inc. New York, NY, USA [20] Rekimoto J (1997) Pick-and-drop: a direct manipulation technique for multiple computer environments. In: UIST ’97: Proceedings of the 10th annual ACM symposium on User interface software and technology, ACM Press, New York, NY, USA, pp 31–39, DOI http://doi.acm.org/10.1145/263407.263505 [21] Rekimoto J, Saitoh M (1999) Augmented surfaces: a spatially continuous work space for hybrid computing environments. In: CHI ’99: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press, New York, NY, USA, pp 378–385, DOI http://doi.acm.org/10.1145/302979. 303113 [22] Rukzio E (2007) Physical mobile interactions: Mobile devices as pervasive mediators for interactions with the real world. PhD thesis, Faculty for Mathematics, Computer Science and Statistics. University of Munich [23] Schilit BN, Sengupta U (2004) Device ensembles. Computer 37(12):56–64 [24] Streitz NA, Geißler J, Holmer T, Konomi S, Müller-Tomfelde C, Reischl W, Rexroth P, Seitz P, Steinmetz R (1999) i-land: An interactive landscape for creativity and innovation. In: Proceeding of the CHI ’99 conference on Human factors in computing systems (CHI’99), ACM Press, New York, NY, pp 120–127, URL http://ipsi.fraunhofer.de/ambiente/ publications [25] Ullmer B, Ishii H (1997) The metadesk: models and prototypes for tangible user interfaces. In: UIST ’97: Proceedings of the 10th annual ACM symposium on User interface software and technology, ACM Press, New York, NY, USA, pp 223–232, DOI http://doi.acm.org/10.1145/263407.263551 [26] Ullmer B, Ishii H, Glas D (1998) mediablocks: physical containers, transports, and controls for online media. In: SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, ACM Press, New York, NY, USA, pp 379–386, DOI http://doi.acm.org/10.1145/280814. 280940 [27] Underkoffler J, Ishii H (1999) Urp: a luminous-tangible workbench for urban planning and design. In: CHI ’99: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press, New York, NY, USA, pp 386–393, DOI http://doi.acm.org/10.1145/302979.303114 [28] Vredenburg K, Mao JY, Smith PW, Carey T (2002) A survey of user-centered design practice. In: CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, New York, NY, USA, pp 471–478, DOI http://doi.acm.org/10.1145/503376.503460 [29] Want R, Schilit B, Adams N, Gold R, Petersen K, Goldberg D, Ellis J, Weiser M (1995) The PARCTAB ubiquitous computing experiment. Tech. Rep. CSL95-1, Xerox Palo Alto Research Center [30] Wellner P (1993) Interacting with paper on the digitaldesk. Communications of the ACM 36(7):87–96, DOI http://doi.acm.org/10.1145/159544.159630
558
Andreas Butz
[31] Wilson A, Izadi S, Hilliges O, Garcia-Mendoza A, Kirk D (2008) Bringing Physics to the Surface. In: In Proceedings of the 21st ACM Symposium on User Interface Software and Technologies (ACM UIST), Monterey, CA, USA
Part IV
Human-centered Interfaces
Multimodal Dialogue for Ambient Intelligence and Smart Environments Ramón López-Cózar and Zoraida Callejas
1 Introduction Ambient Intelligence (AmI) and Smart Environments (SmE) are based on three foundations: ubiquitous computing, ubiquitous communication and intelligent adaptive interfaces [41]. This type of systems consists of a series of interconnected computing and sensing devices which surround the user pervasively in his environment and are invisible to him, providing a service that is dynamically adapted to the interaction context, so that users can naturally interact with the system and thus perceive it as intelligent. To ensure such a natural and intelligent interaction, it is necessary to provide an effective, easy, safe and transparent interaction between the user and the system. With this objective, as an attempt to enhance and ease human-to-computer interaction, in the last years there has been an increasing interest in simulating human-tohuman communication, employing the so-called multimodal dialogue systems [46]. These systems go beyond both the desktop metaphor and the traditional speech-only interfaces by incorporating several communication modalities, such as speech, gaze, gestures or facial expressions. Multimodal dialogue systems offer several advantages. Firstly, they can make use of automatic recognition techniques to sense the environment allowing the user to employ different input modalities, some of these technologies are automatic speech recognition [62], natural language processing [12], face location and tracking [77], gaze tracking [58], lipreading recognition [13], gesture recognition [39], and handwriting recognition [78]. Ramón López-Cózar Dept. of Languages and Computer Systems, Faculty of Computer Science and Telecommunications, University of Granada, Spain, e-mail:
[email protected] Zoraida Callejas Dept. of Languages and Computer Systems, Faculty of Computer Science and Telecommunications, University of Granada, Spain, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_21, © Springer Science+Business Media, LLC 2010
559
560
Ramón López-Cózar and Zoraida Callejas
Secondly, these systems typically employ several output modalities to interact with the user, which allows to stimulate several of his senses simultaneously, and thus enhance the understanding of the messages generated by the system. These modalities are implemented using technologies such as graphic generation, natural language generation [33], speech synthesis [17], sound generation [24], and tactile/haptic generation [70]. Thirdly, the combination of modalities in the input and output allows to obtain more meaningful and reliable interpretations of the interaction context. This is, on the one hand, because complementary input modalities provide with non-redundant information which helps creating a richer model of the interaction. On the other hand, redundant input modalities increase the accuracy and reduce the uncertainty of the information [11]. Besides, both the system and the user can choose the adequate interaction modalities to carry out the communication, thus enabling a better adaptation to environmental conditions such as light/acoustic conditions or privacy. Furthermore, the possibility to choose alternative ways of providing and receiving information allows disabled people to communicate with this type of system using the interaction modalities that best suit their needs. Researchers have developed multimodal dialogue systems for a number of applications, for example, interaction with mobile robots [44] and information retrieval [26]. These systems have also been applied to the main topic of this book, for example, to enhance the user-system interaction in homes [54, 23], academic centers [47], hospitals [8] and theme parks [56]. After this brief introduction, the remaining of this chapter is organised as follows. Section 2 deals with context awareness and user modelling. Section 3 addresses the handling of input and contextual information and how to build abstractions on the environment to acquire it. Section 4 centres on dialogue management, i.e. on how the system responds to inputs and to changes in the environment, focusing on interaction and confirmation strategies. Section 5 addresses response generation, discussing the fission of multimodal information by means of synthesised speech, sounds, graphical objects and animated agents. Section 6 addresses system evaluation, describing the peculiarities of the evaluation of multimodal interfaces for AmI and SmE applications. Section 7 presents the conclusions.
2 Context Awareness Although there is not a complete agreement on the definition of context, the most widely accepted is the one proposed by [15]: “Any information that can be used to characterize the situation of an entity (...) relevant to the interaction between a user and an application, including the user and the application themselves”. As can be observed from this definition, any information source can be considered context as long as it provides knowledge relevant to handle the communication between the user and the system. In addition, the user is also considered to be part of the contextual information.
Multimodal Dialogue for Ambient Intelligence and Smart Environments
561
Kang et al [38] differentiate two types of context: internal and external. The former describes the user state (e.g. communication context and emotional state), whereas the latter refers to the environment state (e.g. location and temporal context). Most studies in the literature focus on the external context. However, although external information, such as location, can be a good indicator of the user intentions in some domains, in many applications it is necessary to take into account more complex information sources about the user state, such as emotional status [9] or social information [48]. External and internal context are intimately related. One of the most representative examples is service discovery. As described in [16], service context (i.e. location of services and providers) can be matched against the user context as a filtering mechanism to provide only the services that are suitable given the user current state. Another example are the so-called proactive systems, which decide not only the information that must be provided to the user, but also when to do it to avoid interrupting. The factors that must be taken into account to decide in which cases the user should be interrupted comprise both the user context (e.g. the activities in which he is involved, his emotional state, social engagement or social expectations) and the context related to the environment, such as the location of the user and the utility of the message [32]. For example, [69] presents a proactive shopping assistant integrated in supermarket trolleys. The system observes the shopper’s actions and tries to infer his goals. With this information, it proactivelly offers adapted support tailored to the current context, for example displaying information about products when the user holds them for a very long time or comparing different products when the user is deciding between two items.
2.1 Context Detection and Processing Context information can be gathered from a wide variety of sources, which produces heterogeneity in terms of quality and persistence [30]. As described in [31], static context deals with invariant features, whereas dynamic context is able to cope with information that changes. The frequency of such changes is very variable and can deeply influence the way in which context is obtained. It is reasonable to obtain largely static context directly from users, and frequently changing context from indirect means such as sensors. Once the context information has been obtained, it must be internally represented within the dialogue system so that it can be handled in combination with the information about the user interaction and the environment. In dynamic environments, the information describing context is highly variable and sometimes very unreliable and contradictory. Besides, it is also important to keep track of previous context during the interactions. This is why it is necessary to build consistent context models.
562
Ramón López-Cózar and Zoraida Callejas
A number of methods have been proposed to create these models1 . The simplest approach for modelling context is the key-value method, in which a variable (key) contains the actual context (value). However, key-values approaches are not suitable for capturing sophisticated contextual information. To solve this drawback, tagged encoding can be employed. It is based on context profiles using a markup scheme, which enables modelling and processing context recursively, and to employ efficient context retrieval algorithms. However, this approach still lacks capability for representing complex contextual relationships and sometimes the number of contextual aspects that can be described is limited. It was adopted for example in the Stick-e Notes system, which was based on the SGML standard language [59]. More sophisticated mechanisms are object oriented models (e.g. in [7]), which have the benefits of encapsulation and reusability. With them, it is possible to manage a great variety and complexity of contextual information while maintaining scalability. Following the philosophy of encapsulation, Kwon and Sadeh [42] presented a multi-agent architecture in which a negotiator agent encapsulates all the knowledge about context. Recently, Henricksen and Indulska [30] have presented a context modelling methodology which takes into account different levels of abstraction to cover the spectrum between the sensor output and the abstract representation that is useful for the application. This is an improvement over the first toolkits developed to represent context information, such as “The context toolkit” [14] which allowed the creation of abstract objects that could be related to the output of sensors. In addition, graphical representations [16] and ontology-based modelling are currently being employed for pervasive and mobile applications [25]. The context representation can be stored either locally, in a computer network, or in both. For example, the context-awareness system developed in the CASY project [79] enables users to define new context events, which can be sent along with multimedia messages through Internet. For example, a child can produce a context event “I am going to bed” and receive a “goodnight message” video which his grandmother had previously recorded, thus allowing geographically distributed families to communicate asynchronously. The processing of context is essential in dialogue systems to cope with the ambiguities derived from the use of natural language. For instance, it can be used to resolve anaphoric references (e.g. decide the noun referred by a pronoun). However, the linguistic context sometimes is not enough to resolve a reference or disambiguate a sentence, and additional context sources must be considered. Porzel and Gurevych [60] identified four sources, namely situational, discourse, interlocutor and domain. These sources were employed in a mobile tourist information system to distinguish different types of where clauses. For example, for the user question “Where is the castle?” the system could answer two types of response: a path to the castle, or a description of the setting. The system selected the appropriate answer considering the proximity of the user to the castle, the intentions of the user (whether he wanted to go there or view it), and the user preferences (whether he wanted the shortest, fastest or nicest path). 1
A comprehensive review of context modelling approaches can be found in [71].
Multimodal Dialogue for Ambient Intelligence and Smart Environments
563
Other important topics concerned with the processing of context are: relevant data structures and algorithms to be used, how to reduce the overhead caused by taking context into account, and possible recovery strategies to be employed when the environment is not providing the necessary services [67]. Additionally, privacy issues are gaining more attention from the scientific community [29].
2.2 User Modelling As previously discussed, the user is also considered to be part of the context. Hence, it is very important to employ user models in AmI and SmE applications to differentiate user types and adapt the performance of the dialogue system accordingly. Besides, user models are essential for conversational interactions, as they enable to deal with interpersonal variations which can vary from a low level (e.g. acoustic features) to a higher level, such as features concerned with syntactic, vocabulary, pragmatic and speaking style. For example, these models enable to enhance speech recognition by selecting the most appropriate acoustic models for the current user. They are also very useful for speech generation, as system prompts can be created according to the level of experience of the user. For instance, pauses can be placed in specific points of the sentences to be synthesised in order to encourage experienced users to speak, even while the system is still talking (barge-in) [45]. The adaptation is specially useful for users of devices such as mobile phones, PDAs or cars, as they can benefit from the functionality built over the knowledge about them as an individual user. For example, this is the case of the in-car system presented in [4], which adapts the spoken interaction to the user as he books a hotel room while driving a car. In these cases, simple user models as user profiles can be employed, which typically contain data concerned with user needs, preferences and/or abilities. For example, the profiles employed by the DS-UCAT system [47] contain students’ personal data and preferences for the interaction with the system, such as language (English or Spanish), spoken modality (enabled/disabled), gender for the system voice (male/female) and acceptation of messages incoming from the environment (enabled/disabled). Considering these profiles, the dialogue manager of the system employs interaction strategies (to be discussed in Sect. 4) which are adaptive to the different user types. For example, system-directed initiative is used for inexperienced users, whereas mixed-initiative is employed for experienced users. A drawback of user profiles is that they require time on the part of the users to fill-in the required data. An alternative is to classify users into preferences’ groups, considering that sometimes the information about preferences can be common for different applications. In order to facilitate the sharing of this information, a crossapplication data management framework must be used, such as the one presented in [40]. This framework has the ability to infer data, so that applications can use cross-profile data mining algorithms. For example, the system can find users that are similar to the one who is currently interacting with it and whose preferences might coincide.
564
Ramón López-Cózar and Zoraida Callejas
General user profiles are suitable to classify habits and preferences of the users in order to respond with the appropriate actions. However in some application domains it is also necessary to model more complex and variable parameters such as the user’s intentions. In this case, an alternative to using general user profiles is to employ the so-called online user modelling, which can be defined as the ability of a system to model users at run-time [18]. These have been used for example in health promotion dialogue systems, in which it is necessary to model the attitude of the users towards following the therapist advices (e.g. whether the user is confident or demoralised) [64].
3 Handling of Input Information In AmI and SmE applications, devices such as lamps, TV, washing machine or oven, and sensors such as microphones, cameras or RFID readers, generate information which is the input to the dialogue system. To enable the interaction between the user and a multimodal dialogue system, as well as the interaction between the system and the devices in the environment, all the generated information must be properly represented so that it can be handled to produce the system responses.
3.1 Abstracting the Environment To implement the interaction between the dialogue system and the environment, the former typically accesses a middleware layer that represents characteristics of the environment, e.g. interfaces and status of devices. Using this layer, the dialogue system does not need to interact directly with the devices in the physical world, thus getting rid of their specific peculiarities. The implementation details of the entities in the middleware are hidden to the system, and thus it uses the same standard communication procedure for any entity in the environment. Moreover, this layer eases the setting up of dynamic environments where devices can be added or removed at run-time. The communication between the entities in the middleware and the corresponding physical devices can be implemented using standard access and control mechanisms, such as EIB (European Installation Bus) [54, 55] or SNMP (Simple Network Management Protocol) [49]. Researchers have implemented this layer in different ways. For example, Encarnaçao and Kirste [19] proposed a middleware called SodaPop (Self-Organizing Data-flow Architectures suPorting Ontology-based problem decomPosition), which supports the self-organisation of devices into appliance ensembles. To do this, their proposal considers goal-based interaction of devices, and distributed event processing pipelines. Sachetti et al [65] proposed a middleware called WSAMI (Web Services for Ambient Intelligence) to support seamless access to mobile services for the mobile user,
Multimodal Dialogue for Ambient Intelligence and Smart Environments
565
either pedestrian, in a car or in a public transport. This middleware builds on the Web services architecture, whose pervasiveness enables services availability in most environments. In addition, the proposal deals with the dynamic composition of applications in a way that integrates services deployed on mobile, wireless and on the Internet. The authors developed three key WSAMI-compliant Web services to offer intelligent-aware multimodal interfaces to mobile users: i) linguistic analysis and dialogue management, ii) multimodality and adaptive speech recognition, and iii) context awareness. Berre et al [5] proposed a middleware called SAMBA (Systems for AMBient intelligence enabled by Agents) to support the interaction and interoperability of various elements by encapsulating and representing them as agents acting as members on an Ambient IntelligenceAmbient Intelligence Elements Society, and by using executable models at run-time in support of interoperatibility. Montoro et al [54] implemented a middleware using a blackboard [20] created from the parsing of an XML document that represents ontological information about the environment. To access or change the information in the blackboard, applications and interfaces employ a simple communication mechanism using XMLcompliant messages which are delivered via HTTP. An interesting feature of this proposal is that it allows to attach linguistic information to each entity in the middleware in order to automatically create a spoken interface associated with the corresponding device. Basically, it is comprised of a verb part (actions that can be performed with the device), an object part (name that can be given to the device), a modifier part (that refers to the kind of entity), a location part (that refers to the location of the device in the environment), and an additional information part (which contains additional information about the entity in the middleware).
3.2 Processing Multimodal Input A number of formalisms have been defined to represent the information of the user interaction captured by the sensors in the environment, for example, typed feature structures [10] and XML-based languages such as M3L (Multimodal Markup Language) [72] or MMIL (Multimodal Interface Language) [63, 65]. Many multimodal dialogue systems typically employ the so-called frames, which can be defined as abstract representations of real-world entities and their relationships. Each entity if constituted of a set of attributes called slots, which are filled with data extracted by the sensors. The filling of the slots can be carried out either incrementally or by means of heritage relationships between frames. Slots can contain a diversity of data types, for example, words provided by a speech recogniser [54, 66], screen coordinates provided by a electronic pend, and location information obtained via RFID [55, 56] or GPS. They can also contain time stamps regarding the moment at which the frames are created. Frames can be classified into three types: input, integration and output frames. Input frames are employed to store the information generated by the devices or
566
Ramón López-Cózar and Zoraida Callejas
captured by the sensors in the environment. Each input frame may contain empty slots if the obtained information is partial. Integration frames are created by the dialogue system as it handles the user interaction. The slots of these frames are filled by the combination of the slots in the input frames. The output frames are used to generate system responses (as will be discussed in Sect. 5). The dialogue system can use these frames to change the status of devices in the environment (e.g. turn on a lamp), or to transmit information to the user employing e.g. synthesised speech and graphics on a PDA screen. For example, [56] used frames to represent contextual information in a theme park application, where users move 3D puzzle pieces to assemble a car. Nigay and Coutaz [57] proposed to use a formalism similar to frames called melting pots, which encapsulate multimodal information with time stamps. These stamps are very important to consider the information provided by devices and sensors either as complementary or independent. Taking into account temporal information, the combination (or fusion) of the information captured from the user can be either microtemporal, macrotemporal or contextual. The microtemporal fusion is carried out when the information captured from the user, e.g. stored in melting pots, is complementary and overlapped in time. The macrotemporal fusion is carried out when the information is complementary and the time intervals of the captured information do not overlap in time, but belong to the same analysis window. Finally, the contextual fusion combines information without considering time restrictions. The fusion of the information can be carried out at signal or semantic levels. In the former case, only temporal issues are involved (e.g. data synchronisation) without taking into consideration any kind of high-level interpretation. This fusion type is usually employed for audiovisual speech recognition, which combines acoustic and lip movement signals [51]. On the contrary, the fusion at the semantic level deals with the combination of multimodal information interpreted separately, and then combined considering their meaning. Among others, Sachetti et al [65] employed fusion at the semantic level to combine information obtained from speech and 2-D gestures captured by an electronic pen.
4 Dialogue Management Using the information generated by the devices, captured by the sensors in the environment and provided by the context, the dialogue system manages the interaction with the user and provides services. As the dialogue between the system and the user is multimodal, the dialogue manager controls the execution flow from various input and output communication channels, e.g. speech and/or icon selection on a PDA screen. The modalities to be used may be determined by the user profile and the contextual information about the environment conditions. If speech is suitable as an input modality, the user voice is captured by a microphone and is the input to a speech recogniser [54, 23, 66] which provides one or more recognition hypotheses. If speech is enabled as an output modality, the system can generate spoken
Multimodal Dialogue for Ambient Intelligence and Smart Environments
567
responses via synthesised speech, typically using text-to-speech (TTS) conversion [17]. For example, a spoken dialogue2 between the user and the GENIO system [23], developed for home automation and entertainment, is as follows: (1) User: Majordomo. (2) System: Tell me. (3) User: Wash my cloths by the time when I come back home. (4) System: When are you coming back? (5) User: At eight in the evening. (6) System: Eight in the morning or in the evening? (7) User: In the evening, I said in the evening. (8) System: OK, eight in the evening. Which washing program? (9) User: Soft, 30 degrees centigrade. (10) System: Did you say 13 degrees? (11) User: No, I said 30 degrees. (12) System: OK, washing machine programmed for soft, 30 degrees centigrade. Anything else? (13) User: Is the dishwasher working at the moment? (14) System: No, it is not. To enable this dialogue, the system employs predefined dialogue management strategies to decide what to do after processing the information obtained from the user, the environment and the context. Typical actions are: provide information to the user extracted from a database query, check or change the status of devices in the environment, confirm data, and ask for missing data necessary to perform a specific action. Information can be missing either because a data item has not been provided by the user, or due to processing errors. For instance, a processing error can be observed in the dialogue shown above, as the data item ’in the evening’ provided by the user in turn (5) was not understood by the system. This might happen because the words ’in the evening’ uttered by the user were not recognised, or were recognised with very low confidence score (as will be discussed in Sect. 4.2) and thus were rejected by the system. As a result, the system needed to employ an additional dialogue turn (6) to clarify the time, thus making the dialogue less efficient. In order to enhance the system performance and provide a friendly interaction, the dialogue management strategies must be adaptive to user characteristics such as knowledge, expertise, preferences and needs, which are stored in the user profile [27] [21]. Specifically, the adaptation to special needs is receiving much attention in terms of how to make systems usable by handicapped people [28], children [1] and the elderly [43]. Despite their complexity, these characteristics are to some extent rather static. Jokinen [37] identified another degree of adaptation in which the dialogue management strategies are not only adapted to the explicit message conveyed during the interaction, but also to the user’s intentions and state. Following 2
For illustration purposes, this dialogue has been slightly changed from the one shown in [23].
568
Ramón López-Cózar and Zoraida Callejas
this guideline, affective computing focuses on how to recognize and dynamically adapt the conversation with the system to the user emotional state. For example, many breakdowns in man-machine communication could be avoided if the machine was able to recognize the emotional state of the user and responded to it more sensitively [50]. Earlier experiments showed that an empathetic computer agent can indeed contribute to a more positive perception of the interaction [61]. This study also examined how social factors, such as status, influence the semantic content, the syntactic form and the acoustic realization of conversations [74].
4.1 Interaction Strategies To implement the user-system interaction by means of a multimodal dialogue, the system developers must design interaction strategies to control the way in which user and system interact with each other employing dialogue turns. For example, one issue to decide is whether the user must provide just the data items requested by the system, and in the order specified by the system, or alternatively, he is allowed to provide the data in the order that he wishes and be over-informative (provide data not requested by the system). Another important issue to decide is who (user or system) has the initiative in the dialogue. In accordance with this criterion, three types of interaction strategies are often distinguished in the literature: user-directed, system-directed and mixed. When the former strategy is used, the user has always the initiative in the dialogue, and the system just responds his queries and orders. This is the case in the sample dialogue shown above. The main problem with this strategy is that the user may think that he is free to say whatever he wants, which tends to cause speech recognition and understanding errors. When the system-directed strategy is used, the system has always the initiative in the dialogue, and the user just answers its queries. The advantage of this strategy is that it tends to reduce the possible user input, which typically leads to very efficient dialogues. The disadvantage is the lack of flexibility on the part of the user, who is restricted to behave as the system expects, providing the necessary data to perform some action in the order specified by the system. When the mixed interaction strategy is used, both the user and the system can take the initiative in the dialogue, consequently influencing the conversation flow. The advantage is that the system can guide the user in the tasks that he can perform. Moreover, the user can take the initiative and be over-informative.
4.2 Confirmation strategies Considering the limitations of the state-of-the-art recognition technologies employed to build multimodal dialogue systems, it is necessary to assume that the
Multimodal Dialogue for Ambient Intelligence and Smart Environments
569
information captured by sensors about the user interaction may be uncertain or ambiguous. To deal with the uncertainty problem, these systems typically employ confidence scores attached to the frame slots, for example, real numbers in the range between 0 and 1. A confidence score under a threshold indicates that the data item in the slot must be either confirmed or rejected by the system. Two types of confirmation strategies are are often employed: explicit and implicit. When the former is used, the system generates an additional dialogue turn to confirm the data item obtained from the previous user turn. Turn (10) in the sample dialogue above shows an example of this confirmation type. The disadvantage of this confirmation method is that the dialogue tends to be lengthy due to the additional confirmation turns, which makes the interaction less effective and even excessively repetitive if all the data items provided by the user must be confirmed. When the implicit confirmation strategy is used, no additional turns are necessary since the system includes the data items to be confirmed in the next dialogue turn to get other data items from the user. In this case, it is responsibility of the user to make a correction if he observes a data item wrongly obtained by the system. Turn (12) in the sample dialogue above shows an example of this confirmation strategy, where the system tries to get a confirmation for the data item ’Soft washing programme’. As the user did not make any correction in his next dialogue turn (13), the system assumed that this data item was confirmed. The input information can be ambiguous if it can be interpreted in different ways by the system. For example, the input made with a pen on a touch-sensitive screen can refer to three different purposes: pointing (as a substitute of mouse), handwriting and drawing. In order to face this problem, the dialogue system must employ some method to try to automatically decide the mode in which the pen is being used, and/or employ an additional dialogue turn to get a confirmation from the user about the intended mode. The strategies discussed above are useful to avoid misunderstandings and deal with the uncertainty of data obtained from the user. One related, but different situation is the non-understanding, which occurs when the system does not get any data from the user interaction. In this case, two typical strategies for handling the error are to ask the user to repeat the input, and to ask him to rephrase the sentence (in the case of spoken interaction).
5 Response Generation After the analysis of the input information captured from the user and obtained from the environment, the dialogue system must generate a response for the user taking into account the context information and the user profile. To do this, it can employ the output frames discussed in Sect. 3.2. These frames are analysed by a module of the system that carries out the so-called fission of multimodal information, i.e., a decision about the devices and interaction modalities to be used for generating the
570
Ramón López-Cózar and Zoraida Callejas
response. For example, the system can generate a multimodal response using synthesised speech, sounds, graphical objects and an animated agent in a PDA screen. The use of several modalities to generate system responses allows a friendlier interaction that better adapts to the user preferences and needs. Moreover, it enables the user to create a proper mental model of the performance of the system and the dialogue status, which facilitates the interaction and decreases the number of interaction errors. The response generation may involve as well to access the middleware to change the status of a device in the environment, e.g. to switch on a specific lamp in the living room [54]. Speech synthesis is typically carried out employing a text-to-speech (TTS) conversion [17], which transforms into speech any sentence in text format. The sentence can be created employing natural language generation techniques, which typically carry out two sequential phases, known as deep and syntactic (or surface) generation [33]. The deep generation can be divided into two phases: selection and planning. In the selection phase the system selects the information to be provided to the user (typically obtained from a database query), while in the planning phase it organises the information into phrase-like structures to achieve clarity and avoid redundancy. An important issue of this phase is the lexical selection, i.e. choosing the most appropriate words for each sentence. The lexical selection depends on the information previously provided by the system, the information available in the context, and specific stylistic considerations. The syntactic (or surface) generation takes the structures created by the deep generation and builds the sentences expressed in natural language, ensuring the grammatical correction. In addition to speech, the system may generate sounds to provide a friendly interaction, for example, using auditory icons [24], which are associations between everyday sounds to particular events of the dialogue system. The display of the communication device employed by the user may include graphical objects concerned with the task carried out by the system, e.g. pictures of a microwave oven, washing machine, dishwasher or lamps [55]. Additionally, the graphical objects can be used to provide visual feedback about the key data obtained form the user, for example, the temperature for the washing programme. The display can also include the so-called animated agents, which are computeranimated characters, typically with the appearance of a human being, that produce facial expressions (e.g. by lips, eyes or eyebrows movements) and body movements in synchronisation with the synthesised speech. The goal of these agents is to provide a friendlier interaction, and ease the understanding of the messages generated by the system. For instance, the GENIO system [23] developed for home automation and entertainment, employs one of these agents to show a human-like majordomo on a PDA screen. Another sample animated agent is the robotic interface called iCat (Interactive Cat) [66], which is connected to an in-home network to control devices (e.g. light, VCR, TV, radio) and to access the Internet.
Multimodal Dialogue for Ambient Intelligence and Smart Environments
571
6 Evaluation There are no generally accepted criteria for evaluating AmI and SmE systems. As it is a highly multidisciplinary area, authors usually employ standard methods to evaluate particular aspects of each system, rather than evaluating the system as a whole. This is the case of the evaluation of multimodal interfaces, which are usually assessed by employing the standard performance and satisfaction metrics of traditional dialogue systems [18]. Furthermore, the implementation of these systems is usually very costly. Therefore, regardless of the approach employed, it is desirable to assess their usability at design time, previously to implementation. Wizard of Oz techniques [22, 76] are typically employed to do this. Using these techniques, a human plays the role of some functionalities of the system and the users are made to believe that they are actually interacting with a machine. Once the validity of the design is checked, the next step is to create a prototype system to evaluate the system performance in real conditions. The results obtained from this evaluation are very useful to find out weaknesses of the system that must be improved before it is made available for the users.
6.1 Contextual Evaluation As argued in [68], most context-aware systems provide a functionality which is only available or useful in a certain context. Thus, the evaluation must be carried out in a given context as well. In this way, there is more control over the data obtained, given that the situation is restricted. However, the evaluation results might not be significant to cope with the full spectrum of natural interactions. Also, simulating the type of data that would be acquired from the sensor system in a real scenario may be difficult. As an attempt to solve these problems, the so-called smart rooms are usually employed. However, these environments fail to reproduce the variety of behaviours, movements and tasks that the users would perform on a daily-live basis, specially when these rooms are located in laboratories. An alternative are the so-called Living labs, i.e., houses built with ubiquitous sensing capabilities which provide naturalistic user behaviours in real life. However, as the user must interact in an environment which is not his real home, his behaviour may still be altered. In any case, user behaviour in living labs is still more natural than the one that could be obtained from short laboratory visits [35]. An example of living lab is the PlaceLab [34], a real house located near the MIT campus in a residential area, occupied by volunteers who can live there for periods of days, weeks and up to months depending of the studies being made. It was designed to be able to sense the opening and closing of doors and windows, electrical current, water and gas flow; and is equipped with cameras and microphones to track people,
572
Ramón López-Cózar and Zoraida Callejas
so that routine activities of everyday home life could be observed, recorded and manipulated. In the case of mobile AmI or SmE systems, the evaluation requires to consider several contexts in order to study the system response in different geographical locations. For example, the MATCH system [36], which provides multimodal access to city help in hand-held devices, was evaluated both in laboratory and mobile settings while walking in the city of New York. Another example is the exercise counselor presented in [6], which is a mobile conversational agent played on a PDA who motivates the users to walk more. It was evaluated both in laboratory and in two open conditions; firstly, with the test user walking on a treadmill while wearing the PDA, and secondly with the user wearing the PDA continuously during eight days.
6.2 Evaluation from the Perspective of Human-computer Interaction It is very important to minimize the number of variables that influence the evaluation of systems in order to obtain representative results. An approach to achieve this goal is to employ a single domain focus [68]. That is, evaluating the system from a single perspective, keeping everything else constant. A typical usage of this approach is to evaluate the system from the perspective of the human-computer interaction. This is the case of the evaluation of some context-aware dialogue systems, such as SmartKom [73] and INSPIRE [52]. The SmartKom project provides an intelligent, multimodal human-computer interface for three main scenarios. The first is Smarktom Home/Office, which allows communication and operation of home appliances. The second is SmartKom Public, which allows access to public services. The third is SmartKom mobile, which is a mobile assistant. The authors differentiate between what developers and users need to evaluate the system [3]. For the developers, the objective of evaluation is to deliver reliable results of the performance under realistic application conditions, whereas from the users perspective the goal is to provide different services retrieved in a multimodal manner. An outcome of this project was a new framework specifically devoted to the evaluation of multimodal systems called PROMISE (Procedure for Multimodal Interactive System Evaluation) [2, 3]. This new framework is an extended version of a widely adopted framework for the evaluation of spoken dialogue systems, called PARADISE [75]. The main idea behind PROMISE is to define quality and quantity measures that can be computed during the system processing. Quality measures include system cooperativity, semantics, helps, recognition (speech, gestures and facial expressions), transaction success and ways of interaction. Quantity measures include barge-in, elapsed time, and users/system turns. A distinct aspect of PROMISE, not considered in PARADISE, is a weightening scheme for different recognition components, which allows solving contradictory inputs from the different modalities.
Multimodal Dialogue for Ambient Intelligence and Smart Environments
573
The PROMISE framework focuses mainly on the calculation of “objective” evaluation measures. However, in order to obtain a meaningful interpretation of user satisfaction, also “subjective” measures about perceived quality of the system should be gathered. Möller et al [53] identified some of the most important factors that influence the perceived quality of the service provided by a dialogue interface: user, agent, environment, task and contextual factors. User factors include attitude, emotions, experiences and knowledge which affect the users judgments. Agent factors are related to the characteristics of the machine counterpart. Environment factors relate to the physical context of the interaction. Task and contextual factors relate to the non-physical context. The INSPIRE system, designed to provide a multimodal interface for home, office and in-car conversational interactions, was evaluated taking into account these factors to assess the impact of the speech output component on the quality judgments of the users.
7 Conclusions This chapter has presented a review of some important issues concerned with the design, implementation, performance and evaluation of multimodal dialogue systems for AmI and SmE environments. It has discussed the conceptual components of these systems, which deal with context awareness, multimodal input processing, dialogue management, and response generation. The first part of the chapter has described context awareness approaches. It has introduced different context types (internal and external) and discussed context detection and processing. As the user is considered to be part of the context, the chapter has also focused on user modelling, discussing user profiles and on-line user modelling. Next, the chapter has addressed the handling of input information. On the one hand, has taken into account the representation of the environment, which is typically modelled by a middleware layer, discussing different alternatives existing in literature to implement this. On the other hand, the chapter has discussed formalisms typically employed by the research community to represent multimodal input information. It has also addressed the fusion types employed to combine the internal representations of the multimodal information captured from the user or generated by the environment. The third part of the chapter has addressed dialogue management, which enables the intelligent, adaptive, and natural multimodal interaction between the user and the environment. The chapter has discussed the kind of such strategies employed to decide the system responses and actions, taking into account predefined dialogue management strategies and user models. These approaches have been further discussed in terms of dialogue initiative and confirmation strategies. The chapter has focused as well on response generation, discussing the procedure called fission of multimodal information, which is a decision about the devices and interaction modalities to be used for generating a system’s response.
574
Ramón López-Cózar and Zoraida Callejas
The last section of the chapter has addressed evaluation. It has initially discussed standard methods for the evaluation of dialogue systems, such as Wizard of Oz and prototyping. Then, it has addressed scenarios for the contextual evaluation of AmI and SmE applications, such as smart rooms and living labs. Finally, it has discussed evaluation from the perspective of human-computer interaction, focusing on the proposals of the SmartKom and INSPIRE projects. Acknowledgements This work has been funded by the research project HADA - Adaptive Hypermedia for the Attention to the Diversity in Ambient Intelligence Environments (TIN2007-64718), funded by the Spanish Ministry of Education and Science.
References [1] Batliner A, Hacker C, Steidl S, Nöth E, D’Arcy S, Russel M, Wong M (2004) Towards multilingual speech recognition using data driven source/target acoustical units association. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), Montreal, Quebec, Canada, pp 521–524 [2] Beringer N, Karal U, Louka K, Schiel F, Türk U (2002) PROMISE A procedure for multimodal interactive system evaluation. In: Proc. of the LREC Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, Spain, pp 77–80 [3] Beringer N, Louka K, Penide-Lopez V, Türk U (2002) End-to-end evaluation of multimodal dialogue systems - can we transfer established methods? In: Proc. of the LREC Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, Spain, pp 558–563 [4] Bernsen N (2003) User modelling in the car. Lecture Notes in Artificial Intelligence pp 378–382 [5] Berre AJ, Marzo GD, Khadraoui D, Charoy F, Athanasopoulos G, Pantazoglou M, Morin JH, Moraitis P, Spanoudakis N (2007) SAMBA - an agent architecture for ambient intelligence elements interoperability. In: Proc. of Third International Conference on Interoperability of Enterprise Software and Applications, Funchal, Madeira, Portugal [6] Bickmore T, Mauer D, Brown T (2008) Context awareness in a handheld exercise agent. Pervasive and Mobile Computing Doi:10.1016/j.pmcj.2008.05.004. In press [7] Bouzy B, Cazenave T (1997) Using the object oriented paradigm to model context in computer Go. In: Proc. of Context’97, Rio, Brazil [8] Bricon-Souf N, Newman CR (2007) Context awareness in health care: A review. International journal of medical informatics 76:2–12 [9] Callejas Z, López-Cózar R (2008) Influence of contextual information in emotion annotation for spoken dialogue systems. Speech Communication 50(5):416–433
Multimodal Dialogue for Ambient Intelligence and Smart Environments
575
[10] Carpenter R (1992) The logic of typed feature structures. Cambridge University Press, Cambridge, England [11] Corradini A, Mehta M, Bernsen N, Martin J, Abrilian S (2003) Multimodal input fusion in human-computer interaction. In: Proc. of the NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management, Yerevan, Armenia [12] Dale R, Moisl H, Somers H (eds) (2000) Handbook of natural language processing. Dekker Publishers [13] Daubias P, Deléglise P (2002) Lip-reading based on a fully automatic statistical model. In: Proc. of International Conference on Speech and Language Processing, Denver, Colorado, US, pp 209–212 [14] Dey A, Abowd G (1999) The context toolkit: Aiding the development of context-enabled applications. In: Proc. of the SIGCHI conference on Human factors in computing systems (CHI 99), Pittsburgh, Pennsylvania, US, pp 434– 441 [15] Dey A, Abowd G (2000) Towards a better understanding of context and context-awareness. In: Proc. of the 2000 Conference on Human Factors in Computer Systems (CHI’00), pp 304–307 [16] Doulkeridis C, Vazirgiannis M (2008) CASD: Management of a context-aware service directory. Pervasive and mobile computing Doi:10.1016/j.pmcj.2008.05.001. In press [17] Dutoit T (1996) An introduction to text-to-speech synthesis. Kluwer Academic Publishers [18] Dybkjaer L, Bernsen N, Minker W (2004) Evaluation and usbility of multimodal spoken language dialogue systems. Speech Communication 43:33–54 [19] Encarnaçao J, Kirste T (2005) Ambient intelligence: Towards smart applicance ensembles. In: From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, pp 261–270 [20] Engelmore R, Mogan T (1988) Blackboard systems. Addison-Wesley [21] Forbes-Riley K, Litman D (2004) Modelling user satisfaction and student learning in a spoken dialogue tutoring system with generic, tutoring, and user affect parameters. In: Proc. of the Human Language Technology Conference - North American chapter of the Association for Computational Linguistics annual meeting (HLT-NAACL’06), New York, US, pp 264–271 [22] Fraser M, Gilbert G (1991) Simulating speech systems. Computer Speech and Language 5:81–99 [23] Gárate A, Herrasti N, López A (2005) Genio: An ambient intelligence application in home automatation and entertainment environment. In: Proc. of Joint soc-EUSI Conference, pp 241–245 [24] Gaver WW (1992) Using and creating auditory icons. SFI studies in the sciences of complexity, Addison Wesley Longman, URL Proceedings/ 1992/Gaver1992.pdf [25] Georgalas N, Ou S, Azmoodeh M, Yang K (2007) Towards a model-driven approach for ontology-based context-aware application development: a case study. In: Proc. of the fourth International Workshop on Model-Based Method-
576
[26]
[27]
[28]
[29] [30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
Ramón López-Cózar and Zoraida Callejas
ologies for Pervasive and Embedded Software (MOMPES ’07), Braga, Portugal, pp 21–32 Gustafson J, Bell L, Beskow J, Boye J, Carlson R, Edlund J, Granstrom B, House D, Wirén M (2000) Adapt - a multimodal conversational dialogue system in an apartment domain. In: Proc. of International Conference on Speech and Language Processing, Beijing, China, pp 134–137 Haseel L, Hagen E (2005) Adaptation of an automotive dialogue system to users’ expertise. In: Proc. of 9th International Conference on Spoken Language Processing (Interspeech’05-Eurospeech), Lisbon, Portugal, pp 222–226 Heim J, Nilsson E, Havard J (2007) User Profiles for Adapting Speech Support in the Opera Web Browser to Disabled Users. Lecture Notes in Computer Science 4397:154–172 Hengartner U, Steenkiste P (2006) Avoiding privacy violations caused by context-sensitive services. Pervasive and mobile computing 2:427–452 Henricksen K, Indulska J (2006) Developing context-aware pervasive computing applications: models and approach. Pervasive and mobile computing 2:37–64 Henricksen K, Indulska J, Rakotonirainy A (2002) Modeling context information in pervasive computing systems. In: Proc. of the First International Conference on Pervasive Computing, pp 167–180 Ho J, Intille S (2005) Using context-aware computing to reduce the perceived burden of interruptions from mobile devices. In: Proc. of the 2005 Conference on Human Factors in Computer Systems (CHI’05), Portland, US, pp 909–918 Hovy EH (1993) Automated discourse generation using discourse relations. Artificial Intelligence, Special Issue on Natural Language Processing 63:341– 385 Intille S, Larson K, Munguia E (2003) Designing and evaluating technology for independent aging in the home. In: Proc. of the International Conference on Aging, Disability and Independence Intille S, Larson K, Beaudin J, Nawyn J, Tapia EM, Kaushik P (2005) A living laboratory for the design and evaluation of ubiquitous computing technologies. In: Proc. of the 2005 Conference on Human Factors in Computer Systems (CHI’05), Portland, Oregon, US, pp 1941–1944 Johnston M, Bangalore S, Vasireddy G, Stent A, Ehlen P, Walker M, Whittaker S, Maloor P (2002) Match: An architecture for multimodal dialogue systems. In: Proc. of Association for Computational Linguistics, Pennsylvania, Philadelphia, US, pp 376–383 Jokinen K (2003) Natural interaction in spoken dialogue systems. In: Proc. of the Workshop Ontologies and Multilinguality in User Interfaces, Crete, Greece, pp 730–734 Kang H, Suh E, Yoo K (2008) Packet-based context aware system to determine information system user’s context. Expert systems with applications 35:286– 300 Kettebekov S, Sharma R (2000) Understanding gestures in multimodal human computer interaction. Int Journal on Artificial Intelligence Tools 9(2):205–223
Multimodal Dialogue for Ambient Intelligence and Smart Environments
577
[40] Korth A, Plumbaum T (2007) A framework for ubiquitous user modelling. In: Proc. of IEEE International Conference on Information Reuse and Integration, Las Vegas, Nevada, US, pp 291–297 [41] Kovács GL, Kopácsi S (2006) Some aspects of ambient intelligence. Acta Polytechnica Hungarica 3(1):35–60 [42] Kwon O, Sadeh N (2004) Applying case-based reasoning and multi-agent intelligent system to context-aware comparative shopping. Decision Support Systems 37:199–213 [43] Langner B, Black A (2005) Using speech in noise to improve understandability for elderly listeners. In: Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’05), San Juan, Puerto Rico, pp 392– 396 [44] Lemon O, Bracy A, Gruenstein A, Peters S (2001) The WITAS Multi-Modal Dialogue System I. In: Proc. of Interspeech, Aalborg, Denmark, pp 1559–1562 [45] Levin E, Levin A (2006) Dialog design for user adaptation. In: Proc. of the International Conference on Acoustics Speech Processing, Toulouse, France, pp 57–60 [46] López-Cózar R, Araki M (2005) Spoken, Multilingual and Multimodal Dialogue Systems: Development and Assessment. John Wiley Sons [47] López-Cózar R, Callejas Z, Montoro G (2006) DS-UCAT: A new multimodal dialogue system for an academic application. In: Proc. of Interspeech ’06 Satellite Workshop Dialogue on Dialogues, Multidisciplinary Evaluation of Advanced Speech-Based Interactive Systems, Pittsburgh, Pennsylvania, US, pp 47–50 [48] Markopoulos P, de Ruyter B, Privender S, van Breemen A (2005) Case study: bringing social intelligence into home dialogue systems. Interactions 12(4):37–44 [49] Martínez AE, Cabello R, Gómez FJ, Martínez J (2003) INTERACT-DM. A solution for the integration of domestic devices on network management platforms. In: Proc. of IFIP/IEEE International Symposium on Integrated Network Management, Colorado Springs, Colorado, US, pp 360–370 [50] Martinovski B, Traum D (2003) Breakdown in human-machine interaction: the error is the clue. In: Proc. of the ISCA Tutorial and Research Workshop on Error Handling in Dialogue Systems, Chateau d’Oex, Vaud, Switzerland, pp 11–16 [51] McAllister D, Rodman R, Bitzer D, Freeman A (1997) Lip synchronization of speech. In: Proc. of ESCA Workshop on Audio-Visual Speech Processing (AVSP’97), Kasteel Groenendael, Hilvarenbeek, The Netherlands, pp 133–136 [52] Möller S, Krebber J, Raake A, Smeele P, Rajman M, Melichar M, Pallotta V, Tsakou G, Kladis B, Vovos A, Hoonhout J, Schuchardt D, Fakotakis N, Ganchev T, Potamitis I (2004) INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control. In: Proc. of the International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, pp 1603–1606
578
Ramón López-Cózar and Zoraida Callejas
[53] Möller S, Krebber J, Smeele P (2006) Evaluating the speech output component of a smart-home system. Speech Communication 48:1–27 [54] Montoro G, Alamán X, Haya P (2004) A plug and play spoken dialogue interface for smart environments. In: Proc. of Fifth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’04), Seoul, South Korea, pp 360–370 [55] Nazari AA (2005) A Generic UPnP Architecture for Ambient Intelligence Meeting Rooms and a Control Point allowing for Integrated 2D and 3D Interaction. In: Proc. of Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-Aware Services, USges and Technologies, pp 207–212 [56] Ndiaye A, Gebhard P, Kipp M, Klessen M, Schneider M, Wahlster W (2005) Ambient intelligence in edutainment: Tangible interaction with life-like exhibit guides. Lecture Notes in Artificial Intelligence 3814:104–113 [57] Nigay L, Coutaz J (1995) A generic platform for addressing the multimodal challenge. In: Proc. of the SIGCHI Conference on Human Factors in Computing Systems, ACM, Denver, Colorado, US, pp 98–105 [58] Ohno T, Mukawa N, Kawato S (2003) Just blink your eyes: A head-free gaze tracking system. In: Proc. of Computer-Human Interaction, Fort Lauderdale, Florida, pp 950–951 [59] Pascoe J (1997) The Stick-e note architecture: Extending the interface beyond the user. In: Proc. of the International Conference on Intelligent User Interfaces, Orlando, Florida, US, pp 261–264 [60] Porzel R, Gurevych I (2002) Towards context-sensitive utterance interpretation. In: Proc. of the 3rd SIGdial Workshop on Discourse and Dialogue, Philadelphia, US, pp 154–161 [61] Prendinger H, Mayer S, Mori J, Ishizuka M (2003) Persona effect revisited. using bio-signals to measure and reflect the impact of character-based interfaces. In: Proc. of the 4th International Working Conference on Intelligent Virtual Agents (IVA’03), Kloster Irsee, Germany, pp 283–291 [62] Rabiner LR, Juang BH (1993) Fundamentals of Speech Recognition. PrenticeHall [63] Reithinger N, Lauer C, Romary L (2002) MIAMM - Multidimensional information access using multiple modalities. In: Proc. of International CLASS workshop on natural intelligent and effective interaction in multimodal dialogue systems [64] de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. Journal of Biomedical Informatics 39:514–531 [65] Sachetti D, Chibout R, Issarny V, Cerisara C, Landragin F (2004) Seamless access to mobile services for the mobile user. In: Proc. of IEEE Int. Conference on Software Engineering, Beijing, China, pp 801–804 [66] Saini P, de Ruyter B, Markopoulos P, Breemen AV (2005) Benefits of social intelligence in home dialogue systems. In: Proc. of 11th International Conference on Human-Computer Interaction, Las Vegas, Nevada, US, pp 510–521
Multimodal Dialogue for Ambient Intelligence and Smart Environments
579
[67] Satyanarayanan M (2002) Challenges in implementing a context-aware system. IEEE Distributed Systems Online 3(9) [68] Schmidt A (2002) Ubiquitous computing - computing in context. PhD thesis, Lancaster University [69] Schneider M (2004) Towards a Transparent Proactive User Interface for a Shopping Assistant. In: Proc. of Workshop on Multi-User and Ubiquitous User Interfaces (MU3I), Funchal, Madeira, Portugal, vol 3, pp 10–15 [70] Shimoga KB (1993) A survey of perceptual feedback issues in Dexterous telemanipulation: Part II. Finger Touch Feedback. In: Proc. of the IEEE Virtual Reality Annual International Symposium, Piscataway, NJ, IEEE Service Center [71] Strang T, Linnhoff-popien C (2004) A context modeling survey. In: Proc. of Workshop on Advanced Context Modelling, Reasoning and Management, UbiComp 2004, Nottingham, England [72] Wahlster W (2002) Smartkom: Fusion and fission of speech, gestures, and facial expressions. In: Proc. of First International Workshop on Man-Machine Symbiotic Systems, pp 213–225 [73] Wahlster W (ed) (2006) SmartKom: Foundations of Multimodal Dialogue Systems. Springer [74] Walker M, Cahn J, Whittaker S (1997) Improvising linguistic style: Social and affective bases of agent personality. In: Proc. of the 1st International Conference on Autonomous Agents (Agents’97), Marina del Rey, CA, US, pp 96–105 [75] Walker M, Litman D, Kamm C, Abella A (1998) Evaluating spoken dialogue agents with PARADISE: Two case studies. Computer Speech and Language 12(4):317–347 [76] Whittaker S, Walker M (2005) Evaluating dialogue strategies in multimodal dialogue systems. In: Minker W, Buehler D, Dybkjaer L (eds) Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Kluwer [77] Yang J, Stiefelhagen R, Meier U, Waibel A (1998) Real-time face and facial feature tracking and applications. In: Proc. of Workshop on audio-visual speech processing, pp 79–84 [78] Yasuda H, Takahashi K, Matsumoto T (2000) A discrete HMM for online handwriting recognition. Pattern Recognition and Artificial Intelligence 14(5):675–689 [79] Zuckerman O, Maes P (2005) Awareness system for children in distributed families. In: Proc. of the 2005 International Conference on Interaction design for children (IDC 2005), Boulder, Colorado, US
Part V
Artificial Intelligence and Robotics
Smart Monitoring for Physical Infrastructures Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
1 Introduction Infrastructures are the backbone of today’s economies. Physical Infrastructures such as transport and energy networks are vital for ensuring the functioning of a society. They are strained by the ever increasing demand for capacity, the need for stronger integration with other infrastructures, and the relentless push for higher cost efficiency. Strategies for coping with these challenges are primarily developed on an international level: The European Commission has formulated so-called Strategic Research Agendas for both the rail network and the power grid domain [7, 5]. As main drivers for realizing infrastructures of the future they identify the improvement of information exchange among stakeholders and the introduction of condition-based maintenance. • Many different organizations collect data about infrastructure operation. In the railway domain, for example, there are the rolling stock manufacturers, rolling stock maintainers, infrastructure operators, infrastructure maintainers, and traffic managers. Each of them collects data in isolation. However, all of them would benefit from having a comprehensive view on the infrastructure that takes into account all available data. Florian Fuchs Siemens AG, Corporate Technology, Intelligent Autonomous Systems / Ludwig-MaximiliansUniversität München, Institute for Informatics, Mobile and Distributed Systems Group, Munich, Germany, e-mail:
[email protected] Michael Berger Siemens AG, Corporate Technology, Intelligent Autonomous Systems, Munich, Germany, e-mail:
[email protected] Claudia Linnhoff-Popien Ludwig-Maximilians-Universität München, Institute for Informatics, Mobile and Distributed Systems Group, Munich, Germany, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_22, © Springer Science+Business Media, LLC 2010
583
584
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
• Maintenance is still primarily carried out as preventive maintenance based on fixed time intervals. This is very costly and accounts in railway networks for about half of the total cost of providing the infrastructure [7]. Replacing it with condition-based maintenance, which is only carried out when required, will increase the efficiency of maintenance procedures. The key to putting these approaches into practice is precise knowledge about the current condition of the infrastructure. Today, such knowledge is not available. This is not due to a lack of measurements—there is an abundance of different sensing systems currently installed: Examples from the railway domain are wheel impact load and hot axle box detectors at the track-side as well as hot axle box and wheel slip detectors onboard. In the power grid domain, there are cooling temperature sensors at transformers and circuit breaker sensors at substations. Instead, the ignorance about the current condition of the infrastructure results from the lack of an integrated interpretation of the available information. This chapter describes an Ambient Intelligence approach for tackling this issue: Today’s infrastructure networks can actually be considered Smart Environments. They are enriched with a large amount of technology and are in particular equipped with networked sensing systems that gather information about local conditions. What is missing today, however, is the intelligence for making use of the available data. In this sense, infrastructure monitoring can be considered an application of Ambient Intelligence. As a consequence, this chapter investigates how infrastructure monitoring can benefit from Ambient Intelligence techniques in order to transform it into Smart Monitoring. The chapter is organized as follows: Section 2 introduces the field of infrastructure monitoring together with its major challenges for the future in order to analyze its relation to Ambient Intelligence. As one foundation for the knowledge-based approach, Sect. 3 develops patterns for engineering infrastructure monitoring ontologies. As the other foundation, Sect. 4 designs a layered system architecture. Section 5 goes into detail about reasoning in a Smart Monitoring System and covers reasoning over distributed knowledge bases. The realization of the overall approach is illustrated in Sect. 6 with a case study on the European railway infrastructure. Section 7 concludes the chapter.
2 Infrastructure Monitoring in the Context of Ambient Intelligence This section introduces the challenges and the state-of-the-art of infrastructure monitoring. It motivates the use of Ambient Intelligence techniques for tackling them. Physical infrastructures include infrastructures as diverse as railway networks and power grids. Still, all of them share the same fundamental characteristics (see Table 1 for examples from railway networks and power grids): Physical infrastructures exhibit a network structure and span large geographical areas. A large number
Smart Monitoring for Physical Infrastructures
Domain Network structure
Railway Infrastructure includes all stationary parts such as tracks, switches, and catenary; rolling stock encompasses all vehicles moving along rails such as locomotives and cars. Stakeholders Train operating companies operate train connections. They lease the rolling stock from rolling stock operating companies, which buy it from rolling stock manufacturers. Infrastructure managers maintain and operate the rail infrastructure. Monitoring Onboard monitoring systems systems observe the status of the pantograph, air conditioners, up to the toilets. Track geometry and rail profile are monitored, too. Track-side systems detect floodings, but also measure wheel impact loads and hot axle boxes. In the UK, more than 3000 remote monitoring systems are currently installed [19]. Infrastructure Several phenomena are obconditions served about the wheel/rail interface. They include axle load, wheel temperature, wheel impact load, and wheel slip. Their combination allows to distinguish between, for example, dragging brakes and damaged wheels.
585
Power Transmission networks deliver electricity from the power plant to the substation, distribution networks move it from the substation to the consumer. They differ in their voltage levels. Electric utilities are usually different companies, which generate, transmit, and distribute electricity to consumers.
In substations, transformer temperature, cooling equipment, and seal integrity are monitored. Other monitoring systems observe circuit breaker insulating gas density, close time, and tank heater status [14]. For feeders, load and voltage profiles are monitored.
Assessing the current condition of a transformer requires, for example, to combine information about its current load and temperature, the state of the cooling equipment and its isolation, among others.
Table 1 Examples for infrastructure domains and their characteristics.
of different stakeholders is involved in the provisioning of the infrastructure. Several dedicated monitoring systems are already in place. Furthermore, there are important overall infrastructure conditions, which can only be assessed by interpreting the available data in an integrated manner.
586
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
2.1 State-of-the-Art in Infrastructure Monitoring Today, however, the monitoring of infrastructure networks is almost exclusively handled by stand-alone systems. Several systems for monitoring the wheel/rail interface in railway networks are described in [16]. An example from the power grid domain is [13]. Still, several authors have acknowledged the need for an integrated interpretation of this information [e. g. 19, 17, 14, 25, 24]. The problems here arise from the distribution of the relevant information across different organizations, the heterogeneity of the monitoring systems, and the lack of a model for the complex interdependencies. The Checkpoint research project realized a train monitoring system, which combines data from dynamic scales as well as hot axle box, loading gauge, and flat wheel detectors [17]. A common data format has been developed for these monitoring systems and rules have been specified for interpreting the obtained data in an integrated manner. Extending the Checkpoint system with additonal monitoring systems, however, requires both modifying the data format and adding new rules. Due to this tight coupling, it is not clear how scenarios with multiple stakeholders and large numbers of monitoring systems can be handled. On the other hand, several approaches in the power grid domain particularly investigate the interoperability of monitoring data among stakeholders [8, 21]: This is achieved by using formal ontologies for representing observations and states. Still, the focus is on simply representing this information without considering the automated derivation of overall conditions. Furthermore, the feasibility of implementing these approaches in today’s power grids is not investigated.
2.2 Context-Awareness in Ambient Intelligence Ambient Intelligence (AmI) envisions systems that assist users in their real-world tasks by taking into account their current situation. To this end, they use sensorenriched environments—so-called Smart Environments—for gathering information about the user’s real-world context. This definition shows that the use of context awareness is a vital property of every Ambient Intelligence system. Dey proposed the most cited definition of context and context-awareness [4]: Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves. A system is context-aware if it uses context to provide relevant information and/or services to the user, where relevancy depends on the user’s task.
The fundamental elements of this definition are illustrated on the left-hand side of Fig. 1: A Context-Aware System aims at assisting its users in their tasks. To this end it uses knowledge about the situation of entities, which are relevant to these tasks.
Smart Monitoring for Physical Infrastructures &RQWH[W$ZDUH6\VWHPV
587 &RQWH[W$ZDUHQHVVLQ ,QIUDVWUXFWXUH0RQLWRULQJ
8VHU
8VHU
&RQWH[W$ZDUH6\VWHP
6PDUW0RQLWRULQJ6\VWHP
XVHV 6LWXDWLRQRIDQ (QWLW\
LVUHOHYDQW
XVHV LVUHOHYDQW
0RQLWRULQJ 6XEMHFW
(QWLW\
&RQWH[W ,QIRUPDWLRQ
&RQGLWLRQ RID 0RQLWRULQJ 6XEMHFW
GHVFULEHV
6PDUW(QYLURQPHQW
&RQWH[W ,QIRUPDWLRQ
GHVFULEHV
0RQLWRULQJ6\VWHPV
Fig. 1 Context-Awareness in general (left) and applied to infrastructure monitoring (right).
The situation of entities is derived from context information, which is delivered by a Smart Environment. In infrastructure monitoring, the setting is identical (see the right-hand side of Fig. 1): A Smart Monitoring System aims at assisting its users in their monitoring tasks. To this end it uses knowledge about the condition of the relevant monitoring subjects. The condition of a monitoring subject is derived from context information, which is delivered by monitoring systems. While Context-Aware and Smart Monitoring Systems share a common goal, they pursue different approaches for achieving this goal: Recent Context-Aware Systems apply formal knowledge representation and reasoning for processing context information [e. g. 15]. This enables them to take into account a broad range of context types and to automate the derivation of higher-level context and situations. In contrast, traditional monitoring systems rely on ad-hoc modeling and ad-hoc processing. As a consequence, they suffer from lack of integration and extensibility as described in the previous section. Motivated by the understanding that both Context-Aware and Smart Monitoring Systems share the same characteristics, this work aims to apply proven techniques from Context-Aware Systems to Smart Monitoring Systems. So it investigates how formal knowledge representation and reasoning benefits the realization of integrated monitoring systems.
588
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
2.3 Knowledge Representation and Reasoning in the Semantic Web Knowledge Representation and Reasoning (KRR) has long been a key discipline of Artificial Intelligence (AI). Theoretical advances in recent years led to the development of so-called Description Logics (DL), which exhibit high expressivity for formalizing semantics as well as decidable inference procedures for reasoning on them [1]. Knowledge models based on DL are usually called ontologies. The Semantic Web initiative of the World Wide Web Consortium (W3C) adopted these formalisms and added a standard for representing them—the Web Ontology Language (OWL). Unlike before, this combination led to the emergence of a rich software ecosystem (editors, reasoners, repositories) for working with Semantic Web ontologies. The availability of an established standard and a solid software foundation is the key enabler for adopting a technology for an inter-organizational, long-lived system. As a consequence, this work adopts the W3C standard OWL DL [2] and the underlying SHIN formalism (without nominals) [12] for applying the Ambient Intelligence technique of knowledge representation and reasoning to a Smart Monitoring System.
3 Pattern-based Modeling of Context Information and Infrastructure Conditions Applying Description Logics-based knowledge representation and reasoning to infrastructure monitoring requires the development of an appropriate ontology. This ontology has to formalize the semantics of context information and infrastructure conditions so that it can be used for automated reasoning. Engineering such an ontology is not an easy task: While only monitoring domain experts know, what has to be modeled, they usually lack the required modeling expertise and knowledge on how this can appropriately be reflected in an ontology. This section therefore reviews methods for supporting the modeler in engineering high-quality ontologies. A pattern-based approach is selected as it increases both transparency and reason-ability of the resulting ontologies. Finally, an ontology design pattern for context information and infrastructure conditions is developed. Svatek compared different methods for engineering ontologies with respect to the accuracy, transparency, and reason-ability of the resulting ontologies [23]: An accurate ontology reflects state of affairs that hold in reality without any tacit assumptions; a transparent ontology is comprehensible for other people than just its designers; a reason-able ontology is usable for non-trivial inference by existing reasoners. The results of his assessment are presented in Table 2. Smart infrastructure monitoring requires transparent ontologies as different stakeholders are involved and need to share a common information model. It also requires reason-able ontologies as infrastructure conditions should be derived automatically. Both requirements suggest to adopt a pattern-based modeling approach. For maxi-
Smart Monitoring for Physical Infrastructures
Method Methodologies Upper ontologies Meta-Properties Language Guides Collected Hints Pre-fabricated Patterns
589
Accuracy
Transparency
* * ** * * *
* **
* **
Reasoningability
* **
Table 2 Assessment of different ontology engineering methods [23].
mizing their reusability, ontologies for smart infrastructure monitoring also have to be accurate. To this end, international standards on condition monitoring are analyzed in the following in order to derive an appropriate ontology design pattern.
3.1 Key Terms and Concepts of Condition Monitoring The ISO standard ISO 17359:2003 Condition monitoring and diagnostics of machines defines terms and concepts in the field of condition monitoring. The definitions of the key terms are presented in the following1: Monitoring Subject. Item under consideration whose condition should be determined through monitoring or diagnostics. Feature. Numerical value of a signal characteristic gained from processing a measured signal, or numerical result of an observation. [..] Condition monitoring uses features that depend on the condition and describe symptoms quantitatively. They also include quantitative expressions for the domains of qualitative symptoms such as “very hot”, “hot”, “warm”, “cool”, “cold”, and “very cold”. Symptom. Measureable or observable physical phenomenon, which reflects a behavior deviation and thereby indicates with a certain probability the existence of one or more faults or failures. [..] For condition monitoring purposes, the symptom is quantified by one or more features. Fault. Deviation of the actual value of the state, functionality, or behavior of a monitoring subject from the desired value. State. Property of a monitoring subject, which expresses the functional integrity and its restriction or endangerment by faults or its cessation by failure, respectively. [..] A monitoring subject can show multiple faults and defects at the same time. These are described by different types of state parameters or classes. In addition to these basic terms, the process of condition monitoring has been described as a classification problem with four transformation steps by [26]. It clarifies the relations and interdependencies between the previously identified key terms: 1 They are taken from DIN ISO 17359 Beiblatt 1:2007-08, the supplement to the German version of the ISO standard.
590
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
0RQLWRULQJ6XEMHFW
KDV6WDWH
6WDWH
UHIHUV7R0RQLWRULQJ 6XEMHFW UHIHUV7R)DXOW
)HDWXUH
SUHFHGHG'LUHFWO\%\)HDWXUH
UHIHUV7R)HDWXUH
)DXOW UHIHUV7R6\PSWRP /HJHQG
6\PSWRP
FRQFHSW .
U
. XVHVU LQ UHVWULFWLRQV
Fig. 2 Concepts and interdependencies of the ontology design pattern for infrastructure conditions.
1. The measurement space is a space of measurements. They are independent from any a-priori problem knowledge and are identical to the measurements obtained by sensors. 2. The feature space is a space of features. Features are extracted from measurements by utilizing a-priori problem knowledge. 3. The decision space is a space of decision variable vectors. They are obtained by suitable transformations of the feature space based on a-priori problem knowledge. As a consequence, they represent symptoms. 4. The class space is a set of categories for a given measurement. With respect to the previous definition, these represent fault classes. The transformation from decision to class space is again performed using a suitable function based on a-priori problem knowledge. In the following, the presented conceptualization of the field of condition monitoring will be encoded in an ontology design pattern.
3.2 Ontology Design Pattern for Infrastructure Conditions The ontology design pattern for infrastructure conditions has to enable the derivation of a monitoring subject condition from its features. It therefore has to capture both the key concepts and the interdependencies between them. As the ontology uses the SHIN -Description Logic, the design pattern is formulated using the SHIN syntax and semantics as defined in [12]. Fig. 2 illustrates the ontology design pattern. It introduces the concepts MonitoringSubject, Feature, Symptom, Fault, and State. For every monitoring domain to be modeled—such as wheel impact loads in the railway domain or transformer temperatures in the power domain—the design pattern is utilized as follows:
Smart Monitoring for Physical Infrastructures
591
• The Feature concept represents features. For every feature type produced by a context source a corresponding concept definition is introduced as a specialization of Feature. Instances of these concepts are associated with an instance of a monitoring subject using the functional role refersToMonitoringSubject. As an example, the definitions of feature concepts for wheel impact, axle box temperature, axle differential temperature, and wheel temperature are presented (all being defined as pairwise disjoint): AxleBoxTemp ≡ LoAxleBoxTemp MeAxleBoxTemp HiAxleBoxTemp AxleDiffTemp ≡ LoAxleDiffTemp MeAxleDiffTemp HiAxleDiffTemp WheelImpact ≡ IncreasedWheelImpact NormalWheelImpact WheelTemp ≡ LoWheelTemp HiWheelTemp
So a feature instance of AxleBoxTemp, for example, is defined to being either an instance of LoAxleBoxTemp, MeAxleBoxTemp, or HiAxleBoxTemp. These correspond to measurement results of an axle box sensor. • The Symptom concept represents symptoms. A domain experts identifies relevant symptom types in the domain to be modeled and introduces a corresponding concept definition as a specialization of Symptom. As a symptom is defined as a deviation from expected observations, it is defined with respect to feature concepts using the functional role refersToFeature. The examples show the symptoms HotWheel, HotAxleBox, and ExcessiveWheelImpact. HotWheel ≡ (∃ refersToFeature.HiWheelTemp ∀ refersToFeature.HiWheelTemp) HotAxleBox ≡ (∃ refersToFeature.(HiAxleBoxTemp MeAxleDiffTemp HiAxleDiffTemp)) (∀ refersToFeature.(HiAxleBoxTemp MeAxleDiffTemp HiAxleDiffTemp)) ExcessiveWheelImpact ≡ (∃ refersToFeature.IncreasedWheelImpact ∀ refersToFeature.IncreasedWheelImpact)
This means that a symptom instance of HotWheel, for example, refers to at least one and only feature instances of HiWheelTemp. • The Fault concept represents faults. It is defined as a deviation of a property of a monitoring subject. So a domain expert again identifies relevant fault types and introduces corresponding specializations of Fault. They are in turn defined with respect to symptom concepts and use the functional role refersToSymptom. See the following examples for the faults WornAxle, DraggingBrake, and DamagedWheel. WornAxle ≡ (∃ refersToSymptom.(HotAxleBox (¬HotWheel)) ∀ refersToSymptom.(HotAxleBox (¬HotWheel))) DraggingBrake ≡ (∃ refersToSymptom.((¬ExcessiveWheelImpact) HotWheel) ∀ refersToSymptom.((¬ExcessiveWheelImpact) HotWheel)) DamagedWheel ≡ (∃ refersToSymptom.((¬HotAxleBox) ExcessiveWheelImpact) ∀ refersToSymptom.((¬HotAxleBox) ExcessiveWheelImpact))
592
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
Here, fault instances of WornAxle are defined as referring to at least one and only symptom instances of HotAxleBox, which are at the same time no instances of HotWheel. • The State concept represents technical states. States characterize the capability of a monitoring subject to provide its functionalities. State types are introduced by domain experts as specializations of State. They are defined with respect to fault concepts using the functional role refersToFault. Here, the states CriticalAxleBoxState and CriticalWheelState are defined. CriticalAxleBoxState ≡ (∃ refersToFault.WornAxle ∀ refersToFault.WornAxle) CriticalWheelState ≡ (∃ refersToFault.(DamagedWheel DraggingBrake) ∀ refersToFault.(DamagedWheel DraggingBrake))
For example, a state instance of CriticalWheelState refers to at least one and only fault instances of at the same time DamagedWheel and DraggingBrake. • The MonitoringSubject concept represents monitoring subjects. On the one hand, it is specialized for monitoring subjects that are relevant in the modeled infrastructure network—such as trains, switches, and brakes in railway networks or transformers, feeders, and isolators in power grids. On the other hand, these monitoring subject are further specialized with respect to the different conditions they can have. Conditions are defined with respect to state concepts using the role hasState. As a consequence, the condition of an instance representing a monitoring subject can automatically be derived by identifying the most specific concept it belongs to. The examples present definitions for PriorityBogieMaintenanceTrain, RegularBogieMaintenanceTrain, and PriorityWheelsetMaintenanceTrain. PriorityBogieMaintenanceTrain ≡ (∃ hasState.CriticalWheelState ∃ hasState.CriticalAxleBoxState) RegularBogieMaintenanceTrain ≡ (∃ hasState.CriticalAxleBoxState ∀ hasState.(¬CriticalWheelState)) PriorityWheelsetMaintenanceTrain ≡ (∃ hasState.CriticalWheelState)
So a monitoring subject instance of RegularBogieMaintenanceTrain is defined as having a state instance of CriticalAxleBoxState, but no state instance of CriticalWheelState. The proposed design pattern enables domain experts to engineer ontologies for context information and infrastructure conditions. As all domain ontologies follow the same pattern, they can easily be integrated into a consistent single ontology. Using this design pattern at design time, the relevant semantics are formalized in the concept definitions. At runtime, they can be utilized by reasoners as follows: Monitoring systems produce context information about monitoring subjects, e. g. the measurement of an increased wheel impact load about a particular train. These are interpreted as features, and an instance of the corresponding feature concept is
Smart Monitoring for Physical Infrastructures
593
generated, e. g. an instance of IncreasedWheelImpact. It is associated with the (usually already existing) instance that represents the monitoring subject, i. e. an instance of Train. Additionally, placeholder instances for symptom, fault, and state concepts are generated according to the design pattern. Note that these are only asserted to be instances of the generic concepts Symptom, Fault, and State, respectively, as it is not known which specialized concept they belong to. Given the concept definitions in the ontology, however, a reasoner automatically derives the most specific concept each of these instances belongs to. (This reasoning task is called realization [1].) In particular, a reasoner can also be queried for the most specific concept of the monitoring subject instance. The result will be a concept name such as PriorityWheelsetMaintenanceTrain, which indicates the condition of the monitoring subject with respect to the known features. Based on such reasoning, the condition monitoring process can be automated.
4 Smart Monitoring Architecture In addition to the previously described ontology for context information and infrastructure conditions, realizing a Smart Monitoring System requires an appropriate system architecture. This section proposes a layered system architecture due to its flexibility and extensibility: Each layer provides a particular functionality and depends only on the next layer down. This enables replacing individual layers without affecting the overall system. The proposed system architecture is illustrated in Fig. 3. It distinguishes the following four layers: Layer 1: Acquisition of Context Information. Components on this layer acquire and provide context information. They represent context sources such as existing monitoring systems, but also databases that collect reports from manual inspection. The core functionality of this layer is to provide context information in a representation based on the system ontology. This way, syntactic and semantic interoperability with other components in the system is achieved. To this end, ContextProvider-components on Layer 1 transform context information from the proprietary format of the context source into the ontology-based representation. ID- and geo-referenced context sources have to be distinguished: •
•
ID-referenced context sources produce context information with respect to the ID of the monitoring subject they refer to. They implement the IdReferencedContextRetrieval interface. Examples are wheel impact load detectors, which obtain the context type “wheel impact load” together with the ID of the train it was measured for. Geo-referenced context sources deliver context information with a geographical reference, i. e. a coordinate in a spatial reference system. They therefore implement the GeoReferencedContextRetrieval interface. An ex-
594
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
/D\HU &RQGLWLRQEDVHG'HFLVLRQ6XSSRUW $SSOLFDWLRQ
$SSOLFDWLRQ
/D\HU5HDVRQLQJRQ&RQWH[W,QIRUPDWLRQ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
,G5HIHUHQFHG &RQWH[W5HWULHYDO
5HDVRQLQJ1RGH
'LVWULEXWHG 5HDVRQLQJ $GPLQLVWUDWLRQ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
'LVWULEXWHG 5HDVRQLQJ $GPLQLVWUDWLRQ
5HDVRQLQJ1RGH
5HJLVWUDWLRQ
'LVWULEXWHG 5HDVRQLQJ $GPLQLVWUDWLRQ
5HDVRQLQJ1RGH
5HDVRQLQJ1RGH 5HJLVWU\
/D\HU0DQDJHPHQWRI&RQWH[W,QIRUPDWLRQ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
1HWZRUN 5HWULHYDO
&RQWH[W 5HSRVLWRU\ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
,G5HIHUHQFHG &RQWH[W5HWULHYDO
&RQWH[W 5HSRVLWRU\
6SDWLDO1HWZRUN 3URYLGHU
&RQWH[W 5HSRVLWRU\ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
&RQWH[W$VVLJQHU
/D\HU$FTXLVLWLRQRI&RQWH[W,QIRUPDWLRQ ,G5HIHUHQFHG &RQWH[W5HWULHYDO
&RQWH[W 3URYLGHU ,G5HIHUHQFHG &RQWH[W5HWULHYDO
,G5HIHUHQFHG &RQWH[W5HWULHYDO
&RQWH[W 3URYLGHU
*HR5HIHUHQFHG &RQWH[W5HWULHYDO
&RQWH[W 3URYLGHU
,G5HIHUHQFHG &RQWH[W5HWULHYDO
*HR5HIHUHQFHG &RQWH[W3URYLGHU
&RQWH[W 3URYLGHU
Fig. 3 Functional layers of the proposed smart monitoring architecture.
ample is a weather service, which obtains the context type “environmental temperature” with reference to a latitude and longitude value. Layer 2: Management of Context Information. Components on Layer 2 manage context information and any other relevant information in the Smart Monitoring System. This includes acquisition, archiving, and provisioning of the information. Context information. ContextRepository-components on Layer 2 obtain context information from ContextProvider-components on Layer 1. They provide them to other components through the IdReferencedContextRetrieval interface. To this end, Layer 2-components have to discover ContextProvider-components on Layer 1. Appropriate discovery mechanisms are required for handling the volatility of context sources that are mobile or provided by different organizations: For ID-referenced context sources there are approaches based on structured peer-to-peer networks [10]. Corresponding schemes can be developed for geo-referenced context sources [27]. For further processing, geo-referenced context information is tranformed into ID-referenced context information: The ContextAssigner-component uses the infrastructure topology to assign a piece of geo-referenced context information to the IDs of the geographically relevant elements of the infrastructure topology [18]. In addition, ContextRepository-components are responsible for managing and archiving the acquired context information and their history. This requires efficient handling of large amounts of data as it is provided by traditional databases.
Smart Monitoring for Physical Infrastructures
595
Components for managing ontology-based data are called RDF-Repositories or Triple Stores. Several systems are already available2 . Additionally, existing database systems such as Oracle 11g and IBM DB/2 are currently extended with support for ontology-based data. Models of the infrastructure network. The model of the infrastructure network is another core type of information handled by components on Layer 2. Using the NetworkRetrieval interface, all components have access to a common, ontology-based representation of the network topology and geometry. Other relevant information. Apart from context information and network models, a Smart Monitoring System could require other types of information. These will also be handled by defining appropriate components and interfaces on Layer 2. Layer 3: Reasoning on Context Information. Components on Layer 3 provide functionality for reasoning on context information. Like ContextRepository-components on Layer 2, they provide the IdReferencedContextRetrieval interface. By performing reasoning, however, they do not only return known (i. e. explicitly asserted) facts, but are furthermore able to additionally derive implicit answers. Different reasoning techniques will be required for different purposes: Reasoning over distributed context information is introduced in Sect. 5 and is realized as a system of ReasoningNode-components on Layer 3 that access ContextRepository-components on Layer 2. Other reasoning techniques are reasoning with respect to the infrastructure topology [see 9], data mining on historical context information, and analytical techniques based on heuristics. Additional reasoning techniques are easily integrated by realizing appropriate Layer 3-components, which depend only on Layer 2-components. This enables different reasoning techniques to reuse the ontology-based representation of context information. They can be used concurrently and independently from each other. Layer 4: Condition-based Decision Support. Components on Layer 4 realize applications, which support infrastructure operators in their decisions based on infrastructure conditions. Naturally, the application logic of Layer 4-components is highly application specific. Typical application areas are condition-based and predictive maintenance. Not shown in Fig. 3 is an additional ontology management component. This component provides the system ontology and handles its versioning. As it is used by all components on the functional layers, it is logically orthogonal and provides the basis for semantic interoperablity among the functional layers.
2
Examples for RDF-repositories: Sesame: http://www.openrdf.org/, OWLIM: http://www.ontotext.com/owlim/, Joseki: http://www.joseki.org/, D2RQ: http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
596
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
5 Reasoning over Distributed Context Information As an example for the reasoning functionality provided by Layer 3 of the previously introduced system architecture, this section describes a novel approach for ontological reasoning over distributed context information: Context sources in infrastructure networks are spatially distributed and operated by different organizations (compare Sect. 2.1). As a consequence, context information is produced in a distributed manner. For technical and organizational reasons, centralizing them is usually not feasible. Still, performing ontological reasoning on distributed context information should be possible as this is the prerequisite for deriving the current condition of a monitoring subject as described in Sect. 3.2. Reasoning over distributed context information can be formalized using the following definitions. A SHIN knowledge base K = (T , A) consists of the terminological box (TBox T ) and the assertional box (ABox A). T contains axioms on SHIN concept and role descriptions, while A uses the terminology in T for asserting facts on individuals [see 12]. An interpretation I that satisfies all axioms in T or assertions in A is called a model of T (I |= T ) or A (I |= A), respectively. Let NC , NR , and NI be sets of concept, role, and individual names, respectively. In a Smart Monitoring System, reasoning has to be performed on a distributed knowledge base with a common TBox and distributed ABox. As context information about the same monitoring subject will be available in more than one knowledge base part, there will be shared individuals, i. e. individuals which occur in more than one knowledge base part. Definition 5.1 (Distributed Knowledge Base K and Shared Individuals Ind∩ (K)) Let T be a TBox and A an ABox for 1 ≤ j ≤ n. A distributed knowledge base K is j defined as K = (T , j A j ) with knowledge base parts K j = (T , A j ), 1 ≤ j ≤ n. An interpretation I is a model of K (i. e. I |= K), iff I |= T and I |= A j for 1 ≤ j ≤ n. Let Ind∩ (K) ⊆ NI be the set of individual names, which occur in more than one K j of K, i. e. Ind∩ (K) = 1≤k=l≤n Ind(Kk ) ∩ Ind(Kl ). Basic ontological query atoms are role query atoms and concept query atoms. Role query atoms are of the form r(t,t ), concept query atoms of the form C(t) with r a role name, C a concept description, and t,t ∈ NV ∪NI , where NV is a set of variable names. Condition-based decision support requires answering conjunctions of these query atoms. The solutions of such conjunctive queries are defined as follows: Definition 5.2 (Solutions of Conjunctive Query Q) Let Q be a conjunctive query with n distinguished variables v1 , . . . , vn and I = (Δ I , ·I ) an interpretation. A total function μ : Var(Q) ∪ Ind(Q) → Δ I is called an evaluation if it holds that μ (a) = aI . We write I |=μ C(t) if μ (t) ∈ CI and I |=μ r(t,t ) if (μ (t), μ (t )) ∈ rI . If I |=μ q for all query atoms q ∈ Q, then I |=μ Q. The solutions MQ,K for Q with respect to a distributed knowledge base K are all tuples mQ,K = (a1 , . . . , am ) ∈ NIm , for which there is an evaluation μ (vi ) = ai I , 1 ≤ i ≤ m, for all models I |= K, so that I |=μ Q.
Smart Monitoring for Physical Infrastructures
597
To enable reasoning over a distributed knowledge base K, a sound and complete procedure is required, which determines the solutions of a conjunctive query Q with respect to K.
5.1 Answering Conjunctive Queries The concept of a distributed reasoning system is introduced as the foundation for a procedure that answers conjunctions of query atoms over distributed knowledge bases. It takes into account the distribution of context information and assumes that each knowledge base part is equipped with reasoning functionality. However, no assumption is made on the employed reasoning technique or implementation, as a distributed reasoning system only defines the interface to the reasoning functionality. Definition 5.3 (Distributed Reasoning System R) Let R be a distributed reasoning system for a distributed knowledge base K. R consists of n reasoning nodes R j , 1 ≤ j ≤ n, which are each associated with a knowledge base part K j . Each reasoning node R j offers basic DL reasoning services on K j including consistency check, instance retrieval, concept realization, role retrieval, and role realization. Assuming a distributed reasoning system, this section develops an algorithm for answering conjunctive queries with respect to the distributed knowledge base. The procedure is based on generating queries to other reasoning nodes. Devising a procedure which is sound and complete as well as efficient requires the assumption of some restrictions on the distribution of the knowledge base (see Sect. 5.1.1). Section 5.1.2 shows how the consistency of a restrictedly distributed knowledge base is checked. Section 5.1.3 develops procedures for answering individual query atoms, while Sect. 5.1.4 combines them for answering conjunctions of query atoms.
5.1.1 Restrictedly Distributed Knowledge Base Let K be a distributed knowledge base, i∩ ∈ Ind∩ (K) a shared individual, C a concept name, and r a role. b is a r-neighbor of i∩ iff r holds between b and i∩ . clos(C) is the smallest set that contains C and is closed under sub-concepts and the negation normal form of C. The closure of K is defined as clos(K) = C(a)∈A clos(C). K is called a restrictedly distributed knowledge base if it holds that C(i∩ ) ∈ Ai for a 1 ≤ i ≤ n =⇒ C(i∩ ) ∈ A j for a 1 ≤ j ≤ n
(1)
i∩ has a r-neighbor in K =⇒ (n r) ∈ clos(K) i∩ is a r-neighbor in K =⇒ (∀ r.C ) ∈ clos(K) for all C .
(2) (3)
598
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
As will be shown in the following, these restrictions enable efficient procedures for consistency checking and conjunctive query answering on (restrictedly) distributed knowledge bases. It is important to show that the ontology design pattern presented in Sect. 3.2 is compatible with these restrictions: In a Smart Monitoring System, monitoring subject individuals are shared individuals as features about the same monitoring subject are distributed over multiple knowledge base parts. However, individuals representing features, symptoms, faults, and states are non-shared individuals as they are produced by one monitoring system and managed in exactly one knowledge base part. Restriction (1) is ensured by using a naming service for monitoring subject individuals: This way it is not possible that an individual i is asserted to be a Train in one knowledge base part and to be a Track in another. Restrictions (2) and (3) apply to the roles hasState and refersToMonitoringSubject, respectively. As described in Sect. 3.2, the restricted concept descriptions are not required by the design pattern. In conclusion, the ontology design pattern for infrastructure conditions can be used in a restrictedly distributed knowledge base. As a consequence, the following reasoning procedures can be employed in a Smart Monitoring System.
5.1.2 Checking Restrictedly Distributed Knowledge Base Consistency The consistency of a restrictedly distributed knowledge base K can be checked by checking the consistency for each individual knowledge base part K j , 1 ≤ j ≤ n. Theorem 5.1 (Consistency Check) K j is consistent for all j ⇐⇒ K is consistent. Proof. (sketched) A SHIN knowledge base is consistent iff the SHIN tableaux algorithm constructs a complete and clash-free completion forest [12]. To this end, it starts from an initial completion forest and repeatedly applies expansion rules [see 12] until a clash occurs or no more applications are possible. Starting with the initial completion trees for the set of K j ’s and K, one can analyze the effects of these expansions rules. It turns out that the restrictions on a restrictedly distributed knowledge base (see Sect. 5.1.1) ensure that the union of the completion forests constructed for the set of K j ’s are equivalent to the completion forest constructed for K. As a consequence, a clash occurs in the completion forest for K iff a clash occurs in the completions forests for the set of K j ’s. This was to be shown.
5.1.3 Answering Role and Concept Query Atoms Being able to show the consistency of a distributed knowledge base is the prerequisite for the following procedures for answering query atoms. A. Answering role query atoms. A (non-transitive) role between two named individuals can only be implied if a role is asserted between them. As a consequence,
Smart Monitoring for Physical Infrastructures
599
it is sufficient to pose the query atom only to those reasoning nodes R j , whose associated knowledge base part K j contains a relevant (i. e. inverse, sub-role) role assertion between these individuals. B. Answering concept query atoms for non-shared individuals. For a non-shared individual a, only reasoning node Rk associated with knowledge base part Kk and a ∈ Ind(Kk ) has to perform the instance check. Theorem 5.2 (Instance Check) K |= C(a) ⇐⇒ K j |= C(a) with a ∈ Ind(K j ) \
Ind∩ (K) for a 1 ≤ j ≤ n.
Proof. (sketched) K |= C(a) is equivalent to the proposition that K ∪ {¬C(a)} is inconsistent. Checking the consistency of K ∪ {¬C(a)} can be reduced to Theorem 5.1. C. Answering concept query atoms for shared individuals. Shared individuals require to take into account multiple knowledge base parts. Still, centralizing the complete distributed knowledge base is inefficient and should be avoided. To this end, the so-called knowledge base digest [K]i∩ for the shared individual i∩ is introduced. It concentrates assertions about i∩ in the different knowledge base parts by K K asserting the most specific concept mscb j and the most specific role msr(i∩j ,b) about each of its r-neighbors b. It is easy to see that [K]i∩ is consistent for a consistent K. Definition 5.4 (Knowledge Base Digest [K]i∩ ) The knowledge base digest [K]i∩ is an ABox, which is generated from the K j as follows (with b, b individuals and r a role): K
[K]i∩ = { msci∩ j (i∩ ) | i∩ ∈ Ind(K j ) } ∪
Ki i { msrK (i∩ ,b) (i∩ , b), mscb (b) | b is neighbor of i∩ in Ki } ∪ . . {b= b | b is neighbor of i∩ in Ki , b = b ∈ Ai }.
Let C∩ and D∩ be concept descriptions of the form C∩ , D∩ → A | C∩ D∩ | C∩ D∩ | ¬ A | ∀r.A | ∃r.A | n s. | n s. with A ∈ NC , r a role without a shared individual as a r-neighbor, and s a simple role. Then the knowledge base digest is used to answer concept query atoms for a shared individual i∩ as follows: Theorem 5.3 (Distributed Instance Check) K |= C∩ (i∩ ) ⇐⇒ [K]i∩ |= C∩ (i∩ ). Proof. (sketched) K |= C∩ (i∩ ) is equivalent to the proposition that K ∪ {¬C∩ (i∩ )} is inconsistent. Similar to the proof for Theorem 5.1 it can be shown that the SHIN tableaux algorithms constructs a complete and clash-free completion forest for K ∪ {¬C∩ (i∩ )} iff it constructs a complete and clash-free completion forest for [K]i∩ ∪ {¬C∩ (i∩ )}.
600
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
Based on the propositions shown in paragraphs A, B, and C procedures for answering individual role and concept query atoms can be derived. D. Identifying relevant reasoning nodes. All previously described procedures require the identification of reasoning nodes (and their knowledge base parts, respectively), which are relevant for a particular query atom. That means, that the content of their associated knowledge base parts has to be indexed with respect to query atoms. Different schemes are imaginable. Based on the assumption that each knowledge base part will contain context information about a particular domain and therefore use only a particular subset of concepts and roles from the overall system ontology, a knowledge base part K j is profiled based on the most specific concepts and roles of its assertions as follows: Kj
Profile(K j ) = { msci
K
| i ∈ Ind(K j )} ∪ { msr(i j,i ) | i1 , i2 ∈ Ind(K j )} 1 2
The rationale is that context information that is continuously added to K j will usually be represented using concepts and roles which are already part of Profile(K j ). Consequently, no update of Profile(K j ) is necessary. Based on the profiles of the knowledge base parts, reasoning nodes, which are relevant for a given query atom, are identified based on the relevancy of the profile of their associated knowledge base part: K j is relevant for C(t) ⇐⇒ ∃D ∈ Profile(K j ) : T |= D C K j is relevant for r(t,t ) ⇐⇒ ∃s ∈ Profile(K j ) : T |= s r All procedures presented in this section will be combined in the next section to a procedure for answering conjunctions of query atoms.
5.1.4 Answering Conjunctions of Query Atoms Devising a procedure for answering conjunctions of query atoms, this section builds on the previously developed techniques for answering individual query atoms. Answering conjunctive queries has a long tradition in relational databases. It is based on the relational algebra containing operators for selection, projection, and join. As it is possible to transform a conjunction of ontological query atoms into the relational algebra [3], techniques from relational databases can be adopted. Three fundamental phases can be distinguished [11]: 1. Parsing the query. The conjunctive query is parsed and represented in relational algebra. The algebraic expression can also be represented as a tree: Each leaf represents a knowledge base part, while each inner node represents an operator (selection σ , projection π , or join ). Arcs indicate the data flow between operators. The result of this phase is a relational operator tree. 2. Generating and optimizing the query execution plan. The relational operator tree is transformed into a physical operator tree (query execution plan). There
Smart Monitoring for Physical Infrastructures
601
is a multitude of different query execution plans for a given relational operator tree. It is the responsibility of the optimization step to select a query execution plan, which can efficiently be executed. Join operators and projections operators can be realized using classical implementations. Selection operators, which represent concept and role query atoms, have to be realized according to the previously described reasoning procedures: • Role query atoms: Reasoning nodes with relevant knowledge base parts are identified, and a query containing the role query atom is sent to them. • Concept query atoms for non-shared individuals: Reasoning nodes with relevant knowledge base parts are identified, and a query containing the concept query atom is sent to them. • Concept query atoms for shared individuals: Reasoning nodes with knowledge base parts containing assertions about the shared individual are identified. A query for generating the knowledge base digest is sent to each of them, and the knowledge base digest is locally aggregated. The concept query atom is answered with respect to this knowledge base digest. 3. Executing the query execution plan. The generated query execution plan is executed in order to determine the solutions of the conjunctive query. Different approaches such as the iterator model can be used for this [11].
5.2 Implementation as a Framework This section presents a framework-based implementation of the reasoning procedures developed in Sect. 5.1. As the system ontology uses the SHIN -Description Logic, it is represented using the W3C standard OWL DL. Queries are formulated using SPARQL, which is another W3C standard [20]. The distributed reasoning system is realized as a framework of reasoning node components (compare Layer 3 in Fig. 3). Each reasoning node is represented by a ReasoningNode-component. It provides the IdReferencedContextRetrieval interface to components on Layer 4. Knowledge base parts are indexed by a ReasoningNodeRegistry-component. The ReasoningNode-component is decomposed into the sub-components QueryExecution- and Reasoner-component. It uses the DistributedReasoning interface for submitting queries to other ReasoningNode-components. QueryExecution-Component. The QueryExecution-component accepts conjunctions of query atoms formulated in SPARQL and coordinates their processing by generating, optimizing, and executing the query execution plan. The implementation is based on ARQ, the SPARQL processor of Hewlett Packard’s
602
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
5HDVRQLQJ1RGH
,G5HIHUHQFHG &RQWH[W5HWULHYDO
4XHU\([HFXWLRQ
5HDVRQHU
'LVWULEXWHG 5HDVRQLQJ $GPLQLVWUDWLRQ
5HDVRQHU
Fig. 4 Decomposition of the ReasoningNode-component.
Semantic Web Toolkit Jena3 . ARQ parses SPARQL queries and generates a relational operator tree. The generation of the relational operator tree has been extended with operators for local, remote, and distributed reasoning: A local reasoning operator represents role query atoms and concept query atoms for nonshared individuals on the local knowledge base part; a remote reasoning operator represents the same query atoms on the knowledge base part of a remote reasoning node; a distributed reasoning operator represents query atoms for shared individuals on the distributed knowledge base. Based on an query optimization framework for Jena [22], optimization heuristics for answering conjunctions of query atoms have been developed. Finally, in order to execute a query execution plan appropriate iterators (i. e. implementations of the operators) had to be provided. They perform the procedures for local, remote, and distributed answering of query atoms as described in the previous section. While the iterators for remote and distributed reasoning submit subqueries to other reasoning nodes through the DistributedReasoning interface, the local reasoning iterator performs reasoning on the local knowledge base part. To this end, it uses the Reasoner-component. Reasoner-Component. The Reasoner-component provides basic reasoning services on the local knowledge base part. It can be realized using existing reasoner implementations such as Pellet and RacerPro4. With only the interface being specified, implementations can easily be replaced.
3 4
Jena Semantic Web Framework: http://jena.sourceforge.net/ Pellet: http://pellet.owldl.com/, RacerPro: http://www.racer-systems.com/
Smart Monitoring for Physical Infrastructures
603
Fig. 5 Deployment of a Smart Monitoring System for the European railway infrastructure (Layers 2 and 3) in the context of the InteGRail research project.
6 European Railway Monitoring as a Case Study The presented approach to Smart Monitoring has been applied to monitoring the European railway infrastructure in the context of the European research project InteGRail [6]. Application Scenarios. Among the prototypically realized Smart Monitoring applications are condition-based maintenance of tracks and condition-based maintenance of trains. Condition-based maintenance of tracks requires tracks to be modeled as monitoring subjects. According to the design pattern, the system ontology also contains specializations of these concepts, which are defined with respect to states, faults, symptoms, and features from different domains. Posing a query about these concepts to a component on Layer 3 causes the distributed reasoning system to automatically take into account distributed context information—in this case, context information about track geometry and rail profile. The resulting most specific concept indicates the maintenance priority for the given track. It is appropriately visualized to the user. Condition-based maintenance of trains uses trains as monitoring subjects. Specialized concepts, which indicate maintenance priority, are defined with respect to the wheel impact load and hot axle box domains. They are automatically taken into account by the distributed reasoning system. The result is visualized to the user. Railway Monitoring Ontology. In order to realize the previously described application scenarios, a system ontology for European railway monitoring has been engineered. Domain experts have used the proposed design pattern to model the following domains: wheel impact load, hot axle box, car doors, active standby, locomotive state, track geometry, and rail profile. The resulting integrated ontology contains 748 concepts and 162 roles in total.
604
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
System Deployment. Components on Layers 1 to 4 of the Smart Monitoring System architecture have been deployed all over Europe. For the sake of clarity, Fig. 5 shows only the ContextRepository-components on Layer 2 and the ReasoningNode-components on Layer 3. On Layer 1, ContextProvider-components have been developed for wheel impact load detectors, hot axle box detectors, door control units, active standby monitoring systems, locomotive status systems and measurement systems for track geometry and rail profile. Context information about each domain is collected and archived by ContextRepository-components on Layer 2 as displayed in Fig. 5. Each of them is equipped with a ReasoningNode-component, which together constitute a distributed reasoning system on Layer 3. Finally, the previously described demo applications have been implemented as a single component each on Layer 4. They base their functionality on the reasoning services provided by Layer 3.
7 Discussion and Conclusion The basic idea of this work was to understand infrastructure monitoring as a Context-Aware System: The physical infrastructure being richly equipped with sensing systems represents a Smart Environment. Making intelligent use of this monitoring data for condition-based maintenance requires Ambient Intelligence techniques. As a consequence, this work investigated the application of the formal knowledge-based approach of Context-Aware Systems for Smart Monitoring of physical infrastructures. It enables the integration of context information from different sources in a consistent manner, the automation of query answering, and the handling of incomplete information. The availability of an ontological knowledge model both documents system behavior and facilitates maintenance. For knowledge representation, the Semantic Web standard OWL DL and, more precisely, the Description Logic SHIN have been chosen: SHIN represents a very expressive DL, which is known to be decidable. Adopting OWL DL greatly facilitated the implementation as a rich software ecosystem including editors, repositories, and reasoners is available for OWL DL. Still, engineering ontologies with DL is not straightforward. However, the presented design pattern enables domain experts to model their domains in an extensible, maintainable, and reason-able manner. This way, the first comprehensive and consistent ontology for railway monitoring has been engineered. For leveraging the ontological modeling of context information and infrastructure conditions in a Smart Monitoring System, a flexible and extensible system architecture was developed: Any context source, whose information can be represented with respect to an ontology, can be implemented as a component on Layer 1. Managing ontology-based data is handled by Layer 2. On Layer 3, arbitrary techniques for processing the available context information can be realized. This way, semantically-annotated context information will be reused. In particular, DL-based reasoning has been extended to distributed environments. Processing techniques can
Smart Monitoring for Physical Infrastructures
605
again be reused by monitoring applications on Layer 4. This way, different views for different actors are provided. With respect to reasoning on ontological context information, efficiency is an issue: As the computational complexity is high, the most efficient approach is to keep knowledge bases small. This can be achieved by distributing the reasoning. Additionally, specialized techniques can be developed. One example is reasoning with respect to the topology of an infrastructure network. In conclusion, a knowledge-based approach to infrastructure monitoring promises a multitude of benefits. This chapter developed several concepts for realizing them. Their practical applicability was demonstrated by a case study on European railway monitoring.
References [1] Baader F, Calvanese D, McGuinness D, Nardi D, Patel-Schneider P (2003) The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press New York, NY, USA [2] Bechhofer S, van Harmelen F, Hendler J, Horrocks I, McGuinness DL, PatelSchneider PF, Stein LA (2004) OWL Web Ontology Language Reference. W3C Recommendation [3] Cyganiak R (2005) A Relational Algebra for SPARQL. Hpl-2005-170, Digital Media Systems Laboratory, HP Laboratories Bristol [4] Dey AK (2001) Understanding and using context. Personal and Ubiquitous Computing 5(1):4–7 [5] European Commission, Directorate-General for Research, Sustainable Energy Systems (2007) Strategic Research Agenda for Europe’s Electricity Networks of the Future. EUR 22580 [6] European Commission-funded Research Project in FP6 (2005) InteGRail – Intelligent Integration of Railway Systems. http://www.integrail. info/ [7] European Rail Research Advisory Council (2007) Strategic rail research agenda 2020 [8] Feng J, Smith J, Wu Q, Fitch J (2005) Condition Assessment of Power System Apparatuses Using Ontology Systems. In: Proceedings of Transmission and Distribution Conference, IEEE/PES, Dalian, China [9] Fuchs F (2008) Semantische Modellierung und Reasoning für Kontextinformationen in Infrastrukturnetzen (Semantic Modelling and Reasoning for Context Information in Infrastructure Networks). PhD thesis, LudwigMaximilians-Universität München [10] Fuchs F, Berndl D, Treu G (2007) Towards Entity-centric Wide-area Context Discovery. In: Proceedings of MDM2007, Workshop on Mobile Servicesoriented Architectures and Ontologies (MoSO07), IEEE Computer Society, Mannheim, Germany
606
Florian Fuchs, Michael Berger, and Claudia Linnhoff-Popien
[11] Graefe G (1993) Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25(2):73—169 [12] Horrocks I, Sattler U, Tobies S (2000) Reasoning with Individuals for the Description Logic SHIQ. In: Proceedings of 17th International Conference on Automated Deduction (CADE), Springer-Verlag, London, UK, pp 482–496 [13] Kezunovic M, Ren Z, Latisko G (2005) Automated Monitoring and Analysis of Circuit Breaker Operation. Transactions on Power Delivery 20(3):1910– 1918 [14] Kreiss D (2003) Non-operational Data: The Untapped Value of Substation Automation. Utility Automation & Engineering / T & D 8(5) [15] Krummenacher R, Strang T (2007) Ontology-Based Context Modeling. In: Proceedings of Workshop on Context-Aware Proactive Systems [16] Lagnebäck R (2007) Evaluation of Wayside Condition Monitoring Technologies for Condition-based Maintenance of Railway Vehicles. PhD thesis, Luleå University of Technology [17] Maly T, Rumpler M, Schweinzer H, Schoebel A (2005) New Development of an Overall Train Inspection System for Increased Operational Safety. In: Proceedings of Intelligent Transportation Systems, IEEE [18] Mieth S, Fuchs F, Stoffel EP, Weiss D (2008) Reasoning on Geo-Referenced Sensor Data in Infrastructure Networks. In: Proceedings of the 11th International Conference on Geographic Information Science, Workshop on Semantic Web meets Geospatial Applications [19] Ollier BD (2006) Intelligent Infrastructure the Business Challenge. In: Proceedings of IET International Conference on Railway Condition Monitoring, IET, Birmingham, UK, pp 1–6 [20] Prud’hommeaux E, Seaborne A (2008) SPARQL Query Language for RDF. W3C Recommendation: http://www.w3.org/TR/rdf-sparql-query/ [21] Schröder A, Laresgoiti I, Werlen K, von der Brelie BS, Schnettler A (2007) Intelligent Self-Describing Power Grids. In: Proceedings of the Conference on Electricity Distribution, CIRED, Vienna, Austria [22] Stocker M, Seaborne A (2007) ARQo: The Architecture for an ARQ Static Query Optimizer. Tech. Rep. HPL-2007-92, Digital Media Systems Laboratory, HP Laboratories Bristol [23] Svatek V (2004) Design Patterns for Semantic Web Ontologies: Motivation and Discussion. In: Proceedings of 7th Conference on Business Information Systems, Poznaˇn [24] Terziyan V, Katasonovic A (2007) Global Understanding Environment: Applying Semantic Web to Industrial Automation. In: Cardoso J, Hepp M, Lytras M (eds) Real-world Applications of Semantic Web Technology and Ontologies, Springer [25] Uslar M (2005) Semantic Interoperability Within the Power Systems Domain. In: Proceedings of the 1st International Workshop on Interoperability of Heterogeneous Information Systems (IHIS), ACM Press, New York, NY, USA, pp 39–46
Smart Monitoring for Physical Infrastructures
607
[26] Venkatasubramanian V, Rengaswamy R, Yin K, Kavuri S (2003) A Review of Process Fault Detection and Diagnosis. Part I: Quantitative Model-based Methods. Computers and Chemical Engineering 27:293–311 [27] Wang H, Zimmermann R, Ku WS (2005) ASPEN: An Adaptive Spatial Peerto-Peer Network. In: Proceedings of the 13th International Workshop on Geographic Information Systems (GIS), ACM, New York, NY, USA, pp 230–239
Spatio-Temporal Reasoning and Context Awareness Hans W. Guesgen and Stephen Marsland
1 Introduction Smart homes provide many research challenges, but some of the most interesting ones are in dealing with data that monitors human behaviour and that is inherently both spatial and temporal in nature. This means that context becomes all important: a person lying down in front of the fireplace could be perfectly normal behaviour if it was cold and the fire was on, but otherwise it is unusual. In this example, the context can include temporal resolution on various scales (it is winter and therefore probably cold, it is not nighttime when the person would be expected to be in bed rather than the living room) as well as spatial (the person is lying in front of the fireplace) and extra information such as whether or not the fire is lit. It could also include information about how they reached their current situation: if they went from standing to lying very suddenly there would be rather more cause for concern than if they first knelt down and then lowered themselves onto the floor. Representing all of these different temporal and spatial aspects together is a major challenge for smart home research. In this chapter we will provide an overview of some of the methodologies that can be used to deal with these problems. We will also outline our own research agenda in the Massey University Smart Environments (MUSE) group. Our interpretation of the smart home can be seen in Figure 1. We assume that unspecified sensors make observations of the house, and that changes in the observations, and hence the output, of these sensors in some way represent the behaviour of the human inhabitants. We are not directly interested in the signal processing that Hans W. Guesgen School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand, e-mail:
[email protected] Stephen Marsland School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand, e-mail:
[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_23, © Springer Science+Business Media, LLC 2010
609
610
Hans W. Guesgen and Stephen Marsland Human Activity
Sensory Observations
Human
Hardware
Data (Tokens)
Patterns
AI - data mining
Behaviours
Recognition and Anomaly Detection
Appropriate Response
AI - machine learning
Fig. 1 The smart home problem consists of interpreting sensory observations caused by human behaviour in order to recognise human activity and respond appropriately.
is inherent in interpreting sensory output; rather we assume that these signals are presented to the smart home controller in the form of ‘tokens’. These could be a direct representation of current sensor states, polled at regular time intervals, or indications that something has changed (in the former case the output could be ‘switch is off’, ‘switch is off’, ‘switch is off’, and so on, while in the latter there would be no token until the switch was moved). However, in either case it is these tokens that form the basis of the smart home reasoning process. Our interest in the smart home begins at this question of how to represent events and sensory signals by tokens in the most useful way. We are not directly interested in what the sensors are, or how they are processed, but rather how to extract useful data from this token stream and interpret that data; as is indicated in Figure 1 we see these as data mining and artificial intelligence problems, and all of the methods that we present here are from these areas. The more we have looked into the data that is presented in smart home problems, the more important context awareness and spatio-temporal reasoning have seemed to us. Actions and events do not occur in isolation, and things that seem odd as individual actions take on different interpretations once fleshed out with further information: a person balancing on one leg with their hand clutching the other leg is not unusual in the course of a yoga exercise program, but is otherwise rather unexpected. Before we begin to describe methods that deal with these problems of spatiotemporal reasoning and context awareness, we describe our motivation for studying smart homes at all, which is centred around our target application area.
2 Motivation A question often raised in connection with ambient intelligence, smart homes, context awareness, and the detection of human behaviour is whether the sacrifice of privacy is warranted by the benefits that the technology has to offer. The answer to this question lies in a problem that developed countries around the world is facing,
Spatio-Temporal Reasoning and Context Awareness
611
viz. the problem of an aging population. Life expectancy is higher than ever before, and so is the expectation of a high-quality, independent lifestyle throughout ones entire existence, irrespective of age or illnesses such as Parkinson’s or Alzheimer’s disease. Unfortunately, this expectation is not always met. Rather than moving an individual to a nursing home or employing the continuous support of a caregiver, an ambient intelligence can help people to live independently for longer by monitoring their behaviour in a non-obtrusive way, and alerting a caregiver if a critical situation arises. We can separate the group of elderly people who desire to live an independent life into three groups. First, there are those with mobility and other problems associated with some sort of physical handicap (e.g., loss of a limb, severe arthritis). Second is a group who are of normal mental and physical health for their age. However, even healthy aging tends to be accompanied by at least some cognitive impairment (mainly involving memory problems and slowed processing speed). Third, there are people that suffer more than normal cognitive impairment for their age, usually involving dementia [29] (poor intellectual functioning involving impairments in memory, reasoning, and judgement). Overall, between 40% and 60% of the independently living elderly suffer from some degree of cognitive impairment [37]. There exists a very large set of everyday domestic activities that this group may have difficulty carrying out. These include turning electrical or gas-driven appliances on and off, turning taps on and off, making tea or coffee, ironing, laundry, remembering to wash hands, and turning lights on and off. Quinn et al. found that bathing, personal hygiene, and getting dressed were major problems for their sample of 80 elderly people (average age 79). 82% needed assistance with managing their medication. Many of the problems listed above can be monitored, and corrected, with a (relatively) unobtrusive ambient intelligence, and such a system could also alert caregivers to dangerous situations that it could not handle itself. Furthermore, in cases where two elderly people live together, one is often the sole caregiver for a less fortunate partner. This can be an incredibly stressful role for an aging caregiver, often resulting in serious illness, or even death [20]. An ambient intelligence, with its ability to monitor and intervene in many situations, could be of great assistance to an overworked, elderly caregiver. The upshot of this discussion is that we are aiming for an unobtrusive monitoring system that does not necessarily interact directly with the inhabitants of the smart home. This may well reduce the resolution of the sensors that can be used, making the problem of identifying behaviours more difficult. However, it should also make the acceptance of the system rather easier for the home inhabitants; particularly if it does little more than passively observe and alert a caregiver of a potential problem, as this can be considerably less intrusive than unwanted check-up visits from an uninformed carer. Note also that we are considering cognitive impairment, such as is typically found in Alzheimer’s and similar degenerative diseases. This means that behaviours can change over time and the system therefore needs to adapt to identify what is occurring in a dynamic learning environment, something that is a challenge for many
612
Hans W. Guesgen and Stephen Marsland
artificial intelligence methods. Just adapting to a new activity is not the solution, because this activity might be abnormal, even if it is repetitive. Falling down should not be interpreted as normal behaviour, even if it is happening frequently, whereas preparing tea instead of the usual coffee is most likely not something a caregiver has to worry about. There are two other potential problems with this type of smart home environment, which are rather common to all medical and health-related analyses. These are the vexed issue of false positives, and the fact that poor health outcomes are (fortunately) rather rare in the general population. False positives are a major reason why genuine emergencies are missed. In a modern day version of the boy who cried wolf, a machine that sets off an alarm every 10 minutes is ignored when the alarm is real. A system that alerts the caregiver twice a day to false positives is worse than having no system at all. The other side of this coin is that the inhabitant of the house performing some unusual or potential dangerous behaviour is presumably rather rare, especially since the system is observing their activity 24 hours a day. In order to get a 99.99% success rate all that the smart home has to do is declare that the inhabitant never doing anything abnormal. From the application point of view, this obviously removes the need for the system at all. For machine learning methods based on real data it is a challenge to solve this problem, and there are two typical approaches. The first is to select the training data very carefully to ensure that only things of interest are seen, so that the system only learns to recognise things of real interest. There are several problems with this, including the huge amount of human involvement, and the fact that the system will not identify a potentially dangerous activity that was not introduced in training. In addition, the bias towards ‘healthy’ observations has another potential problem, which is that almost all of the training data shows people behaving normally, and therefore identifying abnormal behaviour, which is a potential indicator of ill health, is rather difficult. We prefer to consider the intelligence system to be always learning, identifying for itself from the datastream when behaviours are potentially risky. The second approach, which is rather more useful to us, is to modify the error criterion so that success for the system is not simply the number of correct predictions that it makes. If the concept of risk is introduced, where the risk of annoying the caregiver can be balanced against the risk of missing a potential illness, then the success of the system can be more usefully measured, making it more effective. This can be done by something as simple as a ‘loss matrix’, which specifies the cost of making an incorrect prediction for the various possible outputs [7, 26]. Having highlighted these issues, we next describe a number of different techniques from the literature that may be useful in solving some of these problems, and discuss possible applications of them in the smart environment arena, beginning with methods to reason about human behaviours.
Spatio-Temporal Reasoning and Context Awareness
613
3 Sensory Input While we are not directly interested in the sensory input, it is worth considering to what extent they provide direct input concerning spatial and temporal context. Based on our thesis that a smart home should be minimally obtrusive, we assume that the sensors will be minimally informative. Many of them could be as simple as location sensors, the use of electrical devices, etc. These sensors have the benefit of being cheap and easy to install, but they usually provide very limited information compared to, for example, a camera. This has the benefit that very limited preprocessing is required, but the disadvantage that not much information is available for the ambient intelligence. Consider the difference between an image of a person moving round the kitchen, filling and boiling the kettle, placing a teabag in a cup, adding milk, and pouring boiling water on, as opposed to the fact that electricity was used from a particular plug point and the fridge opened. The next question is precisely what form the data should be presented in. We assume that the system has some information about the location of the sensor (at least, the room it is located in) and the time that it was activated at, but there are a variety of different options for how the sensor data can be presented. At one extreme of the options is that the intelligence periodically polls each sensor and gets a reading from it, so that the sensors are always reporting their state, even when there is nobody in that particular room. In this case, the output from a power point could be a series of readings of the values of the power drawn from the socket. While this could be useful (since it might enable the actual device being used to be identified), we prefer to think that this kind of processing is built into the sensor, and the output that is presented is a token that represents the fact that a device (whether specified or not) is being powered from that particular socket. Given that we are assuming that tokens are presented by each sensor, there are still different options, but our default hypothesis is that each sensor presents two different tokens, corresponding to ‘sensor has started’, ‘sensor has stopped’ that are presented in an interrupt-like fashion by the various sensors. While this may not be suitable for certain types of sensor, we generally assume that it is useful.
4 Spatio-Temporal Reasoning As was pointed out before, interpreting human behaviour in context involves reasoning about space and time. For example, preparing a meal at noon in the kitchen is usually a perfectly normal behaviour, but if the same activity occurs at 3 in the morning in the garage, then it is a behaviour that needs some special attention. When referring to spatial and temporal reasoning, we mean reasoning in a similar way to that in which a person would deal with time and space in everyday life, and not the way how a physicist, for example, would describe locations, regions, time points, intervals, events etc. and reason about them. It has been argued in AI that the layperson’s everyday form of spatio-temporal reasoning is of a qualitative,
614
Hans W. Guesgen and Stephen Marsland
rather than a quantitative nature. We are not usually interested in precise descriptions of space and time. Coarse and vague qualitative representations frequently suffice to deal with the problems that we want to solve. For example, to know that the meal is prepared in the kitchen (rather than knowing the exact coordinates for this activity) and at lunchtime (rather than 12:03:37) is often enough to decide whether the behaviour is normal or not. Most qualitative approaches to spatial and temporal reasoning are based on relations between objects (such as regions or time intervals). They are fundamentally different from the approaches developed in the area of computational geometry [5, 36]. Of the temporal qualitative approaches, two are dominant and reappear frequently in different scenarios: Allen’s temporal logic [2] and the point algebra [50]. The first one uses intervals to describe specific events, and relations between intervals to describe interdependencies between events (see Figure 2). For example, if the event of the door to the room being opened occurs before the event of the person entering the room, and if I1 denotes the time intervals in which the door-opening event occurs and I2 the one for the person-in-room event, then we would have the relation I1
Spatio-Temporal Reasoning and Context Awareness
Relation
I1
I1
I1 mI2 I2 miI1
I1 oI2 I2 oiI1
I1 sI2 I2 siI1
I1 dI2 I2 diI1
Illustration
615
Interpretation
I1 I2
I1 I2
I1 I2
I1
I1 before I2 I2 after I1
I1 meets I2 I2 met by I1
I1 overlaps I2 I2 overlapped by I1
I1 starts I2 I2 started by I1
I2
I1
I1 during I2 I2 contains I1
I2
I1
I1 fI2 I2 fiI1
I2
I1 finishes I2 I2 finished by I1
I1 =I2
I1
I1 equals I2
I2
Fig. 2 Allen’s thirteen atomic relations.
the intervals. The interval relations that can be expressed in terms of point relations are called pointisable interval relations. Both Allen’s interval logic and the Villain and Kautz’s point algebra can be used for spatial reasoning by interpreting intervals as one-dimensional objects and time points as locations in space, respectively. For example, if a person (O1 ) is standing left of a box (O2 ) and if the box contains a book (O3 ), then we can conclude that the person is standing left of the book:
616
Hans W. Guesgen and Stephen Marsland <
m
o
fi
di
si
=
s
<
<
<
<
<
<
<
<
<
<, m <, m o, s o, s d d
m
<
<
<
<
<
m
m
m
o, s d
o, s d
o
<
<
o o fi, di
o
o, s d
o, s d
fi
<
m
o
fi
di
di
fi
o
o, s d
f, = fi
di
<, m o fi, di
o fi, di
o fi, di
di
di
di
di
o, s di o d, f, = di di di si, oi fi, di fi, di si, oi si, oi si, oi mi, > si, oi
o fi, di
o fi, di
di
di
si
si
s, = si
d, f oi
oi
oi
mi
>
m
o
fi
di <, m o fi, di <, m o, s, d f, =, fi di, si, oi mi, > di si, oi mi, >
si
=
s
d
f
oi
mi
>
s, = s si
s
d
d
d, f oi
mi
>
d, f oi, mi d >
d
d
d
d, f oi, mi >
>
>
oi f mi, >
d
d
f
oi mi, >
>
>
d, f oi
d, f oi
oi
oi mi, >
>
>
d, f oi
d, f oi
mi
>
>
>
>
>
>
>
<, m <, m o o
=
<, m o fi, di <
s
<
<
<, m <, m o o
d
<
<
<, m <, m o, s o, s d d
f
<
m
o, s d
oi
<, m o fi, di
si
o, s di o d, f, = di oi si, oi oi fi, di fi, di si, oi mi, > mi, > si, oi
<, m s, = d, f o si oi fi, di <, m d, f o, s, d d, f > f, =, fi oi, mi oi, mi > di, si, oi > mi, >
mi
f, = fi
<, m o fi, di
d
mi
>
>
mi
>
>
>
d, f d, f > oi, mi oi, mi > >
f
> <, m <, m <, m o, s, d o, s o, s f, =, fi d d di, si, oi mi, > di o, s f, = si, oi d fi mi, > o, s di d, f, = di si, oi fi, di si, oi mi, > si, oi di di di si, oi si, oi si, oi mi, > oi
mi
Fig. 3 Allen’s composition table. The entry at row r1 and column r2 in the table denotes the possible relations between O1 and O3 , assuming that O1 r1 O2 and O2 r2 O3 .
O1
Spatio-Temporal Reasoning and Context Awareness
Relation
Illustration
P1 < P2
P1
P1 = P2
P1
•
P2
•
•
617
Interpretation
P1 precedes P2
P1 same as P2
P2
•
P1 > P2
P2
•
P1
•
P1 follows P2
Fig. 4 The three different point relations from the point algebra.
Figure 5, which shows three boxes arranged on a flat surface such as a carpet. When O1 O3 O2
Fig. 5 Three boxes arranged on a flat surface.
describing the spatial relations between the boxes, we can split each relation into three components, each of which describes an aspect of the relation with respect to a particular axis: the first component with respect to a horizontal axis, the second with respect to a vertical axis, and the third with respect to a depth axis: O1 d, mi, dO2 and O2 <, s, oO3 . Applying the transitivity table to the components of the relation triples yields an additional spatial description: O1 <, {d, f, oi}, {<, m, o, s, d}O3 . We argue that the topological aspects of space are the most relevant ones for modelling human behaviour in an ambient intelligence. For example, to determine what activity a person might be involved in, it is useful to know which room that
618
Hans W. Guesgen and Stephen Marsland
person is in (although not necessarily the person’s exact position in that room). Or, to determine the possible movements of a person, it is useful to know which rooms are connected to which other rooms. We will therefore focus on the topological aspects of space in the following, introducing one topological spatial reasoning calculus as an example: the Region Connection Calculus or RCC [39]. The basis of the RCC theory is a reflexive and symmetric relation, called the connection relation, which satisfies the following axioms: 1. For each region X: C(X, X) 2. For each pair of regions X, Y : C(X,Y ) → C(Y, X) From this relation, additional relations can be derived, which include the eight jointly exhaustive and pairwise disjoint RCC8 relations: listed below and illustrated in Figure 6. RCC8 = {DC, EC, PO, EQ, TPP, TPPi, NTPP, NTPPi}. Reasoning about space is achieved in the RCC theory by applying a composition table to pairs of relations, similar to the composition table in Allen’s interval logic. Given the relation R1 between the regions X and Y , and the relation R2 between the regions Y and Z, the composition table determines the relation R3 between the regions X and Z. For example, if the region that the person is standing in (R1 ) is disconnected from the region occupied by the box (R2 ) and if the region taken up by the book (R3 ) is a tangential proper part of R2 , then we can conclude that R1 and R3 are disconnected: DC(R1 , R2 ) and TPPi(R2 , R3 ) then DC(R1 , R3 ). The RCC theory is not the only approach to reasoning about topological relations. Egenhofer and Franzosa [10] consider the intersections of boundaries and interiors of pairs of regions and derive a formalisation from that, called the 4-intersection calculus (and by adding the complement of the regions to the intersections, the 9intersection calculus).
5 Symbolic Approaches Since its beginning, the area of artificial intelligence has been split into two camps: symbolic and subsymbolic. Researchers from the symbolic camp base their work on the physical symbol system hypothesis, which states that a “physical symbol system has the necessary and sufficient means for intelligent action” [33]. In the context of detecting behaviour from patterns, this means the explicit representation of actions and events, together with the ramifications of these. Although there are various ways to achieve this, one approach has gained frequent attention in AI: the situation calculus [30]. This calculus is a logic-based approach that uses a state description in its formulas to describe what happens when a particular action is
Spatio-Temporal Reasoning and Context Awareness
Relation
Illustration
619
Interpretation
X
DC(X,Y )
X disconnected from Y
Y
X
EC(X,Y )
X externally connected to Y
Y
PO(X,Y )
X Y
X partially overlaps Y
EQ(X,Y )
TPP(X,Y )
TPPi(X,Y )
NTPP(X,Y )
NTPPi(X,Y )
X
Y
Y
X identical with Y
X
X tangential proper part of Y
Y
Y tangential proper part of X
X
X nontangential proper part of Y
Y
Y nontangential proper part of X
X Y X
Fig. 6 An illustration of the RCC8 relations.
performed. For each action, an axiom is provided that specifies when the action is possible, and another axiom that specifies the effects of the action. For example, if the kettle is not empty in a situation s, then it is possible to heat the water in the kettle. The heating action results in a new situation in which the kettle contains hot water. Formally:
620
Hans W. Guesgen and Stephen Marsland
Possibility asiom: Kettle(K) ∧ ¬Empty(K, s) ⇒ Poss(Heat(K, s)) Effect axiom: Poss(Heat(K, s)) ⇒ Hot(K, Result(Heat(K, s))). The function Result denotes the situation that results from situation s if the kettle is heated. Functions and predicates that can change from one state to another are called fluents (e.g., the predicate Hot), whereas predicates and functions that don’t change are called atemporal or eternal (e.g., Kettle(K)). The effect axioms specify which fluents change from one state to another when a particular action is performed. They do not specify the fluents that stay the same. The problem of representing these is commonly referred to as the frame problem, and it can be addressed by writing frame axioms: Frame axiom: Door(D) ∧ ¬Open(D, s) ⇒ ¬Open(D, Result(Heat(K, s))). The situation calculus is commonly used to perform planning tasks. Given a set of possible actions, a constructive inference engine can then be used to find a sequence of actions that achieves a desired effect. For example, given actions like filling a kettle with water, heating the kettle, putting a tea bag into a cup, pouring water into the cup, etc. the system can assemble them in such a way that it results in achieving the task of making tea. For the purpose of this chapter, we are interested in the inverse problem, which is referred to as the projection task: given a sequence of actions, what is the outcome of the sequence? Although, at first glance, the projection task seems to be trivial, unfortunately this is not the case if we attempt to go beyond purely describing the state of the fluents. For example, it is not very useful to know that after performing the aforementioned actions, the kettle is empty, the tea bag is in the cup, the cup is filled with water, etc. What we rather want is associated a symbol, in this case TeaMaking, with these actions. A straightforward solution to this problem is keeping a database that associates action sequences with their interpretations. Such an approach is feasible if each interpretation is caused by one or very few sequences of actions, which generally is not the case. There are many ways to make tea, like using a pot on the stove rather than a kettle, using loose tea leaves instead of a tea bag, replacing part of the water with rum, etc. Moreover, a sequence of actions can be interleaved with another sequence of actions. For example, while waiting for the water to boil, we might want to prepare a sandwich to go with the tea. Another shortcoming of this approach is the lack of an explicit time and space representation. Although it is possible to use temporal or spatial references instead of states in the axioms of the calculus (like Happens(Heat(K), [10:00, 10:10])) and thereby turning the situation calculus into an event calculus, the resulting calculus still does not automatically inherit the reasoning techniques known for Allen’s temporal algebra or the Region Connection Calculus. Bhatt and Loke [6] have suggested a different approach, which keeps the causal theory (represented in the situation calculus) separate from the spatial theory (represented, for example, in RCC-8). This approach provides a cleaner and more powerful model, but it also puts an extra burden onto the modeller who has to develop the theory.
Spatio-Temporal Reasoning and Context Awareness
621
An approach based on the situation calculus assumes that there is a reasoning engine that interprets the detected sequence of actions and makes an assessment of whether it constitutes a normal or abnormal behaviour. Augusto and Nugent [3] suggest a different approach, which explicitly models abnormal behaviour in the form of Event-Condition-Action rules: ON event IF condition THEN action. The system that uses these rules keeps a current state, which is updated whenever an event occurs. If the condition associated with the event is satisfied, a particular action is triggered, which can be raising an alarm, notifying a caregiver, or just noting a change in the state of the system. The rule-based approach provides explicit control over the detection of abnormal behaviour and the action to be taken if such a behaviour occurs. However, it still requires the modeller to create a reasonable set of rules that accounts for all forms of abnormal behaviour. The approaches discussed in the next section offer a solution to this problem by having the system learn to distinguish between normal and abnormal behaviour.
6 Machine Learning Approaches In addition to the symbolic approaches identified previously, there are also significant contributions that subsymbolic and probabilistic methods can make to the area of context awareness in smart homes; subsymbolic and probabilistic methods are often combined under the generic name of ‘machine learning’. In this chapter, three machine learning methods that can be used for spatio-temporal learning and context awareness in the area of ambient intelligence are described. The three methods provide potential solutions to different parts of the smart home problem, as will be identified as they are described. The section finishes with a brief discussion of other machine learning methods that might be of interest for spatio-temporal learning in smart homes.
6.1 Frequent Pattern Data Mining Assuming that the sensory signals have been encoded in some way into a set of tokens that represent certain events (or the cessation of events), as was discussed previously, the first question is how to identify patterns in this sequence. The token stream is temporal in the sense that tokens appear when events trigger them, but they can be assembled into a sequence and this can be mined as a non-temporal sequence. For example, the token signal over each hour, or some other time window, could be assembled into a string, and then these strings can be treated as separate ‘transac-
622
Hans W. Guesgen and Stephen Marsland
tions’. The window could be strictly temporal in size, or of a fixed number of tokens (to see the difference, consider that a time window of one hour will probably have rather more tokens in it at 6 pm than at 3 am). Selecting a suitable size of window is important, and not trivial. It can be done based on the data using a technique such as cross-validation [7, 26]. The problem for the ambient intelligence is to identify from these strings those patterns of tokens that represent particular behaviours. One way to consider the problem is to say that tokens that appear together frequently are likely to be parts of the same behaviour, or parts of related behaviours. We can therefore appeal to a variety of data mining procedures that aim to do so-called market basket analysis. The aim of these methods is to extract combinations of items that occur together frequently. A typical motivation for these algorithms is to identify patterns of shopping, such as that pasta and red wine are often bought at the same time, or CDs by certain artists, so that shop layout and advertising can be better targeted to customers’ needs. One method of performing frequent pattern mining is the FP-tree [15], which is a tree data structure that extracts patterns with support (i.e., appearance in the dataset) above some pre-defined threshold. It was developed as an alterative to the Apriori algorithm of Agrawal and Srikant [1], and consists of a tree with individual items at each node, with the nodes at the top of the tree having the highest support. There is also a header table that indexes individual items so that they can be quickly found in the tree (usually implemented as a linked list if an item appears more than once within the tree). As an example, suppose that three different sets of tokens are seen in three windows: {A,C, D}, {A, B, D, E}, {B, D, E, F}. The algorithm first counts the support of each token individually, and sorts them into order of decreasing frequency, which would be D : 3, A : 2, B : 2, E : 2,C : 1, F : 1. Assuming that the minimum support (a user-defined parameter) was chosen as 2, those items with support 1 would be removed (here C and F) and then the first record ({A,C, D}) would be put into the tree, starting with D since it has the highest support. C would be ignored since it has support below the minimum, and therefore any pattern that it is part of would have support below the minimum. Each node is annotated with its support. The tree after the first pattern has been entered is shown on the left of Figure 7, and the tree after all three patterns have been entered is shown on the right of the figure. Many methods of data mining can be understood in terms of compression and related information-theoretic ideas. If the cost of transmitting or storing data is to be kept to a minimum, then a pattern that occurs frequently should be represented by a short bit sequence in a codebook, while one that occurs less commonly can have a longer one without incurring much additional cost, since it is seen less often. In this way, the problem of identifying frequent patterns can be mapped into one of finding the shortest bit sequence that encodes the information. The means that statistical measures of complexity such as the Minimum Description Length [41] are of interest, as in [13, 16, 9]. There will be two aims in mining frequent patterns: to identify the longest repeated sequences, and to identify the sequences that occur the most often. For be-
Spatio-Temporal Reasoning and Context Awareness
623
Fig. 7 Left: The FP-tree after the first pattern is entered. Right: The FP-tree after all three patterns have been entered.
haviour identification we are most interested in the first of these, although the second can also provide useful information. One thing that makes the task of behaviour identification significantly more difficult is that there is no reason why patterns in a behaviour should be contiguous in the sequence; for example, when making tea, while waiting for the kettle to boil a person can perform other unrelated tasks such as reading the newspaper or taking out the rubbish. This problem is radically more difficult if there is more than one person in the house. Additionally, there is no reason to assume that tasks are always performed in the same order. Again, tea making provides an example; whether you pour the milk in first or last is a matter of preference, choosing one or the other does not mean that you are performing a different qualitative behaviour. Of course, some people take no milk at all, and other add sugar, and so the actual tokens that comprise one description of the behaviour can vary slightly, as well as their order. A possible way to deal with the first problem could be to completely remove the temporal sequence by sorting the tokens into some other ordering (such as alphabetical based on the arbitrary naming of the tokens) and to use that. This is what is commonly done in data mining, but it has two disadvantages here: firstly that a token that appears in several behaviours and is thus seen several times in the string will probably only be shown once, and the remainder suppressed, and secondly that the temporal information can be useful to discriminate between different behaviours. For example, if a person were to boil the kettle, take milk from the fridge, and remove a cup from the cupboard, then they could be making a cup of tea, but if the first action happened 35 minutes after the other two, that would be rather unlikely; rather they are likely to be events in unrelated behaviours. A possible solution to this problem is presented by Tavenard et al [48], based on the T-pattern of Magnusson [24]. This uses the time interval between events to try and recognise when particular events are related and when they are not.
624
Hans W. Guesgen and Stephen Marsland
This is not to say that frequent pattern data mining is not useful here; it may well be that by identifying groups of tokens that represent particular behaviours, and presenting them in such a way that spatio-temporal reasoning can be applied to the output is a very useful form of preprocessing. By removing the various temporal aspects of the data (except that some kind of windowing is applied to the data stream, so that only tokens that appear relatively close together in time could form part of a pattern) it is possible to identify actions that frequently happen at similar times and that could therefore be part of the same behaviour, without having to consider the temporal order that they appear in. There are multiple challenges in this area, some of which have been identified above, and it is an interesting area of study. It may prove to be a fruitful approach to the problem of identifying behaviours at a relatively naïve level. The question then is how to use the identified behaviours, and the tokens that comprise them, in order to recognise whether or not a person is behaving normally. Several methods for doing this have already been described, but in the next two sections we will now identify two techniques that may well have their part to play in dealing with this challenge.
6.2 Graphical Models One of the most popular areas of research in machine learning currently is that of graphical models. These are a general framework where variables are identified as nodes and edges between the nodes represent conditional relationships between nodes (more correctly, the absence of an edges implies conditional independence of the two nodes). Graphical models have been described as a marriage between computational graph theory and probability, and many common machine learning algorithms can be represented within this framework, including the Kalman filter [19], Hidden Markov Model [38], and Bayesian Network [18]. The graph that underlies the model can be directed (i.e., the edges have arrows on) or undirected. In this second case the models are known as Markov random fields and have shown particular application to image analysis. In this chapter we will restrict ourselves to discussing one particular example, which is to take sequences of tokens that have been identified as forming a behaviour (possibly using the data mining methods described above), and to use examples of when these behaviours occurred in order to attempt to identify these behaviours in the stream of tokens as they are produced. In this way, behaviours can be identified in real-time, and significant departures from normal behaviour hopefully recognised before they become dangerous. Since we are interested in these sequences as they appear in time, we are looking at dynamic Bayesian Networks, a subclass of graphical models that includes Hidden Markov Models (HMM) as its most useful and computationally tractable example. Hidden Markov Models have met with considerable success in another area of temporal sequence analysis, that of speech recognition [38].
Spatio-Temporal Reasoning and Context Awareness
625
The important point about HMMs is the disconnection between observations, which for our smart home application are the tokens that are provided by the sensors, and the underlying states that gave rise to the observations. For example, observations could be that water is coming out of the tap, and possible states that give rise to that are that somebody is using the tap, that it leaks, or that there has been a sudden drip after use. As an example of a possible sequence of observations and the states that could be represented by them, Figure 8 shows a possible sequence of tokens and the states that could have given rise to them (in the figure, token F is about the tap being used, C and D are concerned with electricity being used, and X and Y with the fridge being opened and closed). The figure does not show the probabilities. Linking the sequence of observed tokens over time into lists of actions that can eventually be recognised as behaviours is still non-trivial, however. For example, from a stream of observations, there may be some that are not connected to the current activity, and the order in which the events occur can vary. Normal Hidden Markov Models will not deal well with either of these cases, and extending the paradigm to deal with these and other aspects of temporal sequences is an area of current research. One approach that shows some promise for dealing with the ordering problem is the Conditional Random Field [46], which encodes conditional probabilities of sequence order in an undirected graphical model. While these methods show some promise in removing some of the application-oriented limitations of the Hidden Markov Model, they do incur costs, such as the fact that Markov Chain Monte Carlo (MCMC) methods are needed to approximate the probability distributions, rather than the expectation-maximisation (EM) algorithm of the Hidden Markov Model [26].
Fig. 8 A sample Hidden Markov Model for recognising tea making behaviour. Note that it is not sufficient since it does not include the possibility that the milk is added before the water is boiled.
One perceived limitation of the HMM is that each state only produces one observation. While this makes sense for many applications, it is not intuitive for behaviour recognition. A suggestion of one solution is provided in Figure 8, where two tokens together are identified as a single observation. However, there are other solutions possible, such as modifying the structure of the HMM to allow for more complex
626
Hans W. Guesgen and Stephen Marsland
models. The Hierarchical HMM [12], where each state is allowed to be a complete hierarchical HMM of its own is one such variation that allows complete activities to be built into each of the states. This has been applied to human behaviour recognition by Nguyen et al [34]. Our own focus in this area is to look for a simple solution. By considering the problem of identifying and predicting behaviours as one of competition between a set of Hidden Markov Models, each of which recognises one particular behaviour, the best match for the current set of tokens can be identified probabilistically. In the event that none of the models produces a good match, either the system needs to add a new model and train it on this data, or it needs to alert the caregiver.
6.3 Novelty Detection and Habituation A well-publicised criticism of neural networks and other machine learning methods is that they will always give you an answer, even when the input that is presented is significantly different from those that they were trained on. There are a variety of ways to solve this problem, but one that is of particular interest is novelty detection. In a novelty detection system the algorithm is trained on ‘normal’ behaviour that is seen during the training phase, and then in use, the algorithm classifies signals as either normal or novel. The application to smart homes is most clear in the monitoring scenario that we are considering. Assuming that methods have been used to process sensor signals into a pattern of human behaviour, the task is to decide whether or not the person is acting normally. Some early work in this area is reported by Rivera-Illingworth et al [42] and Jakkula and Cook [17]. In monitoring of life activities, abnormal behaviour could be a completely new activity (jumping up and down aimlessly), performing a normal behaviour in an unexpected place (watering the TV, not the houseplants) or unexpected time (drinking vodka in the morning), or an unexpected person performing a behaviour that, while common, they have not done before (the staid aunt who removes her clothes and wanders around naked). As O’Keefe and Nadel [35] phrase it (page 241): “...the novelty of the wife in the best friend’s bed lies neither in the wife, nor the friend, nor the bed, but in the unfamiliar conjunction of the three.”
Of the four types of novelty identified, only the first lacks context as part of what makes it novel, and this highlights clearly how important context (whether in the form of the person, the place, or the time) is in identifying and categorising behaviours. There are many methods of performing novelty detection described in the machine learning literature, from the multi-layer Perceptron-based novelty filter of Kohonen and Oja [22] to methods based on Support Vector Machines [8]; for an overview of methods, see [25]. However, here, rather than looking at a method based
Spatio-Temporal Reasoning and Context Awareness
627
on a particular machine learning algorithm, we will see how a simple model of the biological phenomenon of habituation can be used to perform novelty detection. Habituation is widely recognised as one of the simplest forms of learning in biological organisms. It consists of a reduction in response rate to a stimulus that is presented repeatedly without ill effect. This has important survival benefits for animals, as it enables them to ignore unhelpful but common stimuli in order to focus on potentially more important things. Habituation has been identified in organisms as simple as the sea slug Aplysia and the nematode worm, and as complex as humans; we habituate to stimuli constantly, from the way that we stop noticing the sensation of the socks on our feet to the ignoring of background sounds, such as those produced by machinery. The effects of habituation are often apparent in psychology experiments, and care must be taken to avoid confounding results. Thompson and Spencer [49] produced a list of nine identifying characteristics of habituation; these were recently revised by an interdisciplinary group of researchers convened for the purpose, and their characteristics, and the justification for the changes from Thompson and Spencer [49] are described in Rankin et al [40].
Fig. 9 Habituation causes an exponential decay in response with stimulus presentation that recovers when the stimulus is withdrawn.
The effect of habituation on the response to a particular stimulus is generally seen as an exponential decay of the response strength that recovers when the stimulus is not presented, as shown in Figure 9. In addition, there can be spontaneous dishabituation when a different stimulus is presented. Modelling this computationally can be as simple as a negative exponential function [44], although models are typically more complex than that. A recent review of modelling habituation and its application in machine learning is provided by Marsland [27].
628
Hans W. Guesgen and Stephen Marsland
For our smart home application area, to demonstrate the use of habituation, we assume that behaviours are encoded in some way, possibly as a set of identified features that categorise individual behaviours. Temporal and spatial information could be added as additional input features. These can then be classified using any clustering algorithm, such as the Self-Organising Map [21] or GWR network [28]. The outputs of the network can be modified by using the habituation model, so that inputs that are seen frequently are habituated to, while those that are seen only rarely are not. The network is then trained on data collected from normal living in the smart home; simply the typical data of daily life, either specific to a particular individual, or collected over many individuals in different houses.
Fig. 10 Combining a model of habituation with a neural network requires modifying the output of the neural network by the amount of habituation that has already occurred for that input class. When a class is seen often its class is habituated to, whereas for a rare class the output is strong.
After training, the output of the habituation will suggest whether a particular input behaviour is normal or has novel features, either in the behaviour itself or its spatial or temporal features, in relation to the training set. This model can be modified to provide additional information, such as distinguishing between behaviour that is normal because it has been seen in the past, but not recently, and completely typical behaviour. The way in which this can be done is to use two different habituating responses. One of them learns slowly, but does not forget; when an input is seen the response decrements for each presentation, but not much. However, when the input is not seen the strength does not change. The other one learns more quickly, so that fewer presentations are required for an input to be normal, but it forgets, so that if an input class is not seen, the habituated response recovers to its maximum level. Consider then the effect of an input that is seen frequently. It will have both outputs habituated and will therefore produce no response. However, an input that has been seen in the past, but not recently, will produce output from one of these habituation modules, but not the other. Interestingly, this setup also enables a second interesting
Spatio-Temporal Reasoning and Context Awareness
629
case to be identified, which is a significant and sustained change in behaviour; when a new input is seen a few times the rapidly learning module will habituate to it, but the long-term one will not. Thus, an input that produces an output only from the long-term module is something that was rare, but is now being seen frequently. It was suggested above that spatial and temporal context can be encoded as inputs to the learning algorithm. However, there are alternatives to this choice, such as training banks of learners, each of which are focused on specific combinations of times and places. In this way, the context of each behaviour is explicitly defined only within that context.
6.4 Other Methods Much of the learning that occurs in the smart home environment does not have clear targets. For example, the activities that the home should recognise are not necessarily described by humans, and we want to avoid having humans describe these in advance. This means that supervised learning methods are not appropriate for many of the tasks. The alternative that is used in the novelty detection methods described above are to use no external feedback and only expect the neural network to cluster relevant data together and detect outliers from it. However, there is an inbetween approach known as reinforcement learning [47], where an external signal is provided that gives feedback on whether or not the output was correct in the form of a reward, but not how to improve it. This makes the problem one of explicit search. One of the main challenges of reinforcement learning is known as the credit assignment problem, and consists of working out which actions led to the reward, since it may well not be those that were closely linked in time. For example, suppose that the smart home predicts that a person is making tea. The reward for this will be given once the observation of tea drinking is made, but the tokens that led to the prediction, such as running the tap and boiling the kettle occurred long before this observation was made. This problem has been considered in a simple smart home context for lighting control by Mozer [31]. Standard neural networks can also be adapted to perform spatio-temporal learning. It is the problem of time-series prediction that has most commonly been considered, although this can be dealt with by using inputs taken within a fixed-timewidth window, which means that the learning algorithm does not need any information about the time series. However, it is also possible to build temporal information into the learning algorithm explicitly. For example, the multi-layer Perceptron [43], can be modified to deal with temporal data in a variety of different ways: typically by using outputs from the network at previous timesteps as inputs (this is known as a recurrent neural network; there are a variety of algorithms, but [11] is a typical reference) or by modifying the behaviour of the neurons, so that they do not completely lose their activation between inputs. These are generically known as time delay neural networks, and an example is the leaky integrate-and-fire model of Stein [45]. For a review of the area of spatio-temporal neural networks, see Kremer [23].
630
Hans W. Guesgen and Stephen Marsland
7 Conclusion The development of smart environments, such as smart homes, poses challenges to researchers from a wide range of areas, including engineers, who face the problem of finding adequate sensors to perceive the environment and effective actuators (including smart user interfaces) to act upon it; cognitive scientists, who seek to understand human behaviour and with that an understanding of the sensor data; and computer scientists, who endeavour to implement an ambient intelligence that can reason about the information perceived and can infer the appropriate actions from it. In this chapter we have only touched the tip of the iceberg by focusing on a very narrow aspect of the overall problem: the classification of complex human behaviour in a spatio-temporal context. In particular, we have focused on the problem of context awareness, where data concerning other activities taking place at approximately the same time, or in approximately the same place, require actions to be considered as a group, rather than in isolation. We have provided examples of this in the smart home context. The chapter has most certainly fallen short of presenting one homogeneous solution to the problem of classifying complex human behaviour. This problem has been fundamental to the field of artificial intelligence since its beginning, and despite massive efforts for more than sixty years has not been solved in general. It therefore does not come as a surprise that we have not introduced a ‘universal ambient intelligence for smart environments,’ but rather have restricted ourselves by shedding some light on various aspects of the problem. One aspect is to automatically find meaning in sensor data and to distinguish between normal and abnormal behaviour on the basis of what has happened in the environment in the past. The reason for this approach is obvious: without the need for explicit modelling of human behaviour, we can in principle more easily create a system that is customised to a particular environment and the people living in that environment. Such an approach should also be able to cope with the problem of a very large number of potential behaviours, as it would learn only those behaviours that are actually occurring in a particular setting: there are many different ways of preparing a cup of tea or making a meal, but each of us usually does so in only a very small number of different ways. Despite having these appealing advantages, subsymbolic approaches have their shortfalls, one of which is the lack of associating meaning with sequences of events. It might be easy enough to detect a tea making sequence, but this does not mean that the system knows much about tea, such as that tea making is a way to provide a drink, that providing a drink is the result of a need for nourishment or satisfying thirst, that the satisfaction of such a need is essential to maintaining a healthy condition, and so on. It would therefore not be able to conclude that cutting up juicy fruit is similar to making tea, while washing the car is not (although on the surface it might seem the other way round, as washing the car involves turning on the tap while cutting up fruit does not). Explicit modelling on a symbolic level, involving ontologies and commonsense reasoning, can overcome this problem, given enough resources.
Spatio-Temporal Reasoning and Context Awareness
631
We hope to have provided a convincing argument that neither symbolic nor subsymbolic approaches alone can provide a comprehensive solution to the problem of creating smart homes. However, we believe that many of the challenges in the development of smart homes can be overcome by carefully merging symbolic and sub-symbolic approaches and linking of symbolic reasoning methods with prediction and inference made through machine learning. It is in this direction that our own research is moving.
Acknowledgements We are grateful for discussions and collaboration with the other members of the Massey University Smart Environments (MUSE) group (http://muse.massey. ac.nz).
References [1] Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the Eleventh International Conference on Data Engineering, pp 3–14 [2] Allen J (1983) Maintaining knowledge about temporal intervals. Communications of the ACM 26:832–843 [3] Augusto J, Nugent C (2004) The use of temporal reasoning and management of complex events in smart homes. In: Proc. ECAI-04, Valencia, Spain, pp 778–782 [4] Balbiani P, Condotta JF, del Cerro L (1998) A model for reasoning about bidimensional temporal relations. In: Proc. KR-98, Trento, Italy, pp 124–130 [5] de Berg M, van Kreveld M, Overmars M, Schwarzkopf O (1997) Computational Geometry. Springer [6] Bhatt M, Loke S (2008) Modelling dynamic spatial systems in the situation calculus. Spatial Cognition and Computation 8(1&2):86–130 [7] Bishop C (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford [8] Campbell C, Bennett K (2000) A linear programming approach to novelty detection. In: Leen T, Diettrich T, Tresp V (eds) Proceedings of Advances in Neural Information Processing Systems 13 (NIPS’00), MIT Press, Cambridge, MA [9] Cook D (2006) Health monitoring and assistance to support aging in place. Journal of Universal Computer Science 12(1):15–29 [10] Egenhofer M, Franzosa R (1991) Point-set topological spatial relations. International Journal of Geographical Information Systems 5(2):161–174 [11] Elman J (1990) Finding structure in time. Cognitive Science 14:179–211
632
Hans W. Guesgen and Stephen Marsland
[12] Fine S, Singer Y, Tishby N (1998) The hierarchical hidden markov model: Analysis and applications. Machine Learning 32:41–62 [13] Gopalratnam K, Cook D (2004) Active LeZi: An incremental parsing algorithm for sequential prediction. International Journal of Artificial Intelligence Tools 14(1–2):917–930 [14] Guesgen H (1989) Spatial reasoning based on Allen’s temporal logic. Technical Report TR-89-049, ICSI, Berkeley, California [15] Han J, Pei J, Mao R (2004) Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 8:53–87 [16] Heierman E, Cook D (2003) Improving home automation by discovering regularly occurring device usage patterns. In: Proceedings of the International Conference on Data Mining [17] Jakkula V, Cook D (2008) Anomaly detection using temporal data mining in a smart home environment. Methods of Information in Medicine [18] Jordan M (ed) (1999) Learning in Graphical Models. The MIT Press, Cambridge, MA [19] Kalman R (1960) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82:34–45 [20] Kiecolt-Glaser J, Glaser R (1999) Chronic stress and mortality among older adults. Journal of the American Medical Association 282:2259–2260 [21] Kohonen T (1993) Self-Organization and Associative Memory, 3rd edn. Springer, Berlin [22] Kohonen T, Oja E (1976) Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks of neuron-like elements. Biological Cybernetics 25:85–95 [23] Kremer SC (2001) Spatiotemporal connectionist networks: A taxonomy and review. Neural Computation 13:249–306 [24] Magnusson M (2000) Discovering hidden time patterns in behavior: T-patterns and their detection. Behavior Research Methods, Instruments, and Computers 32(1):93–110 [25] Marsland S (2003) Novelty detection in learning systems. Neural Computing Surveys 3:157–195 [26] Marsland S (2009) Machine Learning: An Algorithmic Perspective. CRC Press, Boca Raton, Florida [27] Marsland S (2009) Using habituation in machine learning. Neurobiology of Learning and Memory In press [28] Marsland S, Shapiro J, Nehmzow U (2002) A self-organising network that grows when required. Neural Networks 15(8-9):1041–1058 [29] Martin G (1998) Human Neuropsychology. Prentice Hall, London, England [30] McCarthy J, Hayes P (1969) Some philosophical problems from the standpoint of artificial intelligence. In: Meltzer B, Michie D (eds) Machine Intelligence, vol 4, Edinburgh University Press, Edinburgh, Scotland, pp 463–502 [31] Mozer M (2005) Lessons from an adaptive house. In: Cook D, Das R (eds) Smart environments: Technologies, protocols, and applications, pp 273–294
Spatio-Temporal Reasoning and Context Awareness
633
[32] Mukerjee A, Joe G (1990) A qualitative model for space. In: Proc. AAAI-90, Boston, Massachusetts, pp 721–727 [33] Newell A, Simon H (1976) Computer science as empirical inquiry: Symbols and search. Communications of the ACM 19(3):113–126 [34] Nguyen NT, Phung DQ, Venkatesh S, Bui H (2005) Learning and detecting activities from movement trajectories using the hierarchical hidden markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 955–960 [35] O’Keefe J, Nadel L (1977) The Hippocampus as a Cognitive Map. Oxford University Press, Oxford, UK [36] Preparata F, Shamos M (1985) Computational Geometry: An Introduction. Springer, Berlin, Germany [37] Quinn M, Johnson M, Andress E, McGinnis P, Ramesh M (1999) Health characteristics of elderly personal care home residents. Journal of Advanced Nursing 30:410–417 [38] Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257–286 [39] Randell D, Cui Z, Cohn A (1992) A spatial logic based on regions and connection. In: Proc. KR-92, Cambridge, Massachusetts, pp 165–176 [40] Rankin C, Abrams T, Barry R, Bhatnagar S, Clayton D, Colombo J, Coppola G, Geyer M, Glanzman D, Marsland S, McSweeney F, Wilson D, Wu CF, Thompson R (2009) Habituation revisited: An updated and revised description of the behavioral characteristics of habituation. Neurobiology of Learning and Memory In press [41] Rissanen J (1989) Stochastic Complexity in Statistical inquiry. World Scientific Publishing Company [42] Rivera-Illingworth F, Callaghan V, Hagras H (2007) Detection of normal and novel behaviours in ubiquitous domestic environments. The Computer Journal [43] Rumelhart D, Hinton G, Williams R (1986) Learning internal representations by error propagation. In: Rumelhart D, McClelland JL (eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, The MIT Press, Cambridge, MA, vol 1, pp 318–362 [44] Stanley J (1976) Computer simulation of a model of habituation. Nature 261:146–148 [45] Stein RB (1967) The frequency of nerve action potentials generated by applied currents. Proceedings of the Royal Society, B 167:64–86 [46] Sutton CA, Rohanimanesh K, McCallum A (2007) Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research 8:693–723 [47] Sutton R, Barto A (1998) Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA [48] Tavenard R, Salah A, Pauwels E (2007) Searching for temporal patterns in ami sensor data. In: Proceedings of AmI2007, pp 53–62 [49] Thompson R, Spencer W (1966) Habituation: A model phenomenon for the study of neuronal substrates of behavior. Psychological Review 73(1):16–43
634
Hans W. Guesgen and Stephen Marsland
[50] Vilain M, Kautz H (1986) Constraint propagation algorithms for temporal reasoning. In: Proc. AAAI-86, Philadelphia, Pennsylvania, pp 377–382
From Autonomous Robots to Artificial Ecosystems Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
1 Introduction During the past few years, starting from the two mainstream fields of Ambient Intelligence [2] and Robotics [17], several authors recognized the benefits of the socalled Ubiquitous Robotics paradigm. According to this perspective, mobile robots are no longer autonomous, physically situated and embodied entities adapting themselves to a world taliored for humans: on the contrary, they are able to interact with devices distributed throughout the environment and get across heterogeneous information by means of communication technologies. Information exchange, coupled with simple actuation capabilities, is meant to replace physical interaction between robots and their environment. Two benefits are evident: (i) smart environments overcome inherent limitations of mobile platforms, whereas (ii) mobile robots offer a mobility dimension unknown to smart environments. Traditional autonomous robotic systems share philosophical choices that unavoidably lead to “autarchic robot design paradigms”: • Robots must be fully autonomous. Robots can cooperate with humans or other robots, but they must carry out assigned tasks relying exclusively on their own sensors, actuators, and decision processes. • Robots must operate in non structured environments. Robots must adapt to environments that have not been purposely modified for them. Fulvio Mastrogiovanni DIST - University of Genova, Via Opera Pia 13, Genova, e-mail: fulvio.mastrogiovanni@ unige.it Antonio Sgorbissa DIST - University of Genova, Via Opera Pia 13, Genova, e-mail: antonio.sgorbissa@ unige.it Renato Zaccaria DIST - University of Genova, Via Opera Pia 13, Genova, e-mail: renato.zaccaria@unige. it H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_24, © Springer Science+Business Media, LLC 2010
635
636
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
To date, this approach has not been satisfactory, since no robot has proven to be “autonomous” in a generic, “non structured” environment: the few examples of robots close to be considered “autonomous” were designed to work in specific workspaces, or to extensively make use of environmental cues. A special case is constituted by industrial Autonomous Guided Vehicles (AGV in short): reliability and fault-tolerance are achieved by purposively modifying the robot workspace to ease its tasks. The concept is not new, and it can be exemplified by the famous Brooks’ statement about what it can be expected from biological and artificial intelligence: “Elephants don’t play chess” [6]. The intelligent behavior of biological beings heavily relies on the sensors and actuators they are provided with, and the way they evolved in tight relation with the environment where they live. It is no wonder that attempts fail, when trying to build autonomous artificial beings able to “live” in the same environment of biological beings: current technology in sensors and actuators is simply not adequate. Recently, these aspects have been formalized in [15], where Humanoid and Ubiquitous Robotics are two scenarios where these concepts are applied and discussed in great details. Ubiquitous Robotics suggests a different approach: building robots with limited sensing and actuating capabilities, but operating within environments well-suited for them. This option leads to simpler, cheaper, and more reliable fully autonomous systems here and now, and not in a future where the required technology will be – hopefully – available. However, often the stress has been more on feasibility studies and prototyping than on determining general guidelines for system design and integration. In order to carry out a principled discussion about the role of mobile robots within smart environments, it is mandatory to define a number of parameters that can be used both to qualitatively classify current systems and to help in designing new ones. Four parameters have been identified, which are described in the following paragraphs. Integration represents the consolidation between mobile robots and smart environments. This indicator takes different parameters into account, such as ease of communication, ability of undertaking coordinated action, and capacity of reasoning upon heterogeneous information. It is particularly significant given the integrated nature of ubiquitous robotics applications: mobile robots and smart environments should not be only in juxtaposition; on the contrary, they represent complementary and mutually assisting systems, till the limit situation where a clear distinction between them can not be drawn. Balance refers to how high-level objectives are mapped between mobile robots and smart environments, and specifically to the trade-off between information processing at different system levels. The motivation behind the introduction of this parameter relies on the fact that the background of researchers in Ubiquitous Robotics is often either in smart environments or in mobile robotics: therefore, their approaches are usually biased towards one end of this spectrum. Issues related to multi-agent stability, covergence and equilibrium are not considered in this classification, since they are subject to ongoing research [7]. Adaptivity is the ability to achieve a given high-level objective possibly using different agent configurations, in order to exhibit a robust behavior. Adaptivity en-
From Autonomous Robots to Artificial Ecosystems
637
forces fault-tolerance and allows real 24/7 reliable performance1. The parameter is meant to address middleware related issues. Among them, coordination among different agents, inter-agent communication, ability to deal with both numerical, sub-symbolic and symbolic information, capacity of reconfiguring agent scheduling on the basis of sudden needs. Flexibility is the capability to adapt to different scenarios without any need of architectural changes. Considerations related to scalability, modularity and deployability are contemplated, whereas ad hoc solutions are considered strong limiting factors. The core architecture of ubiquitous robotics systems should not depend on the scenario at hand. The inclusion of this parameter is particularly significant since, during the integration process between smart environments and mobile robots, it is likely that the interface is tailored for the application, rather than based on general and principled guidelines. Only four frameworks have been selected in literature for a more in depth analysis. Discarded systems are characterized by at least one of the following drawbacks: (i) the focus still remains on either the mobile robot or the smart environment as an individual entity, and/or (ii) specific solutions are widely adopted and a comprehensive theoretical and practical framework is lacking. As matter of fact, these systems are better labeled as Networked Robotics systems [17]. In the following, a discussion about state-of-the-art architectures is first carried out. Next, an application scenario is described, meant at discussing typical problems related to the introduction of robotic systems into the real-world. Finally, a more specific case study is introduced: the Artificial Ecosystem (AE in short) is described both in theoretical and practical aspects. Conclusions are drawn.
2 State-of-the-art: Integrated Frameworks for Robots in Smart Environments 2.1 Ambience Ambience2 is among the first projects dealing with Ambient Intelligence (AmI in short) issues. Ambience paved the way for important work in this area, specifically in ubiquity principles, context awareness, general artificial intelligence and humanrobot interaction. Furthermore, the project gave rise to the first European Symposium on AmI, where many results have been presented [1]. The main achievement of Ambience has been the definition of networked contextaware environments, that involved the first implementation of AmI concepts into real-world scenarios, such as homes, offices and public buildings. Three main areas have been selected, namely mobile, professional and home. The first two are 1
The term “24/7” refers to the ability of operating in autonomy 24 hours a day and 7 days a week. For a general description of the project, which has been closed in 2003, refer to the Ambience website at http://www.hitech-projects.com/euprojects/ambience.
2
638
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
related to AmI technologies that do not involve mobile robots, such as intelligent meeting facilities, collaborative design processes and human navigation assistance in structured environments. However, of particular interest is the home setting, and specifically the interaction between the robotic assistant Lino and the smart environment [5, 12]. Authors claim that humans should never interact with smart objects directly. From this assumption, it is straightforward to infer that robots are mobile appendices to smart environments, covering human-machine interaction, possibly dealing with emotional cues, speech recognition and verbal communication, thereby requiring state-of-the-art onboard sensing techniques. Therefore Lino is not very different from those embodied and physically situated robots that originated research in Ubiquitous Robotics. However, Lino is networked, in the sense that it is continuously connected to its environment, providing it with situated sensory data and receiving high-level goals and objectives. In later work in AmI, this approach has been abandoned. However, among the benefits introduced by Ambience we stress on the role played by robots: these are mobile, possibly specialized, entities that are able to purposively interact with humans in a way far beyond the common point-and-click paradigm. This idea has been extensively used in those scenarios where physical interaction with humans is important, such as in the case of people with disabilities or other physical injuries, as described in [14]. Ambience is the most basic approach with respect to the taxonomy characterized by the four key parameters introduced in Sect. 1. Integration is at the basic level, since robots and smart environments are individual entities which complementary capabilities are associated solely through simple communication: the smart environment can receive information from the robot and can provide it back with strategies to interact with humans. Balance is not addressed at all, since robots are only devoted to achieve human-machine interaction, whereas real intelligence is in the smart environment part of the architecture. Adaptativity is modest, since roles played either by mobile robots or smart environments are assigned by designers and can not be modified at run-time. Flexibility is sufficient, since the architecture has been tested into three different application scenarios.
2.2 PEIS Ecology The PEIS Ecology framework3 has been envisaged at the AASS Mobile Robotics Lab at the Orebro University (Sweden) and carried out from 2003 onwards. The project is aimed at providing for both cognitive and physical aids to elderly and people with special needs [16]. Novel ideas of the PEIS Ecology framework originate from a careful investigation about the concept of object. Each object is associated with a PEIS entity, an 3 For a description, refer to the official website at http://www.aass.oru.se/∼peis/. PEIS is an acronym for “Physically Embedded Intelligent System”.
From Autonomous Robots to Artificial Ecosystems
639
autonomous and physically situated entity providing information about the object itself, and able to communicate with other PEIS in order to form a PEIS Ecology. There is a strong paradigm shift with respect to Ambience: (i) PEIS Ecology enforces direct communication between objects, and between objects and humans; (ii) physical perception and object manipulation is replaced by direct communication of equivalent information between PEIS entities; (iii) the difference between intelligent (i.e., robots or other distributed devices) and non-intelligent (i.e., objects like furniture and appliances) parts of an environment must vanish, since each entity should become a PEIS, thereby being provided by the necessary capabilities to sense, act and communicate. PEIS entities are also the weak point in the framework: PEIS Ecologies often pose unrealistic assumptions, such as the availability of sensory modalities that are rather difficult to handle. For instance, in home and ambient assisted living scenarios, PEIS Ecologies assume the presence of such artifacts as PEIS forks, PEIS fridge or PEIS bread boxes, which are able to provide such information as “I am used!”, “Be careful, the expiry date on the milk has passed!”, or “No more bread inside!”. In other words, it is not possible to assume that each object is able to communicate with other surrounding objects, at least at the current technology level. PEIS Ecology extends Ambience principles in many directions. Integration is at a good stage, since communication between PEIS entities is completely transparent with respect to underlying device or computational capabilities. Balance has been addressed in depth: PEIS entities provide and reason upon exactly the information they have access to, whereas missing information is provided by other PEIS through communication. Adaptativity issues received much attention, especially in the form of sensor fusion: however, in certain cases, information content is originated by heterogeneous sources, thereby requiring a careful arbitration process that is not clearly addressed. Finally, flexibility has been validated in different scenarios, where heterogeneous teams of mobile robots successfully cooperated with smart environments.
2.3 Ubibot: Sobots, Embots and Mobots Sobots, Embots and Mobots are the three constituents defining an Ubiquitous Robotics system, according to the Ubibot paradigm4. With respect to both Ambience and PEIS Ecology, the Ubibot paradigm is a contemporary effort that is targeted at realizing a comprehensive solution at the infrastructural level. The main motivation behind this framework is to design systems that are seamless with respect to the medium of human-machine communication, adaptive and aware of human activities and needs [10, 11]. 4
For a complete survey about the project, which has been under development at the KAIST, Korea, from 2000 onwards, refer to the official Ubibot webpage at http://rit.kaist.ac.kr/home/Intelligent_Robotics
640
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
The Ubibot framework originates from the work described in [4]. Brady defines Robotics as the intelligent connection of perception to action. According to this elaboration, Ubiquitous Robotics allows to abstract the connection between intelligence, perception and action from its physical implications to a totally distributed mechanism: intelligence, perception and action are not necessarily situated, embodied or located in the physical space. On the contrary, the key idea is to realize them as abstract and individual entities, thus originating the intelligent software robot (i.e., the Sobot), the perceptive embedded robot (i.e., the Embot) and the physically active mobile robot (i.e., the Mobot). Sobots, Embots and Mobots represent dissociated areas of competence of a cognitive system. Each Sobot, Embot and Mobot is in charge of managing exactly one facet of the whole Ubibot system. Sobots, Embots and Mobots reside in a virtual space, the so-called ubiquitous space, which encompasses both parts of the physical world and parts of the infrastructure. As an example, recall the home setting proposed by the Ambience framework. Adopting the Ubibot paradigm, Lino would be a perfect example of Mobot, which is in charge of providing mobility in the physical part of the ubiquitous space. However, differently from Ambience, human-machine interaction would not be artificiously distributed between Lino and smart appliances: on the contrary, only one Embot (the “human-machine interaction” Embot) would be used, i.e., a conceptual entity which exists and acts independently from the underlying, possibly distributed hardware layer. Finally, one or more Sobots could reside in the pure software part of the ubiquitous space, to be scheduled on possibly different workstations. From this example, it is possible to argue that what is being virtualized in the Ubibot framework is not the concept of object, as it happens in the PEIS Ecology, being rather the notion of “area of competence”, each having specific responsibilities, capabilities, and requirements. The Ubibot framework is a fundamental milestone in this field. Integration is at an impressive level, since each entity can be decomposed into its main constituent parts. Balance has been addressed in depth, since this decomposition allows to establish a good trade-off between different tasks to be performed by an ubiquitous system. Adaptivity is enforced by allowing on-line techniques to reconfigure an entire Ubibot system. Flexibility has been thoroughly investigated: the approach has been tested in scenarios ranging from simple robots (i.e., only Mobots have been used), to hybrid scenarios involving Sobots, Embots and Mobots, down to virtual scenarios comprising only Sobots.
2.4 The Artificial Ecosystem The Artificial Ecosystem framework has been envisaged in 2000 by the Laboratorium group, from University of Genova, Italy. Differently from previously described approaches, the focus is on designing systems that exhibit 24/7 operations and reliable performance, rather than on intelligent, multi-modal interfaces for manmachine interaction [13]. In order to achieve this, all the problems related to auton-
From Autonomous Robots to Artificial Ecosystems
641
omy are faced in a full multi-agent perspective. Robots are thought of as mobile individuals within a smart environment where they coexist, cooperate and compete with a network of static individuals, i.e., smart nodes that are assigned different roles. Among them: (i) providing robots with cues about their position within the environment; (ii), controlling automated doors, elevators or surveillance cameras; (iii) detecting emergency situations such as fires or liquid leaks. Robots and smart nodes can mutually exchange information, and are given a high level of autonomy. According to these design principles, “autonomy” is no longer an attribute featured to robots: it is a property distributed throughout the environment. However, there is a symmetry between robots and the network: the concept of smart nodes is extended to the devices onboard robots, to the extent that robots can be considered collections of onboard smart nodes with the special ability of “moving” within the environment. Onboard smart nodes can communicate with all the other nodes in the environment within communication range, and are given some level of autonomy in taking decisions. Authors push the metaphore further and assert that the AE shares some similarities with “natural” ecosystems: a parallelism can be drawn between the energetic flow deriving from solar radiation and the energetic supply in the AE that is given by the building power supply system. A part from this external source of energy, the AE is in dynamic equilibrium, in the sense that it is self-sustaining: similarly to natural ecosystems, where matter and energy is recycled and transformed as needed by “biotic” components, software agents controlling robots and smart nodes – i.e., “bionic” components consuming energy and producing information by interacting with the physical world – do not need any other exchange of energy or information with elements external to the ecosystem. The Artificial Ecosystem metaphor has been used to model intelligent buildings, smart homes as well as more traditional robotics scenarios: the approach is compatible and can take benefit from multi-modal man-machine interaction, Ubiquitous Computing, and Communication technologies. However, in this discussion (and differently from other Ubiquitous Robotics paradigms), humans are are not the main focus: they are considered still part of the ecosystem, but their role is limited to assigning tasks, tuning the system behavior according to the customer preference, and scheduling activities. The AE framework exhibits sophisticated properties in integration, balance, adaptivity and flexibility. Integration is guaranteed by the extreme plasticity characterizing the AE model. There is no real difference between mobile robots and smart environments: both of them are handled by software agents designed according to a unique paradigm and integration is at the agent communication level. For the same reasons, architectures based on the AE framework are well balanced: software agents can be scheduled indifferently onboard mobile robots or on smart nodes embedded in the environment. However, the system can quickly reconfigure in order to address sudden needs, handling both competition and cooperation agent schemes: adaptivity is therefore enforced at the architectural level. Finally, flexibility has been proved in many experimental set-ups: different parts of the system have been deployed in heterogeneous scenarios, such as intelligent buildings, smart homes and autonomous robots.
642
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
3 Application Scenario The following scenario is considered, envisaging a fleet of mobile robots that coordinate their activity with a network of sensors/actuators within a hospital. Since the early attempts of Transition Research Corporation to commercialize the transportation robot HelpMate (the whole story is told in [9]), autonomous transportation in hospitals is considered a “killer application” for intelligent robotics, due to the high amount of “goods” (waste, housekeeping items, meal trails, pharmacy and lab supplies, patient records, etc.) that is continuously moved between one point to another, which has a notable impact in terms of human work, and hence cost. However, and in spite of the significant number of products available at the present day5 , these are mostly an evolution of industrial AGVs, which autonomy in taking decisions to cope with complex situations is still very limited. For example, existing systems are totally unable of planning alternative paths to take into account the current situation (e.g., corridors or elevators that are more or less crowded at a given time of the day), and they focus exclusively on transportation with no place for additional, cognitiveoriented activities to be executed concurrently. In the following, a complex scenario is described in which robots – in addition to transportation – are requested to enforce the hospital’ security through autonomous surveillance. With reference to Fig. 1, which depicts the 1st and 2nd floors of a hospital, as well as its surrounding park, assume the following scenario: • A fleet of mobile robots operates within the hospital and in the park. • A network of sensors and actuators for building automation is distributed in the hospital and in the park. • A node of the network is not necessarily able to communicate with all other network nodes: instead, it is possible to envision different, non connected subnetworks, which topology can change in time. Nodes within a sub-network communicate with each other through wired and wireless communication (Wireless Ethernet, Bluetooth, Zigbee, or similar). • Robots can communicate with (a subset of) network nodes through wireless communication • Robots are assigned with two different kinds of tasks: transportation and surveillance. Transportation is performed according to a given schedule, as well as – when the schedule allows – on demand. Surveillance patrols are performed according to a given schedule and on demand; in addition, surveillance is always active during transportation. • Networks of sensors/actuators are assigned with the main task of automating the building (opening/closing doors and windows, controlling elevators, etc.) and surveillance (detecting thefts and vandal acts, people in restricted areas, natural dangers such as fires, gas and liquid leaks, etc.). 5
See, for example, Automated Transport and Logistics Integration Systems from JBT Corporation at http://www.jbtc-agv.com, or TransCar AGV System from SwissLog at http://www.swisslog.com, just to name a few.
From Autonomous Robots to Artificial Ecosystems
643
Fig. 1 Map of the hospital. Left: 1st floor and park; Right: 2nd floor. Some significant locations are labelled through capital letters: two elevators A and B connect the two floors. A level crossing C regulates the traffic in the park, to prevent robots from crossing the road while human-driven cars are passing by. D is the location where trash is collected, including biological waste from operating theatres E and F). G and H are pharmacies. I is the laundry. All doors along the robot path are automated doors (not labelled for clarity). Patient wards, waiting rooms, labs, ambulatories, closets, as well as other facilities are not labelled as well. The system is currently being implemented in a real hospital in Italy; however, the plan reproduced here is a surrogate of the real one, both for clarity of representation, and for security and legal reasons
This must be done while meeting additional constraints, which in the following will be also referred to as the “contract” between the company which provides the transportation/surveillance service and the customer. 1. Locations to be reached by robots are both outside and inside the building, on both floors of the building, and through doors that can be temporarily closed. In all cases, self-localization must guarantee that robots are always aware of their position, i.e., they are never “lost”. 2. The system must work 24 hours a day, 7 days a week. Human intervention is required only for assigning tasks on schedule and/or on demand, and for periodic maintainance. All operations must be automated, including opening/closing doors, taking elevators, loading/unloading carts and battery recharging. 3. Timing must be very precise, in order to meet the assigned schedule and to execute – when possible – additional tasks that are assigned asynchronously depending on current needs. 4. Robots must safely transport material in crowded areas open to the public. It is necessary to avoid that a cart ends up into the hands of the wrong person. 5. Surveillance must include both reactively firing alarms as a single sensor node detects an anomaly, as well as performing more complex, sensor fusion activities to understand the current situation, and consequently plan an action.
644
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
6. Robots must variate their surveillance patrols in order to avoid being cheated by possible intruders. During patrols, robots must check the state of all doors and windows. 7. The system must be able of self-checking against failures, and report the result. The contract can be more easily fulfilled by allowing robots and network nodes to Cooperate and Compete with each other.
3.1 Cooperation and Competition for Transportation and Surveillance In general, robots and smart nodes can take a mutual benefit from cooperating with each other as a consequence of their “bionic-diversity”. This principle can appear obvious, and hence it is often left behind. However it is worth remembering the fundamental difference between robots and smart nodes in the environment: (i) Robots can move, hence they can reach all places where their intervention is needed, including areas that cannot be directly monitored by smart nodes. However, they can get lost when they move, and have problems to operate on environmental features (doors, elevators, windows, etc.). Moreover, robots have the major problem of electrical power consumption. (ii) Smart nodes in the environment cannot move, and hence they cannot reach all places where their intervention is needed. Since they are tigthly connected to the place where they are located and its immediate neighbourings, they can work as a reference for mobile robots, and can be easily and permanently set to control environmental features. Smart nodes have not the problem of electrical power consumption. With reference to the described scenario, it is easy to envision some situations in which cooperation plays a fundamental role: the reader can easily imagine additional ones. • Localization. The network of smart nodes increases the robot’ self-localization capabilities, both during navigation and when performing specific tasks in selected locations: i.e., smart nodes can act as beacons, allowing to correct the robot’s pose through triangulation or filtering techniques (e.g., Kalman Filter). • Controlling automated doors, elevators, barriers. Upon request of the robot, smart nodes allow the robot to access areas of the environment that are not connected, and hence cannot be reached through simple motion: e.g., by opportunely communicating with a node of the network through a wireless link, the robot can request the node itself to control an elevator, an automated door, or the level-crossing (e.g., A, B, and C in Fig. 1). • Autonomous loading/unloading and charging. Smart nodes allow the robot to easily control devices for autonomous loading/unloading, as well as stations for autonomous battery recharging. • Identity checking. Through smart nodes embedded into the badges assigned to staff personnel, robots can check the identity of a person, for example during a
From Autonomous Robots to Artificial Ecosystems
645
surveillance patrol, or when delivering a cart. If an attempt of theft or sabotage is detected, they can take a picture of the attacker and send it to the network, and finally to a supervision workstation. • Surveillance. Upon request of the sensor network, the robot helps to investigate areas that static smart nodes cannot monitor directly. For example, by “giving a ride” to a smart sensor located onboard the robot itself (a camera, a microphone, a temperature sensor, etc.) up to the area or the person to be monitored. • Traffic reporting. Smart nodes can provide traffic reports about each area of the hospital, thus allowing the robot to plan in advance the best route to avoid crowded areas, and to meet the current schedule. • Pursuit-evasion. Upon request of the robots, smart nodes in the environment help the robot to track a person that the robots is pursuing, and block his/her path by denying access to elevators, doors, and windows. In addition to Cooperation, we strongly believe that it is important to let individuals (i.e., robots and smart nodes) compete with each other: in particular, Competition assumes here the meaning of demonstrating that other individuals are not able to adequately perform their own task. Even if this idea appears to be new in the context of Ubiquitous Robotics, it is quite common in many human activities. In critical domains, splitting the team into two competing groups is a common practice to find out each other’s strengths and weaknesses, and to possibly mitigate the effect of the latter in favor of the former: examples range from military tactics to software engineering, from role-playing in human resource management to system and network security. Finally, the basic idea is in accordance with the engineering principle of straining the system to its breaking point: it can play an important role in autonomous transportation, since all deliveries must be performed in a timely fashion in order to meed deadlines and to guarantee the desired quality of service, and it becomes fundamental in critical tasks such as surveillance. As an aside, notice that this principle is at the basis of security even when periodic surveillance patrols are carried out by humans: most procedures for regular patrols/inspections and video-surveillance start with the assumption that surveyors are “human” in their nature, and therefore untrustable. For example, during surveillance patrols, it is explicitly requested that the timing and the path of the patrol is randomly varied, since repeated patterns allow intruders to learn the better strategy to infiltrate. In addition, since humans are assumed to be inherently “lazy” and “biased” towards simpler solutions (if not dishonest), patrols are logged and traced using recording devices distributed in selected areas. In a similar manner, security guards are taught not to trust security sensors and alarms, that are periodically checked against failure by simulating intrusions: system redundancy and fault diagnosis are only a partial solution, since – in this context – the main problem is the correspondence between the output of the recognition process and ground truth (a well-known problem in AI), and not just checking that the electrical/mechanical devices are properly working. It is easy to envisage a list of specific situations in which Competition can play a fundamental role: again, the reader can easily imagine additional ones.
646
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
• Checking surveillance network. Robots can check whether smart nodes for surveillance are correctly working or not: do surveillance cameras detect the robot passing by? Do Passive Infrared sensors detect intruders? Differently from traditional surveillance, alarms do not need to be disabled during patrols: if the robot is aware to have caused an alarm, it simply tells the supervision station to ignore it. On the contrary, an alarm is fired whenever a discrepancy is found between what the robot and the surveillance network respectively say. • Checking automation network. Robots can check if smart nodes work correctly when requested to control environmental features, e.g., if doors are properly opened and closed, the policy adopted to reserve the elevator is adequate (i.e., it prevents robots from starving while waiting for the elevator), power is propery turned on/off at the the station for battery recharging, etc. • Checking robot patrols. Smart nodes can check if robots are performing surveillance patrol properly: this means recording the path followed by robots, with the purpose of verifying whether the robot is fulfilling the contract or not, e.g., all areas can be considered clear or have been covered at the required frequency, the path is randomly varied, etc. An alarm is fired whenever a discrepancy is found between what the robot and the surveillance network respectively say. • Checking robot mobility. Smart nodes can check if robots are meeting the planned scheduling during transportation tasks, e.g., if they take too long to perform a given operation (entering an elevator, docking into the charge station, etc.), if they have a wrong idea about their own position in the environment, etc. Some of these concepts have been implemented in the systems that have been reviewed; a tentative implementation of Ambience, Ubibot and PEIS Ecology to deal with the proposed scenario in now described.
3.2 Ambience, Ubibot and PEIS Ecology at Work If one were to give a careful look at aims, objectives, and phylosophical positions of Ambience, Ubibot and PEIS Ecology, it turns out that they solve slighly different problems with respect to the proposed scenario6 . Ambience and the Ubibot framework are heavily geared towards serving humans. Subsystems of Ambience and Ubibot do not benefit from mutual interaction. For instance, Lino is purposively designed to be a mobile supplement to the smart environment, with the sole purpose of mediating between users and the environment itself. Lino does not introduce any real novelty within the system: human-machine interfaces could be implemented using distributed multi-media touch-screens as well.
6
In order to describe a hypothetical implementation of the hospital scenario using Ambience, PEIS, and Ubibot, we have no other source of information than our own suppositions. We are confident that the proposers of these framework would design a much more efficient solution than we foresee.
From Autonomous Robots to Artificial Ecosystems
647
With respect to the hospital scenario, we can imagine that Ambience would be characterized as follows: • Architecture. A 2-layer architecture is used: the lower layer comprises a team of Lino mobile robots and distributed devices, whereas the higher layer manages different cognitive processing. Lino robots carry out both previously scheduled and on-demand missions, whereas the smart environment comprises all the relevant distributed devices necessary to recognize user needs. • Interaction with the physical world. Software onboard Lino robots is expanded to cope with safe object delivery, robust navigation between different floors and surveillance. Lino robots are completely “autarchic” in doing this: selflocalization, obstacle avoidance and situation assessment are carried out without any help from distributed devices. On the other hand, system goals are set by the higher layer, and communicated to Lino robots through the network. • Representation and reasoning. The higher layer manages all the decision processes related to hospital automation, including objects transportation, delivery and surveillance: a centralized architecture manages all the data collected by distributed sensors (comprising those onboard Lino robots), assessing situations, planning actions, submitting them to robots, monitoring their execution, and so on. • Cooperation and Competition. Cooperation is addressed by simply designing in advance the overall coordinated behavior of both Lino robots and smart environments, whereas Competition strategies can not be used. • User interface. user interfaces onboard Lino robots are effectively used by staff personnel to interact with the Ambience system, thereby requiring the presence of the robot in a given area, providing all the necessary information for object delivery, etc. The high level of operability of the Ubibot framework assumes a human interacting with the system. Therefore, Sobots, Embots and Mobots are designed as collections of services to be queried on demand. Key points in adopting the Ubibot paradigm for the hospital scenario are as follows: • Architecture. A team of Mobots is developed to perform object transportation and surveillance. Each Mobot is provided with a corresponding Embot able to collect information provided by Mobot’s sensory systems, to perform basic information processing, and to communicate relevant information to Sobots. Other Embots are in charge of collecting information from devices distributed throughout the hospital. As a consequence, the architecture is not layered nor centralized, but simply distributed. • Interaction with the physical world. Mobots are provided with all the necessary sensors commonly adopted for surveillance applications. At the same time, Embots are extended to deal with new data types and to merge relevant information. Sobots are adapted to the scenario at hand as well. In spite of the fact that Mobots (and the corresponding Embots) are provided with robot-specific capabilities, nonetheless activities related to object delivery and surveillance are made easy by previous knowledge gathered by Sobots, thereby improving the
648
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
overall system performance. At the same time, Sobots reason upon situations and goals, consequently orchestrating the coordinated behavior of both Mobots and Embots. • Representation and reasoning. Sobots receive all the relevant knowledge from distributed Mobots and Embots. This knowledge is used to perform situation assessment from distributed sources, to set-up system goals (on the basis of user requirements provided by Embots) and to define a course of action accordingly. However, differently from the Ambience centralized approach, a distributed network of Sobots can be used to manage intelligent processes. • Cooperation and Competition. Coordination is addressed by carefully designing interaction rules between Sobots, Embots and Mobots. These rules are defined on the basis of the scenario at hand, since human needs vary accordingly. At the same time, issues related to Competition can not be explicitly addressed. • User interface. Embots manage human-machine communication. In particular, Embots migrate from device to device (e.g., from Mobots to workstations) in order to ease human operations. This process is carried out under the supervision of relevant Sobots, able to assess user situation and to provide Embots with relevant information. In the PEIS Ecology framework, PEIS entities effectively benefit from mutual interaction, since complex behavior emerges from the cooperation among several PEIS entities. However, as it is suggested in [16], humans are a core part of the PEIS ecology itself: the architecture should be able to incorporate humans in the whole system, operating in symbiosis with them. Although this seems a promising research direction, it can be labeled more suitably as Ambient Intelligence, rather that Ubiquitous Robotics. If PEIS Ecology were used in the hospital setting, the stress would be on the following key points: • Architecture. Robots, devices and other appliances are transformed into PEIS entities. This originate a completely decentralized architecture where local connections are preferred to centralized control. Each PEIS entity is able to provide other entities with requested information (if available, depending on the underlying hardware architecture) through a servicing mechanism. • Interaction with the physical world. PEIS mobile robots interact with other PEIS entities in order to fullfil their tasks. For instance, safe object delivery require a constant communication between the PEIS mobile robot and the PEIS box to be delivered. At the same time, robot navigation is managed as a sequence of interactions with PEIS distributed devices, aimed at aiding robot selflocalization, at communicating with PEIS entities, such as PEIS doors, PEIS battery recharging stations and so on. • Representation and reasoning. High-level reasoning capabilities are achieved through direct communication between PEIS entities. System goals, issued by users, are distributed among different PEIS entities, thereby allowing for a more reliable and fault-tolerance system, given the intrinsic distributed representation of relevant domain knowledge.
From Autonomous Robots to Artificial Ecosystems
649
• Cooperation and Competition. PEIS entities cooperate by implementing a set of loosely coupled communication processes: different sources of information are evaluated on the basis of their information content and reliability. On the other hand, Competition has not been explicitly addressed by this framework. • User interface. user interfaces and explicit human-machine communication are totally missing. The PEIS Ecology, by operating in symbiosis with humans, is able to silently understand basic as well as complex user needs.
4 Case Study: the Artificial Ecosystem Approach 4.1 Basic Principles The key idea is that of representing the system as an ecosystem, which is selfsustaining (and hence autonomous) with the only exception of an external source of energy (which, in Nature, is provided by the sun). This is achieved by finding a common paradigm to represent the mutual relationships between robots and networks of smart nodes on one side, and environmental features that are part of the physical word (doors, windows, elevators, walls, furniture, etc.) on the other. This common paradigm is that of metaphorically assuming the equation energy=information, and to deal with changes in the real physical space as if they were taking place in the information space: energy in the system (electrical, mechanical, chemical) is transformed by sensors into information, information is elaborated through algorithms, and finally transformed into energy by actuators. Since each transformation step consumes energy, an external source of energy is required, provided by the building power supply system (i.e., the amount required by sensors and actuators to work, as well as by microprocessors to execute software processes). Humans are assumed to be part of the ecosystem: however, their only role is that of assigning objectives, tasks, and policies, possibly corresponding to a schedule of activities to be performed along the whole temporal axis, to monitor the system at a supervision station (a direct intervention of a human supervisor is required to take countermeasures in case of an intruder, vandal acts, and so on, also for legal reasons), and to provide periodic maintainance. With the exception of these interventions, the system is completely autonomous. The stress on energetic issues is not only aimed at enforcing the biologic methaphor: instead, power supply is known to be one of the major issues to be dealt with in mobile systems, as long as electrical motors are the only reasonable choice for a robot that operates indoor. More specifically, the Artificial Ecosystem community is constituted of bionic and a-bionic members. Bionic members include only software agents, which can be further classified as: • Sensor Agents. These are agents that handle sensors, i.e., which transform electrical, mechanical, and chemical energy taken from the physical world into information. Sensors communicate the produced information to Cognitive Agents;
650
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
• Actuator Agents. These are agents that handle actuators, i.e., which transform information into electrical, mechanical, and chemical energy in the physical world. Actuator Agents receive the required information from Cognitive Agents; • Cognitive agents. These are agents that elaborate information according to a set of rules or algorithms. Cognitive agents can be further classified as: (i) agents with no internal state governed by condition-action rules. They are used for purely reactive behaviors, e.g. stopping the motors to avoid a collision (robots) or opening an automated door upon request (smart nodes); (ii) agents that maintain an internal state and/or representation in order to choose which action to perform. They are used for more complex tasks, e.g. avoiding obstacles on the basis of a continuously updated local map of the environment (robots) or controlling an elevator dealing with multiple requests (smart nodes); (iii) deliberative agents that handle goals and find action sequences to achieve their goals. These agents are used for high-level planning and execution tasks, e.g. finding a sequence of STRIPS-like operators to reach a target location. A-bionic members include everything that cannot be classified as a software agent, e.g., walls, doors, automated doors, the motors of a robot, humans, etc. As it normally happens in the physical world, they possibly transform (and dissipate) energy. In order to share information, all agents can communicate via a publish/subscribe protocol, which allows to exchange messages on the basis of the main type (and, sometime, sub-types) of each produced message. Messages convey towards the right destination the information produced by Sensor Agents, elaborated by Cognitive Agents, and consumed by Actuator Agents: the message main type allows to immediately identify the content, as well as agents that are allowed to produce and to consume that type of information. This approach allows to dynamically add/remove agents in the system without other agents being aware of that, thus increasing the versatility and the reconfigurability of the system. In addition, many different communication forums are envisaged, each possibly composed of agents that are within communication range at a given time, and that are interested to publish/subscribe messages related to a given topic (i.e. only the agents running on a robot need to know the trajectory to be followed to avoid an obstacle). One particular case of this concept is the global forum, in which all agents participate. In many practical cases all smart nodes in the environment can communicate with each other, since they are permanently connected on the same network through wired or wireless media; hence, if a robot and a smart node are within communication range, the robot can communicate with all nodes of the network, including the supervision station. In the following, it will be omitted reminding that robots and nodes can “talk” on different forums; when it is said that a robot can communicate with a node and viceversa, it is assumed that a forum exists in which both of them participate (possibly the global forum). Since an agent can receive messages of the same type from different sources, an arbitration function is required (possibly different for every subscribed type) to choose what to do with every new message received, either keeping it, discarding it, or merging it with information received so far. The default behaviour is sim-
From Autonomous Robots to Artificial Ecosystems
651
ply consuming every message, but alternative strategies have been implemented as well, based on the authority of the producer (only higher authority messages are consumed, whereas the others are discarded), or on more complex data fusion techniques. Arbitration, especially when performing data fusion, requires on its turn a distance function (possibly different for every subscribed type), which ranks the similarity between messages received from different sources, and acts differently depending on the fact that distance is below a threshold (e.g., data are merged) or above it (e.g., data are not merged). Summarizing: • each messages has a type, possibly sub-types, and one or more fields to be filled, among which the identifyer of the robot or the smart node that produced the message; • agents can participate in one or more forums; all agents participate in the global forum; • agents that subscribe to a given type in a given forum, are guaranteed to receive all messages of that type that have been published in that forum by agents within communication range. • for every type subscribed, each agent is associated with an arbitration function in the form [msg-type m2]=a-funct(msg-type m1) that filters incoming messages according to one of the following behaviours: – none, i.e., it simply maps m1 into m2; – authority, i.e., it maps m1 into m2 when the sender has an authority higher than or equal to other sources of information currently available for that msg-type, whereas it returns an “empty message” (that is discarded) if the previous condition does not hold; – merge, i.e., it merges m1 with messages received so far according to a set of user-defined rules and returns the result. An example could be a weighted sum, or the output of a more complex filtering algorithm. • for every type subscribed, each agent is associated with a distance function in the form [alert-type m3]=d-funct(msg-type m1, msg-type m2), that compares messages from different sources and acts according to one of the following behaviours: – none, i.e., it always return the “empty message”, that is discarded; – compare, i.e., it computes a normalized scalar value that ranks the distance between the two incoming messages, and returns an “empty message” (that is discarded) if the value is below a threshold, it returns and publishes a message of type alert-type containing the measured distance otherwise. The distance depends on the semantics of the message types involved, and hence is computed according to a set of user-defined rules. An example could be the binary, scalar, or Euclidean distance for numerical values, or the output of a classification process for symbolic data, i.e., which answers the question “is the information contained in m1 and m2 the same?”
652
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
Figure 2 on the top shows a very simple robot that is controlled by four agents. The behaviour of each agent is described in the Table in Fig. 2 on the bottom, showing subscribed and published types with relative fields.
Fig. 2 On the top: Artificial Ecosystem composed of a mobile robot and a supervision workstation; on the bottom: agents executed onboard the robot
In the example in Fig. 2, each activity is necessarily initiated by the human operator, which is modeled as an a-bionic member of the ecosystem (since it is not a software agent!). In fact, the agent trajectory needs target-msg messages, which are not produced by any other agent running on the robot. Without them, speed references cannot be computed, and no velocity-msg message is produced: the final effect is that the robot does not move. One can obviously assume that target-msg is produced by another agent in the ecosystem, either running on a robot or on a smart node in the environment, possibly on the basis of higher level information: for istance, starting from a more abstract message of type patrol-msg, which fields identify an area to be continuously patrolled until new orders are given. Whichever solution is adopted, at a certain point of this “chain” an injection of information (and, hence, energy) from humans is required: humans receive information (and, hence, energy) back by monitoring the system’s state at the supervision workstation. Since humans are modeled as a-bionic mem-
From Autonomous Robots to Artificial Ecosystems
653
bers, the human-computer interface is coherently handled by a Sensor Agent named command (for input, see Fig. 2 on the top) and Actuator Agent named display (for output). More specifically: • a need chain is an alternating sequence of producers and consumers, including both software agents and a-bionic members, that transform energy and/or information (e.g., the sequence trajectory → motor → wheels); • a closed need chain is a circular chain (it corresponds to “food chains” in natural ecosystems), i.e., the first and the last element of the chain correspond to the same member (e.g., the sequence wheels → odometry → trajectory → motor → wheels); • a couple producer/consumer can be part of more, distinct need chains (e.g., the couple trajectory → motor is also part of a bigger chain than involves agents command and display scheduled on the supervision station). • the Artificial Ecosystem is said to be self-sustaining when all need chains are closed: all members of the community have everything they need, and everything they produce is consumed by someone. In the example in Fig. 2, all need chains are closed; practically, this corresponds to saying that, given that an external source of energy is available7 , the system is able to work in total autonomy. The external source of energy is required since agents “consume” electrical power to consume, transform, and produce information and/or energy (strictly speaking, it is the hardware on which agents are executed that consumes electrical power). Energy is provided by the power supply system and, in the case of mobile robots, it is stored into batteries that are periodically recharged at the charging station. If this external source of energy is no more available for some reason, the system tends to a “deathly” static equilibrium, which main difference with “death” comes from the fact that all bionic members of the community have the tendency to resurrect as the black-out terminates. Consider now the case that odometry is not sufficient to produce “good quality” position-msg: this is a well known problem in mobile robotics, since odometrical reconstruction unavoidably produces errors that are integrated in time (due to wheel slippage, approximate knowledge of wheel’s radius and other geometric parameters, etc.), thus becoming unreliable even after a short distance. If some other agent within communication range is available that publishes position-msg, it is possible to ideally substitute odometry with the new agent in a completely transparent manner: this could be, for example, a surveillance-camera agent running on a camera for videosurveillance, which sees the robot, computes its position through inverse perspective reconstruction, and publishes the information. Moreover, it is possible that surveillance-camera is the one which recognizes when odometry can not be trusted anymore, by ranking the similarity between messages received from odometry and those published by itself through a properly defined distance function. The case of an agent scheduled on a smart node (i.e., surveillance- camera) that produces messages (i.e., position-msg) con7
Including food and beverages for the human operator!
654
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
sumed by an agent scheduled on a robot (i.e., trajectory) is a special case of Cooperation; the case of an agent scheduled on a smart node (i.e., surveillancecamera) that compares messages received from two different sources (i.e., the agent odometry and itself) is a special case of Competition. In order to avoid confusion, it should be anticipated that the key concepts of Cooperation and Competition do not refer to software agents, but rather to “higher level aggregates” such as robots, smart nodes in the environment, supervision stations, or even PDAs, cellphones, etc. All these physical entities, which will be referred to as individuals in the following, have the peculiarity to be composed both of a-bionic members (i.e., the hardware, including circuitry, microprocessors, motors, wheels, etc.) and of bionic members (i.e., the software agents that control them). Cooperation and Competition refer to individuals for the main reason that all software agents that control an individual are always tightly tied to each other by a privileged relationship of some sort. This is primarily in the eye of the observer, since humans have the tendency to locate software agents in the physical space depending on the microprocessor where they are executed; but it also has objective reasons, since agents controlling the same individual are usually guaranteed to be permanently within communication range, to have the same error when positioning sensor data in the Cartesian Space or when assigning targets to actuators, etc. In order to better explain these concepts, it is necessary to “link” software agents to the physical world, which is done through the following definitions: • each agent is associated with a Computational Unit (i.e., a CPU within a standard i586 architecture, an embedded microprocessor, etc.); • at a given time instant, an agent can be active or not active on the corresponding Computational Unit; agents switch from not active to active, and viceversa, when they receive messages containing proper instructions to do that8 . • agents associated to a Computational Unit that can move in the physical space (e.g., the PC onboard a robot) are referred to as mobile agents; • agents associated to a Computational Unit that cannot move in the physical space (e.g., a smart node in the environment) are referred to as static agents; • an individual is a set composed of sofware agents and the related hardware, with the further assumption that all Computational Units on which the agents are executed, as well as the hardware, are “kinematically constrained”; • a mobile individual includes only mobile agents; a static individual includes only static agents. The term “kinematically constrained” should be meant in its broader meaning, i.e., informally speaking, two objects are kinematically constrained when it is not 8 Specifically, agents are classified as (i) Periodic, i.e., agents that are always active and execute at a fixed time frequency; (ii) Aperiodic, i.e., agents that are normally inactive, but become active and execute one time when messages of a given type are received; (iii) Sporadic, i.e., agents which status switches between Periodic and Aperiodic when they receive messages containing proper instructions to do that. The terminology is derived from Real-Time scheduling theory: since all agents in the system must meet timing constraints to cope with an uncertain and dynamic environment, the Artificial Ecosystem architecture has Real-Time characteristics that guarantee a predictable and safe behavior of the system even when computational resources are limited.
From Autonomous Robots to Artificial Ecosystems
655
possible for them to move apart from each other over a maximum distance, and hence, when one object moves, the other necessarily moves as well, sooner or later. This definition includes many different classes of individuals, among which some funny examples such as a running man holding a PDA in his left hand and a cellphone in his right hand, or a robot on top of which someone has temporarily forgot a microphone. This allows to give the following, final definition, which is valid for all classes of individuals and that will be extensively used in the following: • when an agent belonging to an individual produces messages that are consumed by an agent belonging to a different individual, we say that the two individuals Cooperate (Fig. 3). • when an agent belonging to an individual receives the same type of information from two or more sources, of which at least one belongs to a different individual, and ranks the similarity between the information received from the different sources through a distance function, we say that the individuals that produce the information Compete (Fig. 3).
Fig. 3 Cooperation: an agent belonging to an individual produces messages that are consumed by an agent belonging to a different individual; Competition: an agent belonging to an individual receives the same type of information from two or more sources, and ranks their similarity
4.2 Cooperation and Competition in the Reference Scenario The AE approach has been tested in years 2000 − 2002 at the Gaslini Hospital of Genova, and it is currently being tested in a more complex scenario within an-
656
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
other Italian Hospital. The experimental set-up is composed of two mobile robots MerryPorterT M (Fig. 4) and a set of static smart nodes distributed in the building, with the purpose of controlling active beacons for localization, automated doors, elevators, station for autonomous recharging, autonomous loading/unloading equipment9. MerryPorterT M is capable of indoor/outdoor navigation, and it is equipped with a SICK safety laser-scanner that is used for obstacle avoidance, a Hokuyo laser-scanner mounted on a high pole for map-building and self-localization, ten sonars that return the distance to the closest obstacle as an additional safety measure, the patented system DLPS for beacon-based optical guidance10, two Pan/Tilt/Zoom cameras for surveillance, one Garmin GPS. MerryPorterT M is requested to perform transportation and surveillance tasks; to this purpose, it has has a payload of about 120Kg, and it is able of autonomos loading/unloading by positioning itself below a properly adapted cart, and operating a lifting platform. Autonomous recharging is provided by a wall-mounted power supply panel connected to charge rails embedded in the floor. The control architecture of each of the two MerryPorterT M is modelled as a collection of agents, the mutual relationships among which are by far more complex than described in the simple case in Sect. 4.1. Some agents are executed on the onboard computer, whereas some others are executed on smart nodes located onboard the robot, as if they were sensor/actuator nodes carried around11.
Fig. 4 MerryPorterT M at the Health Care Exposition EXPOSANITA’, 28 – 31 may 2008, Bologna. On the right, the prototype Sta f f etta that was tested at Gaslini Hospital of Genoa in 2000 − 2002
9
MerryPorterT M is commercialized by GenovaRobot (http://www.genovarobot.com). Beacons are transponders that send a signal on a wireless channel whenever they are hit by a rotating laser beam, therefore allowing to perform localization through triangulation and/or filtering techinques, e.g., Kalman Filter. 11 The onboard PC runs Linux, implementing the Posix standard for Real-Time, whereas smart nodes are implemented as Echelon LonWorks control nodes, a standard for building automation. 10
From Autonomous Robots to Artificial Ecosystems
Fig. 5 Agents executed onboard MerryPorterT M
657
658
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
The Table in Fig. 5 reports agents that are executed onboard MerryPorterT M . As in the simpler case in the previous Sub-section, the human operator starts the process by sending a message of type problem-msg: a problem can be a single task, as well a schedule of tasks, which level of abstraction can range from the Cartesian position of locations to be reached in sequence, to the description of objectives and policies, e.g., something that sounds like: patrol the building by randomly variating your paths. As soon as planner receives problem-msg, it iteratively decomposes it into sub-problems, and finally produces a plan-msg message, corresponding to a sequence of actions to be performed: in most cases, this is taken from a library of pre-produced plans that allow performing routinary tasks, but it can require a planning phase in case of special requests or anomalous situations. executor receives the plan, and produces messages to actually execute it: for example, it publishes target-msg messages describing via-points to be reached in sequence, which are then transformed by trajectory into velocity-msg messages consumed by motor-left, motor-right, and motor-front. The state of advancement of the plan is monitored by context-assessment, that subscribes to all relevant messages and possibly publishes a new problem-msg message if the perceived status of the system has changed in an unpredicted way. During motion, the onboard agents sonar-1, sonar-2, and laser-low provide the required information that allows map-grid to build a local representation of the free space; this is used by trajectory to periodically update the computed trajectory in presence of obstacles. The arbitration function used by motor-left, motor-right, and motor-front is based on authority: if, for some reason, the robot is very close to collide, emergency – which has a higher authority than trajectory – takes the control, and immediately stops the motors to guarantee a safe behavior. Finally, trajectory needs to know the current position to compute a path to the target. Position estimates are primarily provided by odometry; when available, kalman compares current observations with a pre-stored model of the environment, and uses observations to update the current position estimate (observations can correspond to the profile of walls and furniture as detected by laser-high and published by map-vector, as well as to gps measurements). The Table in Fig. 6 reports smart nodes, and corresponding agents, that have been implemented in the scenario. Without loosing generality, it is assumed that human operators can assign tasks, objectives, and policies to the system, as well as monitor its activity, through a supervision station connected to the sensor/actuator network, which can therefore be considered as special cases of smart node with high computational power and memory storage. As usual in Ambient Intelligence applications, it is possible to foresee a more complex scenario, in which PDAs are used for this purpose, as well as cellular phones or similar portable equipment, or even ubiquitous, multi-modal, Maria-the-road-warrior-like interfaces embedded in intelligent furniture. However, this is not the focus here, and hence this issue is left in the background. A particular, yet very common case, is when it can be assumed that an action performed by an Actuator Agent on a smart node has an extremely low probability of failure. Consider a light switch, or an automated door: whenever
From Autonomous Robots to Artificial Ecosystems
659
the agent has received and executed a message telling to turn off the light, or to close the door, it is usually reasonable to assume that the operation worked properly. In these cases, the system is modeled as if a virtual Sensor Agent would be present, which automatically detects the change produced by the Actuator in the world, and consequently produces information describing the new state of the system. That is, as soon as the Actuator Agent light-switch has turned off the light, the system assumes that the light is off, and the corresponding Sensor Agent light-check publishes a message accordingly. This approach can be dangerous in case of an automated door: suppose that door-check publishes a message of type door-state-msg, telling the robot that the door is open, whereas it is not! However, if required, this situation can be detected by proximity sensors on board the robot, e.g., sonars and laser scanners. It is sufficient that an agent onboard the robot (say passage-check) subscribes to messages describing the free space (say map-grid-msg), and performs proper computations to finally produce messages of type door-state-msg that can be used to unmask the “lying” agent door-check. This is a case of Competition.
Fig. 6 Smart nodes, supervision stations, and corresponding agents in the hospital scenario
It is now easy to foresesee how Cooperation patterns introduced in Section 3.1 can emerge. Three examples are described in details.
660
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
Localization. Beacons increase self localization accuracy through the following need chain: as a beacon is hit by the rotating laser beam, beacon produces beacon-msg; the robot agent kalman consumes beacon-msg (as well as dlps-enc-msg, i.e., the azimuth direction of the beacon, and position-msg, i.e., the most recent position estimate), and produces kalman-msg by comparing observations with a map that contains the position of all beacons; odometry consumes kalman-msg, thus producing a more accurate estimate of the robot’s position. Chains are closed by the following couple of consumer–producer belonging to different individuals: beacon → kalman. Controlling automated doors, elevators, barriers. Consider the case of an elevator: planner produces a sequence of actions (i.e., a plan-msg message) to call the elevator, keep the door open while the robot enters, select the next floor, exit. However, the robot by itself has no means to perform these actions, since executor publishes elevator-msg, that no other onboard agent is able to consume. This is only one side of the problem: to monitor the state of advancement of the plan, context-assessment needs to know the current state of the elevator, i.e., to check whether the elevator is at the current floor, if doors are open, etc., but no onboard agent is able to produce elevator-state-msg. Smart nodes located on different floors, which are allowed to control elevators and check its state, and a set of corresponding Sensor Agents and Actuator Agents can do the work on behalf of the robot, simply by publishing/subscribing to the corresponding message types. Chains are closed by: executor → elevator, elevator-check → context-assessment. Identity checking. Whenever a robot finds a person within a restricted area, this does not necessarily mean that she/he is an intruder; instead, the robot agent surveillance waits for messages of type badge-msg, which have been possibly produced by a badge agent within communication range. If such message is received, the potential intruder is identified as “staff member”, otherwise an alarm-msg is produced and snapshots are taken by the robot agents camera-left and camera-right, and the corresponding messages are consumed by display at the supervision station. If required, voice communication can be initiated between the human operator and the intruder, involving the two robot agents speaker and microphone, as well as command and display at the supervision station. Chains are closed by: badge → surveillance; camera-left → display; camera-right → display; command → speaker; microphone → display. Analogously, it is possible to foresee how Competition patterns can emerge. Two examples are described in details. Checking surveillance network. Agents surveillance-camera and pir, executed on smart nodes in the environment, publish surv-camera-msg and pir-msg when something has been detected. This information is received by context-assessment at the supervision station, which – after an inferential process – produces a message of type position-msg describing the position of the detected intruder (and, possibly, other messages describing the current state). The robot agent surveillance has subscribed to position-msg: when
From Autonomous Robots to Artificial Ecosystems
661
the robot is within the field of view of a camera for videosurveillance or a PIR, surveillance can check if position-msg messages produced by the onboard agent odometry (i.e., the location where the robot assumes to be) are similar to messages produced by context-assessment. If not, an alert-msg message is published, since there is a concrete possibility that the camera for videosurveillance or the PIR are not properly working. Distance function: compare; msg-type=position-msg; behaviour: computes the Euclidean distance on the basis of the fields x-pos and y-pos. Checking robot patrols. Every movements of the robot is recorded, both through direct observation and – in particular – through explicit local communication between robots and smart nodes in the environment: in fact, since a robot periodically communicates with smart nodes in order to accomplish its own tasks, all its movements can be carefully tracked12. All messages produced by smart nodes are then consumed by context-assessment at the supervision workstation, which consequently produces a patrol-state-msg, containing a set of attributes describing the patrol performed by the robot, i.e., the percentage of coverage, if patrol paths have been randomly varied, if doors and windows have been checked, etc. The agent context-assessment executing on the robot produces a patrol-state-msg as well, containing attributes that describe the robot belief about its own behaviour, and both messages are finally consumed and compared by display, that checks if the robot is aware of how well it is performing (for example, it can happen that – due to bad localization – the robot thinks that it has visited an area, whereas it has not.). Distance function: compare; msg-type=patrol-state-msg; behaviour: a classification process based on attributes.
4.3 Distributed knowledge representation and reasoning This Section focuses on three cognitive agents, namely context-assessment, planner and executor, which provide the Artificial Ecosystem framework with interesting and powerful capabilities, by contributing to several need chains that allow the system “as a whole” to perform context assessment, to support reasoning and to act in order to reach a given objective. In particular, up to now Cooperation and Competition have been described as spontaneously emerging when required, given that the involved individuals are within communication range. In the following, it is described how it is possible to drive the system by purposively establishing Cooperation and Competition patterns. In this scenario, it is assumed that instances of context-assessment, planner and executor are scheduled both on robots (Fig. 5) and on supervision stations (Fig. 6): different instances usually rely on different representations 12
It is extremely difficult for the robot to hide itself when continuously relying on the building automation network to perform its operations, as like as it is difficult for a fugitive to hide her/himself by repeatedly using her/his Credit Card to pay in restaurants and hotels!
662
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
of the system state, and possibly have different objectives. The three agents can be also executed on other kinds of individuals (e.g., PDAs), provided that sufficient computational resources are available, and hybrid schemes are conceivable as well, where different components are scheduled on different individuals. Specifically, key roles include: • Assessing and semantically classifying occurring events exploiting the information flowing within the ecosystem. This is a fundamental counterpart within the Artificial Ecosystem of bionic self-refinement processes in Nature: the system should reason upon itself and its own environment at a higher level, in order to reconfigure parts of concern to increase adaptivity, reliability and autonomy. With respect to this process, context assessment is the very first step, where the system gains a symbolic representation of its own state. • Activating and deactivating agents belonging to the Artificial Ecosystem, thus establishing and restoring need chains to enforce reliability and efficiency. The configuration of the Artificial Ecosystem evolves according to the current system state (i.e., the assessed context): reasoning processes can then arrange activation of necessary agents and deactivation of agents that are no longer needed on the basis of required message types. • Planning and executing actions specifically aimed at operating into and modifying the real world. The Artificial Ecosystem is part of its own environment: as a consequence, context assessment and reasoning produce a course of action that – when applied to the environment – finally results into a given goal state. An example is introduced (i.e., a possible implementation of the Cooperation pattern labelled Surveillance in Section 3.1) in order to explain the behavior of these three agents in detail. Suppose that two Passive Infrared sensors along a corridor leading to the Laundry produce pir-msg messages in sequence, during night time (when nobody should be there). context-assessment at the supervision station consumes these messages, and – by merging them with other information – infers that somebody is possibly in the Laundry (Fig. 7). Since it is necessary to check whether the detected person is a staff member or an intruder and – in the latter case – to take a picture of her/him, context-assessment publishes a new problem-msg, asking for some agent that has proper skills to investigate, and to consequently produce a surveillance-msg message. planner at the supervision station consumes problem-msg: unfortunately, it quickly realizes that there are no cameras for videosurveillance in the Laundry, and hence no surveillance-camera agent able to produce surveillance-msg messages: a need chain exists that is not closed. The only chance is given by the robot MerryPorter that is currently in the Pharmacy, on which the agent surveillance (able to produce surveillance-msg messages as well) is currently not active to avoid wasting computational resources. planner produces a plan-msg message including actions to move MerryPorter to the required destination, as well as to activate the agent surveillance. Finally, plan-msg is consumed by the executor agent running on the robot, which sends proper messages to all involved agents to execute the plan.
From Autonomous Robots to Artificial Ecosystems
663
The context-assessment agent is responsible for maintaining a symbolic representation of both the current state and the goal state that the Artificial Ecosystem must be driven to. To this purpose, the agent subscribes to all the messages exchanged within the ecosystem and, on the basis of techniques related to context acquisition and recognition [8], is able to determine the current “system state” in the form of predicates13. Within the Artificial Ecosystem framework, a representation of relevant knowledge is maintained using an ontology, which represents software agents (defined in terms of published and subscribed message types), a-bionic members of the community (with respect to the individuals they belong to and their location in the Cartesian space), message types exchanged within the ecosystem, the topology of the environment (explicitly representing connections between different places), actions templates, goals possibly submitted by humans or plans which must be executed, and predicates about specific properties of the current state. In general, ontologies are meant to represent what exists in a given domain using the dichotomy concept-role usually adopted in semantic networks. Concepts can be considered classes of objects that have instances, whereas roles represent named relationships among concepts. From a logic perspective, ontologies can be considered as a tractable fragment of First Order Logic: reasoning in ontologies is performed through subsumption, which is a sort of membership relationship between concepts and instances. For a comprehensive survey on ontologies and related reasoning techniques refer to [3]. In particular, the ontology maintains a collection of base symbols, referred to as predicates, which represent the content of messages received over time. For example, as soon as context-assessment receives a pir-msg message, a predicate instance is added to the ontology. Therefore, each predicate is provided with a semantics that depends on the particular message type it is associated with. Beside predicates, the ontology maintains also a set of symbols, called contexts, which are described using base predicates. Contexts are provided with a semantics that is more than the mere union of semantic properties characterizing constituent predicates: for example, as soon as two pir-msg messages are received from the two Passive InfraRed sensors along the corridor, it is possible to infer that somebody is moving to the Laundry, which is more than just saying that two PIRs have detected something in subsequent times. Contexts are hierarchically organized, since they can contribute to more complex contexts. Every time instant, all predicate instances are combined in order to create a general picture of the current system state. Next, for each defined context within the ontology, the system determines if it is satisfied by the current system state through subsumption. Contexts that are satisfied provide the system state with a partial interpretation that is based on their own semantics. If the satisfied contexts contribute to the definition of higher level contexts, this process is iterated until the most general contexts are evaluated. In practice, as anticipated, in the Surveillance example pir-msg messages originated by PIR sensors are associated with properly defined predicate instances 13
In related literature, the symbolic representation of a system through logic predicates is usually referred to as a context, thereby the term “context assessment”.
664
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
Fig. 7 Context assessment in the Surveillance example. Grey circles correspond to concepts, either base predicates or contexts; black circles correspond to “roles”, i.e., relationships between concepts
within the ontology (Fig. 7). Analogously, another predicate takes time into account. These three predicates contribute to the definition of a context (referred to as somebody-in-the-Laundry-at-night in Fig. 7) which semantics refers to a movement towards the Laundry during the night. When actual messages are received, the joint description of the corresponding predicates satisfies the definition of such a context, thus making it necessary to fire proper surveillance procedures. The planner is a cognitive agent that subscribes to problem-msg and publishes plan-msg messages. A plan is simply a sequence of actions which execution, from the current state, leads to the goal state. The broadly used nomenclature of STRIPS-like planning is adopted here14 : each action is characterized by a precondition-list (i.e., a collection of predicates which must hold for the action to take place), a delete-list (i.e., made up of predicates that do not hold anymore after the action is executed) and an add-list (i.e., a collection of predicates holding after the action is performed). Usually, delete-list and add-list are composed in a postcondition-list. However, planning in the Artificial Ecosystem requires an innovative and dramatic paradigm shift in the concept of “system state”: the state must seamlessly include predicates about bionic properties of the ecosystem (e.g., information exchanged, the active/not active state of agents, etc.), a-bionic properties (e.g., the spatial relationships between physical objects), and even hybrid properties, which are related to both bionic and a-bionic members (e.g., actions and perceptions, which involve both energy and information). This approach is very different from what is usally done in reasoning systems: in the Artificial Ecosystem the planning space is not organized in a hierarchical fashion with a distinction between reasoning 14 According to the Problem Domain Definition Language (PDDL) 3.1, which specifications can be found at http://ipc.informatik.uni-freiburg.de/PddlExtension.
From Autonomous Robots to Artificial Ecosystems
665
and metareasoning, i.e., the system is able to reason both on the actions that it must perform in the physical world and on itself according to a flat model which allows to exploit indifferently bionic, a-bionic and hybrid properties of the state. The previous example can be used to illustrate these concepts (Fig. 8): when reasoning “on the system”, the planner scheduled on the supervision station plans to Activate the agent surveillance on MerryPorter by publishing a proper message of type surv-act-msg. On the other hand, when reasoning “on the physical world”, the plan contains proper instructions to Move MerryPorter from Pharmacy to Laundry. In this case, planning entities include: agent, message, individual and place. These are variables to ground predicates, as described in Fig. 8 on top: instances of entities include surveillance as an agent, surv-act-msg and surveillance-msg as messages, MerryPorter as an individual, Pharmacy and Laundry as places. Relevant operators available to planner are described in Fig. 8 on the mid-top. Fig. 8 on mid-bottom reports the fields of problem-msg consumed by planner, including, as usual, a goal and a start state. Notice that, in the goal state, context-assessment does not need to explicitly request a robot to move to the Laundry. Instead, it is sufficient to specify that a message of type surveillance-msg must be produced by an agent that is in the Laundry, i.e., through the predicate AvailableAt(surveillance-msg, Laundry); it is up to planner to exploit the relationships between agents and the individuals they belong to, in order to infer a course of actions allowing to reach the goal from the current state. For example, should a camera for videosurveillance be present in the Laundry, and the corresponding agent surveillance-camera be active, this would be sufficient to immediately satisfy the goal. The resulting plan is encoded within a plan-msg submitted to executor, and is described in Fig. 8 on the bottom. As usual, planner is able to re-plan should the state, as assessed by contextassessment, change in an unexpected way. Finally, the executor is a cognitive agent that subscribes to plan-msg messages and execute the plan. To this purpose, it publishes, in different times, different messages suitable to drive the overall ecosystem: in this case, a surv-act-msg message to activate the agent surveillance, and a proper sequence of targetmsg messages to drive the robot to its destination. Again, its scope encompasses software agents as well as individuals acting in the physical world.
5 Conclusion The main ideas underlying the new and exciting field of Ubiquitous Robotics have been investigated. In particular, it has been shown how issues related to robot autonomy and reliability can be better faced by adopting a fruitful paradigm shift: mobile robots are not autonomous, embodied and physically self-sustaining entities; on the contrary, they are situated in their own environment, which is tailored especially for them, not only for humans. Four frameworks have been selected in the literature for
666
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
Fig. 8 Tables with predicates, operator, a specific problem, and a possible plan (from top to bottom) for the reference example
From Autonomous Robots to Artificial Ecosystems
667
their widely recognized merits in the field and, in particular, a case study has been adopted to describe general guidelines in greater detail, encompassing architectural methodologies, strategies for agent cooperation and competition, distributed knowledge representation and reasoning. Acknowledgements Authors wish to thank Prof. Renato Zaccaria, with whom they elaborated these ideas, as well as Dr. Francesco Capezio for useful – and sometimes harsh – discussions about the principles described in this Chapter.
References [1] Aarts, E., Collier, R., van Loenen, E., Ruyter, B.: Proceedings of the First European Symposium on Ambient Intelligent (EUSAI 2003), vol. 2875. Veldhoven, The Netherlands (2003) [2] Augusto, J.C., (Eds.), D.S.: Advances in Ambient Intelligence. Frontiers in Artificial Intelligence and Applications (FAIA) series, vol. 164. IOS Press (2007) [3] Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Pater-Schneider, P.: The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press (2003) [4] Brady, M.: Artificial intelligence and robotics. Artificial Intelligence 26, 79– 121 (1985) [5] van Breemen, A., Crucq, K., Krose, B., Nuttin, M., Porta, J., Deemester, E.: A user-interface robot for ambient intelligent environments. In: Proceedings of the First International Workshop on Advances in Service Robotics (ASER 2003). Bardolino, Italy (2003) [6] Brooks, R.: Elephants don’t play chess. Journal of Robotics and Autonomous Systems 6, 3–15 (1990) [7] Chli, M., de Wilde, M.: Covergence and Knowledge-Processing in MultiAgent Systems. Advanced Information and Knowledge Processing Series. Springer Verlag / Heidelberg (2009) [8] Dey, A.: Understanding and using context. Journal of Personal and Ubiquitous Computing 5(1), 4–7 (2001) [9] Evans, J.M., Krishnamurthy, B.: Helpmate, the trackless robotic courier: A perspective on the development of a commercial autonomous mobile robot. In: Autonomous Robotic Systems, Book Series Lecture Notes in Control and Information Sciences, vol. 236/1998, pp. 182–210. Springer Berlin / Heidelberg (2006) [10] Kim, J.: Ubiquitous robot. In: Computational Intelligence, Theory and Applications. Advances in Soft Computing Series, vol. 33. Springer Berlin / Heidelberg (2006) [11] Kim, J., Lee, K., Kim, Y., Kuppuswamy, N., Jo, J.: Ubiquitous robot: A new paradigm for integrated services. In: Proceedings of the 2007 IEEE Inter-
668
[12]
[13]
[14]
[15] [16]
[17]
Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria
national Conference on Robotics and Automation (ICRA 2007). Rome, Italy (2007) Krose, B., Porta, J., van Breemen, A., Crucq, K., Nuttin, M., Deemester, E.: Lino, the user-interface robot. In: Proceedings of the First European Symposium on Ambient Intelligent (EUSAI 2003). Lecture Notes in Computer Science, vol. 2875. Veldhoven, The Netherlands (2003) Miozzo, M., Scalzo, A., Sgorbissa, A., Zaccaria, R.: Autonomous robots and intelligent devices as an ecosystem. In: Proceedings of the 2002 International Symposium on Robotics and Automation (ISRA 2002). Toluca, Mexico (2002) Park, K., Bien, Z., Lee, J., Kim, B., Lim, J., Kim, J., Lee, H., Stefanov, D., Kim, D., Jung, J., Do, J., Seo, K., Kim, C., Song, W., Lee, W.: Robotic smart house to assist people with movement disabilities. Autonomous Robots 22(2) (2007) Pfeifer, R., Bongard, J.C.: How the Body Shapes the Way We Think – A New View of Intelligence. MIT Press, Cambridge, MA (2007) Saffiotti, A., Broxvall, M., Gritti, M., LeBlanc, K., Lundh, R., Rashid, J., Seo, B., Cho, Y.: The peis ecology project: Vision and results. In: Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008). Nice, France (2008) Siciliano, B., Khatib, O.: Handbook of Robotics. Springer Verlag / Heidelberg (2008)
Behavior Modeling for Detection, Identification, Prediction, and Reaction (DIPR) in AI Systems Solutions Rachel E. Goshorn, Deborah E. Goshorn, Joshua L. Goshorn, and Lawrence A. Goshorn
1 Introduction The application need for distributed artificial intelligence (AI) systems for behavior analysis and prediction is a requirement today versus a luxury of the past. The advent of distributed AI systems with large numbers of sensors and sensor types and unobtainable network bandwidth is also a key driving force. Additionally, the requirement to fuse a large number of sensor types and inputs is required and can now be implemented and automated in the AI hierarchy, and therefore, this will not require human power to observer, fuse, and interpret. In the past, AI systems were limited because of lack of: distributed networks, bandwidth availability, low cost computing power (instructions/sec and memory size), new required unique and low cost sensors, and generalized software approaches (e.g. now enabled through sequential syntactical behavior grammars discussed in this chapter) for dissecting and implementing the intelligence requirements. The overview of an AI Systems Solution can be seen in the pyramid of Fig. 1, where each level is a system of the entire AI Systems Solution. The primary purpose Rachel E. Goshorn Naval Postgraduate School, Systems Engineering Department, Monterey, CA, U.S.A. Deborah E. Goshorn Naval Postgraduate School, MOVES Institute, Monterey, CA, U.S.A. University of California, San Diego Computer Science and Engineering Department, La Jolla, CA, U.S.A. Joshua L. Goshorn JLG Technologies, Inc. Pebble Beach, CA, U.S.A. Lawrence A. Goshorn JLG Technologies, Inc. Pebble Beach, CA, U.S.A. H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_25, © Springer Science+Business Media, LLC 2010
669
670
Rachel Goshorn et al.
Fig. 1 AI Systems Solution Pyramid.
of this chapter is to describe the Detection, Identification, Prediction, and Reaction (DIPR) system (and its subsystems) in a generalized manner (application independent), the second level of the AI Systems Solution’s pyramid. The total solution would require the Infrastructure of Sec.3 (first level of the pyramid) and the Application Models, which engineer DIPR subsystems to perform in a specific rather than generalized solution (Sec. 4 (third level of the pyramid)). The complete solution begins with an actual Application (fourth level of the pyramid) to obtain knowledge required for the definitions of the Application Models and the DIPR subsystems. Once the Application Models and the DIPR subsystems are defined within the actual application, customization for the final solution is carried out (fifth level of the pyramid). Level two of the AI Systems Solution (DIPR Subsystems) is a generalized (non application specific) software approach breaking the artificial intelligence behavior modeling problem into four distinct intelligence functions to describe and implement utilizing distinct subsystems of: Detection, Identification, Prediction, and Reaction (DIPR), as seen in Fig.2. In order to analyze behavior in systems for various applications, the behavior analysis needs to be modeled a priori. Since the range of behavior analysis is very large, it must be approached from a systems view, describing the entire problem, its subsystems, and then model each part. AI and behavior modeling and fusion of features and intelligent states can be carried out in DIPR. From the sensors, raw sensor data is inputted into the first subsystem of Detection, with extracted objects features as the output. In the “Detection” subsystem, low-level classifiers extract object features from the raw sensor data. The subsystem “Identification” receives object features with spatial-temporal attributes as the input. The output of “Identification” are intelligent states (symbols). These intelligent states (symbols) are created by fusing the inputted object features and spatial-temporal attributes through a priori defined low-level rules of intelligence, as shown in Fig.3. These rules of intelligence for fusion are application dependent. The inputs to “Prediction” are these intelligent states (symbols), with outputs of pre-
DIPR for Behavior Modeling
671
Fig. 2 AI and Behavior modeling occurs at multiple levels of detail, defined in the overall system Detection, Identification, Prediction, Reaction (DIPR) and its subsystems, becoming the foundation for AI and behavior modeling.
dicted behavior outcomes. In the “Prediction” subsystem’s first stage of the Behavior Classifier Module, intelligent states (symbols) are concatenated over time/space to form spatial-temporal sequences. These sequences are then classified into behaviors (behavior labels) through a sequential syntactical behavior classifier. These behavior classification labels are then inferred to predicted behavior outcomes, as shown in Fig.4. The Infer Predicted Outcomes Module is a fusion of the behavior labels and a priori defined application knowledge, through a priori defined inference rules. The inputs to “Reaction” are predicted behavior outcomes, with outputs of “Reaction” are actions. The main function of “Reaction” is to create actions in response to the predicted behavior outcomes and is application dependent. Details of AI and behavior modeling for DIPR is discussed in detail in Sec. 2. Overall, this chapter assumes one object per DIPR. If multiple objects occurred at the same time, multiple DIPRS would be carried out. Therefore, there is one DIPR per object.
Fig. 3 “Identification” subsystem of the DIPR system.
Additional objectives of this chapter are: (1) to introduce the reader to advanced fusing concepts above sensor and state fusion, and (2) introduce the reader to learning concepts to modify the DIPR process automatically. The four subsystems of DIPR can then be distributed onto an infrastructure for implementation in a wide variety of methods to meet the needs of the intended
672
Rachel Goshorn et al.
Fig. 4 “Prediction” subsystem of the DIPR system.
applications with the constraints and limitations of the actual hardware/software. In order to model AI and behaviors for distributed AI systems, a physical/geographical infrastructure is in place (or will be in place), onto which the AI and behavior modeling will be superimposed. This is discussed in Sec. 3. These subsystems and infrastructures can then be shown to demonstrate a number of application models such as: AI decision making, abnormal behavior detections, smart environments, ambient intelligence, and turbocharging detection classifier statistics.. Application Models are developed from knowledge of the specific application and the requirements of the DIPR subsystems. This is discussed in detail in Sec.4.
2 AI and Behavior Modeling for DIPR Subsystems This section describes AI and behavior modeling for the four subsystems of DIPR: Detection, Identification, Prediction, and Reaction. Application models and specific applications of DIPR can be seen in Sec. 4.
2.1 DIPR “Detection” Subsystem As the first subsystem in the DIPR subsystems, the “Detection” subsystem processes on the raw data streamed from sensors in the provided infrastructure. It takes as input raw sensor data and outputs a temporal (feature,space) matrix. Certain preselected features are extracted from the raw data at each time step. The type of features to be extracted is decided a priori and is very fundamental for the modeling process. This is usually done based on which objects need to be detected and how they need to be described in order to describe a behavior. Addiotnally, there are several ways to decide which features to extract from each type of sensor data and behavior modeling application, using pattern recognition or data mining techniques
DIPR for Behavior Modeling
673
[14]. Features vary in complexity ranging from the low-level simple features close to the raw data, to more complex high-level features. For example, a simple feature may be the raw data measured from a sensor. A complex feature may be a function of other simple features. For example, a complex feature might measure the rate of change, i.e. the derivative of another feature. In addition to what type of feature to measure, the time at which each feature is measured must be decided. This is usually a function of the sampling rate of the raw data from the various sensors. In addition, the time interval between feature measurements may be influenced by the behavior modeling application, which might specify how often new information must be obtained. Finally, features may be measured at different spaces. The particular increments in space must be decided as well. The space increments depend on the sensor locations as well as the behavior modeling application.
Fig. 5 (a) Simple temporal (feature,space) matrix example where features are binary-valued (black and white). (b) A temporal (feature, space) matrix example where features are either real valued and normalized between zero and one (grayscale) or binary (black and white).
Since varying features are measured at varying space at a given time, features are stored in a temporal (feature, space) matrix for a particular time instance t, as shown in Fig. 5, and is denoted f(i, s,t), which holds the values of features i ∈ [1, k] at varying spaces s ∈ [1, m] at time instance t. This figure displays two examples of features: one with simple binary-valued features represented as pixel values of zero (white) or one (black). The second example displays both binary-valued features as well as features of analog values, normalized between zero (white) and one (black). The analog-valued features are represented as grayscale pixels. Note that features must all be normalized between zero and one, allowing the fusion of features from multi-modal sensors. In summary, features are detected in various ways. The simplest way is to feed through the sampled sensor raw data itself. Other features may be detected by calculations of rate of change. More high-level features may be the responses of various types of features. Features may be the class labels that are outputted from specific low-level object classifiers. Thus, the “Detection” subsystem consists of all the
674
Rachel Goshorn et al.
low-level classifiers, filters, and methods that detect features. For every time increment, features are detected and maintained at all possible space increments possible, yielding a 3-dimensional matrix which stores feature values, as seen in Fig. 6. Note sometheimes these features must be normalized using a priori normalization factors, as shown in Fig. 6.
Fig. 6 Feature space vectors compose one feature space matrix at one time instance. As time varies, there are multiple feature space matrices that make up the three dimensional spatial-temporal feature matrix.
2.2 DIPR “Identification” Subsystem The main function of the “Identification” subsystem is to fuse temporal (features,space) matrix entries to identify the intelligent state (symbol) of the observed environment at a time step. Therefore, the input to “Identification” subsystem is the temporal (feature,space) matrix, with an output of an intelligent state (symbol); an output occurs if the fusion (intelligent rules) of “Identification” are met. An intelligent state (symbol) at a current time step is simply when desired measured features take on a desired range of values for particular spaces at the current time step, carried out through the Fusion (Intelligence Rules) Module. The Fusion (Intelligent Rules) Module, can be seen in the second stage of the “Identification” subsystem in Fig.3. The desired fusion of feature values at various spaces should be the fusion that captures an important intelligent state (symbol) worth recognizing; an important intelligent state (symbol) is worth recognizing if it occurs in order to capture part of a defined behavior, as seen in Sec. 2.3. Thus, intelligent states (symbols) are identified by recognizing the desired fusion (intelligent rules) of temporal feature values at desired spaces. The fusion of temporal (features,space) is described by defining an intelligent rule for that intelligent state (symbol). Intelligent rules can be generic and defined by the user, application dependent. An intelligent rule may be as simple
DIPR for Behavior Modeling
675
as a Boolean expression which fuses ranges of features with logical and’s, where the clauses of the expression represent the ranges of values for each desired feature to describe a particular intelligent state (symbol). If the intelligence rule is satisfied at a given time step, then the corresponding intelligent state (symbol) is identified and is outputted from the “Identification” subsystem. An example of this case is seen in Fig. 7.
Fig. 7 In this case, a symbol is identified when a Boolean expression combining binary-valued features in space is satisfied at a given time instance.
Thus, for every temporal (feature, space) matrix inputted, there is an intelligent state (symbol) outputted when the fusion (intelligent rules) are met. If the fusion (intelligent rules) is not met, a null state (symbol) is outputted. The use of the null state (symbol) is part of the “Prediction” subsystem behavior analysis; e.g. the system lost track of an object over a set of time steps, or has not initiated a sequence (first module of “Prediction”) for an object. In addition to identifying the intelligent state (symbol), the “Identification” subsystem stores the (feature, space) matrix by storing the values of the (feature, space) matrix at every time increment, creating a three dimensional matrix (features,space,time), as seen in Fig. 6. This allows the “Identification” subsystem to fill in any entries of the temporal (feature,space) matrix that are functions of time. This three dimensional matrix is useful for intelligent states (symbols) extracted from incremental (features,space) (e.g. derivatives). An example of a case when the “Identification” stage must fill in features that are dependent on time is described in Fig. 8. In this case, feature fk is the estimated velocity, derived from the difference in space and time of the object being detected. Storing this three dimensional matrix is also useful when the DIPR system needs to learn new fusion (intelligent rules) to describe previously unseen intelligent states (symbols), and behaviors, as well as (feature,space) changing overall temporal samples, and is described in Section 5.2, DIPR Learning.
676
Rachel Goshorn et al.
Fig. 8 (a.) In this case, the “Identification” subsystem must calculate the real-valued feature f k , which is a function of a binary-valued feature f 2 at two different times (current time ti and a previous time ti − Δ t). The formula of f k estimates velocity of the object which was detected with feature f 2 at two different times and spaces. (b) The “Identification” subsystem fills in the f k entry (which is independent of space) in the (feature,space) matrix. In this case, the rule for identifying a symbol b checks if the value of f k is greater than a certain threshold.
2.3 DIPR “Predict” Subsystem The “Prediction” subsystem is the most involved subsystem of DIPR and the heart of AI and behavior modeling. It takes as input the identified intelligent states (symbols) from the “Identification” stage and outputs predicted behavior outcomes. This stage carries out the high-level behavior classification, as well as prediction of the corresponding outcome of the behavior. In other words, in this subsystem, sequences of intelligent states (symbols) are first classified into behaviors; these behaviors are inferred (predicted) to be one of the known prediction outcomes or possibly the unknown prediction outcome. The “Prediction” subsystem is decomposed into two modules (first and second “Prediction” subsystem stages), as seen in Fig. 4. The first stage of the Behavior Classifier Module (seen in Fig. 4) contains an intelligent state (symbol) buffer where sequences of intelligent states (symbols) are formed. The second stage of the Behavior Classifier Module is the sequential syntactical behavior classifier, explained below in Section 2.3.1. The output of the Behavior Classifier Module is a behavior label. This label is either one of an a priori defined behaviors or an unknown behavior label (if the sequence was not classified as any of the possible predefined behaviors). The behavior labels are inputted to the second stage of “Prediction”, in the Infer Predicted Outcomes Module, which infers one of predicted behavior outcome. The predicted outcomes inferred are based on conditions specified by the user and application.
DIPR for Behavior Modeling
677
2.3.1 Behavior Classification Module (“Predict” Subsystem’s First Stage) The Behavior Classification Module second stage, Sequential Syntactical Behavior Classifier, takes in a sequence of intelligent states (symbols) and outputs the classified behavior label. The behavior label will be one of the predefined behaviors or it will be the unknown behavior label. Before discussing the behavior classification method, the approach to defining behaviors is here discussed. Behaviors are represented with a syntactical grammar-based approach [2, 1], where each predefined behavior is modeled as a Finite State Machine (FSM). [3] Let the set of all possible intelligent states (symbols) compose an alphabet Σ . An example alphabet is Σ = {a1 , a2 , . . . aN } where each letter represents an identified intelligent state-symbol. A behavior is then defined as a set of syntax rules combining the elements of Σ . For example, if a is an intelligent symbol and we want to model a behavior, denoted Bk , which describes when only the intelligent state, a, is observed for a duration of time, we would model it with the following FSM/language, where symbol b denotes the intelligent state of terminating the sequence. ⎤ S → aQ1 Bk = ⎣ Q1 → aQ1 ⎦ Q1 → bF ⎡
(1)
In other words, the FSM reads the sequence of intelligent states (symbol). The sequence is classified into the behavior whose corresponding FSM accepts the sequence. Since systems are not one hundred percent predictable and reliable, it is likely that a sequence will not be accepted by any of the predefined behaviors. This could be due to errors in the low-level classification of the features, a user error by deviating from the behavior by accident, or a user behavior pattern statistically deviating from what’s defined. In order to deal with these deviations, the finite state machines of each defined behavior are augmented with error production rules. For example suppose symbol a and b are two separate intelligent states (symbols), and sometimes because of misdetections in the “Detection” subsystem of the features, the symbols can be interchanged. Let S(a, b) denote substituting the true intelligent state (symbol) b for the mislabeled intelligent state (symbol) a, and the associated cost CS(b,a) , and D(a) denote deleting a mislabeled intelligent state (symbol) a with a cost CD(a) . The corresponding modified FSM Mk is shown in Fig. 9. In order to calculate the distance between an observed sequence of intelligent states (symbols) and a predefined behavior, let the possible intelligent states (symbols) be Σ = r1 , r2 , · · · , rN , where N is the total number of predefined intelligent states (symbols). With this, the distance from an observed sequence of symbols s and a predefined behavior Bl is given by, |Σ | |Σ |
|Σ |
i=1 j=1
i=1
d(s, Bl ) = ∑ ∑ CS(ri ,r j )) nS(ri ,r j ) + ∑ CD(ri ) nD(ri )
(2)
678
Rachel Goshorn et al.
Fig. 9 Augmented FSM M1 of behavior Bk . The augmented syntax rules are in red, and original syntax rules in blue with zero cost.
where nS(ri ,r j ) is the number of substitutions of the true intelligent state (symbol) r j for mislabeled intelligent state (symbol) ri , and nD(ri ) number of deletions of the mislabeled intelligent state (symbol) ri . Note, the term symbol will be used in further descriptions here for the term intelligent state (symbol). Assuming K behaviors, each behavior and its associated FSM are augmented a priori so that any sequence of intelligent symbol is accepted by each behavior, but with a total cost. The augmented behaviors are denoted by B1 , B2 , . . . BK , and their K corresponding augmented finite state machines are denoted by M1 , M2 , . . . MK . An unknown sequence s of intelligent state symbols is then parsed by each Ml , with a cost d(s, Bl ). The sequence s is then classified as the behavior Bg , where Bg = min d(s, Bl , l = 1, 2, · · · , K), and is the behavior to which it is most similar. Therefore, sequences of symbols are classified based upon Maximum Similarity Classification (MSC) as seen in Fig. 10. Automating behavior syntax rule costs is a significant part of the behavior modeling process and is discussed in this section. The costs are automated by first considering the inherent possibility of confusing two symbols. This possibility of confusion between two symbols is denoted by a certain probability between 0 and 1. For example, if there are only three predefined symbols, a, b, and c that can be identified, then there are six possible confusions that may occur. We let P(a|b) equal a probability that a true symbol a might have been confused with the given observed symbol b. We represent the probabilities of all possible symbol pair confusions in a Symbol Confusion Probability matrix, seen in Table 1. Table 1 Symbol Confusion Probability Matrix
True: a True: b True: c
Labeled: a Labeled: b Labeled: c P(a|a) P(a|b) P(a|c) P(b|a) P(b|b) P(b|c) P(c|a) P(c|b) P(c|c)
DIPR for Behavior Modeling
679
Fig. 10 Maximum similarity classification (MSC), with the input sequences of intelligent state (symbols) and output the behavior label.
In the ideal world, the Symbol Confusion Probability Matrix would be the Identity matrix, where the diagonal entries are one and the off-diagonal entries are zero. If no prior knowledge about the inherent confusion probabilities are known between symbols, then the identity matrix may be used for default values for automating the costs. Otherwise, any prior knowledge of symbol confusion probabilities may be combined to define the Symbol Confusion Matrix. Examples of using prior knowledge to automate the Symbol Confusion Matrix are shown in Section 4 The substitution costs are defined as the log of the inversion of the conditional probability entries in the Symbol Confusion Matrix. This can be seen in an example. The cost for substituting the true symbol b in for the observed symbol a is a function of the inversion of the probability that the true symbol a is confused as the observed symbol b. By denoting the conditional probabilities P(truesymbol = a|observedsymbol = b) as P(a|b), then the cost for substituting a mislabeled feature a with the true feature b is: CS(a,b) = 10log10 (
1 ) P(a|b)
(3)
Thinking about this intuitively, if there is a high probability that the true symbol a is confused with symbol b, where P(a|b) is high, then the cost for substituting symbol a for b, CS(b,a) , is clearly low. Additionally, if this probability P(a|b) is nonzero, then P(a|a) is less than one, since P(a|a) = 1 − P(a|b) − P(a|c).
(4)
680
Rachel Goshorn et al.
2.3.2 Infer Predicted Outcomes Module (“Prediction” Subsystems Second Stage) The output of the “Prediction” subsystem first stage is behavior labels, which is inputted into the “Prediction” subsystems second stage of Infer Predicted Outcomes Module. The Infer of Predicted Outcomes Module is a fusion of the behavior labels and a priori defined application knowledge, through a priori defined inference rules. For example, if a behavior label output was “abnormal behavior at the airport, luggage left unattended”, an inferred predicted outcome could be: “bomb may explode”. In this example, the inference rule would be, if a behavior label output of “abnormal behavior at the airport, luggage left unattended”, the inferred predicted outcome is “Bomb may explode”. “Bomb may explode” would require a specified action, which is defined in the next DIPR subsystem of “Reaction”. Overall, the inference rules are application dependent and may require learning and updating for optimal implementation.
2.4 DIPR “Reaction” Subsystem The inputs to “Reaction” are predicted behavior outcomes, with outputs of “Reaction” are actions. The main function of “Reaction” is to create actions in response to the predicted behavior outcomes and is application dependent. Therefore, in “Reaction”, reaction rules are application dependent defined a priori as a function of the inputted inferred predicted outcomes; the actions outputted need to be adapted and learned (either system-wide or user-wide) to ensure appropriate actions occur. The DIPR “Reaction” Subsystem is defined per application by the user. Examples of a DIPR React Subsystem are shown in Sec. 4. In the example above, of inferred predicted outcome of “Bomb may explode”, the action defined a priori could be “Implement High Alert and Send Security Guards to Inspect Luggage”. Overall, the reaction rules are application dependent and may require learning and updating for optimal implementation.
3 Infrastructure for AI Systems Solutions This section describes the infrastructure, level one of the AI Systems Solution in Fig.1. The DIPR subsystems perform separate logical functions performing intelligent processing. These distinct subsystems are implemented as an AI node, as shown in Fig.11. The hierarchy of the DIPR subsystems are implemented as follows: sensors, which produce the raw sensor data, a smart node implements “Detection”
DIPR for Behavior Modeling
681
subsystem function (features), intelligent node implements “Identification” subsystem function (intelligent stats symbols), behavior node implements “Prediction” subsystem function (behaviors), “Reaction” node implements reaction subsystem function (actions defined by user per application). These AI nodes are implemented within a physical infrastructure of the system, which consists of sensors, computers, and communications networks, and any data storage components (databases). The complete DIPR system may be implemented in a variety of architectures, application model dependent, engineered according to the constraints of the the underlying physical infrastructure.
Fig. 11 Intelligent hierarchy, implements DIPR System.
3.1 Physical Infrastructure for AI Systems Solutions The underlying physical infrastructure in which the DIPR system is implemented includes an intelligent sensor network, required to gather and communicate raw sensor data to the higher level intelligence processing modules, such as the DIPR subsystems, the computers required to implement the DIPR subsystem processing functions (AI nodes), and the communications networks required to transmit sensor data to the processing nodes. Sensor networks can range from low data rate sensors (e.g. temperature sensors) to high data rate sensors, (e.g. video-cameras) [21]. As the data rate of the raw sensor data increases, the processing power of the computing nodes as well as the throughput of the communications networks must increase. The advent of high bandwidth communications networks, high speed computers allow for the deployment of real-time systems capable of implementing a DIPR system, even with high data-rate sensors such as networks of video cameras [18]. Each DIPR subsystem is an intelligent processing entity and requires sufficient processing
682
Rachel Goshorn et al.
power and memory, and computers in which the DIPR application are running need to have sufficient resources as well as access to networks used to receive the sensor data in order to process. Communications networks, such as wired Ethernet or wireless communication links, such as 802.15.4, ZigBee, 802.11a,b,g,n, wired Ethernet, each with their own data-rates and costs, are required to transmit data from sensors to computers that will process the data, which the subsystems of DIPR are implemented. The processing power (CPU cycles, memory), the communications networks capacity and latency, as well as the sensors’ requirements all weigh in the architectural considerations defining the implementation of the DIPR nodes, as well as the engineering of a system-wide DIPR solution. The “Detection” processing functions are closest to the physical sensor hardware, and need to have intimate knowledge of the workings of sensors and the raw data produced by the sensors. With modern architecture of embedded CPU architectures, along with the progression Digital Signal Processors (DSPs), more processing power can be embedded into the sensor itself, allowing enough processing power for much of the “Detection” subsystem to be implemented in the sensor itself [19, 20]. The “Identification”, “Prediction” and “Reaction” nodes remain mostly hardware agnostic, as they are working with features, symbols, behaviors, not the raw data. The next subsystem after “Detection”, “Identification” can fuse the features from a network of heterogeneous sensor sources. The independence from the physical sensor allows more flexibility in the geographical location of implementation of higher stages. As the trend towards more processing power is pushed to the sensors themselves, more intelligent “Detection” and feature extraction algorithms may be implemented in a distributed fashion in order to output features with more information in order to maximize the effectiveness of the higher intelligence processing subsystems within in the DIPR framework. With the advent of high power processors and DSPs capable of being embedded in the sensors to create an intelligent sensor, there is the ability of distributing parts, or all, of the DIPR subsystems to run within a single intelligent sensor. Choosing the underlying communications Additionally, the network throughput may be adequate to transmit high data rate sensor information (e.g. video data) and the processing of the data can be located centrally. These engineering decisions are application specific and can be configured according to the chosen system architecture of the solution. The application and implementation of intelligent algorithms on distributed smart sensors is an area of active research, , and the optimization and of network capacity and processing capacity for example [17, 16].
3.2 Overlay Architecture for AI Systems Solutions The DIPR subsystems can be implemented with a variety multiple system-wide architectures, segmenting the different subsystems across different network components. Each application and system requirements can dictate the appropriate architecture chosen, as the application specific system architecture needs to account for
DIPR for Behavior Modeling
683
the physical-geographical infrastructure in which it will be applied. For example, a DIPR system may be retrofitted to an existing infrastructure of sensor networks, or a new system design can be created matching the goals and requirements of the DIPR application with the processing and network constraints of the physical infrastructure. This section gives a brief overview of the fundamentals of a physicalgeographical infrastructure to aid in the understanding of AI and behavior modeling for distributed AI systems, including a centralized “top-down” approach, a distributed “bottom-up” up approach and a hybrid architectures that include both distributed and centralized architectures to implement the intelligent system. In the “top-down” architecture, the entirety of the intelligence of the DIPR system will be implemented in a central location. This architecture is appropriate where the processing power of the underlying physical infrastructure is centralized, such as a server room or command and control center. Additionally, it is an appropriate solution in cases where the sensors themselves do not have computing capacity to perform any of the DIPR subsystem nodes. In this architecture, the raw sensor data is transmitted from the network of sensors to the central location, where the “Detection” node will begin the processing. In the “bottom-up” infrastructure, the intelligence is distributed across a network of intelligent nodes. A requirement of this distributed intelligence is that the end nodes are capable of implementing all stages of the DIPR subsystems This implementation requires complete state information to be held locally at each low level intelligent sensor node. If there is interaction with other sensors, it therefore requires high overhead of coordination and data hand-offs. A hybrid solution which utilizes both distributed intelligence and centralized intelligence would be a network of intelligent sensors, in which the “Detection” node is implemented locally on the sensors, while the “Identification”, “Prediction” and “Reaction” nodes would be implemented at a central location, received detect features from all intelligent sensors. This greatly reduces network bandwidth consumption, because only features are transmitted from the sensors and not raw sensor data. The architecture of a DIPR system could be implemented as a hierarchy of embedded DIPR systems. The outputs of multiple DIPR systems could be fed into a larger scale DIPR system, with each DIPR system looking for behaviors at their respective scales. Suppose behavior needed to be analyzed at all of the airports in the United States of America to detect abnormal behaviors and predict potential harm nation-wide. In order to model this, the physical-geographical hierarchy must be known or designed. The top-down approach would start with one node, where all information of the airports behavior status are known, then it would break down to possibly the Western half of the USA airports (one node) and the Eastern half of the USA (one node), then each of those nodes would break into the various airports in the Western half and the various airports in Eastern half. This infrastructure can be seen in Fig. 12. However, each airport itself would have its own network of sensors for monitoring activities for potential behaviors occurring locally, thus requiring local DIPR systems to function at each airport. The output of the airport DIPR systems would be fed to higher level geographic control center, resulting in a single nation-wide DIPR system for monitoring behavior across all airports on network.
684
Rachel Goshorn et al.
Fig. 12 Example of top-down infrastructure in a physical-geographical hierarchy, relating to behavior analysis of airports nation-wide in the USA.
4 Application Models DIPR subsystems (level two of Fig. 1) have been defined in a generalized approach, with various infrastructure approaches (level one of Fig. 1)). The next level up in the AI System Solutions Pyramid (Application Models), further engineers the AI system solution, where each subsystem of DIPR is engineered into specific structures, based on the application area. These specific structures of DIPR subsystems are termed Application Models (level three of the AI Systems Solution Pyramid in Fig.1). In order to define an Application Model, knowledge of DIPR is needed in order to transform the application into DIPR subsystem constraints and structure. Application engineering requires knowledge of DIPR and the ability to interface with an application. Application Models are pre-engineered applications to put DIPR into a specific application area. The Application Model is then customized in the actual final, unique application. Once the Application Model is defined, or selected from a set of predefined Application Models, the actual application (level four of Fig. 1) is used to customize the Application Model for the final solution where exact parameter values and algorithms are defined. In this section, Application Models are defined along with showing an actual application per Application Model. In more detail, the specific structure of each Application Model helps engineer the design of the DIPR subsystems in preparation of the actual application, where the parameter and algorithm types per DIPR subsystem are defined. Therefore, in an Application Model, the input/output types and algorithm types per DIPR subsystem are defined: the area of types of sensor inputs to “Detection” are defined, the “Detection” low-level classifier types and output types of entries in the temporal (feature,space) matrix are defined; the types of intelligence rules for fusion in “Identification” are defined, along with the output intelligent states (symbols) types;
DIPR for Behavior Modeling
685
the types of behaviors to classify in “Prediction” Behavior Classifier Module are defined, along with the types of syntax rule cost automation approaches; the types of inference rules for inferring predicted behavior outcomes in the Infer Predicted Outcomes Module of “Prediction” are defined; the types of actions are defined as output types of “Reaction” and therefore the reaction rule types for “Reaction” are defined. Application Models in this chapter are defined as: Application Model (A) AI Decision Making, Application Model (B) Abnormal Behavior Detection, Application Model (C) Enabling “Smart Environments”, Application Model (D) Enabling “Smart Environments”, and Application Model (D) Turbocharging Detection Classifiers.
4.1 Application Model (A): AI Decision Making AI Decision Making is generalized as extracting information needed in order to make a decision, and automating the decision making process. There are various sensor types that can be used and are application dependent. Assuming one sensor input type, the low-level classification types in “Detection” should detect low-level features that are important building blocks to a intelligent state (symbol). The output type of “Detection” would be specific features at a given time (entries in the temporal (feature,space) matrix). Intelligent rule types in “Identification” should combine these features each time to output a symbol type as a foundation for forming a behavior needed in order to make a decision. In “Prediction”, behavior types could be various patterns of information in a decision environment. Inference rule types in Inference Predicted Outcome Module of “Predict” would be based on potential patterns defined in the behavior types, and combined with knowledge of outcomes per pattern to yield in making a decision in the following subsystem of “Reaction”. The “Reaction” subsystem rules and outputs are application specific, where each action is a predefined decision based on an inferred predicted behavior outcome. An application of “AI Decision Making” can be seen in [2], where the decision was to decide a specific adaptive filter [11] to selectively filter out a corresponding interference signal in a communications channel [13]. In this application, the sensor input type to “Detection” was a digital signal near-zero frequency. The “Detection” subsystem consisted of a few low-level classifiers. The first low-level classifier transforms the digital signal into a spectrogram (time,frequency,power) [12, 11]; the spectrogram is then transferred into a normalized mode of grayscale image pixel values (between 0 and 255[12]). In “Detection”, two-dimensional filters are predefined, where each filter represents a line of a certain orientation (e.g. zero degrees, 90 degrees, 45 degrees, etc). The next low-level classifier in “Detection” applies two-dimensional filtering and outputs two-dimensional filter that had the highest correlation (this is inputted into one feature of the temporal (feature,space) matrix). The next low-level classifier extracts object statistical characteristics detected with the two-dimensional filter (e.g. number of rows wide, number of columns long, cen-
686
Rachel Goshorn et al.
ter pixels of object, pixel intensity of center of object). The next low-level signal transforms the object statistics into signal statistics: length in time, frequency bandwidth, center frequency, signal power). These signal statistics are added as additional features in the temporal (feature,space) matrix (same temporal (feature,space) matrix that has the two-dimensional filter entry). The output of “Detection” is a (feature,space) matrix at that given time. “Identification” stores the inputted temporal (feature,space) matrices. In “Identification”, two parallel intelligence rules are carried out: one to output the intelligent state, and one to update the temporal (feature,space) matrix. The intelligence rule to output the intelligent state (symbol) is: predefine symbols to be the predefined twodimensional filters; the corresponding intelligence rule is to select the feature entry of the temporal (feature,space) matrix that corresponds to the two-dimensional filter at the given time, correlate this two-dimensional filter with the matching of filter to intelligent state (symbol), output the symbol. In parallel, an additional feature for the temporal (feature,space) matrix is calculated and added into the stored matrix. The new feature added is the change of frequency from the current temporal (feature,space) center frequency feature entry. This is inputted as a chirp frequency (and will be used later for filtering when the rate of change of the interfering signal frequency initializes a filer). In “Prediction”, the first stage of the Behavior Classifier Module forms a sequence of symbols that is then inputted into the sequential syntactical behavior classifier. The two interferer behaviors defined are: a continuous wave signal and a linear chirp, where each syntax rule cost is automated empirically through domain knowledge. The inference rule defined in the Infer Predicted Outcome Module, is if a behavior label is inputted, the predicted outcome is that the signal will be distorted by interfering signal behavior label. The “Reaction” subsystem reaction rules are: various adaptive filter structures are stored that correspond to the behavior label, so the reaction rule is given the behavior label, select the corresponding adaptive filter, prime the filter with the signal statistics (from “Identification”) and the output of “Reaction” is the action to apply the selected and primed adaptive filter to the communications channel.
4.2 Application Model (B): Abnormal Behavior Detection Abnormal Behavior Detection is generalized as classifying known normal behaviors, detecting when a known abnormal behaviors occur and any unknown abnormal behaviors. In general, the known normal behavior labels are not highlighted, they are typically grouped into one category as “normal behavior”. Known abnormal and unknown abnormal behaviors are outputted as ”abnormal”. If a known abnormal behavior is detected, a more refined inference of predicted outcome exists and therefore the action from “Reaction” can be specifically defined. In more description, the Application Model for Abnormal Behavior Detection requires a sensor type input into “Detection” that would have the data needed from which “Detection” could extract the required (feature,space) matrix entries at a given time. In “Detection”,
DIPR for Behavior Modeling
687
typically the low-level classifiers would look for extracting object, region (from predefined spatial regions) at a given time, from which an intelligent state (symbol) could be formed in the next subsystem of “Identification”. In “Identification”, intelligent states (symbols) are typically regions (predefined areas in space) based upon a predefined environment and the corresponding intelligence rules would fuse the temporal (feature,space) appropriately to form the desired intelligent state (symbol). In “Prediction” the Behavior Classifier Module’s stage of Sequential Syntactical Behavior Classifier would have predefined behaviors of known normal behaviors, known abnormal behaviors, and any sequence too far away from these predefined behaviors would be labeled as unknown abnormal behavior. The threshold that defines “too far away” is part of the customization process when the actual application is known. The syntax rules costs for the behaviors can be automated based on the percentage of deviation allowed from normal and empirically from knowledge of the domain application. The Infer Predicted Outcomes Module would have inference rules per known normal behavior, known abnormal behavior, and unknown abnormal behaviors. In the case of known normal behavior as an input into Infer Predicted Outcomes Module, the typical inference rule would output a predicted outcome of “everything is normal (i.e. okay)”. In the case of known abnormal behavior as an input into the Infer Predicted Outcomes Module, a specific inference rule would be defined based on application and domain knowledge for the predicted outcome. For the case where the behavior is labeled as unknown abnormal behavior, the inference rule is application dependent on how drastic the predicted outcome would be; the most general would infer an outcome “an unusual outcome will occur” (if in an airport application “unusual” would translate as “potential harm carried out”). “Reaction” is not required if the AI Systems Solution only requires detecting abnormal behaviors. “Reaction” subsystem is very critical if the abnormal detection application requires a reaction to the detected abnormal behaviors. An ideal application for abnormal behavior detection is to infer a predicted outcome (e.g. “potential harm may be carried out”) and a reaction is required by creating an action to keep this predicted potential harm from occurring. A critical application of AI and behavior modeling for abnormal behavior detection is in a surveillance network over a large area, such as an airport. In this application, abnormal behavior is detected because it is desired to infer a predicted behavior outcome of potential harm and reacts to keep the potential harm from occurring. In this application, large areas need to be monitored through a network of cameras; camera sensor data needs to be managed; this camera sensor data also needs to be used for detecting abnormal behaviors. Predicted behavior outcomes of potential harm are inferred and stopped through a certain reaction. Based on this critical Abnormal Behavior Detection Application, an application from [5] is shown. In [4] an architecture for a network of clustered cameras is used to minimize and efficiently manage bandwidth utilization. From this camera network architecture, the infrastructure outputs per cluster, per person, are used to detect abnormal behaviors intra-cluster; the architecture outputs per person, per network, are also used to detect global (inter-cluster) abnormal behaviors.
688
Rachel Goshorn et al.
In “Detection”, the outputted (feature,space) matrix contains the camera selection manager (CSM) (or Cluster Communications Manager (CCM)) per person per cluster (or per global network). In “Identification”, intelligent states (symbols) per person are formed by combining CSM (or CCM) with time and space attribute, as seen in Fig.13. In “Prediction”, sequences are formed of intelligent states (symbols), from which sequences are classified as a normal behavior labels or an abnormal behaviors (within camera clusters or the overall global network) as shown in Fig.14. In this application, only known normal behaviors are defined and any behavior too far away from the known normal behaviors is classified as unknown abnormal behavior. The threshold for determining “too far away” is defined below in this subsection. In addition to the threshold, the automation of the behavior syntax rules’ costs are defined, based on defining the percentage of deviation allowed. Infer predicted behavior outcomes from classified behavior labels, such as if normal behavior label, the inferred predicted behavior outcome is no harm will occur (everything is okay); if abnormal behavior label occurs, the inferred predicted behavior outcome is harm will occur and you must react. In “Reaction”, the reaction rules are dependent upon application, if predicted behavior outcome of harm will occur, a corresponding action would be set off an alarm and carry necessary action (e.g. security guards to specific location of where predicted harm will occur). In this application, the costs for substituting one symbol for another are automated for each predefined behavior. In this case, the costs are derived by the how much one symbol is allowed to be another symbol. Thus, the costs are derived from how much a behavior can deviate. To do this, we assume that the amount of deviation allowed from a predefined usual behavior is specified along with the behavior definition itself. Similar to a confusion matrix, a percentatge-of-deviation-allowed (PDA) matrix must be specified, as seen in Table 1. Here, an example entry of the PDA Matrix would be PDA(aM , bM ). This particular entry represents the percentage of the (incorrect) bM CSMs that are allowed to take the place of the allowed CSMs, aM , in a predefined behavior B of a cluster M. Table 2 Percentage-of-Deviation-Allowed Matrix DB for Predefined Behavior B. Observed CSM a b c
Behavior CSM: a PDA(a, a) PDA(b, a) PDS(c, a)
Behavior CSM: b PDA(a, b) PDA(b, b) PDA(c, b)
Behavior CSM: c PDA(a, c) PDA(b, c) PDA(c, c)
For example, in [5] the predefined behavior of a person solely staying in the region area represented by a camera a, will allow a person to deviate from this behavior by going to region area covered by camera b for PDA(a, b) percentage of the observed behavior. With the knowledge of how strict each behavior is in their allowed deviation, the substitution costs for each behavior can be automated. For example, the cost for
DIPR for Behavior Modeling
689
Fig. 13 Intra-cluster symbols (top) and inter-cluster symbols (bottom) that compose the outputs of “Identification”.
Fig. 14 Abnormal behavior detection, from the Behavior Classifier Module’s stage of Sequential Syntactical Behavior Classifier.
690
Rachel Goshorn et al.
substituting an observed symbol, bM , into the other symbol, say a, which follows a behavior would be: CS(b,a) = 10log10
1 PDA(b, a)
(5)
The costs for deleting one symbol as for now, are simply doubled the substitution costs and will be automated in future work. The costs for the inter-cluster behaviors are automated in the same manner. Thresholding for detecting when a behavior is “too far away” from a known normal behavior is defined. The threshold is what determines if the current sequence parsed, s is to be classified as abnormal or normal to one of the predefined behaviors. It is derived from the percentage-of-deviation-allowed entries for each symbol, and is a function of the length of the current sequence s of symbols. TBk =
∑
(CS(b,a) PDA(b, a)ls )
(6)
∀pairsa=b
The length of the sequence s to be classified, is denoted by ls , and CS(b,a) is the substitution cost between the two symbols, and PDA(b, a) is the percentage of the symbols in sequence s that are allowed to be deviated from a to b. Note, since the substitution costs were automated specific to each predefined behavior Bk , there is a threshold TBk for each behavior. These thresholds are relative to each behavior, but there is a global threshold, Tmin such that Tmin = min TB1 , TB2 , . . .TBk
(7)
Thus, after parsing/editing the current sequence s through all different behaviors Bk , and performing Maximum Similarity Classification, MSC, the sequence is associated with a potential predefined behavior, Bg , which had the smallest distance, d(s, Bg ). With T = Tmin or T = α Tmin , which enables the system with a tunable threshold, the sequence is classified as abnormal if d(s, Bg ) > T
(8)
The thresholds for the inter-cluster behaviors are automated in the same manner. In future work, both the behaviors and the amounts that a person can deviate from the behaviors to be classified as that behavior, and not abnormal, will be learned.
4.3 Application Model (C): Enabling “Smart Environments” The main goal of Application Model Enabling “Smart Environments” is to control various computer devices (e.g. television, phone, lights, etc.) in an assisted manner through smart interpretation of human gesture behaviors (e.g. hand gestures to smart
DIPR for Behavior Modeling
691
cameras to control lights, telephones, telephones, etc.; can also be audio behaviors interpreted to control similar devices). Various applications are possible, such as Ambient Assisted Living (AAL) to assist the elderly, or those bedridden; other applications could be smart homes, smart classrooms, smart cars, etc. For “Detection” inputs, sensor input needs to have efficient data from which the required temporal (feature,space) can be extracted via the low-level classifiers; typical sensor input types would be video data or audio data. In “Detection” typical low-level classifiers would extract a person’s hand or in the case of audio extract a persons sound. In “Identification”, typical intelligence rules would have a library of predefined intelligent states (rules) that correspond to predefined hand posture shapes (or sounds in audio), from which the inputted temporal (feature,space) would be correlated with this library and output the corresponding intelligent state (symbol). To continue in the visual application, in “Prediction”, a sequence of hand postures would form a hand gesture behavior. Various behaviors would be defined (similar to a simplified American Sign Language). The Infer Predicted Outcomes Module would have a library of computer commands that correspond to a hand gesture behavior; the inference rule would be to correlate the behavior label with the computer command library; the output of “Prediction” would be inferred computer commands. In “Reaction”, reaction rules would read in the computer commands, interpret the correct computer and output an action to send the computer command to the appropriate computer (e.g. send a wireless signal to the device). A similar approach would be carried out for audio applications, where symbols would form various syllables, and behaviors would be various words, that would be inferred into computer commands and similar flow through “Reaction” to send signals to computer devices for control, therefore enabling smart environments. An application can be seen in [7, 8] to enable “smart environments” computer commands are interpreted from human gesture behaviors. This is carried out through classifying human gesture behaviors from which computer commands are inferred and communicated to the computer of interest. In this application, the vision sensor node inhabits the DIPR subsystems. The Detect subsystem outputs a priori defined hand postures with a temporal attribute. The Identify subsystem forms intelligent states (symbols) combining a hand posture with a temporal attribute. Sequences of temporal hand posture features are formed in the first stage of the Behavior Classifier Module of Predict. Sequences are classified into a hand gesture behavior label as an output of the behavior classifier. The predicted behavior outcomes of Predict are the inferred computer commands per hand gesture behavior label. The React subsystem outputs the action of a wireless communication, of the inferred computer command, from the smart vision sensor to the computer of interest. A specific application of enabling ãÁ‘smart environments” is seen in Ambient Assisted Living (AAL) in [8], where assisted living means providing the assisted with custom services, specific to their needs and capabilities. AAL uses the intelligence of the human’s environment and behaviors to enable a “smart environment”. Detect: hand posture; Identify: hand posture and time attribute; Predict: form sequences of hand postures over time, classify hand gesture behavior, infer computer
692
Rachel Goshorn et al.
command; React: Transmit wirelessly the computer command of interest from the smart vision sensor to the computer of interest Predict: If the application needs to model behaviors where there is a lot of misclassifications in the low-level features then we would define the costs in the same manner as in the “turbocharging” application, described in Sec. 4.5.
Fig. 15 Ambient Assisted Living (AAL) application overview.
Fig. 16 Ambient Assisted Living (AAL) DIPR system overview.
DIPR for Behavior Modeling
693
4.4 Application Model (D): Environment or “Ambient Intelligence” (Learning) Environment or “ambient intelligence” is a learning application of becoming aware of your surroundings through AI means. There are various sensor types that can be used to become aware of an environment/surroundings. An approach could have parallel DIPR systems occurring, where one DIPR system is carried out per sensor type. Sensor input types could be: visual and audio. Assuming one sensor input type of visual, the low-level classification types in “Detection” should detect an object and region in space. Therefore, spatial features to detect would be defined a prior. The output type of “Detection” would be object and region at a given time (entries in the temporal (feature,space) matrix). Intelligent rule types in “Identification” should combine regions for the object over time to output a symbol type of region at a given time. In “Prediction”, behavior types could be various patterns over predefined regions. Inference rule types in Inference Predicted Outcome Module of “Predict” would be based on potential patterns defined in the behavior types, and combined with knowledge of outcomes per pattern. There typically is no “Reaction” subsystem in Application Module A, if the goal is to become aware of your surroundings without any actions defined. An application of “Environment or “Ambient Intelligence”’ can be seen in [1] where common behaviors of cars in a specific freeway environment are discovered. In this application, cameras were used for the visual sensor input into “Detection”. The low-level classifier in “Detection” extracted object feature of car, and region, where regions were predefined as five possible lanes (e.g. slow lane, to fast lane, including shoulder regions on both sides of the freeway). The output of “Detection” is a matrix of a temporal entry (car, object ID, region), and each row in the matrix corresponds to a specific car, an object (each car would have a separate DIPR system). In “Identification”, the intelligent rule would extract region and fuse with time to yield the intelligent state (symbol). Behaviors in “Prediction” were defined as various patterns predefined in the field of view of a camera (e.g. stay in the same lane (region), allowed to change to nearby lane and stay in that lane, etc). Syntax rules cost automation in the Behavior Classifier Module of “Prediction” were determined empirically through domain knowledge of the spatial regions and car patterns. The result was “learning” common behaviors. Another application can also be seen in [1], where common human behaviors that occur an engineering lab are discovered. Cameras were placed around the laboratory. Detect took place at one node. Identify and part of Predict subsystem (through Behavior Label output) took place on the same node., but a different node then Detect. Identify used person’s location in space and time attribute. Sequences of spatial locations over time are formed per person. These sequences are classified into a behavior label or an unknown behavior label. Therefore, people’s normal laboratory behaviors were learned. There was no React in this application.
694
Rachel Goshorn et al.
An example of potential inferred predicted outcomes could shift into normal and abnormal behavior detection that infers potential harm to the lab computers and servers. Therefore, the React subsystem would output specific actions to take place (e.g. shut off server if predicted harm to the server is predicted).
4.5 Application Model (E): Turbocharging Detection Classifier Detecting features from raw data is the foundation of the “Detection” subsystem in the DIPR System. When such classifiers have mediocre accuracy, there may be countless tunings until the feature classifier performs acceptably for that specific sensor environment. Additionally, it is often the case that the sensors environment changes, either by a dynamic environment or the sensor itself being physically moved to another location. In either case, the feature classifiers must be tuned again and often tediously retrained to achieve acceptable performance in the changed sensor environment. A novel approach to eliminate this dilemma is to incorporate a high-level classifier to enhance the performance of the low-level classifier, and thus “turbocharging” the detection statistics of the feature classifier [6]. This is an application of a DIPR system inside another DIPR system. In fact, the turbocharger is a paramount application to the DIPR system. The DIPR Turbocharger can be introduced in the the “Detection” subsystem of any other DIPR system, as shown in Fig. 17. In the turbocharging DIPR system, the input to the “Detection” subsystem is the raw data from a single sensor at one space. The output of the “Detection,” which is the input to the “Identification” subsystem, is the output of a single feature classifier at one space and one time instance. The “Identification” subsystem is a direct feed-through in this application. In the Behavior Classification module of the “Prediction” subsystem, the possibly noisy sequences of detected feature labels (or intelligent state (symbols)) are classified as one single detected feature (or intelligent state (symbol)). This is done by predefining a behavior model for each possible feature class that may have been outputted by the feature classifier in “Detection” subsystem. The noisy sequence of symbols is parsed by each possible symbol behavior. The costs of substituting are such that it effectively denoises any symbol distortions caused by the feature classifier. The costs are automated based on the a priori performance statistics of the feature classifier, and are defined below. Finally, there is neither a Prediction Outcome module in the “Prediction” nor a “Reaction” subsystem in the DIPR turbocharger. For automating the costs of the behavior classifier in the turbocharger, assume the probability of misclassifying certain features (symbols) as other features (symbols), is known a priori. For example, the probability of mislabeling the true symbol b as symbol a is known a priori. The probabilities of mislabeling certain hand postures as other hand postures is extracted from the confusion matrix, also known as the contingency table between symbols a, b, and c.
DIPR for Behavior Modeling
695
Fig. 17 There is a DIPR Turbocharger for every feature classifier in the “Detection” subsystem of any DIPR system. Table 3 Symbol Confusion Matrix for Feature Classifier (Ct(x, y)≡Count(x, y))
Labeled: a True: a True: b True: c
Labeled: b
Labeled: c
Ct(a,a) Ct(a,b) Ct(a,c) Ct(a,a)+Ct(a,b)+Ct(a,c) Ct(a,a)+Ct(a,b)+Ct(a,c) Ct(a,a)+Ct(a,b)+Ct(a,c) Ct(b,a) Ct(b,b) Ct(b,c) Ct(b,a)+Ct(b,b)+Ct(b,c) Ct(b,a)+Ct(b,b)+Ct(b,c) Ct(b,a)+Ct(b,b)+Ct(b,c) Ct(c,b) Ct(c,c) Ct(c,a) Ct(c,a)+Ct(c,b)+Ct(c,c) Ct(c,a)+Ct(c,b)+Ct(c,c) Ct(c,a)+Ct(c,b)+Ct(c,c)
As discussed in Sec. 2.3, the cost for substituting a mislabeled feature a with the true feature b is: CS(a,b) = 10log10 (
where (P(a|b) =
1 ) P(a|b)
Count(a,a) Count(a,a)+Count(a,b)+Count(a,c) .
(9)
696
Rachel Goshorn et al.
5 Advanced DIPR Systems Engineering This section describes two advanced systems engineering techniques for the DIPR system.
5.1 Advanced Fusion Methods Using the DIPR System Traditional fusing takes place in the “Identification” subsystem when intelligent rules combine (fuse) features into an intelligent state (symbol). Normally, any fusing relation needed can be accomplished through the unique combining of various features into intelligent states (symbols). Additionally, advanced fusing can take place by fusing any feature set with any combination of: intelligent states (symbols), behavior labels, inferred predicted outcomes, or actions. In the advanced fusion case, feedback from upstream DIPR subsystem outputs could form additional symbols, behavior labels, or inferred predicted outcomes. An example could be an AI system calling for an action that may not be wise when fused with lower level features and therefore required to confirm or deny the recommended action. This feedback of advanced fusion techniques, implies an iteration in the DIPR system flow. The allowing of iterations would be part of the design of the Application Model.
5.2 Adaptive Learning in the DIPR System There are various types of adaptive learning in the DIPR system. This section focuses on the behavior learning. Other learning could be for other DIPR subsystem outputs or rules. There are three behavior learning modes in the DIPR system behavior learning: desired behavior training mode, known-behavior updating mode, and unknown-behavior learning mode. These behavior learning modes respectively (1) train the DIPR system to learn a specified behavior, (2) modify the known behaviors in response to classifying a sequence as the wrong known behavior label, and (3) learn a never-before-seen behavior based on observed sequences which do not fall into any of the known behaviors. Each behavior learning mode uses its own type of Adaptive Learning Module which may combine both supervised and unsupervised learning techniques [15]. The structure of the adaptive learning in the DIPR system introduced in this section and is depicted in Fig. 18. The first behavior learning mode arises when there is a need to program a new desired behavior into the DIPR system, where the syntax rules of the desired behavior are unknown. In other words, the sequence patterns of the intelligent state (symbols) allowed for the new desired behavior are unknown. This particular behavior learning mode, called desired behavior training mode, will use the Desired-Behavior adaptive DIPR learning module. The module is divided into two sub-cases: (a) assuming the intelligent state (symbols) are known and (b) along with the known intelligent
DIPR for Behavior Modeling
697
Fig. 18 Adaptive learning in the DIPR system is structured in a hierarchy. This figure displays behavior learning as disccused in this section. Each behavior learning uses its own adaptive learning module. The second level of this hierarchy could include other learning for learning other DIPR subsystem rules and outputs.
states (symbols), having the flexibility to learn new intelligence rules for new intelligent state (symbols) if needed. For the first sub-case, supervised learning is used to track the patterns of the known intelligent state (symbols) for observed training sequences tagged with the desired behavior’s label. For the second sub-case (lowered to the “Detection” subsystem output), the complete temporal (feature, space) matrix corresponding to the observed sequences tagged with the desired behavior’s label must be used for unsupervised learning of the new intelligent states (symbols). Using the sequence’s corresponding temporal (feature,space) matrix which was stored in the “Identification” subsystem, new rules for combining temporal features in space to create new intelligent symbols (states) are created. New intelligent states (symbols) are created especially when there are time intervals where all of the known intelligence state (symbols) were identified as the null state (symbol) during the “Identification” subsystem. If after new intelligent states (symbols) are created and this desired behavior does not seem to be repeatably classified on similar observation sequences, there is a trial-and-error process for creating new symbols that would ensure repeatability in classifying the desired behavior. If after the trialand-error process completes and the desired behavior has failed to be learned, new feature classifiers in “Detection”must be added to detect new types of features with the same sensors or even new sensor types in similar or more spaces, with the goal of aptly capturing the environment’s states in time and space. The second behavior learning mode (known behavior udpating mode) occurs when the user discovers an error in a behavior-label classification of an observed sequence of intelligent states (symbols) in the “Prediction” subsystem. In this occurrence, there are two possible cases: (a) behavior-label misclassification between two known behaviors, and (b) misclassifying the observed sequence as one of the
698
Rachel Goshorn et al.
known behaviors when it should have been classified as the unknown behavior label. The first occurrence calls for the known behavior updating mode, which uses the Mislabeled-Behavior adaptive DIPR learning module, made up of both supervised and unsupervised learning. The latter case is an example of the unknown behavior learning, which is defined below. The Mislabeled-Behavior adaptive DIPR learning module can be divided into two sub-cases just as in the desired behavior training mode behavior learning, depending on if new rules must be learned for new intelligent states (symbols) in order to modify the known behaviors. Assuming it is sufficient to modify the known behaviors without creating new intelligent states (symbols), the Mislabeled-Behavior adaptive DIPR learning module updates the behaviors’ syntax rules’ substitution costs of the appropriate intelligent states (symbols) which caused the misclassification between the two behaviors; this should ensure the same misclassification between the two behaviors will not happen again for the same observation sequence(s) of intelligent states (symbols). If updating the costs is not sufficient for ensuring the two behaviors will not be mislabeling similar observation sequences, then a similar approach of trial-and-error learning of new intelligence rules for intelligent states (symbols) must occur. Ultimately, if this trialand-error process does not work, the solution may be to add features (and thus add more feature classifiers, sensors in more spaces, and maybe more sensor types) to the temporal (feature,space) matrix for better capturing the distinction between the two behaviors in question. The third (in Fig. 18) behavior learning mode occurs when the DIPR system classifies a sequence of intelligent states (symbols) with the unknown behavior label. This particular mode, termed unknown behavior learning mode uses the UnknownBehavior adaptive learning module and is the first case mentioned which solely uses unsupervised learning. In this module, there is storage of all previously observed sequences of intelligent states (symbols) that were labeled unknown, along with the corresponding temporal (feature,space) matrix for those sequences. This module correlates these stored temporal (feature,space) matrices with the current observation sequence labeled unknown with the objective of learning a new behavior. This new behavior is discovered using unsupervised learning. In the first level of learning, the module correlates the current observed sequence labeled unknown with all of the stored sequences labeled unknown at the level of intelligent states (symbols). If no new behaviors are learned at this level, the module correlates the sequences based on the sequences’ temporal (feature,space) matrices. If there are no common behaviors learned, then the current observed sequence labeled unknown, along with its corresponding temporal (feature,space) matrix is stored.
6 Summary The ability to automate the modeling of human behavior systems in different applications, and additionally, fuse sensor features as well as behaviors, has been demonstrated in the DIPR system and shown in various Application Models. The artificial
DIPR for Behavior Modeling
699
intelligence problem has been broken into four distinct subsystems: Detection, Identification, Prediction, and Reaction (DIPR) to allow engineering and design of each subsystem at a much lower technical level than the total system. These subsystem designs produce engineering subsets that can be taught and designed in a formal structure for ease of an orderly design of a total AI Systems Solution. Application engineering is normal for any engineering technology of product and is also shown to be used in applying the DIPR system to applications of: AI Decision Making, Abnormal Behavior Detection, Enabling “Smart Environments”, “Ambient Intelligence” (learning), and the special application of a DIPR system inside another DIPR system for “Turbocharging” (enhancing detection classifier performance statistics). The world requirement of fusing sensor features, as well as fusing any of the DIPR subsystem outputs was demonstrated in the traditional requirements of fusing as well as advanced forms of fusing. Learning intelligence in the DIPR system was demonstrated in the Application Models as well as an outline of future learning methods of adaptive learning in DIPR systems. The distribution of the DIPR subsystems across various hardware/software architectures has been shown with guidelines and design rules to implement various net-centric systems. Therefore, the overall AI Systems Solution has been delineated in levels of technology system segments (levels of Fig.1), as well as the breaking down of the AI problem into bite-size standard subsystems of DIPR for modeling.
References [1] Goshorn, R.: Sequential Behavior Classification Using Augmented Grammars. M.S. thesis, UCSD (2001) [2] Goshorn, R.: Syntactical Classification of Extracted Sequential Spectral Features Adapted to Priming Selected Interference Cancelers. Ph.D. thesis, UCSD (2005) [3] Sipser, M.: Theory of Computation. PWS Publishing Co., MA (1997) [4] Goshorn, R., Goshorn, J., Goshorn, D., Aghajan, H.: Architecture for Clusterbased Automated Surveillance Network for Detecting and Tracking Multiple Persons. International Conference on Distributed Smart Cameras, ICDSC (2007) [5] Goshorn, R., Goshorn, D., Goshorn, J., Goshorn, L.: Abnormal BehaviorDetection Using Sequential Syntactical Classification in a Network of Clustered Cameras. International Conference on Distributed Smart Cameras, ICDSC (2008) [6] Goshorn, D.: The Enhancement of Low-Level Classifications in Sequential Syntactic High-Level Classifiers Compuer Science. PhD Research Exam, UCSD (2008) [7] Goshorn, R., Goshorn, D.: Vision-Based Syntactical Classification of Hand Gestures to Enable Robust Human Computer Interaction. Workshop on AI
700
[8]
[9] [10] [11] [12] [13] [14] [15] [16]
[17] [18] [19] [20] [21]
Rachel Goshorn et al.
Techniques for Ambient Intelligence, co-located with European Conference on Ambient Intelligence, ECAI (2008) Goshorn, R., Goshorn, D., Kölsch, M.: The Enhancement of Low-level Classifications for Ambient Assisted Living. Behavioral Monitoring and Interpretation Workshop at the German Conference on Aritificial Intelligence, KI (2008) Kölsch, M.: Vision Based Hand Gesture Interfaces for Wearable Computing and Virtual Environments. Ph.D. thesis, UCSB (2004) Kölsch, M., Turk, M.: Hand Tracking with Flocks of Features. Conference on Computer Vision and Pattern Recognition, CVPR (2005) Haykin, S.: Adaptive Filter Theory, 4th ed. Pearson Higher Education (2001) ISBN: 0130901261 Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd ed. Prentice Hall (2008) ISBN number 9780131687288 Proakis, J. G.: Digital Communications, 4th ed. McGraw-Hill Companies (2001) ISBN: 0072321113 Cant-paz,E., Kamath, R.: ., International Conf. on Simulation of Adaptive Behavior (1993) Duda, R. O., Hart, P. E., Stork, D.G.: Pattern Classification. John Wiley and Sons (2001) Hengstler, S., Aghajan, H.: Application Development in Vision-Enabled Wireless Sensor Networks. International Conference on Systems and Networks Communications. ICSNC (2006) Chen, et. el.: A Low-Bandwidth Wireless Camera Network Platform. Tech. Report, UC Berkeley (2008) Akyildizand, I., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks. IEEE Comm Mag, 40(8) (2002) Texas Instruments.: TMS320DM647/TMS320DM648 Digital Media Processor. Datasheet Manual (2008) Marvell Technology Group: PX320 Processor Series. Datasheet Manual (2008) Iyengar, S.S., Brooks, R.R.: Distributed Sensor Networks. CRC Press (2005) ISBN 1584883839, 9781584883838
Part VI
Multi-Agents
Multi-Agent Social Simulation Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
1 Introduction While ambient intelligence and smart environments (AISE) technologies are expected to provide large impacts to human lives and social activities, it is generally difficult to show utilities and effects of these technologies on societies. AISE technologies are not only methods to improve performance and functionality of existing services in the society, but also frameworks to introduce new systems and services to the society. For example, no one expected beforehand what Internet or mobile phone brought into out social activities and services, although they changes our social system and patterns of behaviors drastically and emerge new services (and risks, unfortunately). The main reason of this difficulty is that actual effects of IT systems appear when enough number of people in the society use the technologies. Multiagent social simulation (MASS) is one of solutions to the above problem. In a MASS, a number of intelligent agents, each of whom represents individual human or a group of individuals and has its own knowledge and preferences, interact with each other. As similar to numerical simulations of physical phenomena, the benefit of the MASS is capability to test various situations of societies and assorted combinations of technologies. Using this capability, MASS will bring the following methods to evaluates the new technologies: • How the new technologies can/can’t solve the existing problems of societies. • What kind of condition should be supported for the technologies to bring more benefits than the costs to implement them in the societies. • How societies and human behaviors can changed by the technologies.
Itsuki Noda, Tomohisa Yamashita and Koichi Kurumatani AIST, Japan Peter Stone Department of Computer Sciences, The University of Texas at Austin, USA H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_26, © Springer Science+Business Media, LLC 2010
703
704
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
In this chapter, we focus on traffic and mass navigation systems/services in urban area to show these utilities by simulations. They are good examples for ambient intelligence and smart environments to bring new styles of systems/services that may take place of the existing systems/services, because these technologies enable flexible control of mass-transportation according to various demands and situations.
2 Issues on Multiagent Social Simulation Social Simulation is a big challenge for scientific and engineering methods to investigate and to design social systems, because social phenomena is so complex, noisy and hard to develop precise models and frameworks. The difficulty is typically illustrated in the case of phenomena affected by human decision-making like traffic congestion. For such human-related phenomena, we need to realize multiagent social simulation (MASS) in which multiple agents who has capabilities to make decisions independently with some intelligence interact with each other. On the other hand, Casti investigated how computational methods can attack to analyze behaviors of complex systems like societies [4, 9]. In his investigation, he summarizes five properties that can be used to assess models and simulations for complex systems: (The order is changed from the original for the sake of later discussion.) • Clarity: The ease with which one can understand the model and the predictions/explanations it offers for real-world phenomena. • Simplicity: The level of completeness of the model, in term of how small the model is in things like number of variables, complexity of interconnections among sub-systems, and number of ad-hoc hypotheses assumed. • Bias-free: The degree to which the model is free of prejudices of the modeler having nothing to do which the purported focus or purpose of the model. • Fidelity: The model’s ability to represent reality to the degree necessary to answer the questions for which the model was constructed. • Tractability: The level of computing resources needed to obtain the predictions and/or explanations offered by the model. For clarity, it is important to show possibilities of new technologies to be applied to real world problems with meaningful examples. As mentioned in section 1, a new technology such as AISE may bring or require total changes of social systems. Therefore, it is generally hard to show advantages of the technology using smallscale experiments in real society. As a solution to this issue, MASS must show clear benefits of application of AISE using easily understandable examples. Section 3 describes a typical successful example to show such a benefit of novel traffic control enabled by AISE. For simplicity, focusing on comparable features in whole phenomena is an important aspect. One of important goals of MASS is to compare performance of multiple candidates of social systems relatively, especially, ones of newly proposed and traditional systems. Because a social phenomenon consists of many features that are
Multi-Agent Social Simulation
705
related to each other, we must choose carefully a certain features for a measure of the performance and discard or reduce other variable features in order to simplify the comparison of the systems. Works described in section 3 and section 4 are successful examples of choosing simple measures to show the benefit of AISE. For bias-free, we utilize optimization to figure out structural features of problems in social systems. It may be difficult to guarantee that an actual result value of naive MASS is bias-free, because setups of such a naive MASS include a large number of parameters. Instead to investigate all cases of the setups, we can focus on structural features of the problem by optimization of a certain criterion over various setups. Through the optimization, we can reduce the number of free parameters to be checked. As a result, we can analyze the bias-free structural features. Similarly to Nash’s analysis of game theory based on equilibriums, this approach using optimization is a way to investigate social systems in which peoples and systems are rational to improve the criteria. We show an example of this approach in section 4. For fidelity and tractability, the framework to collect real data is a key issue. Unlike physical simulations, it is generally hard for social simulations to get real data for calibration and validation of the MASS. Especially, phenomena like traffic and crowd movements that consists of various behaviors of many people should be considered not only as a collection of individual behaviors but also as an integration of interactions among people. Therefore, we need to obtain various detailed data in the phenomena. Section 5 describes an attempt to acquire and utilize human behaviors and measurement in towns for MASS.
3 Autonomous Vehicles at Intersections Just as the internet and mobile phones have already had profound impacts on social activities and services, the introduction of autonomous vehicles in the near future is going to drastically alter society. This introduction is imminent because 1) the potential benefits are enormous and 2) the technology is ready. The main limiting factor at this point is economic feasibility. Once autonomous vehicles are prevalent on the roads, there will be opportunities to alter traffic protocols to take advantage of their capabilities. This section illustrates the use of multiagent simulation as a way to validate one such alteration, in particular with regards to intersection traversal. The potential benefits of autonomous vehicles are numerous. First, autonomous vehicles can result in improved safety by eliminating the 95% of vehicular crashes which are caused by human errors [19]. Second, they will improve the mobility of the entire population including old people and young children who are otherwise dependent on other people for their basic travel needs. Third, autonomous vehicles can improve the entire driving experience by enabling people to concentrate on other things while driving: the stress of driving during peak-period and stop and go traffic will disappear. Fourth, autonomous vehicles provide the potential to double or triple road capacity by reducing the space headway between vehicles and by enabling ve-
706
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
hicles to operate at a constant velocity. Fifth, autonomous vehicles will move at a relatively constant velocity, resulting in reduced emissions and improved fuel efficiency due to reduction of stop and go traffic. Sixth, from a planning perspective, current traffic management decisions have to take into account the user behavior or the response of drivers to various measures, which is difficult to predict. Traffic management decisions made for autonomous vehicles can result in improved system performance as the compliance of the vehicles to these instructions is expected to be higher and faster than human responses: vehicular responses to instructions will be much easier to predict than human choices. Thus once technologically feasible, there will be many reasons for people to start taking advantage of autonomous vehicles. While the point of technological feasibility may seem to be a long way off, advances in artificial intelligence, and more specifically, Intelligent Transportation Systems (ITS) [3], suggest that it may soon be a reality. Cars can already be equipped with features of autonomy such as adaptive cruise control, GPS-based route planning [14, 16], and autonomous steering [12, 13]. Some current production vehicles even sport these features. DaimlerBenz’s Mercedes-Benz S-Class has an adaptive cruise control system that can maintain a safe following distance from the car in front of it, and will apply extra braking power if it determines that the driver is not braking hard enough. Both Toyota and BMW are currently selling vehicles that can parallel park completely autonomously, even finding a space in which to park without driver input. GM has announced plans to use a video camera, lasers, and a lot of processing power to be able to identify traffic signs, curves in the street, lane markings, and other vehicles. By the end of the decade, GM hopes to incorporate the system into production vehicles. Once these autonomous vehicles are on the road, it raises the question of how best to take advantage of their capabilities in order to make automobile travel safer and faster. This section summarizes one answer to this question that relies critically on multiagent simulation. Specifically, it summarizes a protocol for intersection traversal based on a reservation paradigm, in which vehicles “call ahead” to reserve space-time in the intersection. In brief, the protocol proceeds as follows. 1. An approaching vehicle announces its impending arrival to an "intersection manager” residing at the intersection. The vehicle indicates its size, predicted arrival time, velocity, acceleration, and arrival and departure lanes. 2. The intersection manager simulates the vehicle’s path through the intersection, checking for conflicts with the paths of any previously processed vehicles. 3. If there are no conflicts, the intersection manager issues a reservation and it becomes the vehicle’s responsibility to arrive on time and travel through the intersection as specified (within a range of error tolerance). 4. In the case of a conflict, the intersection manager suggests an alternate later reservation. 5. The car may only enter the intersection once it has successfully obtained a reservation.
Multi-Agent Social Simulation
707
Though this protocol has huge potential benefits, deploying it immediately in the real world would be far too risky. Huge investments would be required both in infrastructure and in individual vehicles. And of course safety of passengers is always a concern. Thus it is essential to validate the protocol first in simulation. The remainder of this section motivates and describes in detail our protocol for intersection traversal, and summarizes the results from detailed multiagent simulations. The results are encouraging enough to suggest that the time is right to validate the protocol further on real robots, and eventually on full-size automobiles.
3.1 The Intersection Management Protocol Replacing modern intersection control with a robust, autonomous control framework is a complex, multi-part problem. In order to establish a set of criteria by which to judge such a framework, here we enumerate a set of desiderata, which simultaneously provides a description of the problem at hand as well as both qualitative and quantitative metrics for judging solutions to that problem. Autonomy: Each vehicle should be fully autonomous. Were the entire mechanism centrally controlled, it would be susceptible to single point failure, require massive amounts of computational power, and exert unnecessary control over vehicles in situations where they are perfectly capable of controlling themselves. Low Communication Complexity: By keeping the number of messages and amount of information transmitted to a minimum, the system can afford to put more communication reliability measures in place. Furthermore, each vehicle, as an autonomous agent, may have privacy concerns which should be respected. Keeping the communication complexity low will also make the system more scalable. Sensor Model Realism: Each agent should have access only to sensors that are available with current or foreseeable technology. The mechanism should not rely on fictional sensor technology that may never materialize. Communication Protocol Standardization: The mechanism should employ a simple, standardized protocol for communication between agents. Without a standardized protocol, each agent would need to understand the internal workings of every agent with which it interacted. An open, standardized protocol is needed to make adoption of the system feasible for vehicle manufacturers. Deadlock/Starvation Avoidance: Deadlocks and starvation should not occur in the system. That is, any vehicle approaching an intersection should eventually make it through, even if it is better for the rest of the agents to leave that vehicle stranded. Incremental Deployability: The system should be incrementally deployable in two senses. First, it should be possible to set up some intersections to use the system, and then slowly expand to other intersections. Second, the system should function even with few autonomous vehicles (mostly human-controlled vehicles). At any stage of deployment, be it an increase in the proportion of autonomous vehicles or number of equipped intersections, overall performance of
708
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
the system should improve, and there should be a benefit to early adopters. At no point should there exist a net disincentive to continue deploying the system. Safety: Except for gross vehicle malfunction or extraordinary circumstances (natural disasters, etc.), as long as they follow the protocol, vehicles should never collide in the intersection. Note that no stronger guarantee is possible — as with modern mechanisms, a suicidal human driver can always steer a vehicle into oncoming traffic. Furthermore, the system should be safe in the event of total communication failure. If messages are dropped or corrupted, the safety of the system should not be compromised. It is impossible to prevent all negative effects due to communication failures, but those negative effects should be limited to efficiency. If a message gets dropped, it can make someone arrive 10 seconds later at their destination, but it should not cause a collision. Efficiency: Vehicles should get across the intersection and on their way in as little time as possible. To quantify efficiency, we introduce delay, defined as the amount of additional travel time incurred by the vehicle as the result of passing through the intersection. Note that, as argued in Section (g), average vehicle delay is also a convenient and informative indicator of overall system performance, specifically in terms of congestion and, more indirectly, throughput. Of the above desiderata, modern-day traffic lights and stop signs completely satisfy all but the last one. While many accidents take place at intersections governed by traffic lights, these accidents are rarely, if ever, the fault of the traffic light system itself, but rather that of the human drivers. However, traffic lights and stop signs are terribly inefficient. Not only do vehicles traversing intersections equipped with these mechanisms experience large delays, but the intersections themselves can only manage a limited traffic capacity — much less than that of the roads that feed into them. The aim of this project is thus to create an intersection control mechanism that exceeds the efficiency of traffic lights and stop signs without sacrificing any of the other properties listed above. With our desiderata in mind, we have introduced a novel approach to getting vehicles through intersections more efficiently that is a radical departure from existing traffic signal optimization schemes [7]. We assume that computer programs called driver agents control the vehicles, while an arbiter agent called an intersection manager is placed at each intersection. The driver agents “call ahead” and attempt to reserve a block of space-time in the intersection. The intersection manager decides whether to grant or reject requested reservations according to an intersection control policy. Figure 1 shows a sample interaction between a driver agent and an intersection manager. The system functions analogously to a human attempting to make a reservation at a hotel — the potential guest specifies when he or she will be arriving, how much space is required, and how long the stay would be; the human reservation agent determines whether or not to grant the reservation, according to the hotel’s reservation policy. Just as the guest does not need to understand the hotel’s decision process, the driver agents should not require any knowledge of the policy the intersection manager uses to make its decision. In addition to confirming the request made by the driver agent, the intersection manager may also respond with counter-offers.
Multi-Agent Social Simulation
709 REQUEST
Driver Agent
REJECT
Postprocess
Preprocess
No, Reason
Yes, Restrictions
Intersection Control Policy
CONFIRM
Intersection Manager
Fig. 1 One of the driver agents attempts to make a reservation. The intersection manager responds based on the decision of an intersection control policy.
Our prototype intersection control policy divides the intersection into an n × n grid of reservation tiles, where n is referred to as the granularity of the policy. When a vehicle approaches the intersection, it transmits a reservation request, which includes its time of arrival, velocity of arrival, and vehicle characteristics such as size and acceleration capabilities, to the intersection manager. The intersection manager then passes this information to the policy, which simulates the journey of the vehicle across the intersection according to the parameters. At each time step of the simulation, the policy determines which reservation tiles will be occupied by the vehicle. If at any time during the simulation the requesting vehicle occupies a reservation tile that is already reserved by another vehicle, the policy rejects the driver’s reservation request, and the intersection manager communicates this to the driver agent. Otherwise, the policy accepts the reservation and reserves the appropriate tiles. The intersection manager then sends a confirmation to the driver. This control policy allows for much finer-grained coordination than traffic lights: cars traveling in all directions can proceed through the intersection with minimal delay. A graphical depiction of the concept can be seen in Figure 2. A video demonstration1 shows cars coming from all directions barely slowing down as they traverse the intersection and crossing closely in front of each other, yet with no collisions. Experiments in simulation indicate that our proposed reservation system may be able to reduce the time spent waiting at intersections by up to two orders of magnitude.
(a) Successful
(b) Rejected
Fig. 2 In (a), the vehicle’s request is accepted, and reserves a set of tiles at time t. In (b), the black vehicle’s request is rejected because during the simulation of its trajectory, the policy determines that it requires a tile already reserved by the previous vehicle at time t.
1
Available online at http://www.cs.utexas.edu/~kdresner/aim/?p=video
710
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
3.2 Main Result Experiments in simulation are important for fine-tuning the system and for determining whether or not it is worth the investment to deploy the system in the real world. Here we present the main result demonstrating that our proposed reservation system can dramatically reduce delay in simulation when compared to traditional intersection control mechanisms. As indicated in Section 2, in order to study the protocol described above, we needed to carefully choose our simulation model. In particular, we needed a high fidelity simulator to model fine-grained car movement, but one that’s simple enough to enable us to tractably perform sufficient experiments to gain confidence with the model. Unfortunately, no existing simulator met all of our criteria, so we developed our own simulation of cars traversing a single intersection [7]. Figure 3 shows a screen capture of the simulator’s graphical display.
Fig. 3 Snapshot of the simulator.
In all of the experiments, our fully-implemented custom microsimulator simulates 3 lanes in each of the 4 cardinal directions. The speed limit in all lanes is 25
Multi-Agent Social Simulation
711
meters per second. Every configuration shown is run for at least 100,000 steps in the simulator, which corresponds to approximately half an hour of simulated time. Vehicles that are spawned turn with probability 0.1, and turning vehicles turn left or right with equal probability. Vehicles turning right are spawned in the right lane, whereas vehicles turning left are spawned in the left lane. Vehicles that are not turning are distributed probabilistically amongst the lanes such that the traffic in each lane is as equal as possible. The reservation system and the stop sign both have a granularity of 24. The results for the experiments are shown in Figure 4. The reservation system performs very well, nearly matching the performance of the optimal policy which represents a lower bound on delay should there be no other cars on the road.2 At higher levels of traffic, the average delay for a vehicle gets as high as 0.35 seconds, but is never more than 1 second above optimal. Under none of the tested conditions does the reservation system approach the trip times of the traffic light system, which is shown as the best empirical performance over a range of cycle times that were tested. 5 4.5
Stop Sign
4 3.5 Delay (s)
Traffic Light Minimum 3 2.5 2 1.5 1
Optimal
0.5
Reservation System
0 0
1800
3600 5400 Vehicles Per Hour
7200
9000
Fig. 4 Trip times for varying amounts of traffic for the reservation system, the stop sign, an optimized traffic light, and the optimal lower bound.
The stop sign does not perform as well as the reservation system, but for low amounts of traffic, it still performs fairly well, with average delay only about 3 seconds greater than optimal. However, as the traffic level increases, performance degrades. These experiments, presented in full detail in [7], demonstrate that the reservation system can dramatically reduce per-car delay in comparison to traffic lights 2
Optimal is greater than 0 because of the time it takes to slow down for a turn.
712
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
and stop signs. They also demonstrate the full functionality of our intersection microsimulator.
3.3 Extensions A key feature of this reservation paradigm is that it relies only on vehicle to infrastructure (V2I) communication , via digital short range communication (DSRC) . In particular, the vehicles need not know anything about each other beyond what is needed for local control (e.g. to avoid running into the car in front). The paradigm is also completely robust to disruptions in network communication: if a message is dropped, either by the intersection manager or by the vehicle, delays may increase, but safety is not compromised. Nonetheless, prior to implementing it on real vehicles, it is important to study as many aspects of the system as possible. Here we briefly summarize extensions to the simulation and empirical analysis enabling the integration of human drivers into the system; learning policy selection; enabling priority to be given to specific vehicles such as ambulances and fire trucks; a fully decentralized version appropriate for lowtraffic intersections; and a detailed analysis of the impact of catastrophic failures. While an intersection control mechanism for autonomous vehicles will someday be very useful, there will always be people who enjoy driving. Additionally, there will be a fairly long transitional period between the current situation (all human drivers) and one in which human drivers are a rarity. Even if switching to a system comprised solely of autonomous vehicles were possible, pedestrians and cyclists must also be able to traverse intersections in a controlled and safe manner. For this reason, it is necessary to create intersection control policies that are aware of and able to accommodate humans, whether they are on a bicycle, walking to the corner store, or driving a “classic” car for entertainment purposes. With that in mind, we extended the reservation framework to incorporate human drivers [7]. In order to accommodate human drivers, a control policy must be able to direct both human and autonomous vehicles, while coordinating them, despite having much less control and information regarding where and when the human drivers will be. The main concept behind our extension is the assumption that there is a human-driven vehicle anywhere one could be. While this may be less efficient than an approach which attempts to detect or to more precisely model human behavior, it is guaranteed to be safe, one of the desiderata on which we are unwilling to compromise. Adding pedestrians and cyclists follows naturally, and we give brief descriptions of how this would differ from the extensions for human drivers. A reliable method of communicating with human drivers is a prerequisite for including them in the system. The simplest and best solution is to use something human drivers already know and understand — traffic lights. Traffic light infrastructure is already present at many intersections and the engineering and manufacturing of traffic light systems is well developed.
Multi-Agent Social Simulation
713
In order to obtain some of the benefits of the reservation system while still accommodating human drivers, an intersection manager needs to do two things: 1. If a light is green, ensure that it is safe for any vehicle (autonomous or humandriven) to drive through the intersection in the lane the light regulates. 2. Grant reservations to driver agents whenever possible, allowing autonomous vehicles to move through an intersection where human drivers cannot — similar to a “right on red”, but extended much further to other safe situations. Specifically, the extended reservation policy, designed to work with a mixture of autonomous and human drivers, works as follows and as depicted in Figure 5. • Upon receiving a request message, the intersection manager uses the parameters in the message to establish when the vehicle will arrive at the intersection. • If the light controlling the lane in which the vehicle will arrive at the intersection will be green at that time, the reservation is confirmed. • If the light controlling the lane will be yellow, the reservation is rejected. • If the light controlling the lane will be red, the journey of the vehicle is simulated as in the prototype system. • If no required tile is reserved by another vehicle or in use by a lane with a green or yellow light, the policy reserves the tiles and confirms the reservation. Otherwise, the request is rejected.
REQUEST Preprocess
Yes, Restrictions
Red
Light Model
No, Reason
FCFS
REJECT
Postprocess
Yellow
Driver Agent
Green
CONFIRM
Intersection Manager
Fig. 5 A refinement of the reservation system architecture to embrace compatibility with traffic signals.
Note that simply deferring to the reservation system does not guarantee the safety of the vehicle. If the vehicle were granted a reservation that conflicts with a vehicle following the physical lights, a collision could easily ensue. To determine which tiles are in use by the light system at any given time, we associate a set of off-limits tiles with each light. For example, if the light for the northbound left turn lane is green (or yellow), all tiles that could be used by a vehicle turning left from that lane are considered reserved. The length of the yellow light can be adjusted so that vehicles entering the intersection have enough time to clear the intersection before those tiles are no longer off limits. Compatibility with human drivers offers more than the ability to handle the occasional human driver once the levels of human drivers in everyday traffic reaches a
714
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
steady state. It will also help facilitate the transition from the current standard — all human-driven vehicles — to this steady state, in which human drivers are scarce. By way of additional simulations, we have shown experimentally that it is possible to define the protocol such that at all points on the technology penetration curve, there are incentives for both communities and private individuals to adopt autonomous vehicle technology[7]. In addition to interoperability with human drivers, we have also shown that emergency vehicles can be given precedence by granting them preferential reservations, and by blocking reservations to all crossing traffic [7]. And we have shown that certain parameters of the system, such as period of the traffic signals when mixing with human drivers, can be learned online as a function of dynamic performance measures [5]. Though the throughput benefits at large intersections warrant the cost of an intersection manager agent, which will be cost effective compared to traffic signals, placing an intersection manager at every intersection would be prohibitively expensive. Just as currently stop signs serve as a lower cost intersection management protocol today, we have also developed a peer-to-peer intersection control mechanism for autonomous vehicles, specifically for low-traffic intersections and requiring no specialized infrastructure [17]. This system complements the system described above just as stop signs complement more expensive traffic light installations. By using the same assumptions about vehicle capabilities, a driver agent can use both systems seamlessly. Finally, before implementing such a radical change in the transportation infrastructure, it’s critical to determine the worst case scenario. As long as cars follow the defined protocol, there is no risk of accidents. But since cars pass so close to each other in the reservation-based system, it is critical to test in simulation what would happen should a car suffer a mechanical failure (such as a flat tire) in the middle of the intersection. With minor extensions to the protocol described above, we have demonstrated that the expected number of cars involved in the worst case can be limited to no more than three or four [6]. Given that the 95% of accidents caused by human error will be eliminated entirely, the resulting system will yield marked improvements in safety.
3.4 Summary Due to the combination of technological feasibility and enormous societal benefits, Autonomous cars are likely to be common in the near future. Once they arrive, there will be a great opportunity to rethink traffic protocols so as to take advantage of their capabilities. However due to the large investment needed to make such changes, it is crucial to carry out extensive multiagent simulations to verify that the interactions of the vehicles will be as expected, that the benefits will be substantial, and that safety will not be compromised.
Multi-Agent Social Simulation
715
As a result of the successful simulations reported in this section, the reservationbased intersection management protocol is ready for development and testing on real robots, and eventually on full-size vehicles.
4 Evaluation of Dial-a-Ride System by Simulation How to operate bus systems is an important issue for local governments and communities. Most bus systems are operated in fixed route styles in which buses run along fixed routes under fixed time tables. While such systems are effective when users know the routes and time tables and like to follow them, they are inconvenient for users who are unfamiliar with them or travel to various places where there are no bus lines. Because of this characteristic, bus systems tend to lose favor with users, so that it is reduced and become more inconvenient. Technologies of AISE could provide a way to operate buses more flexibly. Networked mobile devices can collect users’ demand of travel and locations of buses in service, and distributed computer resources can assign efficient plans to operate buses. Using such technologies, we can develop flexible dial-a-ride bus systems in which buses run along dynamic routes according to users’ demands. Here is a basic question: If we introduce a flexible dial-a-ride system using new technologies, does it provide more benefit for both users and operating bus companies with reasonable cost compared to traditional fixed-route systems ? Is the new system universally better, or only effective under certain conditions? In this section, we show a way to evaluate such a new style of public transportation system as an example to show and appraise utilities of new technologies using simulations. As mentioned in section 1, it is difficult to show such benefit using real experiments because we need to implement systems into the society. In addition, it is difficult to evaluate the absolute benefit of public transportation systems because the evaluation includes analysis of various psychological and commercial factors like the trade-off between usability and travel fee. In order to avoid such difficulties, we focus on relative benefits between existing and novel bus systems under the same conditions, and investigate them by computer simulations [10, 11].
4.1 Problem Domain and Formalization We suppose that a dial-a-ride system is operated as follows: A passenger can raise a demand (request to transport him/her from one place to another) to a bus center. Then the center assigns the best bus for the demand on the occasion. The assigned bus will pick-up the passenger and transport him/her to the requested destination as quickly as possible. Unlike a normal taxi, multiple passengers may share a bus in the dial-a-ride system. Therefore the routing may not be the shortest path to complete each demand, but a detour for other demands.
716
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
We also suppose that the dial-a-ride system is operated by an on-the-fly system, in which a passenger sends a demand when he/she wants to ride, and a free-routing style, in which a bus can use any routes in a town to pick-up and transport passengers. As written above, the purpose of the simulation is to investigate feasibility of the dial-a-ride systems. To do this, we focus on usability and profitability of bus systems. Generally, the evaluation of such criteria is difficult because usability depend on subjective factors and profitability may change according to social conditions. In order to avoid these difficulties, we simplify usability and profitability as described below, and compare these factors of dial-a-ride systems with those of traditional fixed-route bus systems. Such relative evaluation enables us to ignore varied subjective criterion and social conditions. Of course, this technique is meaningful when simplified measures of the usability and the profitability are reasonable for the comparison. For usability, we specifically address the primary purpose of a bus system: to provide a way for a passenger to reach her destination as quickly as possible. From this point of view, usability is defined as average elapsed time from when a demand is told to the bus center until the demand is satisfied. Profitability is formalized as the number of demands occurring in a unit period per bus, because the most of incomes and costs to general bus systems depend mainly on the number of passengers and buses, respectively. In the experiments shown below, we set conditions of simulations to keep the same profitability on both bus systems.
4.2 Simulation Setup
Fig. 6 Setup of Evaluation of Dial-a-ride and Fixed-route Bus Systems
Multi-Agent Social Simulation
717
In the simulation, both of dial-a-ride and fixed-route bus systems are operated in a town whose roads form an 11 × 11 grid as shown in figure 6. Passengers’ origins and destinations are at crossings of the roads. For the simplicity, we assume that there are no traffic jams. We apply a genetic algorithm (GA) to determine bus-routes for fixed-route systems. Usability of a fixed-route system varies according to bus-routes. Therefore, we need to use an optimal set of routes for a fair comparison to the dial-a-ride system. However, it is difficult to find the optimal set of routes to cover a town theoretically, because it is affected by many factors like the number of buses, average bus speed, and so on. We utilize GA to seek a semi-optimal set of routes for a given condition. For simulation of a dial-a-ride system, we must solve the problem of how to assign a new demand to buses and to re-plan a path for the assigned bus. Theoretically, this assignment and routing problem is a variation of the traveling salesman problem, which is hard to solve optimally with a reasonable cost. Instead we use a kind of simple auction system called successive best insertion [8] in which the system seeks the best pair of insertion positions for two new via-points (departure point and destination point) in the queues of buses when a new demand occurs.
4.3 Results and Discussion As described in section 4.1, we measure usabilities of both bus systems under fixed profitabilities. Figure 7 shows the changes of usabilities (average times to complete a demand) in both systems. In each graph, thin lines indicate changes of the usabilities of dial-a-ride systems where the ratio between the frequency of demands and the number of busses is fixed (the number of demands per bus is 1, 2, 3, 4, 5, 8, or 16). Generally, a high ratio is better for the profit of bus companies. A thick line in each graph indicates changes of the average elapsed time by the fixed-route system. There is only one line for one condition because its usability depends only on the number of buses but not on the frequency of demands. Graphs (a)∼(f) show results of experiments under the following conditions: (a) Demands occur equally in any points in the town. (b) ∼ (e) There is a center (like a train station) where 50–99% of demands are concentrated. (f) There are two centers where 50% of demands are concentrated.
4.3.1 General Feature of Both Systems Graph (a) shows that the usability of the dial-a-ride system is improved faster than one of the fixed-route system when the frequency of demands increases. In both systems, the usabilities are improved because a passenger can have many choices to reach a destination. In addition, because the dial-a-ride system provides more flexibility to fit passenger’s demands, the improvement is greater than the fixedroute system.
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
6 fixed path with TIS dial-a-ride (freq. = 1 * #bus) dial-a-ride (freq. = 2 * #bus) dial-a-ride (freq. = 3 * #bus) dial-a-ride (freq. = 4 * #bus) dial-a-ride (freq. = 5 * #bus) dial-a-ride (freq. = 8 * #bus) dial-a-ride (freq. = 16 * #bus)
5 4 3 2 1 0
1
10
# of bus (in log scale)
100
1000
Average Time to Complete a Demand
Averate Time to Complete a Demand
718
6
fixed-route with TIS dial-a-ride (freq.= 1*#bus) dial-a-ride (freq.= 2*#bus) dial-a-ride (freq.= 3*#bus) dial-a-ride (freq.= 4*#bus) dial-a-ride (freq.= 5*#bus) dial-a-ride (freq.= 8*#bus) dial-a-ride (freq.=16*#bus)
5 4 3 2 1 01
Changes of Ave. Score by # of Bus and Demand Frequency (polarRatio = 0.7) 6
fixed-route with TIS dial-a-ride (freq.= 1*#bus) dial-a-ride (freq.= 2*#bus) dial-a-ride (freq.= 3*#bus) dial-a-ride (freq.= 4*#bus) dial-a-ride (freq.= 5*#bus) dial-a-ride (freq.= 8*#bus) dial-a-ride (freq.=16*#bus)
5 4 3 2 1 01
10
# of bus (in log scale)
100
1000
6
3 2 1 0
1
10
# of bus (in log scale)
100
1000
(e) one center : conc. ratio = 0.99
Average Time to Complete a Demand
Average Time to Complete a Demand
4
1000
4 3 2 1 0
1
10
# of bus (in log scale)
100
1000
(d) one center : conc. ratio = 0.90
fixed-route with TIS dial-a-ride (freq.= 1*#bus) dial-a-ride (freq.= 2*#bus) dial-a-ride (freq.= 3*#bus) dial-a-ride (freq.= 4*#bus) dial-a-ride (freq.= 5*#bus) dial-a-ride (freq.= 8*#bus)
5
100
fixed-route dial-a-ride (freq.= 1*#bus) dial-a-ride (freq.= 2*#bus) dial-a-ride (freq.= 3*#bus) dial-a-ride (freq.= 4*#bus) dial-a-ride (freq.= 5*#bus) dial-a-ride (freq.= 8*#bus) dial-a-ride (freq.=16*#bus)
5
(c) one center : conc. ratio = 0.70 6
# of bus (in log scale)
(b) one center : conc. ratio = 0.50 Average Time to Complete a Demand
Average Time to Complete a Demand
(a) no center
10
6
fixed-route with TIS dial-a-ride (freq.= 1*#bus) dial-a-ride (freq.= 2*#bus) dial-a-ride (freq.= 3*#bus) dial-a-ride (freq.= 4*#bus) dial-a-ride (freq.= 5*#bus) dial-a-ride (freq.= 8*#bus)
5 4 3 2 1 0
1
10
# of bus (in log scale)
100
1000
(f) two centers : conc. ratio = 0.50
Fig. 7 Changes of Usability (Average Elapsed Time) under Fixed Profitabilities.
4.3.2 Effects of Concentrated Demands A town generally has several centers like a train-station or a shopping center, where demands are concentrated. Graphs (b)-(e) and (f) show results of such cases. Com-
Multi-Agent Social Simulation
719
pared (b)-(e), we can see that the advantage of dial-a-ride systems in the usability becomes more obvious when the concentration ratio is high. For example, when a dial-a-ride system suppose the number of demands in time-unit per bus is 8, its usability becomes better than the fixed-route one when the number of busses is about 20 in (a), 16 in (c), 10 in (d) and 8 in (e). On the other hand, (f) shows that the fixed route system gets more benefit than dial-a-ride system in the case of two centers. The usability of the fixed-route system is improved drastically in (f), while the dial-a-ride systems shows similar results as the case of one center. Therefore, the balancing point of usability moves right so that dial-a-ride systems have advantages under situations such that the number of demands are very large.
4.4 Summary From the results of simulations shown in this section, we can summarize the features of a dial-a-ride system compared with a fixed-route system as follows: • Generally, dial-a-ride systems are more scalable than fixed-route systems. • The benefit of dial-a-ride systems is enhanced when the town has a single center, while fixed-route systems have a large advantage when many people move between two centers. These results indicate that the benefit of MAS and ubiquitous computing applications are not guaranteed in all conditions. This kind of analysis will provide more objective measures to justify costs of these technologies.
5 User Model Constructed Based on Sensor Network 5.1 Introduction With maturation of ubiquitous computing technology, particularly with advances in positioning and telecommunications systems, we are now in a position to design advanced assist systems for many aspects of our lives [18]. Many researches have focused on aspects of supporting a pedestrian, i.e., navigation service to destination and data distribution of surrounding stores [1, 2]aˇ ˛e In such services, key technology is location estimate. Recent development of location estimate with GPS, digital camera, IC tag, and wireless LAN has enabled to acquire location information of person, e.g., position and trajectory of pedestrian. In a lot of kinds of information, location information is one of the most useful clues to grasp the context of the visitor because location information of a visitor tells an existence of relations with objects around him/her tells what kind of relationship the object has with the visitor. Location information
720
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
of the visitor promotes the technology which is needed for a multiagent social simulation, i.e., construction of an user model and the technology which is one of the verification targets of a multiagent social simulation, i.e., provision of information. So, acquisition of accurate location information helps to construct a precise model and to provide suitable information. A large quantity of data about the context of the visitor guessed based on location information is analyzed to derive the tendency of behavior selection in specific situation that is required for construction of a precise model. The context of the visitor is also used to comprehend what kind of information the visitor hopes that is required for provision of suitable information. However, although both researches have been developed similarly based on location information, these are often taken up in different research fields. Construction of models is dealt with in the research area of multiagent (or single) system, and provision of information is dealt with in that of ubiquitous computing. If these two technologies are treated on the same information platform, location information acquired for providing information to visitors can also be used to construct precise visitor models, and multiagent social simulation with precise user models can be used to verify new information providing service. In order to realize this mechanism, we have been developed integrated exhibition support system as information platform that can provide two services; locationaware content delivery for provision of information to visitors and tour tend analysis of visitors in a pavilion for construction of the model of visitors. Through the proofing experiment in Aichi-Expo 2005, we have confirmed that our proposed system enables realization of the application of content delivery and visitor analysis. In this section, we introduce our location-aware content delivery system and analysis of tour trend in the pavilion in Aichi-Expo 2005. Furthermore, we share with an example of pedestrian simulation with the visitor model based on the analysis.
5.2 Integrated Exhibition Support System In this section, we explain our location-aware content delivery system [15] using two kinds of user devices, Aimulet GH+ and Aimulet GH shown in Figs. 8 and 9.
5.2.1 User device Aimulet GH+ is composed of a Personal Digital Assistant (PDA) and an active RFID tag . The system detects the location of a user with Aimulet GH+ every second. Based on the user’s location, the system updates the content list containing 6 items of explanations about exhibits that are near the user’s location. The user chooses one item from the content list and touches it on the display to play an explanation with sound, text, and graphics. Aimulet GH is a system using credit-card-sized audio information terminals that provide descriptions of each exhibit in either Japanese or English. Sound data of
Multi-Agent Social Simulation
721
explanations about exhibits and energy are transmitted simultaneously by infrared rays from light resources on the ceiling. Thereby, eliminating the need for a power supply and enabling the realization of credit-card-sized terminals that are only 5 mm thick and weigh 28 g. Aimulet GH is also equipped with an active RFID tag (and a lithium button cell).
Fig. 8 Aimulet GH+
Fig. 9 Aimulet GH
5.2.2 System architecture Our location-aware content delivery system comprises RFID antennas, RFID receivers, an RFID receiver server, a location estimate engine, and a content server. The data flow of the location-aware content delivery service with Aimulet GH+ in Fig. 10 is as follows. An active RFID tag on Aimulet GH+ and Aimulet GH transmits its tag ID; then RFID receivers detect the tag ID through RFID antennas and send it to the RFID server. The RFID server stores RFID data (RFID receiver IDs detecting a tag ID and their respective time stamps). The content server sends the user’s subarea to the content database, and requests a reply including the most optimal content list. Subsequently, the content server sends it to Aimulet GH+ through a wireless LAN. Aimulet GH+ then receives a content list and displays it.
5.2.3 Location estimate • Bridging by linearizing simplification By supposing that we can obtain an ideal environment in which radio wave conditions are extremely stable in the real world, the distance from an RFID tag to an RFID receiver is calculable based on the Received Signal Strength (RSS) of an RFID receiver [2]. However, it is difficult in the real world to observe the RSS precisely
722
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
Fig. 10 Data flow in the location-aware content delivery system
for the reason that the received RSS by RFID receivers is extremely unstable even though an RFID tag remains in a single position. Moreover, the instability of radio wave conditions is also reported as a result of reflection and phasing phenomena. To tackle real-world instability, we employ the number of detections of a tag ID as the key parameter for estimating its location, instead of the RSS. Regarding the robustness of computation, we propose a linearizing simplification to the relationship between the number of detections of a tag ID by an RFID receiver (antenna) and the distance from an RFID tag to an RFID receiver, we then regard this relationship as "the closer an RFID tag is located to an RFID receiver, the higher the number of detections of a tag ID by an RFID receiver." This linearizing simplification decreases the complexity of the location estimation algorithm and enables adjustment of the parameters in a practical period. • Multi-layer location estimate The whole area is divided conceptually into subareas. The division method is conceptualization as layers; some kinds of layers are prepared. Subareas in the same layer have similar size. The sizes of subareas in the lowest layer are smallest of all layers. Higher-layer subareas are larger. The location estimate engine selects one subarea in a layer where the user with Aimulet GH+ is considered, with the highest probability, to be located. First, the location estimate engine selects a subarea in the lowest layer containing subareas of the smallest size. Because the subareas in lower layers are small, selection of one subarea there is more difficult. Consequently, the location estimate engine chooses a higher layer containing larger subareas; it then tries to select one subarea within that layer. Based on the number of detections of tag IDs by RFID receivers, the selection is processed as follows. First, a set of layers L is defined as L = {l1 , l2 , . . . , li , . . . , ln }.
(1)
Multi-Agent Social Simulation
723
Second, set of subareas Si in layer li is defined as Si = {s1,i , s2,i , . . . , s j,i , . . . , smi ,i }.
(2)
A set of RFID receivers R is defined as R = {r1 , r2 , . . . , rk , . . . , rl }.
(3)
Each RFID receiver has a receiver point. Here, the receiver point rprk (id, T ) for a tag ID id at time T is the number of detections of tag ID id by RFID receiver rk for one second at time T . To consider past RFID data of t seconds ago, a time weight of wtime (t) is applied. With time weight wtime (t), we define the total receiver point trprk (id, T ) for a tag ID id at time T as trprk (id, T ) = ∑t=0 wtime (t)rprk (id, T − t).
(4)
Here, the time weight wtime (t) is a monotonically decreasing function. Each subarea has a subarea point. Here, subarea point sps j (id, T ) of subarea s j,i in layer li for tag ID id at time T indicates how a user with Aimulet GH+ transmitting tag ID id is considered to exist in subarea s j,i based on receiver points around subarea s j,i . To calculate subarea point sps j,i (id, T ), a contribution ratio is defined. The contribution ratio crk (s j,i ) of receiver rk to subarea s j,i in layer li indicates how the detection of RFID receiver rk contributes to inferring that an RFID tag transmitting tag ID id exists in subarea s j,i when RFID receiver rk detects the tag ID id. The value of the contribution ratio is determined based on this supposition: "The closer an RFID tag is located to an RFID receiver, the higher the number of detections of an RFID tag by an RFID receiver." Therefore, the closer an RFID receiver is to a subarea (or included into a subarea), the greater the contribution ratio of an RFID receiver to a subarea. Each contribution ratio is set as real number in the range of [0, 1.0]. With contribution ratio crk (s j,i ), subarea point sps j,i (id, T ) is defined as i cr p (s j,i )trpr p (id, T ) sps j,i (id, T ) = ∑mp=0
(5)
After calculation of all subarea points in a layer, the location estimate engine selects one subarea with the highest subarea point in all subareas in a layer. However, if little difference exists between the highest subarea point sp1 and the second highest subarea point sp2 , the location estimate engine select no subarea in layer i; instead, it selects one subarea in the next-highest layer i + 1. The condition by which the location estimate engine rises from layer i to layer i + 1 is defined as min_ratioi,i+1 ≤ sp2 /sp1 .
(6)
Here, min_ratioi,i+1 indicates the minimum ratio by which the location estimate engine rises from layer i to layer i + 1. Otherwise, the location estimate engine
724
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
considers that a user with Aimulet GH+ transmitting tag ID id exists in the subarea with the highest subarea point.
5.3 Proofing Experiments in Expo 2005 Aichi 5.3.1 Outline of Expo 2005 Aichi
Fig. 11 Arrengment of main exhibits in Or- Fig. 12 Subarea configuration of lowest layer ange Hall in Orange Hall
Expo 2005 Aichi was held in Nagoya, Japan (EXPO, 2005) from March 25 through September 25, 2005 (Duration of 185 days). The pavilions were about 90; 22 million visitors attended. Global House had three parts: The Mammoth Laboratory, the Orange Hall, and the Blue Hall. Global House was one of the most popular pavilions because 10% of total guests (22 million) visited Global House within half a year. In the Orange Hall, a maximum of 350 visitors entered every 20 min from 9:30 AM to 8:00 PM. Fig. 11 shows the arrangement of main exhibits in table 1, the entrances and exits (No.1 to 11) in Orange Hall.
5.3.2 Implemented services in Orange Hall In the entrance of Orange Hall (fig.11-1), general visitors received an Aimulet GH, and mobility-challenged visitors (in wheelchairs) received an Aimulet GH+. At the exit(fig.11-11), all visitors were required to return them. The content server selects 6 of 41 contents for Aimulet GH+ based on location of each visitor with Aimulet GH+, and sends the content list to Aimulet GH+. Contents mainly selected by the content server are near the visitor. There are 50 infrared light resources for Aimulet GH are set on the ceiling in Orange Hall. (There are 50 audio contents for Aimulet GH.)
Multi-Agent Social Simulation
725
Table 1 Main exhibits
No.1 Entrance (Visitors receive Aimulet GH or Aimulet GH+.) No.2 Waiting room for NHK Super High Vision Theater No.3 Entrance of NHK Super High Vision Theater No.4 Exit of NHK Super High Vision Theater No.5 Restoration of Toumai Skull No.6 Restoration of Yukagir Mammoth No.7 Statue of Dyonisius No.8 Virtual Reality Theater No.9 Moon Rock No.10 Footmark on Moon No.11 Exit (Visitors return Aimulet GH or Aimulet GH+.)
Regarding the sensor devices that visitors have, an active RFID tag in Aimulet GH and Aimulet GH+ was applied to detect rough location and to count the number of visitors to Orange Hall exhibit areas. In Orange Hall, 139 RFID receivers and antennas are set up on the ceiling near exhibits. Fig. 12 shows part of the configuration of subareas actually applied in Orange Hall. In Orange Hall, although our multi-layer location estimate has three layers, we show only the configuration of the lowest layer (3rd layer), and omit higher layers. In 3rd layer, the whole Orange Hall is divided to 25 subareas. The shapes of A1 to J4 are shown in Fig. 12.
5.4 Tour Trend Analysis and Its Application 5.4.1 Tour Trend In this section, we share with the analysis result based on the location data of visitors in Orange Hall. The location data was acquired by location estimate system with active RFID system (Aimulet GH and Aimulet GH+). In fig. 13, the average time that visitors entering Orange Hall from 9:55AM to 4:25PM in 25th August 2005 have stayed in subareas shown in fig. 12. The horizontal axis means each subarea, and vertical axis means average staying time in the subarea. Because subarea B1 is the waiting room for NHK Super High Vision Theater in fig. 11, all visitors receive an explanation about Super High Vision Theater and how to use Aimulet GH for about 10 minutes. As a result, average staying time of subarea B1 is about 10 minutes. Because subareas J3 and J4 is the exit in fig. 11, all visitors return their Aimulet GH and Aimulet GH+ in J3 and J4, and they stay till they are carried to back yard as depository. As a result, average staying time of subarea J3 and J4 is longer than that of other areas. The other subareas with longer average staying time in fig. 11 have popular exhibits in following table 2 . We confirm that the average staying time is long in these subareas with popular exhibits.
726
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
Table 2 Popular exhibits and subarea
Subarea E1-5 Restoration of Toumai Skull Subarea F1-6 Restoration of Yukagir Mammoth Subarea G2ad’G3-7 ˛ Statue of Dyonisius Subarea H1-8 Virtual Reality Theater Subarea H3ad’I1-9 ˛ Moon Rock Subarea I3-10 Footmark on Moon
Next, we divide all visitors to typical four groups based on the staying time of visitors, and pay attention to the features of movement of each groups. Fig. 15 is the frequency of the staying time at three minutes intervals. The horizontal axis is the staying time, and the vertical axis is the number of visitors. The staying time is defined as the period from the first time which the visitor is estimated in subarea B1 by the location estimate system to the first time estimated in subarea J4. The minimum time of the staying time is 24 minutes because the explanation in the waiting room of Super High Vision Theater takes about 11 minutes, the screen time takes about 11 minutes, and movement from entrance to exit takes about 2 minutes. Based on the frequency of the staying time, we divide visitors to following four types; type 1 is the visitors spending less than 33 minutes, type 2 is the visitors spending 33 to 54 minutes, type 3 is the visitors spending 54 to 69 minutes, type 4 is the visitors spending more than 69 minutes. Fig. 14 shows the staying time of type 1 to 4 in each subarea. The horizontal axis means each subarea, and vertical axis means the average staying time of each type in the subarea. From this graph, as the type changes 1 to 4, the staying time in each subarea increases. However, the staying time of all subareas do not increase, only staying times with popular exhibits shown in fig. 11 increase. As a result, we can confirm that the staying time in the subarea with popular exhibits increases and the number of the exhibits viewed doesn’t increase although the staying time of visitors increases.
5.4.2 Application to pedestrian simulation As an application of tour trend analysis, we implement the staying time of the visitors grouped to type 1 to 4 to pedestrian agents in the pedestrian simulation. A screen shot of our pedestrian simulation is shown in fig. 16. As a result of our simulation, although we have not completed verification of accuracy of simulation results, congestion near the exit of Super High Vision Theater and near popular exhibits are reproduced well. It is valuable that pedestrian simulation is implemented based on real data acquired in the pavilion and its analysis results.
Multi-Agent Social Simulation
727
Fig. 13 Average staying time in each subarea Fig. 14 Average staying time of type 1 to 4 in of all visitors each subarea
Fig. 15 Frequency of the staying time of all visitors
Fig. 16 Screenshot of our pedestrian simulation
728
Itsuki Noda, Peter Stone, Tomohisa Yamashita and Koichi Kurumatani
5.5 Summary In this section, we described an integrated information support system that we developed with fine location estimation to realize location-aware content delivery and tour trend analysis of visitors. We explained the features of our system architecture with active RFID system and implemented services in Expo 2005 Aichi. Finally, we showed the result of tour trend analysis for constructing a visitor model, and shared with our pedestrian simulation with the staying time in each subarea.
6 Concluding Remark In this chapter, we showed three examples of ways to demonstrate validity and feasibility of AISE technologies by MASS. As introduced in section 2, we should track five issues — Clarity, Simplicity, Bias-free, Fidelity, and Tractability — to conduct such simulations. The examples illustrate the importance of these issues. As similar to simulations of physical phenomena, MASS can be a powerful tool to evaluate, analyze and development new technologies and new social systems. Because effects of AISE to the society increase nonlinearly, methodologies to evaluate the effects virtually will be necessary for research of AISE. However, the simulation itself is not a perfect predictor of the expectation of a future society. The main problem is a lack of enough data to develop and validate models of human behavior. Therefore, works like section 5 will be important for future AISE researches.
References [1] Abowd GD, Battestini A, Connell TO (2003) Fhe location service: A framework for handling multiple location sensing technologies. Tech. rep., Georgia Institute for Technology, technical Report GIT-GVU-03-07 [2] Bandara U, Minami M, Hasegawa M, Inoue M, Morikawa H, Aaoyama T (2004) Design and implementation of an integrated contextual data management platform for context-aware applications. In: Seventh International Symposium on Wireless Personal Multimedia Communications (WPMC ’04), pp 266–270 [3] Bishop R (2005) Intelligent Vehicle Technology and Trends. Artech House [4] Casti JL (1997) Would-be Worlds: how simulation is changing the frontiers of science. John Wiley and Sons, Inc. [5] Dresner K, Stone P (2007) Learning policy selection for autonomous intersection management. In: AAMAS 2007 Workshop on Adaptive and Learning Agents, Honolulu, Hawaii, USA, pp 34–39
Multi-Agent Social Simulation
729
[6] Dresner K, Stone P (2008) Mitigating catastrophic failure at intersections of autonomous vehicles. In: ATT08, Estoril, Portugal, pp 78–85 [7] Dresner K, Stone P (2008) A multiagent approach to autonomous intersection management. Journal of Artificial Intelligence Research 31:591–656 [8] Noda I (2004) Usability of dial-a-ride systems. In: Proc. of International Workshop on Massively Multi-Agent Systems, pp 77–90 [9] Noda I, Frank I (1999) Investigating the complex with virtual soccer. In: Heudin JC (ed) Virtual Worlds : Synthetic Universes, Digital Life, and Complexity, Perseus Books, pp 189–209 [10] Noda I, Ohta M, Kumada Y, Nakashima H (2005) Usability of dial-a-ride systems. In: Proc. of AAMAS-2005, pp (p–726) [11] Noda I, Shinoda K, Ohta M, Nakashima H (2007) Evaluation of bus transportation system in urban area using computer simulation. In: Proc. of APCOM’07EPMESC XI, pp MS24–4–1 [12] Pomerleau DA (1993) Neural Network Perception for Mobile Robot Guidance. Kluwer Academic Publishers [13] Reynolds CW (1999) Steering behaviors for autonomous characters. In: Proceedings of the Game Developers Conference, pp 763–782 [14] Rogers S, Flechter CN, Langley P (1999) An adaptive interactive agent for route advice. In: Etzioni O, Müller JP, Bradshaw JM (eds) Proceedings of the Third International Conference on Autonomous Agents (Agents’99), ACM Press, Seattle, WA, USA, pp 198–205, URL citeseer.nj.nec.com/ rogers95adaptive.html [15] Sashima A, Izumi N, Kurumatani K (2004) Consorts: A multiagent architecture for service coordination in ubiquitous computing. In: Multiagent for Mass User Support, Springer Verlag (LNAI-3012), pp 196–216 [16] Schonberg T, Ojala M, Suomela J, Torpo A, Halme A (1995) Positioning an autonomous off-road vehicle by using fused DGPS and inertial navigation. In: 2nd IFAC Conference on Intelligent Autonomous Vehicles, pp 226–231, URL citeseer.nj.nec.com/55688.html [17] VanMiddlesworth M, Dresner K, Stone P (2008) Replacing the stop sign: Unmanaged intersection control for autonomous vehicles. In: ATT08, Estoril, Portugal, pp 94–101 [18] Weiser M (1991) The computer for the twenty-first century. Scientific American 265:94–104 [19] Wierwille WW, Hanowski RJ, Hankey JM, Kieliszewski CA, Lee SE, Medina A, Keisler AS, Dingus TA (2002) Identification and evaluation of driver errors: Overview and recommendations. Tech. Rep. FHWA-RD-02-003, Virginia Tech Transportation Institute, Blacksburg, Virginia, USA, sponsored by the Federal Highway Administration
Multi-Agent Strategic Modeling in a Specific Environment Matjaz Gams and Andraz Bezek
1 Introduction Multi-agent modeling in ambient intelligence (AmI) is concerned with the following task [19]: How can external observations of multi-agent systems in the ambient be used to analyze, model, and direct agent behavior? The main purpose is to obtain knowledge about acts in the environment thus enabling proper actions of the AmI systems [1]. Analysis of such systems must thus capture complex world state representation and asynchronous agent activities. Instead of studying basic numerical data, researchers often use more complex data structures, such as rules and decision trees. Some methods are extremely useful when characterizing state space, but lack the ability to clearly represent temporal state changes occurred by agent actions. To comprehend simultaneous agent actions and complex changes of state space, most often a combination of graphical and symbolical representation performs better in terms of human understanding and performance. Multi-agent strategic modeling (MASM) is a process able to discover a strategic activity from raw multi-agent action data [8]. A strategic activity is a time-limited multi-agent activity that exhibits some important or unique domain-dependant characteristics. The task is to graphically and symbolically represent strategies of multi-agent systems (MAS), in a given environment, to create an algorithm that discovers strategic agent behavior and enable humans to understand and study the underlying behavioral principles. The task is therefore to translate multi-agent action sequence and observations of a dynamic, complex and multivariate world state into a graphbased and rule-based representation. By using hierarchically ordered domain knowledge the algorithm should be able to generate strategic action descriptions and corresponding rules at different levels of abstraction. Matjaz Gams and Andraz Bezek Department of Intelligent Systems, Jozef Stefan Institute, Slovenia, e-mail: matjaz.gams@ ijs.si,[email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_27, © Springer Science+Business Media, LLC 2010
731
732
Matjaz Gams and Andraz Bezek
The domain-independent approach is applied in a Multi-Agent Strategy Discovering Algorithm (MASDA) [2] and tested on a RoboCup Simulated League domain (RoboCup) [16], a multi-agent domain where two teams of 11 agents play simulated soccer games. The physical 2D soccer is simulated with additional “real life” noise when calculating forces on objects. Continuous time is approximated with discrete cycles. All agents can move and act independently as long as they comply with the soccer rules. Agents can freely communicate with each other, but as in real world their visual and hearing perception is distance-limited. The domain is at the same time a complex MAS and easily understandable by humans due to its soccer-related content. The simulated RoboCup environment represents a standard soccer playground. It is simplified, in comparison to human soccer field, but only to adjust to small and inaccurate sensors robo-players have. Therefore, it is suitable for testing different multi-agent strategies for acting in dynamical environment. To provide an example how the system is supposed to perform, an example of a sequence of RoboCup actions is presented. In Figure 1 the four frames show a successful direct-attack action. The players are presented as filled circles with a pointer indicating body direction. The attacking team is black while the defending team is gray. The ball is a white circle with black border; its trace is drawn throughout the sequence. Lines represent traces of movement. The action sequence in Figure 1 is typically described by humans as: “The left team’s left forward player dribbles from the left half of the middle third into the penalty box, then passes a ball to the center forward in the center of the penalty box, where the receiver successfully shoots into the right part of the goal.” This sequence is graphically presented in Figure 2 consisting of player roles, player actions and factual domain-related descriptions, such as positions of involved players. The diagrammatic action representation is commonly used to help understanding spatial and temporal activity, as shown in Figure 2. Since the aim of MASM is to discover strategic activity and to describe this activity in a human-understandable form, MASDA uses a similar representation, with a diagram, action sequence and rules that describe factual domain features. AmI systems and humans could deduce at least some understanding of user behavior from such behavior patterns. This chapter is organized as follows. Section 2 reviews the related work. Section 3 presents MASDA. Section 4 describes an example of a generated strategy. In Section 5 an evaluation of the described method is given, followed by discussion in Section 6.
2 Overview of the Field Multi-agent modeling is a matter of several agent conferences such as AAMAS. The task of graphically representing agent actions has been addressed by several multiagent graphical models, most notably by multi-agent influence diagrams [12] and
Multi-Agent Strategic Modeling in a Specific Environment
733
influence diagram networks [19]. However, all these are concerned with agent decision making, and not agent modeling, and are therefore not suitable for representing observed strategic agent behavior. The work [15] from Intille deals with the related task of recognizing multi-agent actions in the American football domain. However, the focus in that work is on recognition, using a database of plans, knowledge about football, computation of perceptual features from the trajectories, and propagation of uncertain reasoning, the recognition system developed can determine if the example of a game play is in fact an example of a p51curl play. The drawback with their approach is that they explicitly make use of a database of plans and cannot discover new behavior. The paper [9] from Hirano and Tsumoto presents a novel method for finding interesting pass patterns from soccer game records. Taking two features of the pass
Fig. 1 Four frames from a RoboCup game showing a successful direct-attack action sequence.
734
Matjaz Gams and Andraz Bezek
sequence - temporal irregularity and requirements for multiscale observation - into account, they have developed a comparison method of the sequences based on multiscale matching. The method is used with hierarchical clustering to produce clusters of spatial pass patterns. Experimental results soccer game records demonstrate that the method is able to discover some interesting pass patterns that may be associated with successful goals. This work is concerned primarily with spatial pass patterns while out process also captures important domain-specific knowledge. There has been extensive research about quantitative modeling in the RoboCup domain. Unfortunately, the problem of constructing human-readable representations of multiagent activity has received little interest. The work from Raines et al. [17] focused on an automated assistant for analyzing agent teams. Their system ISAAC performs post-hoc, off-line analysis of teams using agent-behavior traces in the domain. The analysis is performed bottom-up, using data mining and inductive learning techniques. ISAAC is able to discover interesting patterns based on user provided features, thus combining the analytical capabilities of the machine with the domain knowledge of the user. Their system is similar to ours in terms of domain knowledge and rule induction. However, our approach explicitly exploits abstraction and provides graphic representation for detected strategies. Work from Kuhlmann et al. [13] deals with the problem of formulating advicegiving as a learning problem. They are able to learn and predict agent behavior from past observations and automatically generate advice to improve its team performance. Although their work concentrates on advice giving task, they successfully construct rules that predict the next significant action events. Our work is similar in terms of rule induction, however we focus towards increased human comprehension and visual representation as opposed to purely classification accuracy. Kaminka et al. [11] address the problem of unsupervised autonomous learning of the sequential behaviors of agents. They translate observations of the complex world state into a time-series of recognized atomic behaviors. The time series are then analyzed to find repeating or statistically significant subsequences characterizing each team. Their work addresses only frequent action subsequence matching while we construct graph-based action model that provides spatial, sequential, and temporal information about agent actions. Kaminka et al. later introduced [10] a
Fig. 2 An example of a soccer strategy: a left side direct attack.
Multi-Agent Strategic Modeling in a Specific Environment
735
behavior-based approach to plan recognition, in which hierarchical behaviors are used to model the observed robots. Efficient symbolic algorithms for answering key queries about observed robots are also presented. However, their approach explicitly utilizes a hierarchical behavior library while our process creates graph-based action models on-the-fly. The work from Sukthankar and Sycara [20] presents a cost minimization approach to the problem of human behavior recognition in a military domain. Using full-body motion-capture data acquired from human subjects, the system is able to recognize the behaviors that a human subject is performing. Low-level motion classification is performed using support vector machines (SVMs) and a hidden Markov Model (HMM); output from the classifier is used as an input feature for the behavior recognizer. Given the dynamic and highly reactive nature of the domain, their system is able to handle behavior sequences that are frequently interrupted and often interleaved. While their approach can identify different behaviors using behavior library our approach detects strategic activity without prior strategic knowledge. MASDA was first presented in [2] as a conference paper. Here it is extended and enriched with several new sections.
3 The Multi-Agent Strategy Discovering Algorithm The MASDA algorithm is designed to discover strategic multiagent activity in three steps: preprocessing, graphical-description generation and symbolic-description generation. Each step is further split into smaller (with data structures in parenthesis): • Data preprocessing: – Detection of actions in raw data (attributes) – Action sequence generation (symbolic sequence) – Introduction of domain knowledge (hierarchical concepts) • Creating graphical description : – Action graph creation (action graph) – Abstraction process (abstract action graph) – Strategy selection (graph path) • Learning symbolic description : – Action description generation (action description) – Data generation (set-valued attributes) – Rule induction (rules) In the first two tasks MASDA detects actions in raw multi-agent data and generates corresponding action sequences. Detailed description of preprocessing is omitted due to lack of space. The domain knowledge is introduced in Section 3.1. The process of generating graphical descriptions is described in Section 3.2. In Section 3.3 we describe the process of symbolic descriptions generation.
736
Matjaz Gams and Andraz Bezek
3.1 Domain-Dependent Knowledge MASDA is an domain-independent algorithm with a special domain-independent structure. It exploits domain knowledge through taxonomies of domain concepts. Taxonomy T is a hierarchical representation of domain concepts. A concept x ∈ T is an ancestor of a concept y ∈ T , x ← y, if it denotes more general concept than the concept y. Hierarchically ordered domain knowledge allows MASDA to travel up and down in the hierarchy to produce more or less abstract descriptions. Specifically, MASDA makes use of: • a taxonomy of agent roles (Troles ) • a taxonomy of agent actions (Tactions ) • taxonomies of binary domain features (T f eatures ) The depth of a concept is defined given a hierarchically ordered structure. It is the length of the shortest path from the root to the given concept. Consequently, the smaller the depth is, the more abstract the concept is. The root concept has a depth 0. Since agents in MAS can change roles and thus change their behavior [14], agent roles are assigned dynamically during agent activity. Similarly, each detected agent action is assigned its corresponding hierarchical representation. While agent roles and actions are used to describe the activity of agents, the domain features are used to describe the truth of some particular domain feature in a domain state space. In our case, domain features are only defined for moving entities - the ball or an agent. They are used to describe agent positions, their relation to the ball and their states. For example, in a RoboCup domain, the feature HasBall is true only for the agent that controls the ball (E.g.: agent1:HasBall=true) and is false for all other agents (E.g.: agent2:HasBall=false). A part of ball-related domain feature taxonomy is shown in Figure 3. The following domain taxonomies are defined for soccer: • Ball-related taxonomy defines relations of players regarding the ball, as shown in Figure 3. For example: has-ball, left, right, front, back, incoming, moving-
Fig. 3 An example of ball-related soccer domain concepts.
Multi-Agent Strategic Modeling in a Specific Environment
737
away, incoming-fast, incoming-slow, moving-away-fast, moving-away-slow, still, immediate, near, far, etc. • Speed-related taxonomy defines the speed of an agent and ball: stationary, moving, fast, medium, and slow. • Position-related taxonomy defines concepts describing whether the ball or an agent is in certain spatial region of the playing field. For example: left-wing, center-of-the-field, right-wing, left-half, right-half, defending-third, defendingthird-danger-zone, defending-third-penalty-box, defending-third-goal-box, middle-third, center-circle, attacking-third, attacking-third-danger-zone, attackingthird-penalty-box, attacking-third-goal-box, etc.
3.2 Graphical Representation The process of constructing graphical action representation from the observed action sequence that was generated in the preprocessing step is presented.
3.2.1 Action Graph A directed graph, where connections correspond to actions and nodes represent common action concept defined by all outgoing connections is an action graph (AG). Nodes a and b are connected, a → b, if an action α , represented by node a, is followed by an action β , represented by node b. Terminal actions (i.e. the last action in an action sequence) are connected to a terminal node. For example, an action sequence (α , β , γ ) is represented as an AG: a → b → c → cend . Node positions correspond to agent positions in a domain space and an appropriate hierarchical ac-
Fig. 4 An example of role concept taxonomy.
738
Matjaz Gams and Andraz Bezek
tion and role concepts are assigned to each node. This enables MASDA to generate more abstract descriptions of agent role and actions. Each connection also keeps an original action instance (i.e. start time and duration of an action) of the represented action. An example of an action graph, obtained from actions in 10 RoboCup games, is presented in Figure 5. A more detailed description of action graphs and the construction process is given in [3]
3.2.2 Abstraction Unlike computers, humans can not deal well with large amount of data. Large action graphs with many nodes are difficult to visualize. One way to overcome this problem is to reduce the number of nodes. However, one would like to reduce action graph nodes in a consistent manner. This can be accomplished with hierarchical clustering of graph nodes. By merging the nearest two graph nodes, we generate a new node that represents common role and action concepts (i.e. the most specific common ancestor of both concepts). The rationale behind the merge process is that actions, frequently occurring near in domain space, define some important strategic concepts. The definition of ’near’ thus solely governs how the abstraction process combines action descriptions. Although the distance function is strictly domain-dependent it must utilize conceptual distance between actions, agent roles, and domain positions. Formally, the distance between nodes a and b is a function distance(a, b), which must comply with the following three distance axioms [6]: • Non-negative definiteness: ∀a, b: dist(a, b) ≥ 0 and dist(a, b) = 0 ⇐⇒ a = b • Symmetry: dist(a, b) = dist(b, a) • Triangular inequality: ∀a, b, c : dist(a, b) ≤ dist(a, c) + dist(c, b) The distance between graph nodes is defined as a weighted sum of distances between node positions and graph distances between role and action concepts. The
Fig. 5 An example of action graph. Each line represents one action.
Multi-Agent Strategic Modeling in a Specific Environment
739
weights can be assigned by trial and error. A more in depth description of the distance function and the whole abstraction process is described in [4]. An abstract action graph (AAG) is an action graph where graph nodes represent more than one agent action. Abstract action graph describes agent behavior in a more abstract way than the original action graph. An abstract action graph, where minimal distance between nodes is greater or equal than some value dist, is labeled AAGdist . Such graph can be created with repeated merging of nearest nodes until the minimal distance between nodes is grater than dist. An action description of a node in AAGdist is a combination of a node position, corresponding action and role concepts and a parameter dist. AAGs with greater value dist represent actions in a more abstract way than AAGs with a lower dist value. Therefore, the value of parameter dist can be regarded as the degree of abstraction of an AAG. An example abstract action graph AAG8 is presented in Figure 6.
3.2.3 Graphical Detection of Strategies How can one distinguish between a strategic plan or agent activity based on some local agent decisions? The underlying hypothesis is that multiple and similar agent activities observed over a longer time period tend to be more strategic than those emerged as local decisions, which can be regarded as noise. Therefore we consider repeated and similar action activity as a strategic activity. Similar actions are represented as a single node in an abstract action graph. The amount of similarity is regulated with the value of abstraction parameter dist. Multi-agent activity is represented as a path in an AAG. Hence, repeated activity can be discovered by finding adequate number of complete action instance sequences in an AAG path. Figure 7 shows an example of a discovered path, together with traces of its action instances. Although not all actions continue throughout the whole path, there are three action sequences that span the whole path, indicating that the path exhibits repeated and similar agent activity as a representation of a strategic activity.
Fig. 6 An example of abstract action graph (AAG8 ).
740
Matjaz Gams and Andraz Bezek
3.3 Symbolic Representation The process of generating symbolic description from the obtained graphic representation of strategic activity is presented.
3.3.1 Action Description Generation The path nodes in the obtained graphic representation of strategic activity usually represent a sequence of strategic actions. A symbolic action description is generated by analyzing the nodes in a path and extracting role and action concepts together with node position and abstraction value. For example in Table 1, an obtained strategic action description sequence is presented. It corresponds to AAG8 path shown in Figure 7. The sequence is strategic because it describes a strategic soccer situation – a successful shot. In general, it is assumed that a strategy generated from AAGdist with a greater dist value is more abstract that the one generated from AAGdist with lower dist value. Action descriptions can be used to create a pattern to detect strategic activity. This is particularly useful when observing unseen agent activity. Namely, if opponent activity matches a starting part of previously constructed strategic action description, it is reasonable to conclude that agents will follow this strategic activity. Consequently, agents can anticipate further opponent actions and devise a suitable counter-activity. Table 1 A strategic action description from AAG8
LTeam.FW: Long-dribble (x: 2.3, y: -11.9) dist: 8 LTeam.FW: Pass-to-player (x: 39.2,y: -10.8) dist: 8 LTeam.FW: Succesful-shoot (x: 41.0, y: -4.9) dist: 8 LTeam.FW: Succesful-shoot (End) (x: 53.6, y:4.5) dist: 8
Fig. 7 An example of a graphical representation of strategic activity.
Multi-Agent Strategic Modeling in a Specific Environment
741
3.3.2 Rule Induction From graphical action description one can generate rules that describe specific agent actions. Data for rule induction algorithms are constructed as follows: Positive examples are action instances in the target node while negative examples are instances in nearby nodes (i.e. near misses). For each time step one generates all pairs (agent role, domain feature) written as role:feature and store only the true ones. Since, agents dynamically change roles, the attribute-value approach is not suitable here. because it is impossible to generate all role:feature pairs for all time steps. Internal data representation of an example is shown in Table 2 with its corresponding attribute-value representation shown in Table 3. Table 2 Internal data representation.
agent A agent A agent A agent B agent B agent B
time 1 2 3 4 role L-FW L-FW C-MF C-MF feature1 T T F F feature2 T F T T role C-MF C-MF R-FW R-FW feature1 F T T T feature2 T T T F
Table 3 Attribute-value data representation with roles instead of agent names.
time 1 L-FW feature1 T L-FW feature2 T R-FW feature1 / R-FW feature2 / C-MF feature1 F C-MF feature2 T
2 T F / / T T
3 4 / / / / TT TF FF TT
The representation in Table 2 and 3 suffers from poor expressiveness because role:feature pairs are not defined for all times. To improve it, one can define setvalued attributes, attributes whose domains are sets instead of single values. With this approach, each feature corresponds to one set-valued attribute while the values are sets of agent roles, whose corresponding agent:feature pair is true. The example data is presented in Table 4 Now it is possible to automatically generate common parent role concepts due to the hierarchically ordered nature of role concepts. An example is shown in Table 5. This allows rule inducer to induce rules that describe agent activity in more
742
Matjaz Gams and Andraz Bezek
Table 4 Attribute-value data representation with roles instead of agent names.
time 1 2 3 4 feature1 {L-FW } {L-FW, C-MF } {R-FW } {R-FW } feature2 {L-FW,C-MF } {C-MF } {R-FW, C-MF } {C-MF}
abstract manner. Namely, a rule generated from example shown in Table 4 is IF (CMF:feature2)⇒ X while rule generated from example in Table 5 is IF (FW:feature1 and C-MF:feature2) ⇒ X. The latter rule exhibits more abstract role concept (FW instead of L-FW and R-FW) and is more specific, as it has two features instead of only one as in the previous rule. Table 5 Set-valued attribute data representation with generated common parent role concepts (FW).
time 1 2 3 4 feature1 {L-FW, FW} {L-FW, C-MF, FW } {R-FW, FW } {R-FW, FW } feature2 {L-FW,C-MF } {C-MF } {R-FW, C-MF } {C-MF}
4 Human Analysis of Results For each action, obtained from nodes in a discovered strategic path a learning data were generated. SLIPPER [7], a set-valued rule inducer with fixed values of parameters was used to generate corresponding rules. In a typical experiment, 10 games were given as input and several tens of strategic action descriptions were constructed each time. For a chosen description we constructed rules for AAG1 to AAG20 . The example in Table 6 presents automatically constructed rules generated from AAG8 action description (see Figure 8).
Fig. 8 An example of discovered strategic action description (thick arrows) and original action sequence instances (thin lines).
Multi-Agent Strategic Modeling in a Specific Environment
743
Table 6 Rules describing discovered strategy.
LTeam.FW:Long-dribble (First frame in Figure 1): RTeam.C-MF:Moving-away-slow ∧ RTeam.L-FB:Still ∧ RTeam.R-FB:Short-distance LTeam.FW:To-player (Second frame in Figure 1): RTeam.R-FB:Immediate LTeam.FW:Successful-shoot (Third frame in Figure 1): RTeam.C-FW:Moving-away ∧ LTeam.R-FW:Short-distance LTeam.FW:Successful-shoot-END (Last frame in Figure 1): RTeam.RC-FB:Left ∧ RTeam.RC-FB:Moving-away-fast ∧ RTeam.R-FB:Long-distance
The rules might seem a bit difficult to comprehend, but in our experience humas get used to the formalism in ten minutes. In analyses with human experts, it was revealed that the rules describe important strategic information: • LTeam.FW:Long-dribble RTeam.C-MF:Moving-away-slow: The right team’s center-midfielder was occupied with previous ball owner and was for a while moving away from the ball. Consequently, he stayed behind the action. RTeam.L-FB:Still: This player is at the bottom of the first frame in Figure 1 moving adjacent to the ball. In terms of relative movement to the ball he appears still. His movement is defensive, but he fails to aim to directly intercept or cover any of the attackers. RTeam.R-FB:Short-distance: The right team’s right fullback player unsuccessfully tried to intercept the dribbler. He consequently stayed at short distance behind the ball. Humans would probably favor the equivalent RTeam.R-FB:behind feature. The three clauses successfully point out three main reasons for the success of the attack. • LTeam.FW:To-player RTeam.R-FB:Immediate: This rule essentially states that a pass occurred because right team’s right fullback came too close to the ball. A ball owner was threatened by the opponent so he passed the ball to the teammate, who was left uncovered. • LTeam.FW:Successful-shoot RTeam.C-FW:Moving-away: Since shots represent a very fast ball movement, players are moving away from the ball. The rule inducer therefore just picked one feature among many equivalents. This is essentially the same as ball:fast which would have been far more intuitive. LTeam.R-FW:Short-distance: there are many agents at short ball distance (LTeam.L-FW, LTeam.R-FW, RTeam.L-FB, RTeam.LC-FB, RTeam.RC-FB, and RTeam.R-FB). As rule inducer considers shorter rules to be better than longer,
744
Matjaz Gams and Andraz Bezek
it only included one such feature instead of more, as would be expected by humans. The rule inducer was not able to choose the most appropriate human-favored features. One reason for this is in a small number of learning examples and consequently substantial number of equivalent agent:feature pairs for each situation. As rule inducer has no meanings to favor one feature over another, it simply picks one. However, expert analysis showed that this nevertheless results in meaningful rules describing actual causal strategic relations.
5 Evaluation MASDA was evaluated on two domains related to Robotic soccer. The first is a complex RoboCup domain, the second is a simple 3vs2 Keepaway domain.
5.1 RoboCup The RoboCup domain is a computer simulated environment in which two teams with 11 agents play simulated soccer. One of the advantages of this domain is in absence of technical problems, for example with perception of objects, problems with communication and problems with motion of robots. Abstraction enables researches to focus on higher order goals like planning, cooperation and learning. The simulated RoboCup environment represents a standard soccer playground. RoboCup Soccer Server System (RCSSS) simulates movement of players and a ball and enforces soccer rules. The environment is distributed, while only score and time are globally known values. Agents must gather other information by themselves or by communication. MASDA was evaluated on 10 RoboCup games played during SSIL [18]. A leaveone-out strategy was used to generate 10 learning tasks. A pre-determined strategy, shown in Table 1 and in Figure 7, was used as a reference and was generated from all 10 games, for abstractions AAG1 to AAG20 . For each learning task, a strategy was generated on 9 games and tested on the remaining game, again for AAG1 to AAG20 . Tests measured the difference between action descriptions, the quality of generated rules for all four action concepts and joint use of rules together with action descriptions. The results presented in Figure 9 are averaged over 10 tests where the x-axis is the level of abstraction (parameter dist for AAGdist ). The results in Figure 10a indicate that the accuracy of action descriptions is approximately constant in relation to the abstraction. The accuracy of rules quickly increases until it peaks at dist = 4 and then slowly decreases. This is expected because for lower abstractions, nodes represent only a few action instances which is insufficient for rule inducer to generate good rules. With high dist values nodes represent quite different action concepts, thus producing more abstract and less accu-
Multi-Agent Strategic Modeling in a Specific Environment
745
rate rules. Accuracy when using both rules and description follows rule accuracy for lower abstraction values, since rules are too specific. With higher abstraction values (> 14) overall accuracy follows description accuracy since rules become too general. When measuring recall, all the three test scenarios give similar results as shown in Figure 10b. That is, the quality of classifying true cases increases to dist = 3 and then remains steady until dist = 14 when it quickly drops. The similar phenomenon is observed when measuring precision, shown in Figure 10c. This can be explained by the generated agent behavior being too abstract. The last test was performed to verify if the abstraction process really generates more abstract descriptions. For this test we measured the abstraction of generated rules, defined as an average feature depth in a feature taxonomy for features used in rules. The lower the depth, the more abstract the rule is. The results, presented in Figure 9, clearly show that average feature depth is negatively correlated with the abstraction parameter dist. This confirms that as abstraction value increases the rules contain more abstract features. Since rules were generated from node data, we can safely conclude that generated strategies and AAGs exhibit the same principle: that the AAG with higher dist value presents more abstract action activity than the one with lower dist value.
Fig. 9 The abstraction of generated rules and number of action instances in strategy from figure 7.
Fig. 10 Measuring relations betwen abstraction and accuracy (left figure), recall (medium figure) and precision (right figure). The x-axis is the degree of abstraction (parameter dist).
746
Matjaz Gams and Andraz Bezek
5.2 3vs2 Keepaway RoboCup is a complex domain with 22 agents. There is a well-known simpler domain where 3 players (marked as K1, K2, K3) have to keep ball away from 2 attackers (marked as T1, T2). The domain is also simplified in light of agent actions. Only player with a ball can act. It can hold the ball or it can pass the ball to one of playfellows. Actions can also involve moving of the player to better position. When constructing 3vs2 Keepaway domain researchers tried to simplify RoboCup domain, but still preserve its attraction. In 3vs2 Keepaway domain it is easier to implement different agent strategies, easier to compare quality of agent strategies and easier to interpret results. In spite of simplifications, the domain preserves all the important properties of multiagent domains, since a single agent can not solve the problem by itself, there is no central control and execution is asynchronous and noisy. Distance dist(a, b) is defined as a distance from agent a to agent b and angle ang(a, b, c) as an angle between agents a and c with vertex in agent b. Let C be the sign for the center of the playground. The domain space is described with 13 attributes: dist(K1, C), dist(K2, C), dist(K3, C), dist(T1, C), dist(T2, C), dist(K1, K2), dist(K1, K3), dist(K1, T1), dist(K1, T2), min(dist(K2, T1), dist(K2, T2)), min(dist(K3, T1), dist(K3, T2)), min(ang(K2, K1, T1), ang(K2, K1, T2)), min(ang(K3, K1, T1), ang(K3, K1, T2)). We describe actions with hierarchy of actions which are presented in Figure 12a and describe roles of agents with hierarchy of agent roles, presented in Figure 12b. Input for MASDA is a sequence of agent actions. Every game in 3vs2 Keepaway is relatively short (less than 20s). This is why complete input corresponds to a sequence of a number of games. In this way, similar situations happen often, which is one of the main conditions for successful performance of the MASDA. 3vs2 Keepaway domain is different from RoboCup, therefore a new distance function is defined. Distance between feature a and b is defined as: dist pos (a, b), if (distHrole (a, b) + distHaction (a, b)) = 0, dist(a, b) = ∞, else,
Fig. 11 Possible actions of agent with a ball in 3vs2 Keepaway domain.
Multi-Agent Strategic Modeling in a Specific Environment
747
where is dist pos (a, b) Euclidian distance: dist pos (a, b) = (a.pos.x − b.pos.x)2 + (a.pos.y − b.pos.y)2. The goal is not discovery, but actual imitation of input strategies in sequences of games with known strategy. We generated description of macroaction concepts. The constructed rules and a sequence of action concepts gained with MASDA were entered in the program that controls agents in 3vs2 Keepaway environment. Then we played 3vs2 game with our agents and compared this game style with the original. The procedure is described in figure 13. For testing we used three strategies: • two refence strategies included in environment 3vs2 Keepaway: hand and rand, • strategy hand3-6-9, where each player has different strategy. Strategy rand randomly chooses action at every interval. Strategy hand holds ball if distance to an opponent is more than 5 m and passes the ball to a playfellow if distance is less than 5 m and the playfellow is open. With strategy hand3-6-9 players have different variations of hand strategy. One player has hand strategy where it hands the ball if opponent is closer than 3 m, one player hands the ball if opponent is closer than 6 m and one player hands ball if opponent is closer than 9 m. We compared learned strategies with reference strategies. Comparison was made with four methods:
Fig. 12 a) Hierarchy of actions and b) hierarchy of roles in 3vs2 Keepway domain.
Fig. 13 Procedure for imitation of the reference strategy.
748
Matjaz Gams and Andraz Bezek
• Measuring average time of the episode, which enables measuring the success of learned and reference strategy. • Measuring accordance of actions, chosen by learned and reference strategies. This way we see similarities between learned and reference strategy. • Analyzing produced rules, which gives an insight in play mechanism. • Visual judgment as an auxiliary evaluation, which considers human judgment of similarities in game. Results are presented in Table 7, showing high match of the learned and the reference hand strategy. Matching of learned rand with reference hand is less exact, which shows that hand and rand strategies are quite different. Matching learned strategies with reference rand is between 31% and 34%. That is expected, since rand randomly chooses actions with 1/3 probability. Table 7 Percentage of agreement of learned played actions with reference played actions.
Strategy reference hand best learned hand average learned hand worse learned hand reference rand best learned rand average learned rand worse learned rand
agreement with agreement with reference strategy hand reference strategy rand 100% 33.07% 94.14% 33.93% 93.24% 33.00% 92.69% 32.41% 33.85% 100% 67.76% 33.72% 58.90% 32.83% 47.93% 31.33%
Strategies hand and rand do not distinguish between players of the same team. For testing of MASDA we implemented strategy hand3-6-9, where every player has different strategy. Rules produced by MASDA are very similar with rules of hand3-6-9. For example: reference rule for agent 1 is that it must hold the ball if enemy is farther than 9 m: (IF DistK1T1>9 → hold). The learned rule is (IF 8
6 Advantages of Strategic Modeling The overall assumption is that understanding behavior of agents in an environment is helpful for several AmI-related tasks. With this aim a new general domainindependent framework for discovering strategic behavior of multi-agent systems in a given environment was designed. Domain-specific knowledge was introduced in the form of role, action and domain feature taxonomies. It is assumed that changing
Multi-Agent Strategic Modeling in a Specific Environment
749
a domain is a straightforward task that would require changing specific domainknowledge of a similar form. We believe that there is a wide range of possible domains that can be exploited by MASDA since its essence is stepwise abstraction in the domain-space. The experiments indicate that the approach is practically useful in terms of accuracy, recall and precision. Our tests also confirm that the abstraction process in fact generates more abstract descriptions of agent activity. The weak point of the presented research is that MASDA was evaluated only on the RoboCup domain. Although the authors believe that no major problem should emerge when introducing another domain, this should be verified in practice. Understanding strategies of agents (i.e. user) on the basis of their movement represents an important asset in several AmI tasks, and it might not be as far-fetched as it seems. Acknowledgements The authors would like to thank Gal Kaminka for helpful discussion and thoughtful comments. This work was partially funded as a PHD project for Andraž Bežek by Slovenian Research Agency.
References [1] J. C. Augusto: Ambient Intelligence: Basic Concepts and Applications. In Selected Papers from ICSOFT’06. Springer Verlag. 2008. [2] A. Bezek: Automatic Modeling of Multiagent systems. Phd. Thesis, Faculty of Computer and Information Science, Ljubljana, 2007. [3] A. Bezek: Modeling Multiagent Games Using Action Graphs. Proceedings of Modeling Other Agents from Observations (MOO 2004), 2004. [4] A. Bezek: Discovering Strategic Multi-Agent Behavior in a Robotic Soccer Domain. Proceedings of AAMAS 05, 2005. [5] A. Bezek, M. Gams, and I. Bratko, Multi-agent strategic modeling in a robotic soccer domain. In Proc. AAMAS, 2006, pp.457-464 [6] J. Calmet, A. Daemi: From Entropy to Ontology. Proceedings of Fourth International Symposium From Agent Theory to Agent Implementation, 2004. [7] W. W. Cohen and Y. Singer: A simple, fast, and effective rule learner. Proceedings of the sixteenth national conference on Artificial intelligence, pp.: 335 - 342, Orlando, United States, 1999. [8] M. Gams: Weak intelligence : through the principle and paradox of multiple knowledge, (Advances in computation, vol. 6). Huntington: Nova Science, 2001. [9] S. Hirano in S. Tsumoto: Finding Interesting Pass Patterns from Soccer Game Records. The Eighth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2004), 2004. [10] G. Kaminka in D. Avrahami: Symbolic Behavior-Recognition. Proceedings of Modeling Other Agents from Observations (MOO 2004), 2004.
750
Matjaz Gams and Andraz Bezek
[11] G. Kaminka, M. Fidanboylu, A. Chang, in M. Veloso: Learning the sequential coordinated behavior of teams from observations. Proceedings of the RoboCup-2002 Symposium, 2002. [12] D. Koller and B. Milch: Multi-Agent Influence Diagrams for Representing and Solving Games. Games and Economic Behavior, 45(1), pp. 181-221, 2003. [13] G. Kuhlmann, P. Stone, and J. Lallinger: The UT Austin Villa 2003 Champion Simulator Coach: A Machine Learning Approach. In Daniele Nardi, et al., editors, RoboCup-2004: Robot Soccer World Cup VIII, Springer Verlag, Berlin, 2005. [14] R. Nair, M. Tambe in S. Marsella: Role allocation and reallocation in multiagent teams: Towards a practical analysis. Proceedings of the second International Joint conference on agents and multiagent systems, 2003. [15] S. S. Intille: Visual Recognition of Multi-Agent Action. PhD Thesis, MIT, 1999. [16] I. Noda et. al: Overview of RoboCup-97. In Hiroaki Kitano, editor, RoboCup97: Robot Soccer World Cup I, pp. 20-41. Springer Verlag, 1997. [17] T. Raines, M. Tambe, and S. Marsella: Towards automated team analysis: a machine learning approach. Third international RoboCup competitions and workshop, 1999. [18] RoboCup 2004: RoboCup Simulation League. RoboCup04 world cup game repository, http://carol.science.uva.nl/ jellekok/robocup/rc04/, 2004. [19] T. Steffens et. al: RoboCup Special Interest Group on Multi-Agent Modelling. http://www.cogsci.uni-osnabrueck.de/ tsteffen/mamsig/, 2001. [20] G. Sukthankar and K. Sycara: A cost minimization approach to human behavior recognition. Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, The Netherlands, ACM Press, 2005. [21] D. Suryadi and P. J. Gmytrasiewicz: Learning models of other agents using influence diagrams. Proceedings of the Int. Conference on User Modeling, pp. 223-232, 1999.
Learning Activity Models for Multiple Agents in a Smart Space Aaron Crandall and Diane J. Cook
1 Introduction With the introduction of more complex intelligent environment systems, the possibilities for customizing system behavior have increased dramatically. Significant headway has been made in tracking individuals through spaces using wireless devices [1, 18, 26] and in recognizing activities within the space based on video data (see chapter by Brubaker et al. and [6, 8, 23]), motion sensor data [9, 25], wearable sensors [13] or other sources of information [14, 15, 22]. However, much of the theory and most of the algorithms are designed to handle one individual in the space at a time. Resident tracking, activity recognition, event prediction, and behavior automation becomes significantly more difficult for multi-agent situations, when there are multiple residents in the environment. The goal of this research project is to model and automate resident activity in multiple-resident intelligent environments. There are simplifications that would ease the complexity of this task. For example, we could ask residents to wear devices that enable tracking them through the space [7, 17, 26]. This particular solution is impractical for situations in which individuals do not want to wear the device, forget to wear the device, or enter and leave the environment frequently. Similarly, video cameras are valuable tools for understanding resident behaviors, even in group settings [2, 10]. However, surveys with target populations have revealed that many individuals are adverse to embedding cameras in their personal environments [8]. As a result, our aim is to identify the individuals and their activities in an intelligent environment using passive sensors. Aaron S. Crandall School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, USA e-mail: [email protected] Diane J. Cook School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, USA e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_28, © Springer Science+Business Media, LLC 2010
751
752
Aaron Crandall and Diane J. Cook
To achieve this overall goal, our first step is to design an algorithm that maps sensor events to the resident that is responsible for triggering a sensor event. This information will allow our algorithms to learn profiles of resident behaviors, identify the individuals currently in the environment, monitor their well-being, and automate their interactions with the environment. Some earlier approaches allow for multiple residents in a single automated setting [4]. In addition, a number of algorithms have been successful designed for activity recognition, albeit in a single-resident setting [12]. To date, however, the focus has often been on looking at global behaviors and preferences with the goal of keeping a group of residents satisfied [21]. In contrast, our research is focused on identifying an individual and logging their preferences and behaviors in the context of the multi-resident spaces. The solutions used in this work revolve around using very simple passive sensors, such as motion, contact, door sensors, appliance interaction and light switches to give a picture of what is transpiring in the space. These information sources offer the benefits of being fixed, unobtrusive and robust devices. Examples of the motion detectors and light switches we use in our testbed are shown in Figure 3. Smart homes are often targeted towards recognizing and assisting with the Activities of Daily Living (ADLs) that the medical community uses to categorize levels of healthy behavior in the home [5, 16, 20]. The ability of smart homes to help disabled and elderly individuals to continue to operate in the familiar and safe environment is one of the greatest reasons for their continued development. So far, most smart home research has focused on monitoring and assisting a single individual in a single space. Since homes often have more than a single occupant, building solutions for handling multiple individuals is vital. Dealing with multiple residents has rarely been the central focus of research so far, as there have been numerous other challenges to overcome before the technology can effectively handle multiple residents in a single space. Since smart home research has the ultimate goal of being deployable in realworld environments, seeking solutions that are as robust as possible is always a factor in the systems we engineered. With that in mind, building an entirely passive solution gives the advantage of keeping the technology separate from the residents while they go about performing their daily routines. This lets the smart home feel as “normal” as possible to the residents and their guests. By reducing the profile of the new devices as much as possible, people should be less affected by the technology that surrounds them. In this chapter we present a solution to part of the problem described above. Specifically, apply a supervised machine learning algorithm to the task of mapping sensor events to the resident responsible for the event. The solution proposed in this work offers the advantage of using previous behavioral data collected from the set of known residents without requiring significant additional actions to be performed by the residents. This historical behavior is used to train the learning algorithm for use in future real-time classification of the individuals and can be updated over time as new data arrives.
Learning Activity Models for Multiple Agents in a Smart Space
753
Here we present the results of using a naive Bayesian classifier to learn resident identities based on observed sensor data. Because this machine learning algorithm is probabilistic, likelihood values are generated for each resident that can be used to appropriately modify the behavior of the intelligent environment. Because the algorithm is efficient and robust, we hypothesize that it will be able to accurately handle the problem of learning resident identities and be usable in a real-time intelligent environment. We validate our hypothesis using data collected in a real smart workplace environment with volunteer participants.
2 Testbeds The smart home testbed environments at Washington State University consist of a lab space on campus and a town home off campus. These testbeds are part of WSU’s CASAS smart environments project [19]. For our study, we used the lab space on campus, as there are multiple faculty, staff, and students who regularly enter the space and a number of different kinds of activity take place throughout the rooms. The space is designed to capture temporal and spatial information via motion, door, temperature and light control sensors.
Fig. 1 Layout of motion (M), door (D), and light switch (S) sensors in the testbed environment.
754
Aaron Crandall and Diane J. Cook
Fig. 2 2D view of inner office furniture layout.
For this project we focus on events collected from motion sensors and resident interaction with lighting devices. The testbed layout is shown in Figures 1 and 2. Throughout this space, motion detectors are placed on the ceilings and pointed straight down, as shown in Figure 3. Their lenses are occluded to a smaller rectangular window giving them roughly a 3’ x 3’ coverage area of the corresponding floor space. By placing them roughly every four feet, they overlap (between a few inches, up to a foot) and allow tracking of an individual moving across the space. The motion sensor units are able to sense when a motion as small as reaching from the keyboard to a mouse. With this level of sensitivity, sensors around workspaces trip even when people sit quietly in a private space to work at a computer. To provide control and sensing over the lighting, Insteon switches are used to control all of the ceiling and desk lights in the room. These switches communicate with a computer and all interactions with them are logged. See Figure 3 for images of both the motion and light switch sensors. The entire lab space, including the portion shown in Figure 2, has two doors with simple magnetic open/closed sensors
Learning Activity Models for Multiple Agents in a Smart Space
755
Fig. 3 CASAS motion sensor and Insteon light switch controller.
affixed to them. These door sensors record door openings and closings via the same bus as the motion detectors. By being able to log any major movement throughout the space, as well as device interactions, this system captures basic temporal and spatial behaviors that can be used to identify individuals based on behavior. Residents in this lab have unique work spaces, as well as unique time frames within which they operate. The tools used in this project are designed to exploit both the spatial and temporal differences between individuals to accurate classify a given individual. These are features of most kinds of living spaces and can be used by software algorithms to accurately identify the current resident.
3 Data Representation The data gathered by CASAS for this study is represented by a quintuple: 1. 2. 3. 4. 5.
Date Time Sensor Serial Number Event Message Annotated Class (Resident ID)
The first four fields are generated automatically by the CASAS data collection infrastructure. The annotated class field is the target field for this problem and represents the resident ID, to which the sensor event can be mapped. Training data was gathered during several weeks in the lab space by asking three individuals working in the lab to log their presence. These participants registered their presence in the lab by pushing a unique button on a pinpad when they entered and again when they left the space. During post processing, the database was filtered to only use sensor events during the time windows when there was a single resident
756
Aaron Crandall and Diane J. Cook
in the space. The corresponding data for the given time frame was then annotated and supplied as training data to our machine learning algorithm. The total time frame for data collection was three weeks, and over 6000 unique events were captured and annotated as training data. A sample of the data as quintuples is shown in Table 1. Table 1 Sample of data used for classifier training.
Date
Time
Serial
Message ID
2007-12-21 16:41:41 07.70.eb:1 ON 2007-12-21 16:44:36 07.70.eb:1 OFF 2007-12-24 08:13:50 E9.63.a7:5 ON 2007-12-24 14:31:30 E9.63.a7:5 OFF
Abe Abe John John
One aspect of the data that presents a particular challenge for supervised machine learning algorithms is the time value. In order to insightfully reason about the time at which events occur, we created more complex features designed to capture the differences in behavior between individuals. Primarily, these strategies revolved around using the data and time information to give the classifier additional information in the form of “feature types”, as shown in Table 2. The times that different people work, especially in a research lab, are very helpful in discriminating the likely resident that is currently in the space. In our lab, one of the three participants worked late at night on several occasions, while another of the participants was the only one to ever arrive before 10am. By incorporating temporal information into the features, this kind of behavior can improve the accuracy of the classifier. Given an automatic training system, picking the best feature type(s) to use can be based on a combination of the resulting accuracy and false positive rates. Table 2 Feature types used to represent time values for classifier training.
Feature Type Simple Hour of Day Day of Week Part of Day Part of Week
Example 07.70.eb:1#ON 07.70.eb:1#ON#6 07.70.eb:1#ON#FRI 07.70.eb:1#ON#MORNING 07.70.eb:1#ON#WEEKDAY
Learning Activity Models for Multiple Agents in a Smart Space
757
4 Classifier We selected a naive Bayes classifier for our learning problem. These kinds of classifiers have been used with great effect in other smart home research projects [24] with success. Applying the same kind of tool to individual identification from the same kind of data set is a logical extension. In this case, a simple naive Bayes classifier was trained, where the features were built from the event information. The target class is the individual with whom the event is associated. This required it be distilled to only a single feature paired to a given class. The class is set by the annotation, but the feature chosen can be built from a number of the fields. For the simplest interpretation, only the serial number coupled with an event message was used, as in Row 1 of Table 2. This simple feature set provides a good baseline to compare with more complex parsings. The more complex parsings, such as “Part-of-Week” (i.e., WEEKDAY or WEEKEND) capture more information about the given behavior, and can give the classifier more information for correct future classifications. Depending on the facets of the data set, different kinds of feature types can give the classifier better or worse results. The data set was randomly split into training and testing sets, with 10% of each class set aside for testing. The classifier was trained on the 90% and run against the testing set. Each class was given an accuracy rate and a false positive rate. This process was repeated for each of our feature types for comparison of accuracy and false positive rates. Training the classifier followed a simple naive Bayes algorithm, as shown in Equation 1. Likelihood(Personn|Eventi ) = P(Personn) ∗ P(Eventi|Personn )
(1)
In this case, Eventi is defined by what kind of feature type is being used (See Table 2 for the ones used in this study). So, the likelihood that a particular person, n, generated sensor event i can be calculated as the product of the probability that person n generates a sensor event (independent of a particular event) and the probability that person n generates event i. The different feature choices available (i.e., Simple vs. Hour of Day) split the data up in different ways. Each way captures the behaviors or the residents with varying degrees of accuracy, depending on the feature types chosen and the behavior of the individuals in the data set. The purely statistical nature of a naive Bayes classifier has the benefit of being fast for use in prediction engines, but lacks the ability to handle context in the event stream that could be advantageous in discerning different behaviors. The naive Bayes classifier is a bag-of-sensor-data learning algorithm, which neither explicitly represents nor reasons about the temporal and spatial relationships between the data points. In contrast, a Markov model encompasses this type of information and, therefore, may do a better job of performing the classification task. A Markov Model (MM) is a statistical model of a dynamic system. A MM models the system using a finite set of states, each of which is associated with a multidi-
758
Aaron Crandall and Diane J. Cook
mensional probability distribution over a set of parameters. The system is assumed to be a Markov process, so the current state depends on a finite history of previous states (in our case, the current state depends only on the previous state). Transitions between states are governed by transition probabilities. For any given state a set of observations can be generated according to the associated probability distribution. Because our goal is to identify the individual that caused the most recent in a sequence of sensor events, we generate one Markov model for each smart environment resident whose profile we are learning. We use the training data to learn the transition probabilities between states for the corresponding activity model and to learn probability distributions for the feature values of each state in the model. A Markov model processes a sequence of sensor events. We feed the model with a varying-length sequence of sensor readings that lead up to and include the most recent reading. To label the most recent reading with the corresponding resident that caused the sensor event, we compute Personn as argmaxn∈N P(n|e1..t ) = P(e1..t |n)P(n). P(Personn) is estimated as before, while P(e1..t |n) is the result of computing the sum, over all states, S, in model n, of the likelihood of being in each state after processing the sequence of sensor events e1..t . The likelihood of being in state s ∈ S is updated after each sensor event e j is processed using the formula found in Equation 2. The probability is updated based on the probability of transitioning from any previous state to the current state (the first term of the summation) and the probability of being in the previous state given the sensor event sequence that led up to event e j . P(S j |e1.. j ) = P(e j |S j )Σ s j−1 P(S j |s j−1 )P(s j−1 |e1.. j−1 )
(2)
We learn a separate Markov model for each resident in the smart environment. To classify a sensor event, the model that best supports the sensor event sequence is identified and the corresponding resident is output as the label for the most recent sensor event.
5 Results Figure 4 shows the classification accuracy of our naive Bayes classifier for the three residents we tested in our lab space. In order to keep actual participant names anonymous, we label the three residents John, Abe, and Charlie. In Figure 4 we graph not only the classification accuracy for each target value, but also the false positive rate. Note that the classification accuracy is quite high for the John values, but so is the false positive rate. This is because our John participant was responsible for most (roughly 62%) of the sensor events in the training data. As a result, the apriori probability that any sensor event should be mapped to John is quite high and the naive Bayes classifier incorrectly attributes Abe and Charlie events to John as well. On the other hand, while Charlie has a much lower correct classification rate, he also has
Learning Activity Models for Multiple Agents in a Smart Space
759
Fig. 4 Classification accuracy using all sensor events.
a lower false positive rate. If the intelligent environment can take likelihood values into account, this information about false positives can be leveraged accordingly. In order to address these classification errors, we added more descriptive features to our data set. In particular, we added the date and time of each sensor event, as shown in Table 2. The classifier can now use time of day or day of week information to differentiate between the behaviors of the various individuals. For example, John always arrived early in the day, while Abe was often in the space late into the evening. Finding the correct features to use for this kind of capturing of the behavior can be done by balancing the overall correct classification rate and false positive rate against one another. The choice of feature descriptors to use is quite important and has a dramatic effect on the classification accuracy results. Looking at the accuracy rate as affected by the feature type chosen, Figure 5, it shows that using hour-of-day increases the average identification significantly. Additionally, by using hour-of-day, the false positive rate drops dramatically, as shown in Figure 6. When the right features are selected from the data set, the classifier is able to make better overall classifications. To demonstrate the effects of picking the best time based identification features for our machine learning problem on an individual’s accuracy, refer to Figure 7. The first column represents the simplest feature set (Table 2, row 1), but comparing it against using the hour-of-day (Table 2, row 2) shows readily that if the classifier is given this extra information John’s accuracy percentage barely moves, while his false positive rate drastically drops, as shown in Figure 8. The false positive rate actually drops from 34% to 9%, which is a marked improvement. Use of the other time based features results in some improvements to John’s classification, but none of the others is others as useful as adding the hour-of-day feature. As an example that has accuracy improvements, but has trade-offs, Charlie’s
760
Aaron Crandall and Diane J. Cook
Fig. 5 Average classification accuracy for alternative feature types.
Fig. 6 Average false positive rates for alternative feature types.
behavior responds differently to the choice of feature type. To demonstrate the improvements in accuracy rate, refer to Figure 9. Charlie’s initial 31% accuracy with simple features was shown to jump to 87% by again using the hour-of-day feature type. This is again likely due to the times of day when Charlie’s activities do not overlap as much with Abe or John. The cost in this example is that Charlie’s rate of false positives goes up from 3% to 6%, as shown in Figure 10. This type of tradeoff needs to be taken into account when performing feature selection for any classification task. Choosing the best feature type to pick involves balancing classification accuracy against the false positive rate. A visual representation of this balancing task is shown in Figure 11. By choosing the time-of-day feature, the benefits to the accuracy rate will probably outweigh the increase in false positive rate. In this case,
Learning Activity Models for Multiple Agents in a Smart Space
761
Fig. 7 Classification accuracy for John class across feature types.
Fig. 8 False positive rates for John class across feature types.
a 2.5x increase in accuracy balances against a 2x increase in false positives. Unless the final smart environment use of the classification is highly dependent on the certainty of the predictions, it should be a simple algorithm to determine which feature type is most advantageous. For this data set, the other features including day of week, part-of-day and partof-week have little improvement over the simple feature strategy. With a correctness of over 93% and a false positive rate below 7%, a prediction engine relying on this classifier can have a high degree of confidence that it is correctly choosing the proper preferences for a given individual.
5.1 Time Delta Enhanced Classification Adding more features to our data set did improve the resident classification accuracy. However, the results were still not as good as we anticipated. We hypothesize
762
Aaron Crandall and Diane J. Cook
Fig. 9 Classification accuracy for Charlie class across feature types.
Fig. 10 False positive rates for Charlie class across feature types.
that one reason for the remaining inaccuracies is the type of sensor events we are classifying. Many motion sensor events occur when individuals are moving through the space to get to a destination, and do not differentiate well between residents in the space. On the other hand, when a resident is in a single location for a significant amount of time, that location is a type of destination for the resident. They are likely performing an activity of interest in that location, and as a result the corresponding motion sensor data should be used for resident classification. To validate our hypothesis, the data set was culled of all extra sensor events where the same sensor generated multiple readings in a row and only the first event in the series was kept. The multiple readings were likely due to small movements occurring repeatedly within the one small area of the lab. Replacing the set of readings with one representative motion sensor event allowed the sensor event to represent the entire activity taking place at that location. With this reduced set of events, the time deltas, or time elapsed between the remaining events, were calculated. The chart shown in Figure 12 gives a count of how long an individual spent at any one motion sensor location before moving to a
Learning Activity Models for Multiple Agents in a Smart Space
763
Fig. 11 Overall classification rates for Charlie class across feature types.
new location. The average time spent on any sensor was 35 seconds, with a standard deviation of 10 seconds. With a graph of this shape, we have evidence supporting the initial hypothesis of being able to garner additional information for training by focusing on locations with significant duration.
Fig. 12 The number of motion sensor events that spanned a given number of seconds.
Next, we removed from our data set any motion sensor events whose durations, or time elapsed between events, fell below two standard deviations from the mean, leaving the longest deltas. With an even more reduced set in hand, the data splitting, training and testing were all done the same way as before with the full data set. The resulting classifier only used a handful of the available sensors throughout the living space, but the accuracy and false positive rates improved dramatically. This is attributed to the fact that motion sensors in shared spaces or walkways will mostly have very small time deltas associated with them. Since these sensors are also the ones with the most false positive rates in the full set classifier, removing these sensor events will improve the overall performance of the classifier. Note that with this filtered-data approach, sensor events with short durations will not be assigned a
764
Aaron Crandall and Diane J. Cook
Fig. 13 Delta-filtered classification accuracy results.
mapping to a specific resident. However, by combining this tool with one that tracks residents through the space [7], only a handful of sensor events need to be classified as long as they have a high accuracy.
Fig. 14 Delta-filtered classification false positive results.
This new classifier saw correct classification rates over 98% with false positives as low as 1%. Again, there was some difference in performance with different feature choices, as shown in Figures 13 and 14. Once again, the hour-of-day feature performed the best, as it seems to give the naive Bayes classifier information that could be used to differentiate between resident behaviors within this data set.
5.2 Markov Model Classification Accuracy In our Markov model approach to labeling sensor events, we need to determine the number of sensor events that we will include in the sequence input to each Markov
Learning Activity Models for Multiple Agents in a Smart Space
765
Fig. 15 Markov model classification accuracy for all three residents as a function of trace size.
model. There is no magic formula that specifies the ideal size for such an event trace. As a result, we computed the accuracy of the Markov model classifier for a variety of event trace sizes. The results are shown in Figure 15. As the figure shows, the classification performance peaks at a trace size of 25. None of the accuracies outperform the naive Bayes classifier for this task. We note that finding the ideal trace size needs to be automated for future applications of the Markov model. One possible approach is to build a model on a sample of the data and empirically determine the trace size that yields the best result for a given classification task. The results from these experiments indicate that resident behavior histories can be insightful mechanisms for identifying the resident that triggered a sensor event, even when multiple residents are present simultaneously in an environment. We did find that some sensor events (e.g., those with long durations) are easier to characterize than others, because they are more closely associated with an activity that the resident may be performing. We also observed that the more complex Markov model did not outperform the naive Bayes classifier. This indicates that viewing the classification task as a bag-of-sensor-events model is effective for identifying residents who are currently in the smart environment and who are triggering sensor events. Once they are able to learn profiles of resident activities, even in multi-resident settings, the smart environment will still face the challenge of trying to meet the needs of all the residents. Because these needs and preferences may diverge or even compete, there is not necessarily a straightforward solution to such decisions. Lin and Fu [11] learn a model of users’ preferences that represents the relationships among users and the dependency between services and sensor observations. By explicitly learning relationships between these entities, their algorithm can automatically select the service that is appropriate for a given resident and context. Because their approach infers the best service at the right time and place, the comfort of residents can be maintained, even in a multi-resident setting.
766
Aaron Crandall and Diane J. Cook
The competition for resources can be observed even at a software component level. In the work of Colley and Stacey [3], the need for software agents to communicate is recognized. Communication in a complex setting such as a smart environment can become quite costly, when the amount of gathered data and the number of provided services increases. In some circumstances it is advantageous for software agents to form clusters, or associate themselves with other agents with whom they will communicate often. Colley and Stacey investigate alternative voting schemes to design optimal clusters for this type of agent organization that meet an optimal number of needs and preference requests. While the previous approaches have demonstrated success in negotiating needs of multiple agents in smart environments, there are many situations in which not all of the stated needs or preferences can be met. In response, the smart environment could prioritize the needs or the residents themselves. An alternative approach, put forward by Roy, et al. [21], focuses not on meeting all of the needs of all of the agents, but instead to strike a balance between multiple diverse preferences. The goal of this approach is to achieve a Nash equilibrium state so that predictions and actions achieve sufficient accuracy in spite of possible correlations or conflicts. As demonstrated in this work, the result can be effective for goals such as minimizing energy usage in a smart environment.
6 Conclusions In this chapter, we investigated a multi-agent issue that pervades smart environment research, that of attributing sensor events to individuals in a multi-resident setting. To address the challenge, we introduced an approach that used a naive Bayes classifier to map sensor events to residents. We felt that a classifier of this type would be able to accurately and quickly differentiate between residents in a smart home environment. Using a real-world testbed with real-world activity, a classifier was built and tested for accuracy. The results shown in this work are encouraging. With simple, raw smart home sensor data the classifier was showing an average accuracy over 90% for some feature selections. After applying some filtration to the data set to exaggerate the behavior of the residents, accuracy rates over 95% and false positive rates under 2% were possible. Choosing the best time-based features can strongly influence the performance of any temporally-dependent environment, and this is no exception. Whether the final application needs a very high level of certainty for one or more of the residents or can trade that certainty off for higher accuracy across all individuals is up to the needs of the final smart home application. Fortunately, developing an algorithmic way of determining the proper features to use is easily done. This method of assigning sensor events to residents also add valuable information to tracking systems that identify people based on an observed series of events by taking into account the likelihoods at every stage of a person’s behavior. Continued work combining simple tools like these are leading to very strong identification
Learning Activity Models for Multiple Agents in a Smart Space
767
strategies. To continue to grow the capabilities of these kinds of classifiers, a number of things can help. Additional data with more individuals will show how robust of a solution this is. Differentiating between two people with very similar schedules might be very difficult for this kind of tool. Comparing this tool as a baseline solution with Hidden Markov or Bayesian Network based solutions will allow the continued research to show how much contextual information assists with the classification of individuals. A next step for this research is to couple the classifier with a larger preference and decision engine. Adding this tool to a passive tracking solution will give significantly more information to any individual’s history for future learning and prediction systems that are deployed in the CASAS testbed. Comparing it to a system without this kind of identification process, or one based on device tracking, will be a significant step for smart home research. Acknowledgements This work is supported in part by NSF grant IIS-0121297.
References [1] P. Bahl and V. Padmanabhan. Radar: An in-building rf-based user location and tracking system. In Proceedings of IEEE Infocom, pages 775–784, 2000. [2] O. Brdiczka, P. Reignier, and J. Crowley. Detecting individual activities from video in a smart home. In Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 363–370, 2007. [3] M. Colley and P. Stacey. Agent association group negotiation within intelligent environments. In Proceedings of the International Conference on Intelligent Environments, 2008. [4] D. Cook and S. K. Das. How smart are our environments? an updated look at the state of the art. Journal of Pervasive and Mobile Computing, 3(2):53–73, 2007. [5] M. Diehl, M. Marsiske, A. Horgas, A. Rosenberg, J. Saczynski, and S. Willis. The revised observed activities of daily living: A performance-based assessment of everyday problem solving in older adults. Journal of Applied Gerontology, 24(3):211–230, 2005. [6] W. Feng, J. Walpole, W. Feng, , and C. Pu. Moving towards massively scalable video-based sensor networks. In Proceedings of the Workshop on New Visions for Large-Scale Networks: Research and Applications, 2001. [7] J. Hightower and G. Borriello. Location systems for ubiquitous computing. IEEE Computer, 32(8):57–66, 2001. [8] S. S. Intille. Designing a home of the future. IEEE Pervasive Computing, 1:80–86, 2002. [9] V. Jakkula, A. Crandall, and D. J. Cook. Knowledge discovery in entity based smart environment resident data using temporal relations based data mining.
768
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Aaron Crandall and Diane J. Cook
In Proceedings of the ICDM Workshop on Spatial and Spatio-Temporal Data Mining, 2007. J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer. Multicamera multi-person tracking for easyliving. In Proceedings of the Third IEEE International Workshop on Visual Surveillance, pages 3–10, 2000. Z. Lin and L. Fu. Multi-user preference model and service provision in a smart home environment. In Proceedings of the IEEE International Conference on Automation Science and Engineering, pages 759–764, 2007. C. Lu, Y. Ho, and L. Fu. Creating robust activity maps using wireless sensor network in a smart home. In Proceedings of the Third Annual Conference on Automation Science and Engineering, 2007. U. Maurer, A. Smailagic, D. Siewiorek, and M. Deisher. Activity recognition and monitoring using multiple sensors on different body positions. In Proceedings of the International Workshop on Wearable and Implantable Body Sensor Networks, 2006. S. Moncrieff. Multi-modal emotive computing in a smart house environment. Journal of Pervasive and Mobile Computing, special issue on Design and Use of Smart Environments (to appear), 2007. R. J. Orr and G. D. Abowd. The smart floor: A mechanism for natural user identification and tracking. In Proceedings of the ACM Conference on Human Factors in Computing Systems, The Hague, Netherlands, 2000. M. B. Patterson and J. L. Mack. The cleveland scale for activities of daily living (csadl): Its reliability and validity. Journal of Clinical Gerontology, 7:15–28, 2001. M. Philipose, K. P. Fishkin, M. Perkowitz, D. J. Patterson, D. Hahnel, D. Fox, and H. Kautz. Inferring activities from interactions with objects. IEEE Pervasive Computing, 3(4):50–57, 2004. N. Priyantha, A. Chakraborty, and H. Balakrishnan. The cricket location support system. In Proceedings of the International Conference on Mobile Computing and Networking, pages pages 32–43, 2000. P. Rashidi and D. Cook. Adapting to resident preferences in smart environments. In Proceedings of the AAAI Workshop on Advances in Preference Handling, 2008. B. Reisberg, S. Finkel, J. Overall, N. Schmidt-Gollas, S. Kanowski, H. Lehfeld, F. Hulla, S. G. Sclan, H.-U. Wilms, K. Heininger, I. Hindmarch, M. Stemmler, L. Poon, A. Kluger, C. Cooler, M. Bergener, L. Hugonot-Diener, P. H. Robert, and H. Erzigkeit. The Alzheimer’s disease activities of daily living international scale. International Psychogeriatrics, 13:163–181, 2001. N. Roy, A. Roy, K. Basu, and S. K. Das. A cooperative learning framework for mobility-aware resource management in multi-inhabitant smart homes. In Proceedings of the IEEE Conference on Mobile and Ubiquitous Systems: Networking and Services (MobiQuitous), pages 393–403, 2005. D. Sanchez, M. Tentori, and J. Favela. Activity recognition for the smart hospital. IEEE Intelligent Systems, 23(2):50–57, 2008.
Learning Activity Models for Multiple Agents in a Smart Space
769
[23] L. Snidaro, C. Micheloni, and C. Chivedale. Video security for ambient intelligence. IEEE Transactions on Systems, Man and Cybernetics, Part A:Systems and Humans, 35:133–144, 2005. [24] T. van Kasteren and B. Krose. Bayesian activity recognition in residence for elders. In Proceedings of the International Conference on Intelligent Environments, 2008. [25] C. Wren and E. Munguia Tapia. Toward scalable activity recognition for sensor networks. In Proceedings of the Workshop on Location and ContextAwareness, 2006. [26] J. Yin, Q. Yang, and D. Shen. Activity recognition via user-trace segmentation. ACM Transactions on Sensor Networks, 2008.
Mobile Agents Ichiro Satoh
1 Introduction Mobile agents are autonomous programs that can travel from computer to computer in a network, at times and to places of their own choosing. The state of the running program is saved, by being transmitted to the destination. The program is resumed at the destination continuing its processing with the saved state. They can provide a convenient, efficient, and robust framework for implementing distributed applications and smart environments for several reasons, including improvements to the latency and bandwidth of client-server applications and reducing vulnerability to network disconnection. In fact, mobile agents have several advantages in the development of various services in smart environments in addition to distributed applications. • Reduced communication costs: distributed computing needs interactions between different computers through a network. The latency and network traffic of interactions often seriously affect the quality and coordination of two programs running on different computers. As we can see from Figure 1, if one of the programs is a mobile agent, it can migrate to the computer the other is running on communicate with it locally. That is, mobile agent technology enables remote communications to operate as local communications. • Asynchronous execution After migrating to the destination-side computer, a mobile agent does not have to interact with its source-side computer. Therefore, even when the source can be shut down or the network between the destination and source can be disconnected, the agent can continue processing at the destination. This is useful within unstable communications, including wireless communication, in smart environments.
Ichiro Satoh National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_29, © Springer Science+Business Media, LLC 2010
771
772
Ichiro Satoh
• direct manipulation A mobile agent is locally executed on the computer it is visiting. It can directly access and control the equipment for the computer as long as the computer allows it to do so. This is helpful in network management, in particular in detecting and removing device failures. Installing a mobile agent close to a real-time system may prevent delays caused by network congestion. • Dynamic-deployment of software Mobile agents are useful as a mechanism for the deployment of software, because they can decide their destinations and their code and data can be dynamically deployed there, only while they are needed. This is useful in smart environments, because they consist of computers whose computational resources are limited. • Easy-development of distributed applications Most distributed applications consist of at least two programs, i.e., a client-side program and a server side program and often spare codes for communications, including exceptional handling. However, since a mobile agent itself can carry its information to another computer, we can only write a single program to define distributed computing. A mobile agent program does not have to define communications with other computers. Therefore, we can easily modify standalone programs as mobile agent programs. As we can see from Figure 2, mobile agents can save themselves through persistent storage, duplicate themselves, and migrate themselves to other computers under their own control so that they can support various types of processing in distributed systems. Computer 1
Computer 2 Communication
Client program
Server program Network
Computer 1
Agent migraion
Agentbased client program
Computer 2 Agentbased client program
Server program
Network
Fig. 1 Reduced communication
Although not all applications for distributed systems will need mobile agents, there are many other applications that will find mobile agents the most effective technique for implementing all or part of their tasks. Mobile agent technology can be treated as a type of software agent technology, but it is not always required to offer intelligent capabilities, e.g., reactive, pro-active, and social behaviors that are
Mobile Agents
773 Mobile Computer D agent clone Duplication Mobile agent
Migration Mobile agent
Computer B mobile agent
Computer A Computer C
Parallel execution
Mobile agent clone Computer E Mobile agent clone Computer F
Saving
Mobile agent on persistent storage
Fig. 2 Functions of mobile agents in distributed system
features of existing software agent technologies. This is because these capabilities tend to be large in terms of scale and processing, and no mobile agent should consume excessive computational resources, such as processors, memory, files, and networks, at its destinations. Also, the technology is just an implementation approach of distributed systems rather than intelligent systems.
1.1 Mobility and Distribution Fuggetta, et al [4] provided a description of mobile software paradigms for distributed applications. These are classified as client/server (CS), remote evaluation (REV), code on demand (COD), and mobile agent (MA) approaches. By decompiling distributed applications into code, data, and execution, most distributed executions can be modeled as primitives of these approaches as we can see from Figure 3. • The client-server approach is widely used in traditional and modern distributed systems (Figure 3 a)). The code, data, and execution remain fixed at computer A. Computer B requests a service from the server with some data arguments of the request. The code and remaining data to provide the service are resident within computer B. As a response, computer B provide the service requested by accessing computational resources provided in it. Computer B returns the results of the execution to computer A. • The remote evaluation approach assumes that the code to perform the execution is stored at computer A (Figure 3 b)). Both the code and data are sent to computer B. As a response, computer B executes the code and data by accessing computational resources, including data, provided in them. An additional interaction returns the results from computer B to computer A. • The code-on-demand approach is an inversion of the remote evaluation approach (3 c)). The code and data are stored at computer A and execution is done at computer B. Computer A fetches code and data from computer B and then executes the code with its local data as well as the imported data. An ex-
774
Ichiro Satoh a) CS (Client/Server) Data
Data
Data Code Data
Computer A
Computer B
b) REV (Remote EValuation) Code
Code
Code Data
Computer A
Computer B
c) COD (Code On Demand) code
Code
Code Data Computer A
Computer B
d) MA (Mobile Agent) data Code Computer A
Data Code
Data Code Data Computer B
Fig. 3 Client/server, remote evaluation, code on demand, and mobile agent
ample of this is Java applets, which are Java codes that web-browsers download from remote HTTP servers to execute locally. • The mobile agent approach assume that the code and data are initially hosted by computer A (Figure 3 d)). Computer A migrates the data and code it need to computer B. After it has moved to computer B, the code is executed with the data and the resources available on computer B.
2 Mobile Agent Platform Mobile agent platforms consist of two parts: mobile agents and runtime systems. The former defines the behavior of software agents. The latter are called agent platforms, agent systems, and agent servers, and support their execution and migration. The same architecture exists on all computers at which agents are reachable. That is, each mobile agent runs within a runtime systems on its current computer. When an agent requests the current runtime system to migrate itself, the runtime system can migrate the agent to a runtime system on the destination computer, carrying its state and code with it. Each runtime system itself runs on top of the operating system as a middleware. It provides interpreters or virtual machines for executing agent programs, or the system themselves are provided on top of virtual machines, e.g., the Java virtual machine (JVM).
Mobile Agents
775
2.1 Remote Procedure Call Agent migration is similar to RPC (Remote Procedure Calling) or RMI (Remote Method Invocation). RPC enables a client program to call a procedure for server programs running in separate processes, generally in different computers from the client [2]. RMI is an extension of local method invocation that allows an object to invoke the methods of the object on a remote computer. RPC or RMI can pass arguments to a procedure or method of a program on the server and receives a return value from the server. The mechanism for passing arguments and results between two computers through RPC or RMI correspond to that for agent migration between two computers. Figure 4 shows flow for the basic mechanism of RPC between two computers. computer A
computer B
calling program arguments marshaling
calling program arguments
return
return
marshaling
unmarshaling
unmarshaling
marshaled arguments
OS/Hardware
OS/Hardware marshaled return
network
Fig. 4 Remote procedure call between two computers
2.1.1 Agent Marshaling Data items, e.g., objects and values, in a running program cannot be directly transmitted over a network. They must be transformed into external data representation, e.g., a binary form or text form, before migrating them (Figure 5). Marshaling is the process of collecting data items and assembling them into a form suitable for transmission in a message. Unmarshaling is the process of disassembling them on arrival to produce an equivalent collection of data items at the destination.1 The marshaling and unmarshaling processes are carried out by runtime systems in mobile agent systems. The runtime system at the left (at sender-side computer) of Figure 6 marshals an agent to transmit it to a destination through a communication channel or message and then the runtime system at the right (at receiver-side computer) of Figure 6 receives the data and unmarshals the agent.
1
Note that marshaling and serialization are often used without any distinction between them. The latter is a process of flattening and converting an object, including its referring objects, into a sequence of bytes to be sent across network or saved on a disk.
776
Ichiro Satoh Agent
Marshaled (serialized) agent
Object 1
1 Reference
Reference Object 2
Marshaling
2
Object 3
Unmarshaling
2
3
3
4
3
Reference Object 4
4
Fig. 5 Marshaling agent Computer A
Mobile agent
Computer B
Mobile agent
Mobile agent
Runtime system Marshaling
Unmarshaling
Mobile agent
Runtime system Marshaled agent
OS/Hardware
Marshaling
Unmarshaling
OS/Hardware Network
Fig. 6 Agent migration between two computers
2.1.2 Agent Migration Figure 6 shows the basic mechanism for agent migration between two computers. Step.1 The runtime system on the sender-side computer suspends the execution of the agent. Step.2 It marshals the agent into a bit-chunk that can be transmitted over a network. Step.3 It transmits the chunk to the destination computer through the underlying network protocol. Step.4 The runtime system on the receiver-side computer receives the chunk. Step.5 It unmarshals the chunk into the agent and resumes the agent. Most existing mobile agent systems use TCP channels, SMTP, or HTTP as their underlying communication protocols. Mobile agents themselves are separated from the underlying communication protocols.
2.1.3 Strong Migration vs. Weak Migration The state of execution is migrated with the code so that computation can be resumed at the destination. According to the amount of detail captured in the state, we can classify agent migration into two types: strong and weak. • Strong migration: is the ability of an agent to migrate over a network, carrying the code and execution state, where the state includes the program counter,
Mobile Agents
777
saved processor registers, and local variables, which correspond to variables allocated in the stack frame of the agent’s memory space, global variables. These correspond to variables allocated in the heap frame. The agent is suspended, marshaled, transmitted, unmarshaled and then restarted at the exact position where it was previously suspended on the destination node without loss of data or execution state. • Weak migration: is the ability of an agent to migrate over a network, carrying the code and partial execution state, where the state is variables in the heap frame, e.g., instance variables in object oriented programs, instead of its program counter and local variables declared in methods or functions. The agent is moved to and restarted on the destination with its global variables. The runtime system may explicitly invoke specified agent methods. Strong migration can cover weak migration, but it is a minority. This is because the execution state of an agent tends to be large and the marshaling and transmitting of the state over a network need heavy processing. Moreover, like the latter, the former cannot migrate agents that access the computational resources only available in current computers, e.g., input-and-output equipment and networks. The former unfortunately has no significant advantages in the development and operation of real distributed applications as discussed by Srasser et al. [22]. The program code for an agent needs to be available at the destination where the agent is running. The code must to be deployed at the source at the time of creation and at the destination to which it moves. Therefore, existing runtime systems offer a facility for statically deploying program code that is needed to execute the agent, for loading the program code on demand, or for transferring the program code along with the agent.
2.2 Mobile Agent Languages Since mobile agents are programming entities, programming languages for defining mobile agents are needed. There has been a huge number of programming languages, but all of these are not available for mobile agents. Programming languages for mobile agents must support the following functions. They should enable programs to be marshaled into data and vice versa. They should also download code from remote computers and link it at run-time. A few researchers have provided newly designed languages for defining mobile agents, e.g., Telescript [24], and most current mobile agent systems use existing general-purpose programming languages that can satisfy the above requirements, e.g., Java [1]. Telescript provides primitives for defining mobile agents, e.g, go operation, and enables a thread running on an interpreter to migrate to another computer. The Java language itself offers no support for the migration of executing code, but offers dynamic class loading, a programmable class loader, and a language-level marshaling mechanism, where these can be directly exploited to enable code mobility. Creating distributed systems based on mobile agents is a relatively easy paradigm because most existing mobile
778
Ichiro Satoh
agents are object oriented programs, e.g., Java, and can be developed by using rapid application development (RAD) environments. Distributed systems are characterized by heterogeneity in hardware architectures and operating systems. To achieve heterogeneity, the state and code of an agent need to be saved in a platform-independent representation. Hidden differences between platforms is provided at the language level, by using intermediate byte code representation in Java or by relying on scripting languages, such as Python and Ruby. Therefore, Java-based mobile agents are executed on Java virtual machines. The costs of running agents in a Java virtual machine on a device are decreasing by using just-in-compiler technologies.
2.3 Agent Execution Management The runtime system manages execution and monitoring of all agents on a computer. It allows several hundred agents to be present at any one time on a computer. It also provide these agents with an execution environment and executes them independently of one another. It manages the life-cycle of its agents, e.g., creation, termination, and migration. Each agent program can access basic functions provided by its runtime system by invoking APIs (Table 1). The agent uses the go command to migrate from one computer to another with the destination system address (and its target agent’s identifier) and does not need to concern itself with any other details of migration. Instead, the runtime system supports the migration of the agent. It stops the agent’s execution and then marshals the agent’s data items to the destination via the underlying communication protocol, e.g., TCP channel, HTTP (hyper text transfer protocol), and SMTP (simple mail transfer protocol). The agent is unpacked and reconstituted on the destination. Table 1 Functions available in agents
command parameters function go destination address, agent-identifier agent migration terminate agent-identifier agent termination duplicate agent-identifier agent duplication identify agent-type identification lookup agent-type, runtime system address discovery of available agents communicate agent-identifier inter-agent communication
Mobile Agents
779
2.4 Inter-agent Communication Mobile agents can interact with other agents residing within the same computer or with agents on remote computers as other multi-agents. Existing mobile agent systems provide various inter-agent communication mechanisms, e.g., method invocation, publish/subscribe-based event passing, and stream-based communications.
2.5 Locating Mobile Agents Since mobile agents can autonomously travel from computer to computer, a mechanism for tracking the location of agents is needed by the users to control their agents and for agents to communicate with other agents. Several mobile agent systems provide such mechanisms, which can be classified into three schemes: • A name server multicasts query messages about the location of an agent the to computers and receives a reply message from a computer hosting the agent (Figure 7 (a)). • An agent registers its current location at a predefined name server whenever it arrives at another computer (Figure 7 (b)). • An agent leaves a footprint specifying its destination at its current computer whenever it migrates to another computer to track the trails of the agent (Figure 7 (c)). In many cases, locating agents is application specific. For example, the first scheme is suitable for an agent moving within a local region. It is not suitable for agents visiting distant nodes. The second scheme is suitable for an agent migrating within a far away region; in the case of a large number of nodes, registering nodes are organized hierarchically. However, it is not suitable for a large number of migrations. The third scheme is suitable for a small number of migrations; it is not appropriate for long chains.
2.6 Security Security is one of the most important issues with mobile agent systems. Most security issues in mobile agents are common to existing computer security problems in communication and the downloading of software. In addition, many researchers have explored mechanisms to enhance security with mobile agent systems. There are two problems in mobile agent security: the protection of hosts from malicious mobile agents and the protection of mobile agents from malicious hosts. It is difficult to verify with complete certainty whether an incoming agent is malicious or not. However, there are two solutions to protecting hosts from malicious mobile agents. The first is to provide access-control mechanisms, e.g., Java’s security manager. They
780
Ichiro Satoh Name server
a)
Query message
Query message
Agent Computer A
Agent migration
Query message
Agent Computer B
Reply message
Agent migration
Agent Computer C
Name server
b)
Arrival message
Arrival message
Agent Computer A
agent migration
Agent Computer B
Arrival message
Agent migration
Agent Computer C
Name server
c)
Query message
Agent
B
Computer A
Reply message
Forwarding query Forwarding query message message C Agent Agent Agent Computer B migration migration
Agent Computer C
Fig. 7 Discovery for migrating agents
explicitly specify the permission of agents and restrict any agent behaviors that are beyond their permissions. The second is to provide authentication mechanisms by using digital signatures or authentication systems. They explicitly permit runtime systems to only receive agents that have been authenticated, have been sent from authenticated computers, or that have originated from authenticated computers. There have been no general solutions to the second problem, because it is impossible to keep agent private from runtime systems executing the agent. However, (non-malicious) runtime systems can authenticate the destinations of their agents, to check whether these are non-malicious, before they migrate the agents to these destinations. While strong security features would not immediately make mobile agents appealing, the absence of security would certainly make mobile agents unattractive and unpractical.
Mobile Agents
781
2.7 Remarks Several technologies have been presented for enabling software to migrate between computers, e.g., mobile code, process-migration, and mobile objects. Mobile agents differ from mobile codes, e.g., downloadable applets, in that mobile codes can maintain the states of running programs. As a result, they must start their initial states after they have been deployed at remote computers. One of the most important differences between mobile agents and traditional techniques, e.g., process-migration or mobile objects is in their acceptable levels of mobility-transparency. Introducing too much transparency can adversely affect other characteristics, such as complexity, or the scope of modifications made to the underlying environment. For example, a solution allowing the migration of processes or objects at any time in response to a request from any other object would require significant changes to the underlying environment, e.g., balancing the processor load and escaping from a shutdown computer, whereas mobile agents can move where and when they choose, typically through a go statement. Similarly, solutions that insist on continuous communication and name resolution could be achieved for naming and open channel handling, but they would incur significant complexity in communication support and the naming model. Process-migration and mobile object technologies require fully transparent solutions at the operating system level to minimize complexity. For example, processes and objects still continue to access the computational resources, e.g., file systems, database systems, and channels, that they accessed at their source-side computers, even after they have moved. With a reasonable choice of transparency-requirements, mobile agents can access computational resources provided in current computers after mobile agents have moved to their destinations. Although mobile agents are similar to mobile objects at the programming-level, they contain threads and they are therefore active and can act autonomously, whereas most mobile objects are implemented as passive entities.
3 Mobile Agent Applications Many researchers have stated that there are no killer applications for mobile agent technology [7], because almost everything you can do with MAs can be done with more traditional technologies. However, mobile agents make it easier, faster, and more effective to develop, manage, and execute distributed applications than other technologies. We describe typical applications of mobile agents as follows:
3.1 Remote Information Retrieval This is one of the most traditional applications of mobile agents. If all information were stored in relational databases, a client could send a message containing
782
Ichiro Satoh
SQL commands to database servers. However, given that most of the world’s data is in fact maintained in free text files on different computers, remote searching and filtering require the ability to open, read, and filter files. Since mobile agents can perform most of their tasks locally at the destination, Client can send its agents to database servers so that they locally perform a sequence of query or update tasks on the servers. Communications between the client and server can be minimized, i.e., the migration of a search agent to the server and the migration of an agent to the client. Since agents contain program codes for filtering information that is of interest to their users from databases, they only need to carry wanted information back to the client to reduce communication traffic. Furthermore, agents can migrate among multiple database servers to retrieve and gather the interesting data from the servers. They can also determine the destinations based on information they have acquired from the database servers that they have thus far visited.
3.2 Network Management Mobile agent technology provides a solution to the flexible management of network systems. Mobile agents can locally observe and control equipment at each node by migrating among nodes. Mobile agent-based network management has several advantages in comparison with traditional approaches, such as the client/server one. • As code is very often smaller than the data it processes, the transmission of mobile agents to sources of data creates less traffic than transferring the data itself. Deploying a mobile agent close to the network nodes that we want to monitor and control prevents delays caused by network congestion. • Since a mobile agent is locally executed on the node it is visiting, it can easily access the functions of devices on this node. • The dynamic deployment and configuration of new or existing functionalities into a network system are extremely important tasks, especially as they potentially allow outdated systems to be updated in an efficient manner. • Network management systems must often handle networks that may have various malfunctions and disconnections and whose exact topology may not be known. Since mobile agents are autonomous entities, they may be able to detect proper destinations or routings on such networks. Adopting mobile agent technology eliminates the need for administrators to constantly monitor many network management activities, e.g., the installation and upgrading of software and periodic network auditing. There have been several attempts to apply this technology to network management tasks. Karmouch presented typical mobile agent approaches to network management [8]. Satoh proposed a framework for building and operating agent itineraries for network management systems [14, 17] and constructed domain-specific languages for describing agent migration for network management [19].
Mobile Agents
783
3.3 Cloud Computing Load-balancing is a legacy application of process migration and mobile agent technologies. In a distributed system, e.g., a grid or cloud computing system, computers tend to be numerous and their computational loads are different. Computers may also be dynamically added to or removed from the system. Tasks should be dynamically deployed at computers which loads light rather than those lose with heavy loads. Since mobile agents can migrate to other computers, tasks that are implemented as mobile agents can be relocated at suitable computers whose processors can execute the tasks. This is practical in implementing massively multi agent systems that must operate a huge number of agents, which tend to be dynamically created or which terminate, on a distributed system that consists of heterogeneous computers.
3.4 Mobile Computing Mobile agents use the capabilities and resources of remote servers to process their tasks. When a user wants to do tasks beyond the capabilities of his or her computers, the agents that perform the tasks can migrate to and be executed at a remote server. Mobile agents can also mask temporal disconnections in networks. Mobile computers are not always connected to networks, because their wired networks are disconnected before they are moved to other locations or wireless networks become unstable or non-available due to deteriorating radio conditions or are not uncovered by the area at all. A stable connection is only requested at the beginning to send the agent, and to take the agent back at the end of the task, but this is not requested during the execution of the whole application execution. Several researchers have explored mechanisms for migrating agents through unstable networks [3, 6, 16]. When a mobile agent requests a runtime system to migrate itself, the system tries to transmit the moving agent to the destination. If the destination cannot be reached, the system automatically stores the moving agent in a queue and then periodically tries to transmit the waiting agent to either the destination or another runtime system on a reachable intermediate node as close to the destination as possible. These relay runtime systems repeat the process until the agent arrives at its destination.
3.5 Software Testing Mobile agents are useful in the development of software as well as the operation of software in distributed and mobile computing settings. An example of these applications is testing methodology for software running on mobile computers, called Flying Emulator [15, 18]. Wireless LANs or 4G-networks incorporate wireless LAN technologies, and mobile terminals can access the services provided by LANs, as
784
Ichiro Satoh
well as global network services. Therefore, software running on mobile terminals may depend on not only its application-logic but also on services within the LANs that the terminals are connected to. Effective software constructed to run on mobile terminals for 4G wireless networks and wireless LANs must be tested in all networks to which the terminal could be moved and then connected to. Like existing approaches, this provides software-based emulators for mobile terminals for software designed to run on the terminals. It also constructs the emulators as mobile agents that can travel between computers. As we can see from Figure 8, these emulators can carry target software to networks that the terminals are connected to and allow it to access services. These services are provided by the networks in the same way as if the software had been carried by and executed on terminals connected to the networks.
Target software
Target software
Physical mobility of terminal with target software Wireless network
Disconnection and movement
Servers
Wireless network Servers
Wireless network Servers
LAN B Disconnection and movement
LAN A
LAN C
emulation Logical mobility of emulator with target software
Target software Mobile agentbased emulator Target software
Servers
Migration
Mobile agentbased emulator
Servers
Servers
LAN B Migration
LAN A
LAN C
Fig. 8 Corelation between the movement of target mobile computer and migration of mobile agentbased emulator
3.6 Active Networking There are two approaches to implementing active networks (for example, see [23]). The active packet approach replaces destination addresses in the packets of existing
Mobile Agents
785
architectures with miniature programs that are interpreted at nodes on arrival. The active node approach enables new protocols to be dynamically deployed at intermediate and end nodes using mobile code techniques. Mobile agents are very similar to active networks, because a mobile agent can be regarded as a specific type of active packet, and an agent platform in traditional networks can be regarded as a specific type of active node. There have been a few attempts to incorporate mobile agent technology with active network technology (for example, see [8]). Of these, the MobileSpaces system [16] provides a mobile agent-based framework for integrating the both approaches. The framework enables us to implement network processing of mobile agents with mobile agent-based components, where the components are still mobile agents so that they can be dynamically deployed at computers to customize network processing.
3.7 Active Documents Mobile code technology is widely used in plug-in modules for rich internet applications (RIA) in web-browsers, e.g., Java Applet and Macromedia Flash. Such modules provide us with interactive user experiences because their virtual machines, e.g., Java virtual machines and Flash players, can locally execute and render them across multiple platforms and browsers without having to communicate with remote servers. However, it is not easy to save their results on local computers or remote servers, and to resume them with the previous results later, since their code can be transported but not their state. Mobile agents solve this problem and provide a next-generation RIA. Mobile agent-based modules for RIA can naturally carry both their code and state at client computers. For example, MobiDoc [11] is a mobile agent-based framework for building mobile compound documents where a compound document can be dynamically composed of mobile agent-based components, which view or edit their contents, e.g., text, images, and movies. It can migrate itself over a network as a whole, with all its embedded components. Each component is self-contained in the sense that it maintains its content and program code for viewing and modifying the content inside it, and multiple components can be combined into an active and mobile document.
4 Ambient Computing Ambient computing is one of the most important applications of mobile agents and mobile agents is useful in building and operating ambient computing environments.
786
Ichiro Satoh
4.1 Mobile Agent-based Middleware for Ambient Computing Ambient computing environments, which consist of computers often have limited resources, such as restricted levels of CPU power and amounts of memory. Mobile agents can help to conserve these limited resources, since each agent only needs to be present at the computer when the computer needs the services provided by that agent. Ambient computing environments are used as social infrastructures, which must support massive users and consist of numerous devices, in building and cities in future. The scale and complexity of large-scale ambient computing environments are beyond our ability to manage these using traditional approaches, such as those that are centralized and top-down. On the other hand, mobile agents can support a non-centralized approach for ambient computing environment.
4.2 Context-aware Mobile Agent Several research projects also attempted to migrate agents between computers according to changes in the real world. The SpatialAgent framework [12, 13] provides a bridge between the movement of physical entities, e.g., people and things, and the movement of mobile agents to support and annotate the entities using location-tracking systems, e.g., RFID technology. It binds physical entities with mobile agents and navigate agents to stationary or mobile computers near the locations of the entities and places to which the agents are attached, even after their locations have changed. Figure 9 (a) shows that a moving entity carrying an RF-tagged agent host and a space containing a place-bound RF-tag and RF reader. When the reader detects the presence of the RFID tag that is bound to the agent host, the framework instructs the agents attached to the tagged place to migrate to the visiting agent host to offer location-dependent services of for that place. Figure 9 (b) shows that an RF-tagged agent host and an RF reader have been allocated. When an RF-tagged moving entity enters the coverage area of the reader, the framework instructs the agents attached to the entity to migrate to the agent host within the same coverage area to offer entity-dependent services to the entity. One of the most typical application of mobile agent technology in ambient computing is follow-me application [5], which tracks the current location of the user and allows him/her to access his/her applications at the nearest computer as he/she moves around in the building. Previous studies of such approaches deploy only the screenshots of applications at appropriate computers. Mobile agent technology can migrate not only the user interfaces of applications but also the applications themselves to computers as shown in Figure 10. To provide context-aware services according to contextual information in the real world, we need a model about the real world, called world model or location model. Satoh propose a mobile agentbased location model as a location-based service discovery system, because it did not maintain only spatial relationships between physical spaces and entities but also
Mobile Agents
787
the locations of location-based services and computing devices within the model [20]. (b) Moving tagged entity and stationary sensor
(a) Moving agent host and stationary sensor Step 1 Agent host
Step 1 Agent host Host movement
Agent host
stationary sensor
Space
Stationary sensor
RF-tag Space
Step 2 Agent host
Agent host
A tagged entity movement
RF-tag
Agent migration to visiting host
Step 2
RF-tag
Agent host
Stationary sensor
Agent migration to host near moving entity
agent host RF-tag
Space
tag
Space
Fig. 9 Linkages between physical and logical worlds
clock application (mobile agent) editor application (mobile agent) agent migration agent host
agent host tag
tag
tag cell 1
tag tag
user movement
cell 2
Fig. 10 Follow-Me Desktop Applications
4.3 Personal-Assistant Agents Agents are often used to assist users in the real world as well as cyberspaces. Their assistance should be provided in personalized form according to their users’ requirements. Mobile agents can maintain user preferences and experiences inside them and migrate at at nearby computers to interact with their users without communication delay between the users and agents. Satoh proposed a framework for managing and operating agent-based user-guides in museums. Visitors move from exhibit to exhibit in a museum. Therefore, when they move to another exhibit, their agents should be deployed at computing devices close to their destination exhibits. That is, the virtual agents could accompany their visitors and annotate exhibits in front of
788
Ichiro Satoh
the visitors in the real-world on behalf of museum guides or curators (Figure 11). Agents as visitor guides can record and maintain user preferences and experiences because they are active and have state inside them. Therefore, they can provide users with assistant-services according to the users’ previous behaviors. They navigate visitors to the next exhibits according to the previous exhibits that they watched in addition to their routes.
Fig. 11 User assistant agent at Museum of Nature and Human Activities in Hyogo.
Each agent is attached to at most one visitor and maintains his/her preference information and programs that provide annotation and navigation to him/her. To enable agents to be easily developed and configured without any professional administrators, we divided each agent into three parts: The first is a user-preference part to maintain and record information about visitors, e.g., their knowledge, interests, routes, name, and the durations and times they spend at exhibits they visit. The second is an annotation part to define a task for playing annotations about exhibits or interacting with visitors. The third is a navigation part to define a task for navigating visitors to their destinations. The third part in the experiment provides four type built-in navigations corresponding to typical behaviors of visitors in museums outlined in Figure 12. The navigation service instructs users to move to at least one specified destination spot. selection service enables users to explicitly or implicitly select one spot or route from one or more spots or routes close to their current spots by moving to the selected spot or one spot along the selected route. The termination service informs users that they have arrived at the final destination spot. The warning informs users that they had missed their destination exhibit or their routes. As application-specific services could be defined and encapsulated within the agents, we were able to easily change the services provided by modifying the cor-
Mobile Agents
789
responding agents while the entire system was running. More than two different visitor-guide services could also be simultaneously supported for visitors. Even while visitors were participating, curators with no knowledge of context-aware systems were able to configure the annotative content by doing drag-and-drop manipulations using the GUI-based configuration system. Such dynamic configuration is useful, because museums need to continuously provide and configure services for visitors.
Fig. 12 User-navigation patterns.
5 Conclusion Mobile agents are just an implementation technique used in the development and operation of distributed systems, including smart environments, as other software agents, including multi-agents, are themselves not goals but tools for modeling and managing our societies and systems. Therefore, the future of mobile agents may not be specifically as mobile agents. They will be used as essential technologies in distributed computing or ambient computing, even though they will not be called mobile agents. In fact, although monolithic mobile agent systems were developed in the past decade to illustrate the concepts of mobile agents, recent several mobile agent systems have been developed based on several slightly different semantics for mobile agents. Ambient computing environments become social infrastructures in building and cities. Mobile agents are useful for building and operating such large-scale ambient computing environments, which consist of massive users and numerous devices.
References [1] K. Arnold, and J. Gosling, The Java Programming Language, Addison-Wesley 1998.
790
Ichiro Satoh
[2] A. Birrel and B. Nelson, Implementing remote procedure calls, ACM Transactions on Computer Systems, vol. 2, no.1, February 1984. [3] J. Cao, X. Feng, J. Lu, and S. K. Das, Mailbox-Based Scheme for Designing Mobile Agent Communication Protocols, IEEE Computer, pp.54-60, vol. 35, no.9, 2002. [4] A. Fuggetta, G. P. Picco, and G. Vigna Understanding Code Mobility IEEE Transactions on Software Engineering archive Vol. 24, No. 5, May 1998. [5] Harter A, Hopper A, Steggeles P, Ward A, Webster P. The Anatomy of a Context-Aware Application. Proceedings of Conference on Mobile Computing and Networking (MOBICOM’99); ACM Press; 1999; 59-68. [6] D. Kotz, R. S. Gray, S. Nog, D. Rus, S. Chawla, and G. Cybenko, Mobile Agents for Mobile Computing. in D. Milojicic, F. Douglis, and R. Wheeler (ed), Mobility, Mobile Agents and Process Migration, Addison Wesley and ACM Press, 1999. [7] Dejan Milojicic, Mobile agent applications, IEEE Concurrency, vol. 7, no.4, pp.80-90, JulySeptember 1999 [8] V. A. Pham, A Karmouch, Mobile Software Agents: An Overview, IEEE Communications Magazine, vol. 36 no. 7, pp.26-37, July 1998, [9] I. Satoh, MobileSpaces: A Framework for Building Adaptive Distributed Applications Using a Hierarchical Mobile Agent System, Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS’2000), pp.161-168, April 2000. [10] I. Satoh, MobiDoc: A Framework for Building Mobile Compound Documents from Hierarchical Mobile Agents, Proceedings of International Symposium on Agent Systems and Applications/International Symposium on Mobile Agents (ASA/MA2000), pp.113-125, Lecture Notes in Computer Science (LNCS), vol. 1882, Springer, September 2000. [11] I. Satoh, MobiDoc: A Mobile Agent-based Framework for Compound Documents, Informatica, vol.25, no. 4, pp.493-500, December 2001. [12] I. Satoh, Physical Mobility and Logical Mobility in Ubiquitous Computing Environments, Proceedings of 6th International Conference on Mobile Agents (MA’2002), Lecture Notes in Computer Science (LNCS), vol. 2535, pp.186-202, Springer, October 2002 [13] I. Satoh, SpatialAgents: Integrating User Mobility and Program Mobility in Ubiquitous Computing Environments, Wireless Communications and Mobile Computing, vol.3, no.4, pp.411423, John Wiley, June 2003. [14] I. Satoh, Building Reusable Mobile Agents for Network Management, IEEE Transactions on Systems, Man and Cybernetics, vol.33, no. 3, part-C, pp.350-357, August 2003. [15] I. Satoh, A Testing Framework for Mobile Computing Software, IEEE Transactions on Software Engineering, vol. 29, no. 12, pp.1112-1121, December 2003. [16] I. Satoh, Configurable Network Processing for Mobile Agents on the Internet, Cluster Computing, vol. 7, no.1, pp.73-83, Kluwer, January 2004. [17] I. Satoh, Selection of Mobile Agents, Proceedings of 24th IEEE International Conference on Distributed Computing Systems (ICDCS’2004), pp.484-493, IEEE Computer Society, March 2004. [18] I. Satoh, Software Testing for Wireless Mobile Computing, IEEE Wireless Communications, vol. 11, no. 5, pp.58-64, IEEE Communication Society, October 2004. [19] I. Satoh, Building and Selecting Mobile Agents for Network Management, Journal of Network and Systems Management, vol.14, no.1, pp.147-169, Springer, 2006. [20] I. Satoh, A Location Model for Smart Environment, Pervasive and Mobile Computing, vol.3, no.2, pp.158-179, Elsevier, 2007. [21] I. Satoh, Context-aware Agents to Guide Visitors in Museums, in Proceedings of 8th International Conference Intelligent Virtual Agents (IVA’08), Lecture Notes in Artificial Intelligence (LNAI), vol.5208, pp.441-455, September 2008. [22] Strasser, M., Baumann, J. and Hole, F.: Mole: A Java Based Mobile Agent System, Proceedings of Workshop on Mobile Object Systems, Lecture Notes in Computer Science (LNCS), Vol. 1222, Springer, 1997. [23] D. L. Tennenhouse et al., A Survey of Active Network Research, IEEE Communication Magazine, vol. 35, no. 1, 1997.
Mobile Agents
791
[24] J. E. White, Telescript Technology: Mobile Agents, in Software Agents, Bradshaw, J. (ed.), MIT Press, 1997.
Part VII
Applications
Ambient Human-to-Human Communication Aki Härmä
1 Introduction In the current technological landscape colored by environmental and security concerns the logic of replacing traveling by technical means of communications is undisputable. For example, consider a comparison between a normal family car and a video conference system with two laptop computers connected over the Internet. The power consumption of the car is approximately 25kW while the two computers and their share of the power consumption in the intermediate routers in total is in the range of 50W. Therefore, to meet a person using a car at an one hour driving distance is equivalent to 1000 hours of video conference. The difference in the costs is also increasing. An estimate on the same cost difference between travel and video conference twenty years ago gave only three days of continuous video conference for the same situation [29]. The cost of video conference depends on the duration of the session while traveling depends only on the distance. However, in a strict economical and environmental sense even a five minute trip by a car in 2008 becomes more economical than a video conference only when the meeting lasts more than three and half days. It is often suggested that the transportation and telecommunications are strongly coupled, or complementary. The telephone is an important means of making appointments and therefore enabling and justifying more travel, but at the same time travel increases the need for communications. For example, Short et al [97] give an example where the opening of a bridge in the UK led to a significant increase in the telephone traffic between the two previously separated areas. In fact, a recent analysis by Choo and Mokhtarian [23] on travel and telephony data in the USA since 1950’s indicates that travel triggers more need for telecommunication than telephony leads to travel.
Härmä Philips Research Europe, Eindhoven, The Netherlands. e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_30, © Springer Science+Business Media, LLC 2010
795
796
Aki Härmä
The analysis of travel and telephony statistics in the USA since 1950’s till early 2000’s [23] draws probably a quite accurate picture of the complementarity because both travel (mainly cars) and telephony in early 2000’s were actually very similar to 1950’s. However, in the last few years travel has became even less attractive due to increasing energy prices, environmental, and security concerns, while in telephony there are suddenly many new possibilities related to the availability of broadband communications, the Internet and Voice-over-IP (VoIP) technologies. However, it seems that there are still some aspects missing from current telecommunications which makes it only partially acceptable replacement for travel. A realistic faceto-face experience, and the possibility to touch the other are certainly contributing factors but the experience of a common space and control of the involvement in a communication session may also play a significant role. The topic of this chapter is ambient communication technologies. We may define an ambient communication system as a spatially distributed system of connected terminal device which enable dynamic migration of a communication session from one location in the environment to another and spatial capture and rendering for multiple simultaneous sessions. The goal of this chapter is to give a broad picture of the elements of ambient communication systems and give more detailed examples of the architectures and specific solutions needed in distributed voice-only speaker telephony, that is, ambient telephony. The main principles of ambient communications are introduced in Sections 2 and 3. In the Section 4 we give an overview of some of the existing solutions and research challenges related to the development of a full-scale ambient telephone system. Later, in Section 7 we give also some ideas on how the same concept can be extended to the visual communication technologies. We also review some approaches on user tracking and calibration of ambient communication systems in Sections 5 and 6, respectively. Finally, we discuss several additional topics in Section 8 and give a summary of areas where more research is needed.
2 The Long Call When the first commercial telephone service was started in New Haven NY, USA, 130 years ago, the early telephones were leased in pairs [19]. There was a fixed wiring between the two devices and the connection was always open, and there were no phone bills because the call counter had not been invented yet. The modern scenario is very similar: the Voice-over-IP (VoIP) systems are also based on a peerto-peer connection over the internet and there are no counters or phone bills. If the connection time is not counted one could expect that the call durations increase. Recent telecom statistics [33] show that the average call durations in the traditional telephone and IP telephony are 163 and 379 seconds, respectively. What is interesting in this change is that not only the mean (or median) call duration increases but there is a new category of very long duration calls. For example, the percentage of traditional PSTN calls lasting over one hour was 0.7% [33]. The re-
Ambient Human-to-Human Communication
797
Fig. 1 Hall’s classification of the social interpersonal distance as a function of the physical interpersonal distance.
sults from a Skype traffic measurement [42] indicates that 4.5% of the calls were longer than one hour and that 0.5% of the calls took more than three hours. It is known that many people take voip calls that last hours or days. These can no longer be considered as traditional telephone calls: the voip phone is used as a awareness system [75] eliciting the experience of connectedness to the other in the user’s environment [1]. However, that statistics cited above show that currently few people make very long calls even if it was free. One possible reason is that the current terminal technologies do not support such use. The typical speech rate in a traditional telephone call is around 120 words per minute. In a long duration call, it can be expected that the words per minute rate fluctuates in a similar way as in natural interaction between people who are in the same room. That is, the phone call becomes a fragmented sequence consisting of interactions and silent periods. In many ways, the form of social interaction in a continuously open telephone line could be similar to the interaction between people who, for example, live together. If the model for the next generation telephony is taken from the natural interaction between the people who are physically present, we have to take into account also the fluctuations in the inter-personal distance. The theory of proxemics by Hall [44], see Fig. 1, suggests that the social distance, or the intensity of interaction, between people correlates with the physical distance. The possibility to control the interpersonal distance depending on situational, social, and emotional context is one of the missing aspects in traditional communication technologies. For natural one-to-one conversation the typical distance is the personal distance around one meter. There are clear biases depending on the cultural background, personality, gender [90], and the relation between the persons, see, e.g., [97, 68] for a review. However, it seems that there is often an optimal interpersonal distance for communication. If the other talker is too close, e.g., closer than one meter, it may be perceived inconvenient or arousing. On the other hand, in one study [100], when the nose-to-nose distance was larger than 1.7 meters, familiar subjects in one-to-one conversation tended to search for a seating position closer to each other. A speakerphone system aiming at mimicing physical presence should be able to support the fluctuations in the interpersonal distance. In Section 4.4 we give an overview of some of the technologies to control the distance in audio telephony. It is clear that the traditional handset is not an optimal terminal device for a long call with a fluctuating conversational state. Holding a phone, or sitting in front of
798
Aki Härmä
a videoconference device for more than an hour causes fatigue and makes the user unavailable for other communication with local and other remote people. Also, if the users leave the terminal devices, it would be difficult to know when the other is next time close to the phone and available to continue the conversation. Naturally one way of staying in reach is to wear the terminal continuously, for example, a bluetooth earpiece or a head-mounted display. Advanced techniques to combine the local acoustic scene and remote persons voices using special headsets have been proposed by several authors, see, e.g, [50, 55]. However, long-term use of bodyworn appliances is inconvenient for various reasons. In this chapter we focus on technologies where the same service is provided by a network of terminal devices distributed in the environment. This scenario facilitates a flexible control of spatial attributes of the presence of the remote person, which may possibly solve most of the terminal problems mentioned above.
3 Spatial Attributes Of Presence The word presence implies that the other has some location in the user’s environment. Let us first make a distinction between two extremes: telepresence and social presence [59, 13]. Telepresence is often characterized by the experience of being there. Telepresence technology aims at providing an experience of being present in another location, e.g., typically in virtual reality with the help of a head-mounted display and data gloves, or an immersive environments such as the CAVE environments installed in several research laboratories. Collaborative virtual environments have been studied extensively in business, conference, and e-learning applications [24], but not usually in the context of home communications. Social presence may be characterized as an experience of the other being here. In ambient communication the users remain in their natural environment and therefore the focus is more specifically in social presence than in telepresence, although, it is usually not possible to make a complete separation. For example, while a highquality speech communication system may give a social presence experience of having the other here, the background sounds heard in any telephony system may give also an experience of being there. The mixing of the two main branches of the presence technology in practical systems is partly due to the limitations in technology, i.e., in segmentation, transmission, and rendering of the audiovisual or multi-modal representations of the other person. A typical use case for social presence technology is illustrated in Fig. 2, where the desired experience is the social presence of the other in the user’s own natural environment. In this scenario, a representation of a remote person, the other, is virtually transferred to and rendered in the natural environment of the user. In other words, the user’s environment is augmented by a mediated presence of the other. The Augmented Reality, AR, is produced by adding synthetic objects into the real environment [20]. Many traditional voice-only teleconference systems built in dedicated rooms already starting from 1950’s can be considered as augmented reality
Ambient Human-to-Human Communication
799
Fig. 2 Augmented presence.
systems, see e.g. [82, 111], for a review of historical systems. In these voice-only teleconference systems the microphone is typically placed on a meeting table and one or more loudspeakers also on the table or in some cases in in the ceiling or walls of the room. In the Remote Meeting Table system used in the 1970’s by the UK Civil Service Department is a very clear example. In the reported system the voices of remote participants in a meeting room teleconference application were played from individual loudspeakers with a name sign and a activity lamp placed on a meeting table in positions where the remote person would be seated if present [97]. More recently similar systems including a small camera and a video screen have been proposed, for example, by [96, 114]. The same idea can be also taken even further by including an anthropomorphic robotic interface such as in the TELENOR system introduced in [104]. Mixed reality (MR) includes Virtual Reality (VR), AR, and a continuum between them [78, 105]. One example of a mixed reality system is illustrated in Fig. 3, where the user’s physical environment is extended by an opening to the physical environment of the other. This configuration is essentially the spatial model for all videoconference systems, see [82, 62], for a review.
Fig. 3 Extended space by a virtual opening.
800
Aki Härmä
The opening between the two rooms in Fig. 3 resembles an open window between the two environments. Conceptually, the environment of the other is virtually moved to the neighborhood of the user and the two walls are co-positioned and removed. Communication systems based on the concept of an acoustic window [99, 46, 43], or audio-visual window [17, 54] have been proposed by many authors. Naturally, one may also consider a mixed reality system where the remote environment, or several distant environments, are mixed with the local environment, see Fig. 4. This does not necessarily mean that all multi-modal inputs from two or more environments are mixed into one complex scene. The idea of overlapping the home floor plans is typically used in order to position representations of the remote people in an intuitive way. Grivas introduced an ambient communication system where a small number of similar devices at two homes were represented in the home of another person by colored light sources [41]. For example, if a user switched on a coffee machine, it turned on a light in the other user’s home in a location corresponding to the coffee machine.
Fig. 4 The physical spaces of the user and the others can be positioned freely, e.g., on top of each other, or as virtual neighbors.
The technologies for real-time communication can be characterized by the map of Fig. 5. The conventional telephony falls into the left bottom corner of the map. It is session-based technology for which the characteristic model for interaction is the call. The call is a session which is started and terminated, and it has a high intensity at the level of 120 words per minute during the session. In the visual communication
Ambient Human-to-Human Communication
801
Fig. 5 The map of telephones.
there is a continuous eye-contact between the participants. A long voip call is an example of a more persistent form of communication where the concept of the call session is vanishing and it is replaced by a continuous acoustic presence of the other. Moreover, the traditional telephone is an example of terminal-centric technology where the attention of the user to the terminal itself is needed to use the system. The speakerphone is a step towards ambient communications where the user may carry on conversation in the neighborhood of the device. The concept of ambient telephony, in the right top corner of the map of Fig. 5, is introduced in detail in the following sections. ambient telephone is a speakerphone which supports the persistent use and it is ubiquitously available in the home environment [48]. The ambient communication is essentially a service provided by the networked infrastructure in home. In some sense, this service is similar to heating, air conditioning, or lighting which are also infrastructure services provided everywhere and continuously in the modern home environment.
4 Ambient Speech Technology The ambient telephone is a speakerphone system based on arrays of loudspeakers and microphones, which are distributed in the home environment and are connected to each other via a home network. The audio rendering and capture in the ambient communication system are performed in a spatially selective way. Therefore, it is possible to position the voice of a remote person and the capture position of the local speech, in principle, at any location in the environment or move it from one room to another with a local user.
802
Aki Härmä
The possibility to move the call from one device and one spatial location to another is one of the central features of the ambient telephone. The free mobility of the remote and local participants in the environment is one of the elementary properties of real physical presence and it is essential for the control of the interpersonal distance. The mobility also makes it possible to carry on with other activities while having a call. Another benefit of the ambient telephone is the easy management of multiple simultaneous calls. For example, receiving a new call or making a new call while another call is still open can be performed very naturally because of the possibilities to move talkers to distinct spatial positions or leave calls open, for example, in different rooms.
4.1 Ambient Telephone Architecture One possible architecture for an ambient telephone system is shown in Fig. 6. The system has one master phone, which is the main gateway to external communication channels such as the Internet, mobile phone network, or the public switched telephone network, PSTN. The master phone is typically the central control point of the system. The number of individual telephone units, or phonelets, in the system may be arbitrary and they can be placed freely. In this chapter we use the term phonelet for individual units of a distributed communication system. Note that the term has not necessarily relation to the phonelets used in Java-based telephone software development [67]. The communication between the phonelets and the master phone include audio streams and different types of control messages. The network platform for an ambient telephone system can be TCP/IP network where individual devices may be connected in different ways including wired, wireless, and powerline connections. However, a similar functionality can also be built on top of other network platforms including the DECT/CAT-IQ cordless telephone network [27]. In order to implement a spatially selective capture and rendering of audio it is necessary to know the physical locations of the phonelets in the environment. The configuration and calibration of the ambient communication system is discussed in Section 6. A typical block diagram of a phonelet is shown in Fig. 7. For each incoming call each phonelet initializes a caller instance which contains the implementations of certain signal processing algorithms for capture and rendering needed in the distributed speakerphone. The master phone may be similar to the other phonelets but it has certain additional functions mainly related to the control of the entire phone system as illustrated in Fig. 8. The master phone also creates a new software instance for each incoming call and it is responsible for controlling the rendering and capture position of the call, and monitor the activity status of the call.
Ambient Human-to-Human Communication
803
Fig. 6 A distributed ambient telephone system.
4.2 Speech Capture And Enhancement The fundamental problem in speech capture is to maximize the signal-to-interference ratio for the desired talker. The interferences are caused by other sound sources in the environment, reverberation, and the speech of the remote person rendered to the environment. In the ambient telephone the distance between the talker and the microphone device is often beyond the echo radius, which is the distance of the microphone from the talker where the energies of the direct sound and room reverberation are at the same level. To reduce the amount of reverberant sound it is beneficial to use microphone arrays combined with beamforming to maximize the amplitude of the direct sound [107]. The use of beamforming requires tracking of the users which is briefly discussed below. It is also often necessary to try to actively cancel unwanted sound sources using side-lobe cancellation techniques, see, [40, 61], and active noise suppressions techniques [30]. The suppression of reverberantion can also be treated separately, see, e.g., [106]. Several powerful algorithms exist for the speech enhancement using microphone arrays, see, e.g., [9, 108] for a review. However, the dynamic hand-over of speech capture in an ambient telephone system of two or more spatially separated arrays contains new challenges for speech enhancement algorithms.
804
Aki Härmä
Fig. 7 A typical architecture of a phonelet.
Fig. 8 A typical architecture of a master phone.
Most work for speech capture in stereophonic and multichannel audio communication have been performed in the extended space configuration of Fig. 3. The benefits of stereophonic representation of the audio are usually linked to the cocktail
Ambient Human-to-Human Communication
805
party effect [21] which in this case translates to better separation and identification of far-end talkers and higher intelligibility in noisy conditions. The problem of acoustic echo arises from the acoustic propagation of the sound played back from a loudspeaker to the microphone capturing the local speech. In the first stereophonic speakerphones [15, 3] this was solved by using half-duplex communication solution where the near-end microphones were muted when the farend talker was active. Better solutions based on adaptive echo cancelation [101] and later adaptive multichannel echo cancelation [8] have been developed to make communication systems full-duplex. However, the near-end speech activity detection is still necessary to control the adaptation of the cancelation filters during double-talk and the residual echo attenuation. Basically all solutions are based on adaptive cancelation filters and dynamic attenuation of the remaining echo. In a multichannel audio communication systems there is a separate acoustic propagation path from each active loudspeaker to each active microphone. The task of the echo canceller is to synthesize copies of the echo path signals using typically frequency-domain adaptive filters [98] for modeling each echo path. The generated synthetic signals are then subtracted from the microphone signals. This is illus˜ represents a matrix of acoustic transfer functions from trated in Fig. 9 where H each loudspeaker to each microphone. The task of the multichannel echo canceller is to use a truncated estimate of the acoustic transfer functions H, as illustrated in Fig. 9 to cancel those parts from the microphone signals. The stereophonic echo problem is significantly more challenging than the single-channel echo problem. The main problem is related to the fact that the two signals played from the loudspeakers may be, and usually are, correlated. Therefore, there is no unique way to determine which signal components observed in the microphone signal are arriving from which speaker in the room. Consequently, there is no unique solution for the normal equations needed to solve the coefficients of the adaptive filters modelling the acoustic paths. In addition, it can be shown that most possible solutions depend on the acoustic transfer functions from the talker to the microphone in the far-end room. The two most common solutions are to decorrelate the loudspeaker signal by using non-linear distortion to one of the signals [7], or to control the estimation of the adaptive filter coefficients dynamically in such a way that the filters are only adapted in frequency and time slots where the two signals are uncorrelated [94]. In principle, the use of multichannel acoustic echo cancelation of Fig. 9 is necessary when the audio is being transmitted in a stereo or multichannel representation. This is typically the case in extended space scenario [15, 17, 18, 43]. However, in the ambient telephone system for the home environment we may usually assume that the speech signals transmitted from the far-end are monophonic. Therefore, it is possible to simplify the system by at least two possible ways. Fig. 10 gives an example of a system of multiple single-channel echo cancelers and in Fig. 11 shows even a simpler system with only one echo canceler between the received far-end signal before spatial reproduction and the transmitted single-channel speech signal derived from the multi-microphone input. The obvious problem with the solution in Fig. 10 is that the echo path being modeled by the adaptive filter will also model spatial sound rendering method R, which can be usually represented as a linear slowly-
806
Aki Härmä
Fig. 9 A multichannel echo cancelation system.
Fig. 10 A multiple single-channel echo cancelation system.
varying matrix filtering operation. In the system of Fig. 11, the modeled echo path will also contain the processing of the microphone signals, that is, the echo canceler can be characterized as a system given by RHC [22]. In general, it is difficult ˜ to isolate the estimation of the filter matrix representing the pure acoustic paths H in the systems of Figs. 10-11. Reed, Hawksford, and Hughes [88] have proposed a modification of an AEC algorithm where the rendering can be separated in the case of using amplitude panning in spatial sound reproduction. The implementation of the ambient telephone system using network-connected phonelets in the way shown in Fig. 6 poses additional requirements for the algorithmic architecture. In principle it is necessary that all echo paths in the system of Fig. 9 which is a computationally expensive task. [18] and [18] and [84] have demonstrated that the complexity of the multi-channel echo canceller can be signifi-
Ambient Human-to-Human Communication
807
Fig. 11 A single channel echo cancelation system.
cantly reduced in the case where audio reproduction is based on wavefield synthesis of a single source signal. This can also be seen as a method for the separation of R from the estimation of the canceller filter. The adaptation of set of filter coefficients H in Fig. 9 requires that all loudspeaker and microphone signals are available in one device, which translates to a high continuous load for the local network. In addition, since there is a separate echo canceler array for each incoming caller the multi-channel echo cancelation solution quickly leads to large computational load. In the multiple single-channel solution of Fig. 10 the data transmission can be kept relatively low and there is only a single echo canceler for each caller which makes it somewhat easier to scale the solution for different numbers of phonelets. The speech enhancement and the decision to transmit the microphone data to the master phone can also be performed locally in each phonelet which reduces the network load. This is essentially the solution illustrated in Fig.6 but there is additionally a secondary echo canceler in the master phone. A similar solution has been proposed earlier, for example, by [63] but not in a context of a network-distributed telephone system. The system of Fig.11 is also a usable solution. The main additional problem in this configuration is that also the dynamic microphone preprocessing C is now inside the echo cancellation loop. Modified AEC algorithms for the case of an adaptive beamformer in the echo loop have been proposed, e.g., in [63, 45]. In the real-time 16-channel hard-wired ambient telephone system installed in the HomeLab [92] of Philips Research, Eindhoven, The Netherlands in 2007, the echo cancelation solution was a single-channel echo canceler where the echo-loop contained spatial rendering and speech capture which was relatively simple microphone selection method based on user tracking. In practical tests the performance of this solution was found acceptable, although, some residual echo can be heard at the remote end especially during the periods when the rendering position of the caller is being moved from one room to another. The movement of the rendering and capture position is per-
808
Aki Härmä
formed by changing the rendering and capture processing, R and C, Any changes in those also naturally affect the acoustic transfer functions in H. The speech capture should be controlled by speech activity detection [108]. The fluctuating activity level of long calls sets high requirements for the detection of the speech activity. In a multi-user environment, the VAD should also be combined with speaker recognition. Moreover, the problem of near-end interference is emphasized in ambient telephony because of acoustic multitasking, e.g., listening to music or watching television during a call, which are a natural activities in the case of real physical presence with people.
4.3 Speech Transmission There are a large number of standardized speech coders which are used in different speech communication applications, see, for example, [102, 66], for overview. In VOIP applications standardized ITU-T G.711, G.729/G.729A, and G.723.1 coders are commonly used but there is also a large number of proprietary coders used in popular VOIP applications. In wideband speech communication the G.722 [76] and AMR-WB [11] coders are also popular. In the ambient telephone application the transmission content can be based on many different coders depending on whether the incoming call is received through a mobile or VoIP network. When the speech signal of the remote caller is played through a loudspeaker system the requirements for the voice quality are high. For example, a low-quality mobile phone speech played through a loudspeaker system may not sound acceptable. Therefore, it may be necessary to develop techniques for post-transmissions speech enhancement to alleviate the quality problems and preferable make every caller sound as good as possible. For example, in the system illustrated in Fig. 6 this is performed in the master telephone, see, Fig. 8. Typically post-transmission speech enhancement should contain algorithms for speech bandwidth extension, see [73], and packet-loss concealment in the case of VoIP transmission [108]. The requirements for the echo cancelation are coupled with the choice of the audio transmission format. For example, in the network-distributed ambient telephone system illustrated in Figs. 6-8 the residual echo cancellation is performed in the master phone. The speech signal from a phonelet should be transmitted over the local network preferably in encoded form. It is known that many sophisticated coders such as AMR-WB introduce non-linear distortions to the microphone which severely degrade the performance of the adaptive filters used in echo cancellation [47]. For this reason, the system illustrated in the figures actually uses a simple G.722 coder for in-home transmission because there the problem is much smaller.
Ambient Human-to-Human Communication
809
4.4 Spatial Speech Reproduction One of the challenges for spatial speech production is to create positioned sound sources in a very large listening area. The rendering of the speech of the other person can be performed using any of spatial audio rendering techniques including amplitude panning [86, 74], various types of holophonic reproduction methods such wave field synthesis (WFS) [10, 110], adaptive methods such as transaural reproduction [65], or adaptive wave field synthesis [36]. In all cases the listener is assumed to stay more or less in an optimal listening position, the sweet spot, or at least within a restricted listening area inside a volume enclosed by the loudspeakers. Furthermore, the sources are typically restricted to the space outside this volume. Some of these restriction should be relaxed in the ambient telephone scenario because one cannot necessarily assume that listener is seated in the sweet spot and this configuration does not allow flexible control of the interpersonal distance.
4.4.1 Follow-Me Effect For A Large Listening Area In ambient telephony the user is free to move in the environment, for example, to walk from one room to another or to go outside of the listening area. In theory, it is possible to build a sound field reproduction system which can create a physically plausible approximation of the actual acoustic wavefield produced by a human talker in the environment [10]. However, this is not always feasible or desired, e.g, in the home environment, because the required number of loudspeakers increases rapidly as a function of the area or the volume of the listening space. In practice, we are limited to a sparse and widely distributed arrays of loudspeakers. The most familiar example of such an ambient audio reproduction system is the public announcement system of a railway station or an airport. While the public announcement system is designed to deliver the same message to everyone in the area the goal of an ambient telephone system is to produce localized voice of a remote person to a tracked individual. Clearly a sparse array of speakers does not allow for an exact reproduction of the desired wave field for most listening positions. In fact, it turns out that the reproduced sound field is very different from the desired physical sound field. However, in [51] it was demonstrated that this type of sparse array can actually create a very strong illusion that the sound source is moving with the listener when the listener walks, for example, through a hallway or from one room to another. This can be done simply by playing the same speech signal synchronously from all the loudspeakers. The produced effect is similar to the visual phenomenon that sun always appears moving with an observer. Fig. 12 shows the predictions of the direction of a sound source for a person walking by a line array of loudspeakers. The directional estimates were computed using a model of binaural hearing detailed in [51]. The simple method of playing identical signals from a line array of loudspeakers to create the follow-me effect for a moving observer has many shortcoming. First, for a listener who is not in front of the array but outside of the range spanned by the
810
Aki Härmä
Fig. 12 A prediction of the binaural localization model of the perceived direction of the source in positions in front of a uniform line array with listeners facing parallel to the array.
two ends of the array, the sound appears severely colored due to the comb filtering effect. In hands-free telephony, it is desired that the sound is following the user. The sound source should appear localized close to the user also for a second person with a different state of motion. For a second person who is not using the application, the stationary follow-me effect may appear disturbing. A simplest method for localized playback is to play the voice of the remote person always only from the nearest loudspeaker. Typically that transition form one speaker to another should be performed smoothly to avoid artifacts related to the onset and offset of audio. However, this method has the limitation that the voice appears jumping from one device to another. It is possible to use multiple loudspeakers simultaneously to create localized phantom sound sources. Several dynamic spatial rendering algorithms were compared in [51]. The algorithms were compared in a listening experiment illustrated in Fig. 13. The listening test was carried out in a quiet and relatively damped listening room with a line array of eight small loudspeakers. In the experiments the subjects were asked to walk back and forth the path marked in Fig. 13. The reproduction algorithms were controlled in real-time based on camera-based tracking of the position of the subject. The subjects gave grades for various attributes related to the perceived position and movement of a sound source. It actually turned out that it is relatively easy to create a convincing illusion that the remote talker is moving with the user. The movement in most algorithms is smooth and the coloration of sound is at a low level. However, larger differences were found for a second observer who is not moving. Therefore the authors of [51] recommended the use of a rendering technique based on the secondary source method. In the secondary source method the sound is delayed and attenuated in individual loudspeakers to simulate the actual propagation of sound from the desired source position to the position of the loudspeaker. A similar method is used in con-
Ambient Human-to-Human Communication
811
Fig. 13 The listening test setup.
.
ference and lecture hall reinforcement systems to improve the localization of the local talkers [26]. In the system of Figs. 6 the secondary source method can be implemented in a distributed way such that each incoming speech signal is multicasted to all phonelets where the signal is modified for reproduction in the rendering block indicated in Fig. 7 using the dynamic geometric model maintained by the master phone of Fig. 8.
4.4.2 The Control Of Interpersonal Distance The secondary source method described above has a low complexity and it produces a clearly localized ambient speech source relatively independently of the position of the observer. However, the perceived distance of the remote person is always at the minimum at the distance of the nearest loudspeaker. Several acoustical properties that depend on the distance between source and listener are known to affect the perception of distance. In the free field the source intensity at the position of the listener is inversely proportional to the source distance. Various studies have shown that for familiar sources the intensity correlates with perceived distance [25, 103]. The ratio between the direct and reverberant sound is inversely proportional to the source distance and provides a potential cue for distance perception [77] and the temporal structure of early reflections from walls is know to contribute [64, 16]. Other acoustical cues that can contribute to perceived distance at near distances close the listener are spectral cues due to head scattering resulting in a relative increase in low frequency energy [31], interaural level difference cues, and to a lesser extent interaural time delay cues, while at very large
812
Aki Härmä
Fig. 14 The listening test setup.
.
distances the attenuation of high frequencies may provide cues for distance perception (cf. [25]). The dynamic distance cues such as the Doppler effect, see e.g. [49] for a review, are probably not relevant in ambient telephony because the effect becomes only audible when the speed of the sound source is high. It is relatively easy to reproduce sound of a remote talker such that it is at any distance farther away than the nearest loudspeaker [37] but it is difficult to create an illusion that the other is at the intimate or even personal range if the loudspeakers are farther away that two meters. One of few potential technologies for creating a virtual sound source very close to a listener is an exotic parametric array reproduction technique known as the audio spotlight [116, 85]. A somewhat similar illusion can also be created using by focussed beam-shaping with an array of loudspeakers [69]. There are also possibilities to use highly directional loudspeakers as a part of the sound reproduction system for the ambient telephone to control the distance effect. A loudspeaker configuration combining a highly directional panel loudspeaker and a surround audio system was introduced in [52]. The system is shown in Fig. 14 where S is a directional speaker and the surrounding loudspeaker system is used to create virtual image sources [2] that correspond to the desired virtual source position. It was demonstrated in [52] that with the careful control of sound reproduction with a highly directional loudspeaker aiming at the subject and the level of simulated early reflections it is possible to create a controllable effect of having a remote talker’s voice in a position between the listener and the nearest loudspeakers. However, this method again requires knowing the position of the subject and some means of controlling the radiation pattern of the directional loudspeaker.
Ambient Human-to-Human Communication
813
5 Tracking And Interaction The control of spatial rendering and capture usually requires that the position of the local talker is known. For example, in the follow-me scenario where the virtual speakerphone follows a user from one room the position of the local user should be tracked. The technologies for localization of users are reviewed in another chapter of this handbook (Part VI, Chapter 6). The three major techniques for location-sensing are triangulation, proximity and scene analysis, see [56] for the taxonomy. Triangulation is based on multiple distance measurements between know points, in proximity techniques the distance to a known set of points is measured, and in scene analysis a view from a particular vantage point is examined. In many tracking methods the user is required to carry a beacon which may be a transmitter or a receiver device. Active RFID tags are a commonly used for tracking based on either proximity or triangulation. However, the preferred solution for ambient communication systems is beaconless tracking where the location is determined by the user’s voice, moving body, radiated heat, or some other measurable quantity related to the presence of the user. The most commonly used beaconless tracking techniques are based on localization of active talkers using microphone arrays [28] or cameras, or a combination of both microphone and camera data, see, e.g., [35]. In the case where there are several potential users in the environment it is also necessary to identify the users in order to determine how the speech signal(s) should be rendered in the environment. This requires development of algorithms that are capable to both localize and identify the users. Most likely a functional solution to tracking in ambient communication applications should be based on multiple modalities and high-level reasoning and learning algorithms. Software architectures for ubiquitous tracking in network environment have been proposed, see, e.g., [57, 93]. In this chapter we have mainly focused on the media technology of the ambient communication. Various topics related to the user interaction with the system and privacy need further study. The privacy issues in ubiquitous systems such as ambient telephone have been recently reviewed, e.g., by [72]. The user interaction aspects of ambient intelligence has a dedicated chapter in this book (Part III).
6 Calibration And Configuration It is easy to build an ambient telephone setup in a laboratory environment. The loudspeakers and microphones can be put in optimal positions and the rendering and capture can be controlled by a known geometric model. When a user installs such a system in the home environment the geometry is unknown. Therefore, the system should be able to calibrate itself. The networked nodes of the system can find each other via some middleware such as UPNP [60]. However, the network discovery of devices does not provide information about the physical locations of the devices in the environment. When the devices are connected via a wireless network it is possi-
814
Aki Härmä
ble to find their relative locations by measuring the signal strength, time-of-arrival, or angle-of-arrival of radio frequency signal between each transmitter and receiver, see [83]. However, the RMS localization errors in WLAN and DECT are typically above 2 meters [109]. In the Ultra-Wideband (UWB) wireless technology the localization errors are well below one meter [38] which is probably sufficient for most ambient communication scenarios. The RF measurement does not provide accurate information to control audiovisual capture and rendering applications in a multiroom application. It may happen that there is no line-of-sight between two devices but they are still in the same room, or two devices may be in different rooms but only on opposite sides of a wall. For example, drywall is acoustically and visually an efficient isolator between the rooms but has a low tangent loss for UWB radio propagation [80]. Therefore, it is necessary to use also acoustic or vision techniques in the calibration of the ambient communication system. The acoustic calibration can be performed using a separate measurement session, see, e.g. [79]. Many modern high-end surround audio systems feature automatic offline calibration of the loudspeaker setup. Usually those are based on playing test sequences from loudspeakers and comparing those to a microphone signal captured at the central listening area. In [112], it has been demonstrated that an ad hoc network of laptops can be used as a sensor array for capturing speech from participants of a meeting and such arrays can also be calibrated automatically using off-line measurements [87]. Controlled off-line measurement provides accurate measurement data on an acoustic path. However, a separate measurement session needs to be repeated every time a new device is added to the system, or devices are moved from one place to another. The orchestration of such measurements between various devices in a dynamic environment becomes very difficult. An alternative way of measuring the acoustic paths between devices is to perform it continuously during the normal operation of the system. These techniques are typically based on adaptive system identification where the acoustic path is modelled as a high order FIR filter and the coefficients of that filter are estimated by comparing the original signal to a captured microphone signal. For example, [71] reported that a standard FIR LMS algorithm converged to a useful solution in one room in few minutes of playing typical audio material. Adaptive estimation can be also performed simultaneously to several positions in the listening area [32]. It is also possible to embed low-level test signals into audio signals and use those in the identification of the path [81]. One potential method for the automatic measurement of acoustic paths in a system of networked audio devices was recently introduced in [47]. The calibration should ideally give a physical model of the environment the devices there which could be then used in the control of focus and nimbus [89], or the audibility and visibility of the remote and local participants. The geometric model should be the basis for the software architecture used to control an ambient communication system. The free mobility of the call in the user’s environment is a challenge for the software architecture because it requires decoupling of the communication application from the individual devices such as phonelets running the software. For example, the migration of the call from one device to another be performed with the help of the Session Initiation Protocol (SIP) [58] but it essentially closes the call in
Ambient Human-to-Human Communication
815
one terminal opens it in another terminal. A middleware solution more suitable for the control of ambient communication applications has been recently proposed in [93].
7 Ambient Visual Communication Sound and light are fundamentally different in the sense that the diffraction around obstacles and reflection from surfaces largely preserves the acoustic information while visual information is usually completely lost. Therefore, direct line-of-sight is needed both in the capture of the visual representation of a subject and rendering of the received image at the receiving end. In the extended space scenario of Fig. 3, e.g., the traditional video-telephony, the geometric properties of localized audio and video are similar. In both visual and acoustic domains the goal is to capture and reproduce a view through a window connecting the two locations. The experience of interpersonal distance is naturally an essential aspect of video communication. It can also be used, for example, as a part of the user interface of the system [91]. In the ambient communication scenario of Fig. 2, however, there is not clear matching paradigm in visual reproduction for the spatial localized audio. For example, true holographic projection systems capable of creating visible 3D characters in open air probably remain science fiction in the foreseeable future. Three-dimensional autostereographic display technologies are already mature [70] and provide interesting possibilities for the control of the interpersonal distance in conversation. However, the viewing area is still limited in front of the flat display, thus making the communication mode terminal-centric. Various volumetric display systems, see, e.g, [14] for a review, are not limited to the 2D display but the visual representation is projected inside an enclosed volume that can only be observed from outside. In a large multi-room environment continuous capture of a freely moving subject can only be performed with a network of cameras. Multi-camera techniques for tracking and capturing of a human character have been reviewed in another chapter of this book (Part VII, Chapter 1) and, e.g., in [53]. For the rendering of the visual representation there are currently three solutions available: the use of multiple display devices, video projection [12] or pixelated light techniques integrated in lighting. In multi-display reproduction separate display devices are placed in different rooms. In the follow-me scenario they can be controlled in such a way that the current audio-visual call migrates from one display device to another following the moving user. This is technically feasible and supported by middleware solutions for networked home, however, intelligent power management techniques are needed in the displays to make the application environmentally acceptable. One possibility is to use very low-power electrophoretic display techniques (e-paper), for example, integrated in a wall paper or a piece of furniture. The video projection techniques provide interesting new opportunities due
816
Aki Härmä
to the progress in solid-state lighting and liquid crystal active optics which makes it possible to build power-efficient projection devices. In e-paper and video projection on arbitrary surface it is difficult to create a visually accurate representation of the remote person, although, full-colour e-paper technology will be available in the near future and there exist also various techniques for the photometric compensation in video projection, see, e.g., [5, 34]. It is possible to render a reduced visual representation which may be sufficient for determining the pose and current activity of a remote person and also support audio telephony with hand and body gestures. For example, the remote person can be represented as an animated character or an avatar [95] which is optimized for the limited visual rendering. In video projection it is also possible to render silhouette images as shadows representing the remote people [4, 115].
8 Future Research The topic of this chapter is ambient communication systems which are defined as distributed systems of connected terminal devices which enable dynamic migration of a communication session from one location in the environment to another and spatial capture and rendering for multiple simultaneous sessions. In this chapter we give a broad overview of existing research and challenges in the area of ambient communications rather than the algorithmic details. If the idea of ambient human-to-human communication modeling the real physical presence of remote people in the users natural environment is taken literally, the list of open research questions is very long. In voice-only telephony the main challenge is the clean capture of speech in a noisy and reverberant environment. In the audio rendering the accurate control of the perceived interpersonal distance requires more research. In visual communication the challenges are in the capture of the character of a person from a large area in varying lighting conditions. The capture is tightly coupled with the rendering of the image in the receiving end. For ambient visual rendering there are several solutions available but they are limited by the terminal-centricity or quality and visibility. In addition to the continuous development of algorithms for audio-visual communication there is also very interesting new developments in materials and hardware technologies. For example, the pressure-sensitive paint introduced in [39] may evolve into a usable solution for true ambient capture of speech. There the microphones can be literally painted on any surface and the pressure signal impinging the surface can be read using a laser scanner from another location in the room. For the audio reproduction the development of ferroelectret materials [6] and new advances in nano-technology [113] may offer in the future thin free-form loudspeaker panels that can be integrated in furniture or wall paper.
Ambient Human-to-Human Communication
817
9 Acknowledgements The author would like to thank S. van de Par and W. de Bruijn for some of the material reprinted here from [51, 52], and B. den Brinker, M. Bulut, and T. Määttä for valuable comments on an early manuscript.
References [1] Ackerman MS, Starr B, Hindus D, Mainwaring SD (1997) Hanging on the ‘wire’: a field study of an audio-only media space. ACM Transactions on Computer-Human Interaction 4(1):39–66 [2] Allen JB, Berkley DA (1979) Image method for efficiently simulating smallroom acoustics. J Acoust Soc Am 65(4):943–950 [3] Aoki A, Koizumi N (1987) Expansion of listening area with good localization in audio conferencing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’87), pp 149 – 152 [4] Apperley M, McLeod L, Masoodian M, Paine L, Phillips M, Rogers B, Thomson K (2003) Use of video shadow for small group interaction awareness on a large interactive. In: Proc. of Fourth Australasian User Interface Conference (AUIC2003) [5] Ashdown M, Okabe T, Sato I, Sato Y (2006) Robust content-dependent photometric projector compensation. In: Proc. of ProCams 2006: International Workshop on Projector-Camera Systems, pp 60–67 [6] Bauer S, Gerhard-Multhaupt R, Sessler GM (2004) Ferroelectrets: Soft electroactive foams for transducers. Physics Today 57:37–43 [7] Benesty J, Morgan DR, Sondhi MM (1998) A better understanding and an improved solution to the specific problems of stereophonic echo cancellation. IEEE Trans Audio Speech Processing 6:156–165 [8] Benesty J, Gaensler T, Eneroth P (2000) Multi-channel sound, acoustic echo cancellation, and multi-channel time-domain adaptive filtering. In: Acoustic Signal Processing for Telecommunications, Kluwer Academic Publ., Boston, USA, chap 6, pp 101–120 [9] Benesty J, Makino S, Chen J (eds) (2005) Speech enhancement. Springer [10] Berkhout AJ, de Vries D, Vogel P (1993) Acoustic control by wave field synthesis. J Acoust Soc Am 93(5):2764–2778 [11] Bessette B, Salami R, Lefebvre R, Jelinek M, Rotola-Pukkila J, Vainio J, Mikkola H, Järvinen K (2002) The adaptive multirate wideband speech codec (amr-wb). IEEE Trans Speech Audio Processing 10(8):620 – 636 [12] Bimber O, Emmerling A, Klemmer T (2005) Embedded entertainment with smart projectors. IEEE Computer Society pp 48–55 [13] Biocca F, Harms C, Burgoon JK (2003) Toward a more robust theory and measure of social presence: Review and suggested criteria. Presence: Teleoperators and virtual environments 12(5):456–480
818
Aki Härmä
[14] Blundell BG, Schwarz AJ (2002) The classification of volumetric display systems: Characteristics and predictability of the image space. IEEE Trans Vis Computer Graphics 8(1):66–75 [15] Botros R, Abdel-Alim O, Damaske P (1986) Stereophonic speech teleconferencing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’86), pp 1321 – 1324 [16] Bronkhorst A, Houtgast T (1999) Auditory distance perception in rooms. Nature 397:517–520 [17] de Bruijn W (2004) Application of wave field synthesis in videoconferencing. PhD thesis, TU Delft, Delft, The Netherlands [18] Buchner H, Spors S, Kellermann W (2004) Wave-domain adaptive filtering: acoustic echo cancellation for full-duplex systems based on wave-field synthesis. In: Proc. Int. Conf. Acoustics, Speech, Signal Process. (ICASSP’04), Montreal, Canada, vol IV, pp 117–120 [19] Casson HN (1910) The History of the Telephone, 1st edn. A. C. McClurg & Co., Chicago, USA, univ. Virginia Library Electr. Text Center [20] Caudell T, Mizell D (1992) Augmented reality: an application of headsup display technology to manual manufacturing processes. In: Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences, IEEE, Hawaii, vol II, pp 659 –669 [21] Cherry EC (1953) Some experiments on the recognition of speech, with one and two ears 25(5):975–979 [22] Chiucchi S, Piazza F (2000) A virtual stereo approach to stereophonic acoustic echo cancellation. In: Proc. Int. Symp. Circuits and Systems (ISCAS’00), Geneva, Switzerland, vol IV, pp 745–748 [23] Choo S, Mokhtarian PL (2005) Do telecommunications affect passanger travel or vice versa? Transportation Research Record (1926):224–232 [24] Churchill EF, Snowdon DN, Munro AJ (eds) (2001) Collaborative Virtual Environments: Digital Places and Spaces for Interaction. Springer-Verlang London Ltd [25] Coleman P (1963) An analysis of cues to auditory depth perception in free space. Psychol Bull 60:302–315 [26] Davis D, Davis C (1997) Sound System Engineering, 2nd edn. Focal Press, Boston, MA, USA [27] DECT (2008) Cordless advanced technology - internet and quality. http://www.dect.com [28] Di Biase JH, Silverman HF, Brandstein MS (2001) Robust localization in reverberant rooms. In: Brandstein MS, Ward D (eds) Microphone arrays: Signal Processing Techniques and Applications, Springer-Verlag, chap 7, pp 131– 154 [29] Dickson E, Bowers R (1973) The videotelephone: a new era in telecommunications. Tech. rep., Cornell University [30] Diethorn E (2005) Foundations of spectral-gain formulae for speech noise reduction. In: Proc. Int. Workshop Acoustic Echo and Noise Control (IWAENC’2005), Eindhoven, The Netherlands
Ambient Human-to-Human Communication
819
[31] DS Brunghart WR NI Durlach (1999) Auditory localization of nearby sources. II Localization of a broadband source. J Acoust Soc Am 106:1956– 1968 [32] Elliott SJ, Nelson PA (1989) Multiple-point equalization in a room using adaptive digital filters. J Audio Eng Soc 37(11):899–907 [33] FCC (2007) Trends in telephone service. Tech. rep., Federal Communication Commission, Washington DC, USA [34] Fujii K, Grossberg M, Nayar S (2005) A projector-camera system with realtime photometric adaptation for dynamic environments. In: In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, VPR 2005, vol 2, pp 1180–1187 [35] Gatica-Perez D, Lathoud G, Odobez JM, McCowan I (2007) Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans Audio, Speech, Language Processing 601(15):2 [36] Gauthier PA, Berry A (2006) Adaptive wave field synthesis with independent radiation mode control for active sound field reproduction: Theory. J Acoust Soc Am 119(5):2721–2737 [37] Gerzon MA (1992) The design of distance panpots. In: 92nd AES Convention preprint 3308, Vienna, Austria [38] Gezisi S, Tian Z, Giannakis GB, Kobayashi H, molisch AF, Poor HV, Sahinoglu Z (2005) Localization via ultra-wideband radios. IEEE Signal Process Mag 22(4):70–84 [39] Gregory JW, Sullivan JP, Wanis SS, Komerath NM (2006) Pressure-sensitive paint as a distributed optical microphone array. JAcoust Soc Am 119(1):251– 261 [40] Griffiths LJ, Jim CW (1982) An alternative approach to linearly constrained adaptive beamforming. IEEE Trans Antennas and Propagation AP-30(1):27– 34 [41] Grivas K (2006) Digital selves: Devices for intimate communications between homes. Pers Ubiquit Comput 10:66–76 [42] Guha S, Daswani N, Jain R (2006) An experimental study of the skype peerto-peer voip system. In: Proc. The 5th Int. Workshop on Peer-to-Peer Systems (IPTPS ’06), Santa Barbara, CA, USA, pp 1–6 [43] Haapsaari T, de Bruijn W, Härmä A (2007) Comparison of different sound capture and reproduction techniques in a virtual acoustic window. In: 122nd AES Convention, paper 6995, Vienna, Austria [44] Hall ET (1963) A system for the notation of proxemic behavior. American Anthropologist 65:1003–1026 [45] Hämäläinen M, Myllylä V (2007) Acoustic echo cancellation for dynamically steered microphone array systems. In: Proc. IEEE Workshop Appl. Digital Signal Process. Audio Acoustics (WASPAA’2007), New Paltz, NY, USA, pp 58–61 [46] Härmä A (2002) Coding principles for virtual acoustic openings. In: Proc. AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, Espoo, Finland
820
Aki Härmä
[47] Härmä A (2006) Online acoustic measurements in a networked audio system. In: 120th AES Convention Preprint 6666, Paris, France [48] Härmä A (2007) Ambient telephony: scenarios and research challenges. In: Proc. INTERSPEECH 2007, Antwerp, Belgium [49] Härmä A, van de Par S (2007) Spatial track transition effects for headphone listening. In: Proc. 10th DAFx, Bordeaux, France [50] Härmä A, Jakka J, Tikander M, Karjalainen M, Lokki T, Hiipakka J, Lorho G (2004) Augmented reality audio for mobile and wearable appliances. J Audio Engineering Society 52 [51] Härmä A, van de Par S, de Bruijn W (2007) Spatial audio rendering using sparse and distributed arrays. In: 122nd AES Convention, paper 7056, Vienna, Austria [52] Härmä A, van de Par S, de Bruijn W (2008) On the use of directional loudspeakers to create a sound source close to the listener. In: 124nd AES Convention, paper 7328, Amsterdam, The Netherlands [53] Hartley RI, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn. Cambridge University Press [54] Heeter C, J Gregg FBDD J Climo (2003) Being There: Concepts, effects and measurement of user presence in synthetic environments, Ios Press, Amsterdam, The Netherlands, chap 19: Telewindows: Case Studies in Asymmetrical Social Presence [55] Higa K, Nishiura T, Kimura A, Shibata F, Tamura H (2007) A system architecture for ubiquitous tracking environments. In: IEEE International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan [56] Hightower J, Borriello G (2001) Location systems for ubiquitous computing. In: Dimitrova N (ed) Trends, Technologies & Applications in Mobile Computing, IEEE Computer Society, special report 5, pp 57–66 [57] Huber M, Pustka D, Keitler P, Echtler F, Klinker G (2007) A system architecture for ubiquitous tracking environments. In: IEEE International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan [58] IETF (2007) Session Initiation Protocol. http://www.cs.columbia.edu/sip/ [59] IJsselsteinjn WA (2005) 3D Video Communication, John Wiley & Sons, chap History of telepresence, pp 7–21 [60] Jeronimo M, Weast J (2003) UPnP design by example - software developers guide to Universal Plug and Play. Intel Press [61] Johnson DH, Dungeon DE (1993) Array Signal Processing: Concepts and Techniques. Prentice Hall, Englewood Cliffs, NJ, USA [62] Kauff P, Schreer O (2005) 3D Video Communication, John Wiley & Sons, chap Immersive videoconferencing, pp 76–90 [63] Kellermann W (1997) Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays. In: P. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP’97), Munich, Germany, pp 219–222 [64] Kendall G, Martens W (1984) Simulating the cues of spatial hearing in natural environments. In: Proc. International Computer Music Conf., Paris, pp 111–125
Ambient Human-to-Human Communication
821
[65] Kirkeby O, Nelson PA, Orduna-Bustamante F, Hamada H (1996) Local sound field reproduction using digital signal processing. J Acoust Soc Am 100:1584–1593 [66] Kleijn WB, Paliwal KK (eds) (1995) Speech Synthesis and Coding. Elsevier Science Publ. [67] Klinner III KV, Walker DB (2001) Building a telephone/voice portal with java. Java developers journal pp 74–82 [68] Knapp ML, Hall JA (2006) Nonverbal communication in human interaction, 6th edn. Thomson Wadsworth [69] Komiyama S, Morita A, Kurozumi K, Nakabayshi K (1991) Distance control system for a sound image. In: AES 9th Int. Conf., Detroit, USA [70] Kubota A, Smolic A, Magnor M, Tanimoto M, Chen T, Zhang C (2007) Multiview imaging and 3DTV. IEEE Signal Processing Magazine 10:10–21 [71] Kuriyama J, Furukawa Y (1989) Adaptive loudspeaker system. J Audio Eng Soc 37(11):919–926 [72] Langheinrich M (2005) Personal privacy in ubiquitous computing: Tools and system support. PhD thesis, Swiss Fed. Inst. Tech. Zurich, Zurich, Switzerland [73] Larsen E, Aarts RM (2004) Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loudspeaker Design. Wiley [74] van Leest AJ (2005) On amplitude panning and asymmetric loudspeaker setups. In: AES 119th Convention preprint 6613, New York, USA [75] Markopoulos P, IJsselsteijn WA, Huijnen C, de Ruyter B (2005) Sharing experiences through awareness systems in the home. Interacting with Computers 17 pp 506–521 [76] Mermelstein P (1988) G.722, a new CCITT coding standard for digital transmission of wideband audio signals. IEEE Communications Magazine 26:8– 15 [77] Mershon D, King E (1975) Intensity and reverberation as factors in the auditory perception of egocentric distance. Percept Psychophys 18:409–415 [78] Milgram P, Kishino F (1994) A taxonomy of mixed reality visual display. IEICE Trans Information and Systems E77-D(12):1321–1329 [79] Mourjopoulos J (1988) Digital equalization methods for audio systems. In: Proc. 84th Conv. Audio Engineering Society [80] Muqaibel A, Safaai-Jazi A, Bayram A, Attiya AM, Riad SM (2005) Ultrawideband through-the-wall propagation. IEE Proc Microwaves, Antennas and Propagation 9:581–588 [81] Nielsen JL (1996) Maximum-length sequence measurement of room impulse responses with high level disturbances. In: AES 100th Convention Preprint 4267, Copenhagen, Denmark [82] Olgren CH, Parker LA (1983) Teleconference technology and applications. Artech House, Inc., Dedham, MA, USA [83] Patwari N, Ash JN, Kyperountas S, III AOH, Moses RL, Correal NS (2005) Locating the nodes. IEEE Signal Process Mag 22(4):54–69
822
Aki Härmä
[84] Peretti P, Palestini L, Cecchi S, Piazza F (2008) A subband approach to wavedomain filtering. In: Proc. HSCMA’2008, Trento, Italy, pp 204–207 [85] Pompei FJ (1998) The use of airborne ultrasonics for generating audible sound beams. In: 105th AES Conv. Preprint 4853, San Francisco, USA [86] Pulkki V (1997) Virtual source positioning using vector base amplitude panning. J Audio Eng Soc 45(6):456–466 [87] Raykar VC, Kozintsev IV, Lienhart R (2005) Position calibration of microphones and loudspeakers in distributed computing platforms. IEEE Trans Audio and Speech Processing 13(1):70–83 [88] Reed MJ, Hawksford MO, Hughes P (2005) Acoustic echo cancellation for stereophonic systems derived from pairwise panning of monophonic speech. Proc IEE Vis Image Signal Process 152(1):122–128 [89] Rodden T (1996) Populating the application: A model of awareness for cooperative applications. In: Proc. ACM 1996 Conf. Computer Supported Cooperative Work, Boston, MA, USA, pp 87–96 [90] Rosegrant T, McCroskey JC (1975) The effects of race and sex on proxemic behavior in an interview setting. The Southern Speech Communication Journal 40:408–419 [91] Roussel N, Evans H, Hansen H (2004) Proximity as an interface for video communication. IEEE Multimedia 11(3):12–16 [92] de Ruyter B, Aarts E (2005) Ambient intelligence: visualising the future. In: Proc. Advanced Visual Interfaces (ACM) (AVI 2005) [93] Schmalenstroeer J, Leutnant V, Haeb-Umbach R (2007) Amigo context management service with applications in ambient communication scenarios. In: Proc. European conference on Ambient intelligence (AMI’07), Darmstad, Germany [94] Schobben DWE, Sommen PCW (1999) A new algorithm for joint blind signal separation and acoustic echo canceling. In: Proc. 5th Int. Symp. Signal Processing App., Brisbane, Australia, pp 889–892 [95] Schreer O, Englert R, Eisert P, Tanger R (2008) Real-time vision and speech driven avatars for multimedia applications. IEEE Trans Multimedia 10(3):352–360 [96] Sellen A, Buxton B (1992) Using spatial cues to improve teleconferencing. In: Proc. CHI’92 [97] Short J, Williams E, Christie B (1976) The social psychology of telecommunications. John Wiley & sons [98] Shynk JJ (1992) Frequency-domain and multirate adaptive filtering. IEEE Signal Processing Magazine pp 14–37 [99] Snow WB (1934) Auditory perspective. Bell Laboratories Record 12(7):194– 198 [100] Sommer R (1961) Leadership and group geometry. Sociometry 24:99–110 [101] Sondhi MM (1967) An adaptive echo canceller. Bell Syst Tech J XLVI:497– 510 [102] Spanias AS (1994) Speech coding: a tutorial review. Proc IEEE 82(10):1541– 1582
Ambient Human-to-Human Communication
823
[103] Strybel T, Perrott D (1984) Discrimination of relative distance in the auditory modality: The success and failure of the loudness discrimination hypothesis. J Acoust Soc Am 76:318–320 [104] Tadakuma R, Asahara Y, Kawakami HKN, Tachi S (2005) Development of anthropomorphic multi-d.o.f. master-slave arm for mutual telexistence. IEEE Transactions on Visualization and Computer Graphics 11(6):626–636 [105] Tamura H, Yamamoto H, Katayama A (2001) Mixed reality: Future dreams seen at the border between real and virtual worlds. IEEE Computer Graphics and Applications 21(6):64–70 [106] Triki M, Slock DTM (2008) Robust delay-predict equalization for blind SIMO channel dereverberation. In: Proc. HSCMA’2008, Trento, Italy, pp 248–251 [107] Van Veen BD, Buckley KM (1988) Beamforming: A versatile approach to spatial filtering. IEEE ASSP Magazine 5(2):4–24 [108] Vary P, Martin R (2006) Digital Speech Transmission: Enhancement, Coding and Error Concealment. John Wiley & Sons, Chichester, England [109] Vossiek M, Wiebking L, Gulden P, Wieghardt J, Hoffmann C, Heide P (2003) Wireless local positioning. IEEE Microwave Magazine pp 77–86 [110] Ward DB, Abhayapala TD (2001) Reproduction of plane-wave sound field using an array of loudspeakers. IEEE Trans Speech, Audio Processing 9(6):697–707 [111] Watanabe K, Murakami S, Ishikawa H, Kamae T (1985) Audio and visual augmented teleconferencing. Proc IEEE 73(4) [112] Wehr S, Kozintsev I, Lienhart A, Kellermann W (2004) Synchronization of acoustic sensors for distributed ad-hoc audio networks and its use for blind source separation. In: Proc. IEEE Sixth Int. Symp. Multimedia Software Eng. (ISMSE’04) [113] Xiao L, Chen Z, Feng C, Liu L, Bai ZQ, Wang Y, Qian L, Zhang Y, Li Q, Jiang K, Fan S (2008) Flexible, stretchable, transparent carbon nanotube thin film loudspeakers. Nano Letters [114] Yankelovich N, Simpson N, Kaplan J, Provino J (2007) Porta-person: Telepresence for the connected conference room. In: Proc. CHI’2007, San Jose, CA, USA [115] Yasuda S, Hashimoto S, Koizumi M, Okude N (2007) Teleshadow: Shadow lamp to feel other presense. In: Proc. SIGGRAPH’2007, San Diego, CA, USA [116] Yoneyama M, Fujimoto JI (1983) The audio spotlight: An application of nonlinear interaction of sound waves to a new type of loudspeaker design. J Acoust Soc Am 73(5):1532–1536
Part VII
Applications
Smart Environments for Occupancy Sensing and Services Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
1 Overview The term smart environment refers to a physical space enriched with sensors and computational entities that are seamlessly and invisibly interwoven. A challenge in smart environments is to identify the location of users and physical objects. A smart environment provides location-dependent services by utilizing obtained locations. In many cases, estimating location depends on received signal strength or the relative location of other sensors in the environment. Although devices employed for location detection are evolving, identification of location is still not accurate. Therefore, in addition to devices or utilized physical phenomena, algorithms that enhance the accuracy of location are important. Furthermore, other aspects of utilizing location information need to be considered: who is going to name important places and how are the name ontologies used. This chapter introduces detecting devices, algorithms that enhance accuracy, and some prototypes of smart environments. Many interesting and open research topics can be identified within the smart environment framework.
2 Detecting Devices Devices that detect location utilize some kind of physical measurement.The means of measurement range from wireless radio to vision and pressure. Susanna Pirttikangas University of Oulu, Oulu, Finland, e-mail: [email protected] Yoshito Tobe Tokyo Denki University, Tokyo, Japan, e-mail: [email protected] Niwat Thepvilojanapong Tokyo Denki University, Tokyo, Japan, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_31, © Springer Science+Business Media, LLC 2010
825
826
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
Fig. 1 ActiveBadge (from http://www.cl.cam.ac.uk/research/dtg/attarchive/thebadge.html).
2.1 Infrared Infrared communications cannot always be used in many environments; they span very short distances and cannot penetrate walls or other obstructions and work only in a direct “line of sight.” However, if these shortcomings are not obstacles, they can be used for localization. An infrared indoor location system was developed by AT&T Cambridge (Fig. 1) [59]. It consists of a cellular proximity system that uses diffuse infrared technology. Each person the system can locate wears a small infrared badge. A central server collects badge data from fixed infrared sensors around the building. As with any diffuse infrared system, Active Badges work poorly in locations with fluorescent lighting or direct sunlight because of the spurious infrared emissions these light sources generate. Diffused infrared radiation has an effective range of several meters, which limits cell sizes to small- or medium-sized rooms.
2.2 Ultrasonic Sound Ultrasonic sound is in the range 20-100kHz and can be used for localization. Several localization systems have used ultrasonic sound.
2.2.1 Active Bat AT&T’s researchers have developed the Active Bat location system (Fig. 2) [61, 60, 19, 2], which uses an ultrasound time-of-flight lateration technique to provide more accurate physical positioning than Active Badges. The Active Bat system is based on the principle of trilateration – position finding by measurement of distances (the better-known principle of triangulation refers to position finding by measurement of angles). A short pulse of ultrasound is emitted from a transmitter (a Bat) attached
Smart Environments for Occupancy Sensing and Services
827
Fig. 2 A ceiling receiver (left) and a Bat (right). Inset: An installed receiver (from [18])
to the object to be located, and the times-of-flight of the pulse to receivers mounted at known points on the ceiling are measured. The speed of sound in air is known, so the distances from the Bat to each receiver can be calculated – given three or more such distances, the 3D position of the Bat can be determined (and hence that of the object on which it is mounted). By finding the relative positions of two or more Bats attached to an object, its orientation can be calculated. Furthermore, some information about the direction in which a person is facing can be deduced, even if they carry only a single Bat, by analysing the pattern of receivers that detected ultrasonic signals from that transmitter and the strength of the signal they detected. Figure 2 shows a ceiling receiver and a Bat , which may be attached to objects or carried by personnel. Bats measure 7.5cm × 3.5cm × 1.5 cm, and are powered by a single 3.6V Lithium Thionyl Chloride cell, which has a lifetime of around fifteen months. Each Bat has a unique 48-bit code and is linked to the fixed location system infrastructure by a bidirectional 433MHz radio link. Bats have two buttons, two LEDs, and a piezo speaker, allowing them to be used as ubiquitous input and output devices, and a voltage monitor that allows their battery status to be interrogated remotely.
2.2.2 Cricket Complementing the Active Bat system, the Cricket Location Support System [52, 45, 46] uses ultrasound emitters to create the infrastructure and embeds receivers in the object being located. This approach forces the objects to perform all their own triangulation computations. Cricket uses a combination of RF and ultrasound technologies to provide location information to attached host devices. Wall- and ceiling-mounted beacons placed throughout a building publish information on an RF channel. With each RF advertisement, the beacon transmits a concurrent ultrasonic
828
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
Fig. 3 A cricket device (from http://cricket.csail.mit.edu/)
pulse. Listeners attached to devices and mobiles listen for RF signals, and upon receipt of the first few bits, listen for the corresponding ultrasonic pulse. When this pulse arrives, the listener obtains a distance estimate for the corresponding beacon by taking advantage of the difference in propagation speeds between RF (speed of light) and ultrasound (speed of sound). The listener runs algorithms that correlate RF and ultrasound samples (the latter are simple pulses with no data encoded on them) and pick the best correlation. Even in the presence of several competing beacon transmissions, Cricket achieves good precision and accuracy quickly. In addition to determining spaces and estimating position coordinates, Cricket provides an indoor orientation capability via the Cricket compass. A Cricket listener connects to the host device via an RS232 serial connection. The Cricket beacon and listener are identical hardware devices (see Figure 3). A Cricket unit can function as either a beacon or a listener, or it can be used in a “mixed” mode in a symmetric location architecture (which may be appropriate in some sensor computing scenarios), all under software control. A variety of sensors can be attached to a Cricket device via the 51-pin connector on the Cricket. There are prototypes of Crickets with a Compact Flash (CF) interface, which may be a more convenient way to attach to handhelds and laptops than the RS232 interface.
2.2.3 DOLPHIN Inherently, a trilateration-based positioning system such as the Active Bat and Cricket systems requires precisely positioned references. Since an ultrasonic signal usually can propagate less than five meters and does not penetrate obstacles, a large number of references are required to locate objects in an actual indoor environment. To overcome the configuration cost of an indoor location system, Minami et al. [36] designed and implemented a fully distributed indoor ultrasonic positioning system called DOLPHIN (Distributed Object Localization System for Physical-space Inter-
Smart Environments for Occupancy Sensing and Services
829
networking) that locates various indoor objects based on a distributed positioning algorithm similar to iterative multilateration. In the positioning algorithm, the nodes in the system play three different roles in a sequence: there is one master node, one transmitter node, and the rest are receiver nodes. A node that has determined its position can become a master node or a transmitter node. In every positioning cycle, the master node sends a message via an RF transceiver for time synchronization of all nodes. When a transmitter node receives the message, it sends an ultrasonic pulse. At the same time, each receiver node starts its internal counter and stops it as soon the ultrasonic pulse arrives. The receiver nodes that received the ultrasonic pulse then compute the distance to the transmitter node. If a receiver node measures a sufficient number of distances, it computes its position based on multilateration.
2.3 FM Radio Frequency Modulation (FM) radio can also be utilized for localization. The motivation for this method comes from a new class of smart personal objects that receive digital data encoded in regular FM radio broadcasts. Krumm et al. [30] propose RightSPOT, a simple algorithm for computing the location of a device based on signal strengths from FM radio stations. RightSPOT uses a vector of radio signal strengths taken from different frequencies to identify location. Each time a location is to be inferred, the device scans through a set of FM frequencies and records the signal strength of each one. The Bayesian classification algorithm does not use absolute signal strengths, but instead uses a ranking of signal strengths to help ensure robustness across devices and other variables. This classification is motivated by the assumption that different locations will show different relative signal strengths.
2.4 WiFi In general [21], [13], location estimation using radio frequency (RF) is done by building a radio map that tabulates signal strength values sent by access points in the environment. In order to provide accurate results, this tabulation needs to be done with a very high density - every few square meters. With this map, a statistical or deterministic prediction model for the area of interest is built [63]. Location can then be estimated by utilizing Euclidean distance to calculate the similarity of two radio scans [4]. [32] provides an excellent reference list to the subject. Another problem is locating small devices in a sensor network using referencesensors that have a predetermined location. Then, location of the devices needs to be obtained recursively.
830
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
2.5 Ultra-Wideband Ultra-wideband (UWB) radio is a relatively new technology [16, 51] that utilizes electromagnetic pulses of very short duration (sub-nanosecond) and a wide bandwidth (several gigahertz). A transmitter emits such a pulse, and receivers accurately measure the local time-of-arrival of the pulse with high resolution and accuracy in real-time using a microwave hardware design. Such resolution and accuracy are needed for positioning at indoor (sub-meter) scales since radio pulses propagate at very close to the speed of light. To date, UWB positioning systems have typically employed one of two architectures: (i) a set of stationary receivers that are tightly synchronized by wire estimate the local time-differences-of-arrival of a pulse emitted by a mobile tag; or (ii) pairs of wireless nodes (such as those in an ad hoc sensor network) measure the round-trip times-of-flight of UWB pulses to estimate inter-node ranges. These time-differences-of-arrival or ranges are then fed into an appropriate location algorithm (such as multi-lateration using nonlinear regression, or a Kalman filter) to produce a location estimate for the mobile tag or sensor nodes. In open environments without obstructions that significantly attenuate or reflect the UWB pulses, accuracies of several tens of centimeters are easily achievable. UWB positioning pulses can propagate through internal building walls without much attenuation, and UWB receivers can incorporate an array design to additionally estimate the angle-of-arrival (azimuth and elevation) of the incoming pulse. As a result, a much lower sensor density may be required to get an accurate positioning result, compared with other similarly fine-grained positioning technologies. However, this accuracy may not satisfy all applications (such as those involving real-time user interfaces or augmented/virtual reality), and multipath fading remains a challenge, particularly for indoor home, office and industrial environments [33].
2.6 Vision A person can be tracked by cameras. As presented in [9], a person is tracked in crowded and/or unknown environments using multi-modal integration. This combines stereo, color, and face detection modules into a single robust system. Dense, real-time stereo processing is used to isolate users from other objects and people in the background. Skin-hue classification identifies and tracks likely body parts within the silhouette of a user. Face pattern detection differentiates and localizes the face within the identified body parts. Faces and bodies of users are tracked over several temporal scales: short-term (user stays within the field of view), medium-term (user exits/reenters within minutes), and long-term (user returns after hours or days).
Smart Environments for Occupancy Sensing and Services
831
2.7 Pressure Smart floor techniques provide location estimation by utilizing embedded pressure sensors that capture footsteps. These systems use the data for position tracking, distinguishing between robots, animals and humans, recognizing gender, classifying the correctness of sport activity, evaluating different gait-related injuries in medical studies, and pedestrian biometric identification. The main advantage of these systems is that the user does not need to carry or wear a device or tags. However, installation of the floor is time-consuming and expensive. In addition to the Active Badge system, the same researchers developed the Active Floor [1]. This was one of the first smart floor systems and it was built from load cells composed of L168 aluminum alloy. The compressive load on the cell is measured by a 350 Ω full bridge strain gauge and the measured quantity is called the ground reaction force (GRF). The prototype of the active floor consisted of a 4x4 array of load cells. The active floor was constructed over normal flooring. Similar floor cells were utilized in [41]. In these experiments, Hidden Markov models and nearest neighborhood classification were used to identify and track persons on the active floor. A recent study that utilizes the GRF sensor is presented in [47]. In this study, support vector machines are the main technique for analyzing the data the floor produces. The EMFi floor is a pressure-sensitive floor that has been installed under the normal flooring of the University of Oulu’s robotics research laboratory. The material, Electro-Mechanical Film [42] (EMFi), is a thin, flexible, low-price electret material that consists of cellular, oriented polypropylene film coated with metal electrodes. In the EMFi’s manufacturing process, a special voided internal structure is created in the polypropylene layer, which makes it possible to store a large permanent charge in the film with a corona method using electric fields that exceed the EMFi’s dielectric strength. An external force affecting the EMFi surface causes a change in the film’s thickness, resulting in a change in the charge between the conductive metal layers. This charge can then be detected as a voltage. The EMFi- floor in the laboratory is constructed of 30 vertical and 34 horizontal EMFi- sensor stripes, 30 cm wide each, which are placed under the laboratory’s normal flooring (see Figure 4). So the stripes form a 30x34 matrix with a cell size of 30x30 cm. Each stripe out of a total of 64 produces a continuous signal that is sampled at a rate of 100Hz and streamed into a PC, where the data can be analyzed in order to detect and recognize the pressure events, such as footsteps, affecting the floor. This floor has been utilized to identify persons, distinguish between a robot and a human, detect abnormal events on the floor, and track people walking on the floor [57], [28]. Simple ON/OFF switch sensors have been used in [64]. This Ubifloor recognizes users with a multilayer perceptron. The sizes of the installations range from small constellations comprised of a few cells to big rooms equipped with sensors [56], [38]. Markov chains, support vector machines, and combined classifiers have been utilized in different tasks that are mentioned in the beginning of this section. In many
832
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
Fig. 4 Setup of EMFi- sensor stripes under the University of Oulu laboratory’s normal flooring.
cases, not only one footstep, but consecutive footsteps need to taken into account in making the final decision on the identity or location of the person.
3 Estimation Algorithms Pervasive computing systems seek to maintain current information about the status of users/objects/environments or relevant data about the situation. Occupancy sensing is aiming at providing abstract information about the users places, rather than locations as coordinates. Furthermore, one must consider whether he is interested in determining the movement of a person/object, the static position of a person/object, or the relative distance between two objects - all depending on the application.
3.1 Location Estimation The location-awareness of Ubicomp-applications is a key element in providing calm and smart services for users. Actually, location information is the most intuitively and easily exploitable of all the context variables one could think of. Naturally, ex-
Smart Environments for Occupancy Sensing and Services
833
ploitation starts by creating a locating system by providing the environment and the users with sensors and developing algorithms for estimating the location of different actuators in the environment. When location is measured, the measurement needs to be processed to remove any artifacts and noise, whatever the sensing apparatus is. This removal can be called smoothing or averaging the location estimates. While simple averaging (geometric mean) of measurements in a certain time window might be a good solution for some cases, there is a constant need for multi-sensor fusion and more sophisticated techniques in pervasive computing. Unfortunately, in the real world, things are under constant change. Therefore, the algorithms we utilize need to be dynamic in nature or the system will need to be modified and calibrated frequently. For example, when trying to locate a user who is using a wireless LAN, one needs to take into account the fact that the signals are prone to environmental variations and the terminals might not work optimally in terms of battery life or local resources of the device. The signal strength is not the same in different locations, even at the same distance from the access points, due to blocks or reflections in the environment.
3.2 WiFi Locationing Three techniques and different metrics for evaluating the characteristics of 802.11 location estimation are described in Table 3.2 (freely modified from [32]).
3.3 GSM cells and GPS In the previous sections we described location estimation using WiFi. A second considerable method for locating is utilizes GSM cell-network information. According to [20] the most popular algorithm for estimating location from GSM cells is a particle filter variation utilizing point estimation that places the estimate to the latest measurement. A good comparison on algorithms for location estimation with cellular network data in different settings is presented in [8]. Locating with the NAVSTAR Global Positioning System (GPS) presents very advanced technology and is utilized in many applications, the most natural being vehicle navigation. The technique was developed for the U.S. military [15]. To get an accurate location, one needs to have visibility to at least four satellites, and the basic algorithm determines the position of a GPS receiver by utilizing Euclidean distances [37]. The satellites send signals and the positioning is done on the receiver devices. It is possible, with different techniques, to specify the location of a receiver with an accuracy of a few millimeters [55]. There are drawbacks in using GPS, as it requires a clear view to at least four satellites at all times for the GPS receiver to estimate the position. This kind of
834
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
Technology
802.11 signal- 802.11 signal- 802.11 proximity strength fingerprint- strength modeling ing Accuracy 2D coordinates with 2D coordinates with Location accuracy 1-3 m median accu- 10-20m median ac- dependent on AP racy curacy density Coverage Building to cam- Areas with 802.11. Anywhere with pus scale. Requires coverage and radio 802.11 coverage and 802.11 coverage map. Best accuracy an AP location map. and radio map. Best achieved when 3+ accuracy achieved APs visible. when 3+ APs are visible. Infrastructure No Additional in- No additional infras- No additional infrascost frastructure needed tructure is needed tructure is needed bebeyond 802.11 APs. beyond 802.11 APs. yond 802.11 APs. Creating a radio map Creating radio maps is time-intensive and is less work than for new/removed APs fingerprinting. require remapping. Per-client cost Software-only solu- Software-only solu- Software-only solution for devices with tion for devices with tion for devices with 802.11 NICs. 802.11 NICs. 802.11 NICs. Privacy Very good when Very good when Very good when localization is per- localization is per- localization is performed on the client, formed on the client, formed on the client, but poor when local- but poor when local- but poor when localization is performed ization is performed ization is performed in the infrastructure. in the infrastructure. in the infrastructure. Well-matched Asset and person- Social network- Outdoor tour guides, use cases nel tracking in ing, tour guides, nearby resource adindoor environ- indoor/outdoor nav- vertisement, activity ments, indoor map- igation/tour guides, tracking. ping/navigation/tour fitness/activity guides. tracking. Table 1 Metrics for evaluating three different WiFi location estimation methods (from [32]).
Smart Environments for Occupancy Sensing and Services
835
system preserves privacy, as the individual receiver makes the location estimation and it also assures worldwide coverage for all users. A side effect is that the GPS signal sent from the satellites does not penetrate walls, soil, nor water, and therefore, it cannot be used indoors, underground, or under water. In cities, tall buildings affect coverage.
3.4 Bayes Filtering To fuse the location measurements from various sources and remove measurement noise, the most widely applied methods are Bayesian filters. Bayesian filters use noisy measurements to estimate a system’s state in a probabilistic way. Usually the angle and distance between the observed object and the observer, are estimated. A good overview of Bayesian techniques for location estimation can be found in [14].
3.4.1 Kalman Filtering The Kalman filter is an efficient and easily implementable algorithm for estimating and predicting the system’s state S(n + 1). The system’s state is estimated by measurements z(n), according to a measurement and state model. The Kalman filter is especially suitable for tracking [26], [6], [44], [53]. In addition, the Extended Kalman Filter (EKF) is also used to solve the tracking problem. There are three dynamical EKF models [65] in tracking based on the amount of internal states that the target estimates: Position (P) model, PositionVelocity (PV) model, and Position-Velocity-Acceleration (PVA) model. As one would expect, the PVA model achieves the best tracking results.
3.5 Particle Filtering A special form of Bayesian filters, that is, particle filters, estimate the location of a user at time t by utilizing a collection of weighted particles pti ,wti , (i = 1, . . . , n). The user’s position is assumed to be represented by each pti and the weight wti is a likelihood that the user is in the location that the particle presents. Particle filters have been used in robotics [17], as a very nice property of particle filters is that you can fuse many different kinds of sensor measurements to your locating system. Furthermore, [20] show that the particle filter approach is as accurate as deterministic algorithms and is practical to implement and utilize on small devices.
836
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
4 From Measurements to Meaning and Services In this section we present different location-aware services in pervasive computing research. Location awareness is a term used for something, a system, device or object, that is aware of its or the wearers’ current location.
4.1 Location-Aware Reminders In pervasive computing, the first applications presented that utilized location information were context-aware reminders. However, utilization of these applications has not reached the larger community, yet, because the locating systems have been restricted to certain areas, such as campuses or office buildings. One of the first reminders was Commotion [35]. Commotion utilizes GPS to locate a mobile device and people could set reminders into specific locations. Users were given reminders whenever they approached the location (within a certain userset time limit). The authors [11] developed Cybreminder, where location information was not the only trigger for a reminder. In their research, the main importance was to develop a toolkit [10] and a framework for easy development of context-aware applications. A distributed location beacon system feeds MemoClip, which is a location-based wearable reminder appliance [5]. Another wearable reminder application is the Sulawesi system, where the user is located by GPS and infrared signals [39]. Stick-e Notes [7] and Place-Its [54] are designed to resemble post-its, and they are located in the environment using GPS-enabled PDA’s (stick-e) or Symbian Series 60 mobile phones (Place-Its). Place-Its advantage over Stick-e is the ability to locate the device with GSM cell tower information in addition to GPS.
4.2 A Scenario on Routine Learning After having measured and estimated a location that is satisfiable in some sense (e.g. in accuracy, speed or computational complexity, or cost) we can utilize the information to infer a higher level context. In a smart environment, it is not enough that the system reacts to the user’s behavior, it is necessary to learn from the past and predict the next move. With the right algorithms, locations evolve into a significant place, and from significant places one can find traces and paths and then predict the user’s next moves. The scenario might be as follows (from [43]). Marie buys a new mobile device. She is not very familiar with all the services the device has to offer, so it is mainly used as a phone. Marie has an electronic calendar, also. She takes the phone everywhere she goes.
Smart Environments for Occupancy Sensing and Services
837
Fig. 5 A scenario of utilizing the user’s routines and significant places. (from [43])
After a week, the device asks Marie if she wants to get familiar with all the possibilities the device has to offer. If Marie would help the device place her important locations on the map, she could start to use the location-sensitive services, for example. From her friends, Marie has heard that these services are really convenient and wants to try them. The phone presents the places Marie has frequently visited on a map and asks her to name these places. Marie names the places and for some time she plays with the new navigation service, which is activated by clicking the names of the places in the calendar. The device shows a route from her current location to the clicked place. Marie cannot believe her eyes when the device asks whether she would like the device to go automatically into a silent mode when she enters her office’s meeting room. This is something she has been waiting for! Marie has a meeting in half an hour at her office, and she is still at home. The device reminds her that now it is time to go. Marie goes to the meeting at the office. The device is silent.
4.2.1 The Significant Place Typically, algorithms will need a certain amount of data to learn the routines of a user. First, the significant location is recognized and then one can start making assumptions about traces, paths or sequences of places from the data. In section 4.2.3, we describe different experiments and collected data. An interesting fact is that in the beginning of this research the significant place was not actually recognized from the measurements, but from the lack of measurement. Marmasse [35] and Ashbrook [3] both utilize the GPS signal, which is lost when the user goes indoors. They estimate the time that the signal is out of reach
838
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
and also the distance between the spot where it disappeared and came up again (to identify cases when the GPS is being shut down instead of going into a building). An important requirement that a place recognition algorithm needs to fulfill is that it needs to be able to handle the transition phase between two important places. If one utilizes a simple clustering algorithm, like k-means (or a modification like in [3]) or the Gaussian Mixture Model [24], the clusters that are formed include the coordinates from the transition places. Therefore, the algorithm should filter unimportant coordinates by dropping clusters where a smaller amount of time is being spent, like in [25]. Their algorithm reforms a new cluster whenever the stream of coordinates moves away from the current cluster. This approach also bypasses the limitation of traditional clustering algorithms, that the number of clusters needs to be determined beforehand. Besides locating information, one can use sensors like accelerometers or magnetic sensors [58] or bus schedules and stop locations [34] as additional information to determine important locations. The latest developments have been automatic identification of meaningful places from GPS data [40]. In their framework, a Bayesian approach that is fully automatic and does not require any parameter tuning is presented. The above algorithms utilize GPS signals for locationing, but [31] have used GSM-network cell information and developed their own algorithm to recognize important locations and routes. The problem with utilizing GSM cells is that the cells are large and a single cell may contain several significant locations. They state that their algorithms would work better with WiFi instead of cell-networks. BeaconPrint is a framework and a technical foundation for recognizing and labeling significant places from both GPS and 802.11 radio [22]. In their algorithms, they emphasize the importance of recognizing places that are visited infrequently or for short durations.
4.2.2 The Routine After the important locations have been identified, the user needs to name them or the system needs to geocode them. In a small-scale environment, such as an office, the infrared or ultrasound beacons can include the location name listing in the system, but this will become too hard in larger-scale environments that utilize GSM cells or GPS. Usually, manual labeling needs to be performed in these systems (see Fig. 6) . Naming the places is a difficult problem, as the naming conventions are personal and multi-dimensional. When the places have been learned, the path can be predicted and the user’s routines can be found from the data. A typical algorithm for user path prediction is the Markov model and its variations [12], [3], [35]. For example, in [3], significant locations were clustered from GPS data and they formed a Markov model for each location with transactions to every other location. Each node in the Markov model was a location, and a transition is the probability that the user is traveling between the locations.
Smart Environments for Occupancy Sensing and Services
839
Fig. 6 Office space where the important locations have been recognized from WiFi data and the places have been prompted from the user. (from [43])
An effort has also been made to recognize relative position with respect to other people in [12]. They utilized a standard Bluetooth-enabled mobile phone and recognized social patterns in daily user activity, inferring relationships, identifying socially significant locations, and modeling organizational rhythms.
4.2.3 Data Collection tasks Data-collection must be long enough to capture the routines and the typical behavior of the users. In this section, we discuss the different data sets collected during place recognition experiments. In [3], they collected four months of data from a single user and it resulted in hundreds of thousands of GPS coordinates to be classified into significant locations. Also, single user data sets were collected in [40]. Here the period was one year instead of one month. The researcher was living normally in the city of Encschede and recorded daily situations with a Nokia 6680 mobile phone and an external Bluetooth GPS receiver, resulting in 700 distinct location measurements (duplicates filtered)).
840
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
They also had a second dataset collected in Innsbruck during Ubicomp 2007 for recognizing work and tourism-related location traces. To develop BeaconPrint [22], one month of multi-sensor data (802.11 and GSM radios) were collected from three persons. GSM cellular data were collected from three users covering the period of six months, in [31]. The data collection periods are long-lasting but usually only a few persons have participated. This is because the equipment has not been available in, for example, ordinary mobile phones. The situation is changing rapidly and many mobile devices have the capability to measure location in different ways, and therefore more statistically meaningful results can be achieved in future field tests. It is very expensive to start field testing in which people need to be given new devices and instructions for performing the tests, and to follow through the experiment and analyze the results. However, the authors’ view is that bringing technology to people is the only way to learn and improve systems and make technologies main-stream.
5 Platforms of smart environments This section introduces some platforms of smart environments that were actually built.
5.1 EasyLiving The EasyLiving system [29] is a smart environment that utilizes computer vision developed at Microsoft Research. It focuses on tracking a person and visual interaction between the system and people (Figure 7). For example, if a user has a desktop (interactive computing session) open and is using one set of devices and then moves to another suitably equipped location, the desktop moves to those devices. Additionally, the lights come on and off based on the user’s location, and an appropriate set of audio speakers can be selected using knowledge of the location of the speakers and the user. EasyLiving has 4 different sensors that measure location and identity: cameras, pressure mats, a thumbprint reader, and a keyboard login system. 3-D stereo cameras measure the location of foreground people-shaped blobs and reports position measurements as a 3-coordinate tuple. The pressure mats periodically report their binary state, i.e., whether a person is sitting or standing on them. The thumbprint reader and the login system report once whenever someone uses them to identify themselves. The person-tracking module combines past person-tracking history, knowledge about where people are likely to appear in the room, pressure-mat measurements, and the most recent camera measurements to produce a database of continually updated estimates of where particular people are located. Since the camera sen-
Smart Environments for Occupancy Sensing and Services
841
Fig. 7 EasyLiving (from http://research.microsoft.com/easyliving/).
Fig. 8 AwareHome (from http://awarehome.imtc.gatech.edu/).
sors cannot determine identity, but rather can only keep track of the identity of a blob once it is assigned, this layer combines measurements from the login sensors with the person tracking information. Knowing where the keyboard and thumbprint reader are located, when a person-identification event occurs, the person track can be assigned a unique identity (The person tracking system need never know the actual identity of the person it is tracking; all it knows is an ID number, which it assigns when the person arrives). The output of the person-tracking module is a list of locations of people in the 2D plane.
5.2 Aware Home The Georgia Institute of Technology has developed a platform called Aware Home for a home domain [27]. The research focuses on evaluating user experiences of domestic technologies with the background of health, education, entertainment, and usable security. The location of the user is detected with the following techniques. • RFID floor mat system. This system gives room-level information about people. Floor-mat-antennas are placed at known locations in the home and each has
842
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
unique information about location. The inhabitants have to wear a passive RFID tag below their knees, usually in their shoes, as the floor mat antenna identifies the person during walks. • Computer vision techniques. Sometimes room-level information is not enough and more information is required, like where somebody is in the room. computer vision techniques are used in Aware Home to get the whereabouts of people in the home automatically. This technique does not identify people, it gives information about the location and orientation of moving objects. In this system cameras are installed in the ceiling, four cameras in the kitchen, four cameras in the living room and six cameras in the hallway. • Audio techniques. Voice techniques can be used to identify a person as each individual has their own unique voice. The system has two modes, ‘training mode’ and ‘test mode’. In the training mode, first the user’s voice is recorded and various characteristic parameters of the voice are extracted and then stored in a feature vector. In the test mode, a person is recognized by his voice; the voice is first recorded and then its characteristic parameters are extracted and compared with the feature vector. The closest match is the current speaker in the system. • Finger print detection. Fingerprint techniques are also used to identify people. Scanners are installed at door locks, with each scanner having a unique location in the home. Scanners scan the fingerprints and identify the person.
5.3 PlaceLab House_n PlaceLab House_n, developed by Massachusetts Institute of Technology and TIAX LLC, is a residential condominium consisting of smart components [23]. Hundreds of sensing components installed in the home are used to monitor activity in the environment. Each of the interior components contains a micro controller and a network of 25 to 30 sensors. Small, wired and wireless sensors are located on the objects that people touch and use, including cabinet doors and drawers, controls, furniture, passage doors, windows, kitchen containers, etc. They detect on-off, open-closed, and object movement events. Radio frequency devices permit identification and approximate position detection of people within the PlaceLab. Nine infrared cameras, nine color cameras, and 18 microphones are distributed throughout the apartment in cabinet components and above working surfaces, such as the office desk and kitchen counters. Eighteen computers use image-processing algorithms to select the four video streams and one audio stream that may best capture an occupant’s behavior, based on motion and the camera layout in the environment. They also study the positioning accuracy of a GSM beacon-based locating system in the environment.
Smart Environments for Occupancy Sensing and Services
843
Fig. 9 Place Lab House_n (from http://architecture.mit.edu/house_n/placelab.html).
5.4 Robotic Room
Fig. 10 Robotic Room (from http://www.ics.t.u-tokyo.ac.jp/index-j.html).
A group from the University of Tokyo has created a Robotic Room to identify a person’s activity in order to support human living [48, 49, 50]. Functions of monitoring human respiration by TV camera on the ceiling or pressure sensors on the surface of the bed were realized. A human motion tracking system is based on a full body model and a bed fitted with pressure sensors. The full body model consists of a
844
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
skeleton and a surface model. The bed has 210 pressure sensors under the mattress. It can measure the pressure distribution image of a lying person. The lying person’s motion is tracked by considering potential energy, momentum, and the difference between the measured pressure distribution image and a pressure distribution image calculated from the full body model.
6 Conclusion In this chapter we have different devices, techniques, and prototypes for detecting a person occupying a smart environment. The locating devices and algorithms most widely utilized in pervasive computing were listed. Tables were presented in which the different approaches are compared to each other. Occupancy sensing was widely examined in this chapter, and references to state-of-tha-art technology were given. One cannot say which is the perfect locating method or device in different situations. The accuracy of the system is one of the main driving forces. Also privacy issues need to be considered. In many countries the law states that a person’s location information cannot be utilized without permission. Furthermore, people do not want to be located everywhere and all the time. Fortunately, with the current technologies, the users can perform locating themselves on the client side. Different solutions have been presented for anonymous locating [32], but still real privacy needs to be sought for, as it is important that the users can feel safe in using the services available. An obvious matter that one needs to remember is that location is not free. For example, setting up a WiFi locating system requires installation of several access points and also, manual calibration of the map of the area and the corresponding signal strengths. Not to mention the costs of satellite locating systems. Administration of the systems also costs money. The services that one can offer by utilizing occupancy sensing are vast, from navigation and memory aids, automatic lifeblogging, and shopping aids to military applications, gaming, etc. The next phase is when the systems can really accurately predict the users’ next moves. Efforts have been made in this kind of behavioral analysis, but we still rely on current information rather than utilizing history and future data. How to accurately classify what is happening now, where, why, and next is still an open research problem. It can be noted that pervasive systems are getting more and more complex. The authors’ opinion is that combining many locating algorithms will provide a better solution than utilizing only one single technique. When location can be obtained from anywhere, anytime, and by anyone, the new sensor fusion techniques can be seen as the winning party. One promising effort has been made in [62], where a map of a building, a wearable acceleration sensor, and WiFi (particle filter locating) were utilized to locate pedestrians. The systems are more expensive now, but as they become more common and when proper software libraries for a wide variety
Smart Environments for Occupancy Sensing and Services
845
of locating sensors have been made, increased accuracy and ease of use can be expected.
References [1] Addlesee, Jones, Livesey, Samaria (1997) The ORL active floor. IEEE Personal Communications 4(5):35–41 [2] Addlesee M, Curwen R, Hodges S, Newman J, Steggles P, Ward A, Hopper A (2001) Implementing a sentient computing system. IEEE Computer Magazine 34(8):50–56 [3] Ashbrook, Starner (2003) Using GPS to learn significant locations and predict movement across multiple users. Pers Ubiquit Comput 7(5):275–286 [4] Bahl, Padmanabhan (2000) RADAR: an in-building rf-based user location and tracking system. In: IEEE INFOCOM, pp 775–784 [5] Beigl (2000) Memoclip: A location based remembrance applicance. In: 2th International Symposium on Handheld and Ubiquitous Computing (HUC2000, Springer Press, pp 230–234 [6] Bernardin K, Gehrig T, Stiefelhagen R (2006) Multi-and single view multiperson tracking for smart room environments. In: CLEAR, pp 81–92 [7] Brown (1996) The stick-e document: a framework for creating context-aware applications. In: Proceedings of EP’96, Palo Alto, also published in it EP–odd, pp 259–272, URL http://www.cs.kent.ac.uk/pubs/1996/396 [8] Chen, Sohn, Chmelev, Haehnel, Hightower, Hughes, LaMarca, Potter, Smith, Varshavsky (2006) Practical metropolitan-scale positioning for gsm phones. In: Proc. Ubicomp 2006 [9] Darrell T, Gordon G, Harville M, Woodfill J (1998) Integrated person tracking using stereo, color, and pattern de tection. In: Conf. Computer Vision and Pattern Recognition, pp 601–608 [10] Dey A, Abowd G (2000) The context toolkit: Aiding the development of context-aware applications. In: Workshop on Software Engineering for Wearable and Pervasive Computing [11] Dey AK, Abowd GD (2000) Cybreminder: A context-aware system for supporting reminders. In: HUC ’00: Proceedings of the 2nd international symposium on Handheld and Ubiquitous Computing, Springer-Verlag, London, UK, pp 172–186 [12] Eagle, Pentland (2006) Reality mining: Sensing complex social systems. Pers Ubiquit Comput 10:255–268 [13] Ekahau I (2002) Ekahau positioning engine 2.0. Tech. rep., Ekahau, Inc., technology White Paper [14] Fox, Hightower, Liao, Schulz, Borriello (2003) Bayesian filtering for location estimation. IEEE Pervasive Computing 2(3):24–33 [15] Getting (1993) The global positioning system. IEEE Spectrum 30(12):43–47
846
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
[16] Gezici S, Tian Z, Giannakis GB, Kobayashi H, Molisch AF, Poor HV, Sahinoglu Z (2005) Localization via ultra-wideband radios. IEEE Signal Processing Mag 22(4):70–84 [17] Gustafsson, Gunnarson, Bergman, Forssell, Jansson, Karlsson, Nordlund (2002) Particle filters for positioning, navigation and tracking. IEEE Trans Signal Processing 50:425–435 [18] Harle, Hopper (2005) Deploying and evaluating a location-aware system. In: MobiSys ’05: Proceedings of the 3rd international conference on Mobile systems, applications, and services, ACM, New York, NY, USA, pp 219–232, DOI http://doi.acm.org/10.1145/1067170.1067194 [19] Harter A, Hopper A, Steggles P, Ward A, Webster P (1999) The anatomy of a context-aware application. In: Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking (MOBICOM’99), Seattle, Washington, USA, pp 59–68 [20] Hightower, Borriello (2004) Particle filters for location estimation in ubiquitous computing: A case study. In: Lecture Notes in Computer Science. Ubicomp 2004, vol 3205/2004, pp 88–106 [21] Hightower, Smith (2006) Practical lessons from place lab. IEEE Pervasive Computing 5(3):32.39 [22] Hightower, Consolvo, LaMarca, Smith, Hughes (2005) Learning and recognizing the places we go. In: Ubicomp 2005, pp 159–176 [23] Intille SS, Larson K, Tapia EM, Beaudin J, Kaushik P, Nawyn J, Rockinson R (2006) Using a live-in laboratory for ubiquitous computing research. In: Proceedings of PERVASIVE 2006, pp 349–365 [24] Jang JS, Sun CT, Mizutani E (1997) Neuro-Fuzzy and Soft Computing. Prentice Hall [25] Kang, Welbourne, Stewart, Borriello (2005) Extracting places from traces of locations. ACM SIGMOBILE Mobile Computing and Communications pp 58–68 [26] Katsarakis N, Souretis G, Talantzis F, Pnevmatikakis A, Polymenakos L (2006) 3d audiovisual person tracking using kalman filtering and information theory. In: CLEAR, pp 45–54 [27] Kidd CD, Orr R, Abowd GD, Atkeson CG, Essa IA, MacIntyre B, Mynatt E, Starner TE, Newstetter W (1999) The aware home: A living laboratory for ubiquitous computing research. In: Proceedings of the 2nd International Workshop on Cooperative Buildings (CoBuild’99) [28] Koho, Suutala, Seppänen, Röning (2004) Footstep pattern matching from pressure signals using segmental semi-markov models. In: Proceedings of 12th European Signal Processing Conference (EUSIPCO 2004), Vienna, Austria, pp 1609–1612 [29] Krumm J, Harris S, Meyers B, Brumitt B, Hale M, Shafer S (2000) Multicamera multi-person tracking for easyliving. In: Proceedings of the 3rd IEEE Workshop on Visual Surveillance, Dublin, Ireland
Smart Environments for Occupancy Sensing and Services
847
[30] Krumm J, Cermak G, Horvitz E (2003) Rightspot: A novel sense of location for a smart personal object. In: Proceedings of International Conference on Ubiquitous Computing (UBICOMP) [31] Laasonen, Raento, Toivonen (2004) Adaptive on-device location recognition. In: LNCS 3001, Springer-Verlag, pp 287–304 [32] LaMarca, Lara (2008) Location systems: An introduction to the technology behind location awareness. Synthesis Lectures on Mobile and Pervasive Computing, 122 p. [33] Lee JY, Scholtz RA (2002) Ranging in a dense multipath environment using an uwb radio link. IEEE J Select Areas Commun 20(9):1677–1683 [34] Liao, Patterson, Fox, Kautz (2007) Learning and inferring transportation routines. Artificial Intelligence 171:311–331 [35] Marmasse, Schmandt (2000) Location-aware information delivery with commotion. In: Lecture Notes in Computer Science. Handheld and Ubiquitous Computing, Springer Berlin/Heidelberg, vol 1927/2000, pp 361–370 [36] Minami M, Fukuju Y, Hirasawa K, Yokoyama S, Mizumachi M, Morikawa H, Aoyama T (2004) Dolphin: A practical approach for implementing a fully distributed indoor ultrasonic positioning system. In: Ubicomp 2004, pp 347– 365 [37] Misra, Enge (2006) Global Positioning System: Signals, Measurements, and Performance. Ganga-Jamuna Press, Lincoln, MA 01773 [38] Murakita, Ikeda, Ishiguro (2004) Human tracking using floor sensors based on the markov chain monte carlo method. In: Proc. 17th Int. Conf. Pattern Recognition, IEEE, vol 4, pp 917–920 [39] Newman, Clark (1999) Sulawesi: A wearable application integration framework. In: Proc. 3rd Int Symp Wearable Computers, pp 170–171 [40] Nurmi, Bhattacharya (2008) Identifying meaningful places: The nonparametric way. In: LNCS 5013, Proc. Pervasive, Springer-Verlag Berlin Heidelberg, pp 111–127 [41] Orr, Abowd (2000) The smart floor: A mechanism for natural user identification and tracking. In: Proc. 2000 Conf. Human Factors in Computing Systems (CHI), pp 275–276 [42] Paajanen M, Lekkala J, Kirjavainen K (2000) Electromechanical film (EMFi) - a new multipurpose electret material. Sensors and actuators A 84(1-2) [43] Pirttikangas S, Porspakka JRS, Röning J (2004) Know your whereabouts. In: Proc. Communication Networks and Distributed Systems Modeling and Simulation Conference (CNDS’04), San Diego, California, USA [44] Pnevmatikakis A, Polymenakos L (2006) 2d person tracking using kalman filtering and adaptive background learning in a feedback loop. In: CLEAR, pp 151–160 [45] Priyantha NB, Chakraborty A, Balakrishnan H (2000) The cricket locationsupport system. In: Proceedings of the 6th Annual ACM/IEEE International Conference on Mobile Computing and Networking (Mobicom 2000), Boston, MA, USA
848
Susanna Pirttikangas, Yoshito Tobe, and Niwat Thepvilojanapong
[46] Priyantha NB, Miu A, Balakrishnan H, Teller S (2001) The cricket compass for context-aware mobile applications. In: Proceedings of the 7th Annual ACM/IEEE International Conference on Mobile Computing and Networking (Mobicom 2001), Rome, Italy [47] Rodriguez, Lewis, Mason, Evans (2008) Footstep recognition for a smart home environment. International Journal of Smart Home [48] Sato T (2003) Robotic room: Human behavior measurement, behavior accumulation and personall behavioral adaptation by intelligent environment. In: Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2003), pp 515–520 [49] Sato T, Harada T, Mori T (2004) Environment-type robot system robotic room featured by behavior media behavior contents and behavior adaptation. IEEE/ASME Transaction on Mechatronics 9(3):529–534 [50] Sato T, Harada T, Mori T (2004) Robotic room: Networked robots system in configuration of room. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Network Robot System Workshop, pp 11–18 [51] Shi Q, Kyperountas S, Correal NS, Niu F (2005) Performance analysis of relative location estimation for multihop wireless sensor networks. IEEE J Select Areas Commun 25(4):830–838 [52] Smith A, Balakrishnan H, Goraczko M, yantha NP (2004) Tracking moving devices with the cricket location system. In: Proceedings of the 2nd international conference on Mobile systems, applications, and services (MobiSys ’04), pp 190–202 [53] Smith K, Schreiber S, Potucek I, Beran V, Rigoll G, Gatica-Perez D (2006) 2d multi-person tracking: A comparative study in ami meetings. In: CLEAR, pp 331–344 [54] Sohn, Li, Lee, Smith, Scott, Griswold (2005) Place-its: A study of locationbased reminders on mobile phones. In: Ubicomp 2005, Springer, pp 232–250 [55] Stenbit (2001) Gps sps performance standard. Tech. rep., The National SpaceBased Positioning, Navigation, and Timing (PNT) Executive Committee [56] Suutala, Fujinami (2008) Gaussian process person identifier based on simple floor sensors. In: Proc 3rd European Conf Smart Sensing and Context (EuroSSC2008), Zurich, Switzerland [57] Suutala, Röning (2008) Methods for person identification on a pressuresensitive floor: Experiments with multiple classifiers and reject option. Information Fusion Journal, Special Issue on Applications of Ensemble Methods 9:21–40, URL http://dx.doi.org/10.1016/j.inffus. 2006.11.003 [58] Vildjiounaite, Malm, Kaartinen, Alahuhta (2002) Location estimation indoors by means of small computing power devices, accelerometers, magnetic sensors, and map knowledge. In: Lecture Notes in Computer Science, Pervasive 2002, Springer-Verlag Berlin Heidelberg, pp 211–224 [59] Want R, Hopper A, Veronica Falc a, Gibbons J (1992) The active badge location system. ACM Transactions on Information Systems 10(1):91–102
Smart Environments for Occupancy Sensing and Services
849
[60] Ward A (1998) Sensor-driven computing. PhD thesis, University of Cambridge [61] Ward A, Jones A, Hopper A (1997) A new location technique for the active office. IEEE Personal Communications 4(5):42–47 [62] Woodman, Harle (2008) Pedestrian localization for indoor environments. In: Proc. 10th Int. Conf. Ubiquitous Computing, pp 114–123 [63] Yin, Yang, Ni (2005) Adaptive temporal radio maps for indoor location estimation. In: IEEE PerCom, pp 85–94 [64] Yun, Lee, Woo, Ryu (2003) The user identification system using walking pattern over the ubifloor. In: Proc. Int. Conf. Control, Automation, and Systems (ICCAS) [65] Zhu Y, Shareef A (2006) Comparisons of three kalman filter tracking algorithms in sensor network. In: International Workshop on Networking, Architecture, and Storages 2006 (IWNAS’06)
Smart Offices and Intelligent Decision Rooms Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
1 Introduction Nowadays computing technology research is focused on the development of Smart Environments. Following that line of thought several Smart Rooms projects were developed and their appliances are very diversified. The appliances include projects in the context of workplace or everyday living, entertainment, play and education. These appliances envisage to acquire and apply knowledge about the environment state in order to reason about it so as to define a desired state for its inhabitants and perform adaptation to these desires and therefore improving their involvement and satisfaction with that environment. Such “intelligent” or “smart” environments and systems interact with human beings in a helpful, adaptive, active and unobtrusive way. They may still be proactive, acting autonomously, anticipate the users needs, and in some circumstances they may even replace the user. Decision Making is one of the noblest activities of the human being. Most times, decisions are taken involving several persons (group decision making) in specific spaces (e.g. meeting rooms). On the other hand, the Carlos Ramos GECAD and ISEP, e-mail: [email protected] Goreti Marreiros GECAD and ISEP, e-mail: [email protected] Ricardo Santos GECAD and ESTGF, e-mail: [email protected] Carlos Filipe Freitas GECAD , e-mail: [email protected] ISEP - Institute of Engineering of Porto, R. Dr. Antonio Bernardino de Almeida, 431, 4200-072 Porto, Portugal GECAD - Knowledge Engineering and Decision Support Research Center at ISEP ESTGF - School of Management and Technology at Felgueiras, Rua do Curral - Casa do Curral Margaride - 4610 156 Felgueiras, Portugal H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_32, © Springer Science+Business Media, LLC 2010
851
852
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
present economic scenario (e.g. increased business opportunities, intense competition) needs fast decision-making to succeed. In this chapter we will focus on smart offices and intelligent meeting rooms context. These topics are well studied and they intend to support the Decision Making activity, however, they have received a new perspective from the Ambient Intelligence concept, a different way to look at traditional offices and decision rooms, where it is expected that these environments support their inhabitants on a smart way, promoting an easy management, efficient actions and, most importantly, to support the creation and selection of the most advantageous decisions. The topic in discussion, the Smart Offices appliances, are presented on the next section (2) where we include an overview of existent approaches and match them with definitions of such environments, followed with a detailed description of a few important projects present in literature, detailing in the hardware, software and capabilities perspectives. In section 3 we will discuss the Intelligent Meeting Rooms (IMR) where we will present a definition of such environments, as well as, some projects. On section 4 we will detail a Intelligent Decision Room project, LAID (Laboratory of Ambient Intelligence for Decision Making), installed at GECAD research center in ISEP. The environment created at LAID provides support to meeting room participants, assisting them in the argumentation and decision making processes, combining rational and emotional aspects. Special attention will be given to the ubiquity question, because in group decision context it is natural that group members are not available to meet at the same time and in the same space. We finish the document with some conclusions.
2 Smart Offices To build intelligent environments it is vital to design and make effective use of physical components such as sensors, controllers and smart devices. Then, using the information collected by these sensors, the software e.g., intelligent agents, can reason about the environment and trigger actions in order to change the state of the environment, by means of actuators. Such sensors/actuators networks need to be robust and self-organized in order to create an ubiquitous/pervasive computing platform. The IEEE 1451 studies formalised the notion of a smart sensor as one that provides additional functions beyond the sensed quantity, such as signal condition or processing, decision-making functions or alarm functions [12]. This led to devices that do not intend to solve the entire intelligent environment problem [7]. However, they provide intelligent features confining to a single object and task. The design of these sensors/actuators leads us to some pervasive computing and middleware issues. Here we can find challenges as invisibility, service discovery, interoperability and heterogeneity, pro-activity, mobility, privacy, security and trust [53]. In this way, to design and build a Smart Office, it becomes fundamental a selection of middleware that decreases the development and usability effort of software
Smart Offices and Intelligent Decision Rooms
853
solutions that bring together all the data collected by the sensors, and then reasons about the environment and acts on it. Now we will focus on the desired features for this peculiar AmI environments, the smart offices, present in literature. In the “SENSE-R-US” project [36] researchers are focused in gathering real world data of employees in an office environment, identifying the steps to clear up that data and extract relevant information, in order to be able to build a meeting driven movement model for office scenarios. This research enables the produced system to be queried about the position and status of persons and to resolve queries such as “What is the temperature in room X?”, “Which rooms are available?”, “Is person X in a meeting?” or “How many meetings has person P already had today?”. On the other hand, in [34], the research is focused on using software agents to automate the environment customisation, enabling it to adapt to users’ needs and to provide customised interfaces to the services available at each moment, without discarding security issues. It is expected a flexible support to users, devices and services addition and removal, scalability, reliable authentication and access control and to support devices with different capabilities, in terms of power supply, bandwidth and computer power. In [19] we can find several approaches to solve the decision making task, defined as the process of determining the action that should be taken by the system in a given situation in order to optimise a given performance of one or more metrics. The algorithms establish a mapping from the gathered information at past observations with the known information about the current state of the environment and the inhabitants to choose a decision to be made at a current point in time. Decision making task problems are well documented in literature and can be summarised in three kinds of algorithms, they are Reactive and Rule-Based Decision Systems, Planning Algorithms and Learning Algorithms. In Reactive and Rule-Based Decision Systems decisions are based on a set of facts and rules that represent the various states of the environment and its inhabitants, joined at a set of condition-action maps. A rule is triggered when all of its conditions are satisfied. An example of such application can be seen in [24]. Inside of this approach different implementations are presented, such as Fuzzy Logic in [64] or influence diagrams when uncertainty is present in [50]. However this technique must be constructed by the programmer or the user, a problem that does not exist in Planning Algorithms which provides a framework that allows decision making agents to autonomously select the best actions. To perform these actions such systems have a model of the environment and the available actions and the resulting effects and are able to generate a solution in form of a sequence of actions that lead to the desired tasks’ objectives. Several approaches of planning algorithms are exposed and discussed in detail in [50]. But both previous mechanisms do not provide the possibility of self customisation. This is only possible if the system designer or the user build customised models. The autonomous adaptation to the inhabitants by the system is only possible with machine learning methods such as Supervised Learning, however such techniques require the inhabitant or the programmer to provide a training set of data to the system. Such approaches are well reported by [37]. neural networks, Bayesian networks, Markov Models, Hid-
854
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
den Markov Models and State Predictors are others techniques that were already used, [9, 45, 46], in AmI features systems and these can dismiss the need of user or programmer intervention. Such algorithms pose many challenges in terms of safety considerations and computational efficiency mainly in large scale environments but the reverse of this medal is the environment possibility to perform many useful tasks and improve the comfort and energy efficiency. To allow highly complex environments to be controlled according to the preferences of its inhabitants, solutions pass by hierarchical multi-agents systems with reinforcement learning. Details of such techniques can be seen in [14, 3, 19, 6]. Besides the monitoring, prediction, autonomy and adaptation features already referred as key features of smart offices, [7] also focus on the capabilities of these environments to communicate with humans in a natural way. Normally classical offices use the commons Graphical User Interfaces (GUI) for interactions between user and system. However some authors consider that this kind of interfaces are intrusive and not truly user friendly, and to prove this theory a new class of interfaces emerged from the work on Human-Machine interaction. They are the Perceptual User Interfaces (PUI) that are not based on a keyboard and mouse couple as input interface but based on human actions like speech, gesture, interaction with object, etc. Very good examples of this kind of interfaces are Wellner’s MagicDesk [61], Berard’s MagicBoard in [4] detecting user commands, such as print or solve equations, introduced by human gestures and more recently the Intelligent Room Project [5] that is able to recognise human speech and gestures. Now that we exposed common features present in such environments, we are ready to explore the definitions of smart environments present in literature. [25] defines a smart office as an environment that is able to help its inhabitants to perform everyday tasks by automating some of them and making the communication between user and machine simpler and effective. In [33, 34] smart offices are defined as an environment that is able to adapt itself to the user needs, release the users from routine tasks they should perform, to change the environment to suit to their preferences and to access services available at each moment by customised interfaces. In other words, Smart Offices contribute to reduce the decision-cycle offering, for instance, connectivity where-ever the user is, aggregating the knowledge and information sources. Smart offices handle several devices that support everyday tasks. Smart offices may anticipate user intentions, doing tasks on his behalf, facilitating other tasks, etc. As we can see, the described definitions meet the features already presented. Now that we state the definition of Smart Office we will present what can be done in such environments, for that, on the next subsections we will report some important projects present in the literature, specifying the capabilities of the systems in terms of hardware, software and functionalities.
Smart Offices and Intelligent Decision Rooms
855
2.1 Active Badge This is considered the first smart office application and its goal is to forward incoming calls for a user of a building to the nearest phone. [59] The hardware consists on a badge, that the user must wear when he is inside of the building and emits unique infrared waves for each user, infrared sensors that receive the badge waves and a central database that is modified by the sensors when they contact with one badge. The software consists on an application that is used by the receptionist to query the database about the last known position of a user. Then she/he is able to transfer the call to the telephone terminal nearest to the user. Despite the simplicity and specificity of the system, the intermediate database enables the possibility of new sensors addition, even different kinds of mechanisms(e.g., like fingertips lockers), without the need to change or even restart the application.
2.2 Monica SmartOffice In 1999 started the development of a smart office entitled of Monica [26, 25] that intended to anticipate user intentions and augment the environment to communicate useful information, through user monitoring. In terms of hardware, this project involves 50 sensors (cameras and microphones) 4 mobile cameras, located in the corners, track one or more users, a wide-angle camera that observes the entire office and enables recognition of user activities and monitors the tracking strategy. A mobile camera on top of the users computer monitor permits a close-up view of the user at the computer, ideal for face recognition or facial-expression recognition. Microphones distributed across the ceiling are used for speech recognition. In terms of actuators, exist three, a video projector and two speakers. Finally six Linux Pentium III PCs connected with a fast Ethernet local network power Monica. Four computers perform vision processing for user tracking and activities recognition. A fifth PC is dedicated to the MagicBoard for both video projection and image processing. A sixth computer hosts the supervisor and is linked to the Internet. Supporting this hardware 12 software modules exist, with specific tasks. They are the Gesture Recognition, a 3 dimensional mouse, a magic board and a magic desk, finger and click detection, a virtual assistance, MediaSpace for distributed collaborative work, Tracker, Face recognition, Activity detection, Phycons for user interaction and an Internet Agent and Gamma architectures witch allow the interrelation between all the others software agents.
856
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
2.3 Stanford’s Interactive Worksapaces The Stanford Interactive Workspaces project is a research facility called the iRoom, located in the Gates Information Sciences Building and focus on augmenting a dedicated meeting space with technology such as large displays, wireless/multimodal I/O devices, and seamless integration of mobile and wireless “appliances” including handheld PC’s. The iRoom contains [20] three touch sensitive white-board displays along the side wall and a custom-built 9 megapixel, 6-foot diagonal display called the interactive mural. In addition, there is a table with a 3 x 4 foot display that was designed to look like a standard conference-room table. The room also has cameras, microphones, wireless LAN support, and several wireless buttons and other interaction devices. The system infrastructure present in [20] is called the Interactive Room Operating System (iROS). It is a metaoperating system or middleware infrastructure tying together devices that have their own low-level operating system. iROS is composed by sub-systems named Data Heap, iCrafter, and the Event Heap. They are designed to address the three user modalities of moving data, moving control and dynamic application coordination, respectively. The inter-dependency between systems is very low, making the overall system more robust, and provides an easy deployment/replacement of devices.
2.4 Intelligent Environment Laboratory of IGD Rostock This laboratory intended to create an interactive environment based on multimodal interfaces and goal-oriented interaction differing from other systems that normally use a function-oriented interaction. To do it, the components provide semantical interfaces with a description about the meaning and the effect of their functions in the environment. These semantic descriptions present in the Lab components take the form of a set of preconditions that must be fulfilled before an action can be triggered. The semantic description of the effects of that action provides the input to a planning assistant that is able to develop system comprehensive strategies and plans, even with new devices, to fulfil the goals [16]. The hardware includes a series of agents that control microphones, all kinds of sensors (temperature, brightness), standard devices and computer resources. As software infrastructure the following modules are included: speech analysis and recognition, a dialogue management system (to translate user-goals into system goals), a context manager (for managing the system’s view of the world), different sensors (to have perception of the environment), a planning assistant (to create strategies to fulfill the users goal), a scheduler (for plan execution), lots of devices (to have real-world effects). This is a different approach to Smart Offices where the interactions on the Lab are goals to reach instead of actions to be performed.
Smart Offices and Intelligent Decision Rooms
857
2.5 Sens-R-Us The Sensor-R-Us’ application aims at building a Smart Environment in office context and is placed in the University of Stuttgart [36]. It has two types of sensor nodes, the “base stations” and the “personal sensors”. Base stations are installed in all rooms, these serve as location beacons and gateways to the mobile nodes carried around by the employees and a PC based GUI, the position and status (e.g., ‘in a meeting’) of persons or sensor information like room temperature can be queried. Personal sensors receive the location beacons and determine their position and also send beacons which are received by other personal sensors to update their neighbourhood list. This list and microphone data is used to detect the occurrence of meetings.
2.6 Smart Doorplate The Doorplate project is located at the University of Lancaster [57] and is able to display notes for visitors and change them remotely, e.g. by sending a text message (SMS) from the office owner. Thus in case of owner absence it should be able to perform simple secretary tasks, like accept/ forward messages, notify visitors of the current location of owner or return time. Composed by a display, preferably a touch screen, and a microphone connected to a network, a person tracking system (identification tags and tag readers), additional sensors (to detect if a door is open) and actuators like electronic door locks and interconnection system software and middleware to connect the appliances are focused on the scenario of a visitor direction service, and is based on a peer-to-peer middleware architecture despite the commonly used client/server system.
2.7 Ambient Agoras Researches made by [56] pursuit to augment the architectural envelope to create a social architectural space that supports collaboration, informal communication, and social awareness. The design of smart artifacts that users can interact with simply and intuitively in the overall environment were their main focus. For that, they introduce a distinction between two types of smart artifacts: system-oriented, importunate smartness, that creates an environment in which individual smart artifacts or the environment as a whole can take certain self directed actions based on previously collected information and leaves the human out from the decision process, and people-oriented, empowering smartness who places the empowering function in the foreground so that ``smart spaces make people smarter ´´, empowering users to make decisions and take mature and responsible actions, putting the users in the decision loop.
858
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
The artifacts and software components developed include InfoRiver, InforMall, and the SIAM-system [47], Hello.Wall, ViewPort, and Personal Aura [56]. The Hello.Wall is an ambient display with 1.8 meter wide by 2 meter high which has integrated light cells and sensing technology and provides awareness and notifications to people passing by or watching. In a brief it is a piece of unobtrusive, calm technology that exploits people’s ability to perceive information via codes. ViewPorts consists of a WLAN-equipped PDA-like handheld device with RFID readers and transponders so that a ViewPort can sense other artifacts and be sensed itself. It is used by Hello.Wall, which borrows the ViewPort’s display to privately show more explicit and personal information that can be viewed only on a personal or temporarily personalized device. Depending on access rights and current context, people can use ViewPorts to learn more about the Hello.Wall, to decode visual codes on the wall, or to access a message announced by a code. Personal Aura is an easy and intuitive interface that let users control their appearance in a smart environment. They can decide whether to be visible to a tracking system and, if so, they could control the social role in which they appeared. The participants indicated [56] that they perceived the Hello.Wall as being an appropriate means of establishing awareness of people who were working at a remote site, thus overcoming the isolation of not being physically present without causing privacy problems. they also cite that the interactions with the remote site took place more often, spontaneous video conference interactions were less formal, and videoconferencing became a daily routine.
3 Intelligent Meeting Rooms The increasing competitiveness present in the business world led people and organisations to take decisions in a short period of time, in a formal group setting and in specific spaces (e.g. meeting rooms) [41]. As those decisions have to be the most advantageous, taking into account the quality of the final results, the researchers have developed several decision support systems. At the beginning developed systems aimed to support face-to-face meetings [41]; today’s the aim is to develop systems that support distributed and asynchronous meetings, naturally allowing a ubiquitous use that can add flexibility to the global organisational environment of today [13]. Such orientation in software development was followed by the design and building of rooms with specific hardware and software that could empower the decisions makers’ actions, supporting them with knowledge and driving their attention to the problem and avoiding their minds from wandering on needless issues. This kind of rooms is commonly named by Intelligent Meeting Rooms (IMR), and can be considered as a sub domain of Smart Rooms for the workplace context. The generic goal of Intelligent Meeting Rooms is normally referred as a system that supports multi-person interactions in the environment in real time, but also as a system that is able to remember the past, enabling review of past events and the reuse of past information in an intuitive, efficient and intelligent way [35]. Besides this classical definition [29] cite that IMR should also support the decision making
Smart Offices and Intelligent Decision Rooms
859
process considering the emotional factors of the intervenient participants, as well as the argumentation process. The infrastructure used for such rooms include a suite of multimodal sensory devices, appropriate computing and communications systems. A good identification of the components in a smart environment and their connection can be seen in [7]. The following subsections provide details on IMR projects present in literature focusing on their tasks, hardware and software capabilities. For now we will present the systems that fit in the [35] and on the next section we will refer a project that we have developed, the LAID project [32], that goes further then the classical definition of IMR and also meets the [29] definition.
3.1 The IDIAP Smart Meeting Room In the field we can find interesting projects such as IDIAP smart meeting room [38] receiving meetings containing up to twelve participants. The hardware is composed by a table, whiteboard, computer projection screen, 24 microphones configured as lapel microphones, in the ears of a binaural manikin, and in a pair of 8 channel tabletop microphone arrays, three video cameras, and equipment for capturing timestamped whiteboard strokes. On IDIAP recorded data is precisely synchronized so that every microphone, pen-stroke, and video sample can be associated with simultaneously captured samples from other media streams. The software component uses XML to catalogue all the data and it is mentioned an off-line media interactive browsing system.
3.2 SMaRT SMaRT [58] was developed at ISL (Interactive Systems Laboratories) and intends to provide meeting support services that do not require explicit human-computer interaction, enabling the room to react appropriately to users’ needs maintaining the focus on their own goals. It supports human-machine, human-human, and humancomputer-human interactions providing multi modal and fleximodal interfaces for multilingual, multicultural meetings. It is instrumented to collect data on behaviors so that the participants’ interactions and their activities on the meeting can be analyzed.
3.3 M4 and AMI Literature also refers the M4 (Multi Modal Meeting Manager) project as a largescale project funded by the European Union in its 5th Framework Programme [40].
860
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
M4 aim is to design a meeting manager that is able to translate the information that is captured from microphones and cameras into annotated meeting minutes that allow for high-level retrieval questions, as well as summarization and browsing. It is concerned with the building of a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The software goal of the M4 project includes the analysis and processing of audio and video streams, robust conversational speech recognition, to produce a wordlevel description, recognition of gestures and actions, multimodal identification of intent and emotion, multimodal person identification and source localization and tracking. This inferred data can be accessed from the M4 meeting manager. In this project is also included AMI (Augmented Multi-party Interaction) software [40] concerned with new multi modal technologies to support human interaction, in the context of smart meeting rooms and remote meeting assistants. It aims to enhance the value of multi modal meeting recordings and to make human interaction more effective in real time. Its main focus is on using the hardware to 3D centroid tracking, person identification and current speaker recognition, event recognition for directing the attention of active cameras, best view camera selection, active camera control and in the Graphical summarization/user interface.
3.4 NIST Smart Space Project The NIST Meeting Room [55] is an US Department of Commerce project that seeks pervasive devices, sensors, and networks, for context-aware smart meeting rooms that sense ongoing human activities and respond to them. In terms of hardware this IMR has over two hundred microphones, five HD cameras, a smart whiteboard, and a locator system for the meeting attendees. Together, these sensors generate over a gigabyte per minute of sensor data, which are time tagged to millisecond resolution and stored. [11] This project includes 2 major software toolkits, the ATLAS and SmartFlow. The sensors streams are managed by the NIST SmartFlow system, a data flow middleware layer that provides a data transport abstraction. NIST SmartFlow system generates the connections and transports the data among the clients and it consists of a defined middleware API for real-time data transport, and a connection server for sensor data source, processing, display, and storage graph node. Thus being SmartFlow layer makes possible to integrate components from multiple sources, such as speaker identification and speech recognition systems. Architecture and Tools for Linguistic Analysis Systems (ATLAS) is a signalindependent, linguistic annotation framework that is responsible for the extraction of low-level semantic descriptions in the form of metadata, or annotations from the data streams and subsequent inferring of higher-level annotations about the users tasks on meeting context. In the scope of this project some products already have been released to the market such as a NIST Microphone Array Series, designed to minimise analog signal
Smart Offices and Intelligent Decision Rooms
861
paths to reduce noise, and also allowing flexible deployment in research laboratories. [10]
3.5 MIT Intelligent Room Project The coordinator of this project, Brooks [5] have argued that Intelligent Environments (IE) should adapt to, and be useful for, ordinary everyday activities; they should assist the user, rather than requiring the user to attend to them; they should have a high degree of interactivity; and they should be able to understand the context in which people are trying to use them and behave appropriately. The software environment supports change, it is an adaptable system. Change appears by means of anonymous devices customised to users, by explicit user requests, by the needs of applications and different operating conditions [54]. This project has produced several communications and on the following text we will cite some of them. MIT’s intelligent room prototype facilitates activities via two systems the Rascal and ReBa, which are responsible for the adaptive and reactive components and provides a way to make use of perceptual information. They developed a Semantic Network [21] a graph notation for representing the knowledge of the context in patterns of interacted nodes and links, which represents a model for capturing a vast amount of information from cameras, microphones, and sensors, for creating context-aware room reactions. There also exists MetaGlue, that can be seen as an execution platform for distributed agents having a dynamic nature that allows the inclusion of new agents in the system in run time, resource discovery and agent state recovering makes the agents “invincible”. In terms of collaboration, Meeting View and The eFacilitator are supporting tools that make possible to effectively visualise the content of a meeting. [54] SAM, Look-To-Talk (LTT) and Multimodal Sketching are human computer interfaces (HCI). SAM is an HCI project in which was developed an expressive and responsive user interface agent for intelligent rooms. LTT interface involves an animated avatar representing the room and it intends to engage in speech recognition capabilities of the environment into human activities. Multimodal Sketching aim is to integrate speech into the existing sketching system, which helps to get an idea of the user‘s intentions.
4 LAID - Laboratory of Ambient Intelligence for Decision Support LAID (Laboratory of Ambient Intelligence for Decision Making) is an Intelligent Environment to support decision meetings (Fig. 1). The meeting room supports distributed and asynchronous meetings, so participants can participate in the meeting
862
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
Fig. 1 Ambient Intelligent Decision Lab
where ever they are. The software included is part of an ambient intelligence environment for decision making where networks of computers, information and services are shared [31]. In these applications emotions handling are also included. The middleware used is in line with the live participation/supporting on the meeting. It is also able to support a past review in an intuitive way, however the audio and video remember features are still in arrangement. The GECAD Intelligent Decision Room is composed by the followed hardware: • • • • •
Interactive 61 plasma screen Interactive holographic screen Mimio Note grabber One interactive 26 LCD screen (Six of this) used by 3 decision points 3 cameras, Microphones and activating terminals controlled by a CAN network.
With this hardware we are able to gather all kind of data produced on the meeting, and facilitate their presentation to the participants, minimizing the middleware issues to the software solutions that intend to catalog, organize and distribute the meeting information. As an example of a potential scenario, it is considered a distributed meeting involving people in different locations (some in a meeting room, others in their offices, possibly in different countries) with access to different devices (e.g. computers, PDAs, mobile phones, or even embedded systems as part of the meeting room or of their clothes). Fig. 1 shows an Ambient Intelligent Decision Laboratory with several interactive Smartboards. The meeting is distributed but it is also asynchronous, so participants do not need to be involved at any time (like the meeting participant using a PDA and/or a notebook in Fig. 2). However, when interacting with the system, a meeting participant may wish to receive information as it appears. Meetings are important events where ideas are exposed, alternatives are considered, argumentation and negotiation take place, and where the emotional aspects of the participants are so important as the rational ones. This system will help participants, showing available information and knowledge, analysing the meeting trends and suggesting arguments to be exchanged with others.
Smart Offices and Intelligent Decision Rooms
863
Fig. 2 Distributed Decision Meeting
4.1 Ubiquitous Group Decision Making The term Group Decision Support System (GDSS) [17, 27] emerged effectively in the beginning of the eighty-decade. According to Huber [18] a GDSS consists of a set of software, hardware, languages components and procedures that support a group of people engaged in a decision related meeting. A more recent definition from Nunamaker and colleagues [42] says that GDSSs are interactive computerbased environment which support concerted and coordinated team effort towards completion of joint tasks. Generically we may say that GDSS aims to reduce the loss associated to group work (e.g. time consuming, high costs, improper use of group dynamics, etc.) and to maintain or improve the gains (e.g. groups are better to understand problems and in flaw detection, participants’ different knowledge and processing skills allow results that could not be achieved individually). The use of GDSS allows groups to integrate the knowledge of all members into better decision making. Jonathan Grudin [15] classifies the digital technology to support the group interaction into three phases: the pre-ubiquitous, the proto-ubiquitous and the ubiquitous ones. In the pre-ubiquitous phase, that begun in the 70’s, face-to-face meetings were supported. In the proto-ubiquitous phase, distributed meetings were supported. This phase came to life in the 90’s. The ubiquitous phase is now getting under way, supports meetings, and it is distributed in time and space. ubiquitous computing was introduced in the 90’s and anticipates a digital world which consists in many distributed devices that interact with users in a natural way [60]. This vision was too far ahead for its time, however the hardware to implement Mark Weiser’s vision is now commercially available and at a low cost. In an ambient intelligent environment, people are surrounded with networks of embedded intelligent devices providing ubiquitous information, communication and services. Intelligent devices are available whenever needed, enabled by simple or effortless interactions, attuned to senses, adaptive to users and contexts, and acting autonomously. High quality information and content may therefore be available to any user, anywhere, at any
864
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
time, and on any device. Today, there is an increasing interest in the development of ubiquitous group decision support systems (uGDSS) to formalise and develop “any time and any place” group decision making processes, instead of using the conventional “same place and same time” approach. This interest came with the need of joining the best potential group of participants. With the economy globalization, possible participants to form the group, like specialist or experts in specific domains, are located in different points of the world and there was no way to put them in the same decision room. Until some years ago, a way out of this scenario was to wait until all the participants meet together. Actually, there is a growing interest in developing systems to hold up such scenarios. There are many areas where ubiquitous group decision making apparently makes sense. One of the most cited areas in literature is healthcare, since patients’ treatment involves several specialists, like physicians, nurses, laboratory assistants, radiologists. These specialists could be distributed along departments, hospitals or even living in different countries. The HERMES system, a web-based GDSS was tested according to this scenario [22]. There are other GDSS that support ubiquitous decision making (GroupSystems software; WebMeeting software; VisionQuest software).
4.2 Ubiquitous System Architecture Here, our objective is to present a ubiquitous system able to exhibit an intelligent and emotional behavior in the interaction with individual persons and groups. This system supports persons in group decision making processes considering the emotional factors of the intervenient participants, as well as the argumentation process. Groups and social systems are modeled by intelligent agents that will be simulated considering emotional aspects, to have an idea of possible trends in social/group inter-actions. The system consists of a suite of applications as depicted in Fig. 3. Its main goals of the system are [31]: • The use of a simplified model of Groups and Social Systems for Decision Making processes, balancing Emotional and Rational aspects in a correct way; • The use of a decision making simulation system to support meeting participants. This will involve the emotional component in the decision making process; • The use of an argumentation support system, suggesting arguments to be used by a meeting participant in the interaction with other participants; • The mixed initiative interface for the developed system; • The availability of the system in order to be used in any place, in different devices and at different times. The main blocks of the system are: • WebMeeting Plus is an evolution of the WebMeeting project [28] with extended features for audio and video streaming. In its initial version, based on
Smart Offices and Intelligent Decision Rooms
865
Fig. 3 System architecture
Web-Meeting, it was designed as a GDSS that supports distributed and asynchronous meetings through the Internet. The WebMeeting system is focused on multi-criteria problems, where there are several alternatives that are evaluated by various decision criteria. Moreover, the system is intended to provide support for the activities associated with the whole meeting life cycle, i.e. from the pre-meeting phase to the post-meeting phase. The system aims to support the activities of two distinct types of users: ordinary group “members” and the “facilitator”. The system works by allowing participants to post arguments in favor/neutral/against the different alternatives being discussed to address a particular problem. It is also a window to the information repository for the current problem. This is a web based application accessible by desktop and mobile browsers and eventually WML for WAP browsers; • ABS4GD (Agent Based Simulation for Group Decision) is a multi-agent simulator system whose aim is to simulate group decision making processes, considering emotional and argumentative factors of the participants [15]. ABS4GD is composed by several agents, but the more relevant are the participant agents that simulate the human beings of a decision meeting (this decision making process is influenced by the emotional state of the agents and by the exchanged arguments). The user maintains a database of participant’s profiles and the group’s model history; this model is built incrementally during the different interactions of the user with the system; • WebABS4GD is a web version of the ABS4GD tool to be used by users with limited computational power (e.g. mobile phones) or users accessing the system through the Internet. The database of profiles and history will not be shared by all users, allowing for a user to securely store its data on the server database, which guarantees that his/her model will be available for him or her at any time;
866
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
• IGTAI (Idea Generation Tool for Ambient Intelligence) is an Idea Generation Tool to support a ubiquitous group decision meeting dedicated to the idea generation task. It is a tool designed to users with little experience in informatics systems, with group knowledge management, ubiquitous access, user adaptiveness and pro-activeness, platform independence and the formulation of a multi - criteria problems at the end of a work session.
4.3 ABS4GD and WebMeeting Plus agent based simulation is considered an important tool in a broad range of areas e.g. individual decision making (``what if´´scenarios), e-commerce (to simulate the buyers and sellers behaviour), crisis situations (e.g. simulate fire combat), traffic simulation, military training, entertainment (e.g. movies). According to the architecture that we are proposing we intend to give support to decision makers in both of the aspects identified by Zachary and Ryder [63], namely supporting them in a specific decision situation and giving them training facilities in order to acquire competencies and knowledge to be used in a real decision group meeting. We defend that agent based simulation can be used with success in both tasks. As referred in the introduction multi-agent systems seem to be quite suitable to simulate the behaviour of groups of people working together [29, 8]. Each participant of the group decision making process is associated with a set of agents to interact with other participants. The community should be persistent because it is necessary to have information about previous group decision making processes, focusing credibility, reputation and past behaviours of other participants [2]. There are three different types of agents in our model: Facilitator agent, Assistant agent and the Participant agent. The Facilitator agent helps the responsible for the meeting in its organization (e.g. decision problem and alternatives definition). During the meeting, the Facilitator agent will coordinate all the process and, at the end, will report to the responsible the results of the meeting. The Assistant agent works as an assistant to the participant of the meeting presenting all the updated information of the meeting. This agent acts like a bridge between the participant (user) and the participant agent. The Participant agent is the most important agent of the model and will be described with more detail in the next section.
4.3.1 Participant Agent The participant agent represents a very important role in the group decision support system. For that reason, we will present the architecture and a detailed view of all the component parts. The architecture is divided into three layers that are: the knowledge layer, the communication layer and the reasoning layer as seen on Fig. 4.
Smart Offices and Intelligent Decision Rooms
867
In the knowledge layer the agent has information about the environment where he is situated, about the public profile of the other participant’s agents that compose the decision meeting group, and regarding its own preferences and goals (its own public and private profile). The information in the knowledge layer is dotted of uncertainty [39] and will be accurate along the time through interactions done by the agent. The interaction layer is responsible for the communication with other agents, the interface with the user of the group decision making simulator and by the mixed initiative interaction between participant and agents.
Fig. 4 Participant Agent Architecture
The reasoning layer contains three major components: • The argumentative system that is responsible for the arguments generation. This component will generate explanatory arguments and persuasive arguments, which are more related with the internal agent’s emotional state and about what he thinks of the others agents’ profile (including the emotional state). • The decision making module will support agents in the choice of the preferred alternative and will classify all the set of alternatives in three classes: preferred, indifferent and inadmissible; • The emotional system [52] will generate emotions and moods, affecting the choice of the arguments to send to the others participants, the evaluation of the received arguments and the final decision. • The reputation module support the user in the definition of the level of trust to the participant agent in the delegation of actions.
868
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
4.3.2 Argumentation System Arguments may be classified according to type. Here we assume that the following six types of argument have persuasive force in human negotiations [43] [48] : threats; promise of a future reward and appeals; appeal to past reward; appeal to counter-example; appeal to prevailing practice; and appeal to self interest. These are the arguments that agents will use to persuade each other. This selection of arguments is compatible with the power relations identified in the political model: reward, coercive, referent, and legitimate [51]. This component will generate persuasive arguments based on the information that exists in the participant’s agent knowledge base [30]. Argumentation Protocol During a group decision meeting, participant agents may exchange the following locutions: request, refuse, accept, request with argument. • Request (AgPi , AgP j , α , arg) - in this case agent AgPi is asking agent AgP j to perform action α , the parameter arg may be void and in that case it is a request without argument or may have one of the arguments specified in the end of this section. • Accept (AgP j , AgPi , α ) - in this case agent AgP j is telling agent AgPi that it accepts its request to perform α . • Refuse (AgP j , AgPi , α ) - in this case agent AgP j is telling agent AgPi that it cannot accept its request to perform α . The purpose of the agent participant is to replace the user when he is not available. For example, in Fig. 5, it is possible to see the argumentation protocol for two agents. However, note that one of the participants is not available at the moment leaving all the actions to the participant agent (AgP2 ). Note that this is the simplest scenario, because in reality, group decision making involves more than two agents and, at the same time that AgP1 is trying to persuade AgP2 that agent may be involved in other persuasion dialogues with other group members. The autonomy of the participant agent is connected with the trust that the real participant has in the agent. As the agent exchanges locutions with other participants the user can approve or reject the locutions made by the participant agent there are no definitive locutions made by the agent. In that case trust level can be increased or decreased. Arguments Selection In our model it is proposed that the selection of arguments should be based on agent emotional state. We propose the following heuristic: • If the agent is in a good mood he will start with a weak argument; • If the agent is in bad mood he will start with a strong argument.
Smart Offices and Intelligent Decision Rooms
869
Fig. 5 Argumentation protocol
We adopt the scale proposed by Kraus for the definition of strong and weak arguments, where the appeals to a prevailing practice are the weakest and threats are the strongest arguments. We defined two distinct classes of arguments, namely a class for the weaker ones (i.e., appeals) and a class for the remainders (i.e., promises and threats). Inside each class the choice is conditionally defined by the existence in the opponent profile of a (un)preference by a specific argument. In case the agent does not detain information about that characteristic of the opponent, the selection inside each class follow the order defined by Kraus [23]. Arguments evaluation In each argumentation round the participant agents may receive requests from several partners, and probably the majority is incompatible. The agent should analyse all the requests based on several factors, namely the proposal utility, the credibility of proponent and the strength of the argument. If the request does not contain an argument, the acceptance is conditioned by the utility of the request for the self, the credibility of the proponent and one of its profile characteristics, i.e., benevolence. We consider: ReqtAgPi = {Request1t (AgP, AgPi , Action), ..., Requestnt (AgP, AgPi , Action)},
where AgP represents the identity of the agent that perform the request, n is the total number of requests received at instant t and Action the request action (e.g., voting on alternative number 1). The algorithm for the evaluation of this type of requests (without arguments) is presented next:
870
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
if ¬pro f ileAgPi (benovolent) then foreach request(Proponent, AgPi, Action) ∈ ReqtAgPi do re f use(Proponent, AgPi, Action) end else foreach request(Proponent, AgPi, Action) ∈ ReqtAgPi do if AgPOAgPi ! Action then Requests← Requests request(Proponent, AgPi, Action) else re f use(Proponent, AgPi, Action) end end (AgP, RequestedAction)← SelectedMoreCredible(Requests) foreach request(Proponent, AgPi, Action) ∈ Requests do if Proponent = AgP or RequestedAction = Action then accept(Proponent, AgPi, Action) else re f use(Proponent, AgPi, Action) end end end
If the requests are accomplished with arguments, then evaluation is performed using the following criteria: strength of the argument, opponent credibility, existence of preference for the argument, quality if the information detained about the opponent, convincing factor (analyze the validity of the argument).
4.3.3 Emotional System The emotions that will be simulated in our system [29] are those identified in the reviewed version of the OCC model: joy, hope, relief, pride, gratitude, like, distress, fear, disappointment, remorse, anger and dislike. The agent emotional state (mood) is calculated in this module based on the emotions felt in the past and in others agents’ mood [52]. Each participant agent has a model of the other agents, in particular has information about the other agent’s mood. This model of the others considered incomplete information handling and the existence of explicit negation, following the approach described in [1]. Some of the properties that characterise the agent model are: gratitude debts, benevolent, credibility [2], (un)preferred arguments. Although the emotional component is based on the OCC model we think that with the inclusion of mood we can overcome one of the major criticisms that usually is pointed to this model, the fact
Smart Offices and Intelligent Decision Rooms
871
that OCC model does not handle the treatment of past interactions and (in our case) past emotions.
4.4 Idea Generation Techniques IGTAI aims to support the group in the idea generation task. So, in this section the study made is briefly exposed. Today several Idea Generation Techniques are known e.g the Nominal Group Technique (NGT), Theory of Inventive Problem Solving Professional Classes-Great Results (TRIZ), Mind-mapping, Brainstorming [44], cooperative KJ [62]. Mind-mapping at its most basic form is a simple hierarchy and could be drawn in any tree-shaped format. The idea is to add a formal structure to thinking, starting on the question that is intended to answer. Brainstorming is a problem resolution method which aims the generation of the maximum number of ideas in order to solve a problem in a collaborative and non-critical atmosphere. Alex Osborn, the founder of this method in 1938 cites that “in group a regular person can create two times or more ideas than in singular”. However for the success of this method four rules and two principles must be followed. The two principles are the “judgment delay” and “the amount creates quality”. The first principle means that in the use of this technique there is no space to criticise the other’s ideas because that could stop the improvement of some ideas. The second one says that the existence of more ideas means a better final solution. In this technique exist four rules that must be followed and they are focus on quantity ( the greater the number of ideas generated, the greater the chance of producing a radical and effective solution), no criticism (instead of immediately stating what might be wrong with an idea, the participants focus on extending or adding to it, reserving criticism for a later ’critical stage’ of the process), unusual ideas are welcome (they may open new ways of thinking and provide better solutions than regular ideas) and finally combine and improve ideas ( Good ideas can be combined to form a very good idea, this approach leads to better and more complete ideas than just generation of new ideas, and increases the generation of ideas, by a process of association).
4.5 Implementation Some implementation details of the simulator (ABS4GD), the WebMeeting Plus and the idea generation tool (IGTAI) are described here. ABS4GD The ABS4GD is developed in Open Agent Architecture (OAA), Java and Prolog. OAA is structured in order to: minimize the effort involved in the creation of new agents, that can be written in different languages and operating on diverse platforms;
872
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
encourage the reuse of existing agents; and allow for dynamism and flexibility in the creation of agent communities. More information about OAA can be found in www.ai.sri.com/ oaa/. Some screens of the ABS4GD prototype running on LAID (Fig. 1) may be found in Fig. 7 and Fig. 6.
Fig. 6 Community of agents
Fig. 6 shows the collection of agents that work at a particular moment in the simulator: ten participant agents, the facilitator agent (responsible for the follow-up of all simulations), the voting agent, the clock agent (OAA is not specially designed for simulation, for that reason it was necessary to introduce a clock agent to control the simulation), the oaa_monitor (i.e. an agent that belongs to the OAA platform, and is used to trace, debug and profile communication events for an OAA agent community) and the application agent (responsible for the communication between the community of agents and the simulator interface). Fig. 7 shows an extract of the arguments exchanged by the participant agents. Once a simulation is accomplished, agents update the knowledge about the other agents’ profile (e.g. agent credibility). WebMeeting Plus The Webmeeting plus is a ubiquitous application intended to be used in a web
Smart Offices and Intelligent Decision Rooms
873
Fig. 7 Argumentation dialogues
browser and is developed in JavaScript, Java J2SE 5.0, OAA and Prolog. JavaScript and Java is used in the development of the main application, the OAA and the Prolog is used in the implementation of the agents. The participants, as well as the organizer, can use it in our Ambient Intelligence for Decision Support Lab (Fig. 1) or in any other computer system. The only requirement is to have Internet access. This system intends to reduce the disadvantages of traditional meetings, delegating all the time consuming and boring duties to the agents. All the participants of the meeting have an agent that personifies him in the interaction with other participants in the meeting. The main screen of the system can be seen in Fig. 8. Fig. 9 on the left presents a window of the responsible of the meeting where it is possible to see, in the top, a list of all the participants in the meeting. The public profile of each one of the participants can be visualised by the responsible if he intends to do it. In the bottom of the window it is presented a graphic with the trend of the meeting. For each alternative the characteristics can be visualised, as well as how many participants may support that alternative. On the right there is a window of the participant where it is possible to see a table with all the requests made by a participant/agent of a meeting to all the other participants. For each request there is a response made by the participant who received the request. For a participant, it is possible to change the requests made till then (e.g. he can change his mind about some subject or disagree with a request made by his
874
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
Fig. 8 WebMeeting Plus
Fig. 9 Organizer and Participant Window
agent). In the bottom it is possible to see a graphic that shows all the preferred alternatives of the participant. The graphic shows to the participant the trends and the chances of each alternative. IGTAI The IGTAI was developed in Java J2SE 5.0 and the communication between the
Smart Offices and Intelligent Decision Rooms
875
client and the server was provided by the RMI Java technology. The desktop clients were implemented in a Java applet which allows its invocation over the internet by any browser. The mobile client uses the Java Micro Edition and the package Java ME RMI 1.0. The data generated by the prototype is stored on the server side in a MySql 5.0.21 relational data base. For the development of the agents community the Open Agents Architecture (OAA) was used, developed at SRI International. OAA is a framework for integrating a community of heterogeneous software agents in a distributed environment and is structured to minimise the effort of creating new agents. It allows the creation of agents in various programming languages and operating platforms and this particular detail will allows us to reuse agents executed in others works. With this kind of development tools we achieve an application platform independency. In Fig. 10 it is possible to see the mobile version of the IGTAI. The mobile version for now is a small version of the IGTAI with limited functionalities.
Fig. 10 Meeting mobile application
Fig. 11 shows the Meeting Desktop panel where it is possible to see the user ranking that consists of a three bar graphic representing the number of ideas introduced by the current user, the number of ideas from the worst and from the best one as well as the total number of ideas introduced. This will give a productivity idea and a little bit of competition which could lead to a productivity increase.
5 Conclusions Nowadays, the human being expends a lot of time in working spaces, like offices and decision rooms. In spite of the amount of technology available (computer, networks, cameras, microphones), these spaces are still too passive. In order to make the working environments more smart or intelligent research is being done in the are of Smart Offices and Intelligent Decisions Rooms. this research originated some
876
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
Fig. 11 Meeting Desktop panel
prototype level spaces experiencing some of the functionalities that will be certainly available in the offices and meeting rooms of the future. In accordance with the methodology proposed in [49] it is expected that Smart Offices and Intelligent Decision Rooms integrate AmI environments covering the following tasks: • • • • • • •
interpreting the environment state; representing the information and knowledge associated with the environment; modelling, simulating and representing entities in the environment; planing decisions or actions; learning about the environment and associated aspects; interacting with humans; acting on the environment.
The several systems presented in this chapter consider in part these tasks and gave the first steps for the future trends of this kind of environments.
6 Acknowledgements We would like to acknowledge FCT (Portuguese Foundation for Science and Technology) and Programmes FEDER, POSI, POSC, and PTDC for their support in several R&D projects and fellowships.
Smart Offices and Intelligent Decision Rooms
877
References [1] Analide C, Neves J (2002) Antropopatia em entidades virtuais. In: I Workshop de Teses e Dissertacoes em Inteligencia Artificial (WTDIA02), Brazil [2] Andrade F, Neves J, Novais P, Machado J, Abelha A (2005) Collaborative Networks and Their Breeding Environments, Springer-Verlag, chap Legal Security and Credibility in Agent Based Virtual Enterprises, pp 501–512 [3] Asadi M, Huber M (2004) State space reduction for hierarchical reinforcement learning. In: FLAIRS Conference [4] Brard F (1994) Vision par ordinateur pour la ralit augmente : Application au bureau numrique. PhD thesis, Université Joseph Fourier, June [5] Brooks RA, Dang D, Bonet JD, Kramer J, Mellor J, Pook P, Stauffer C, Stein L, Torrance M, Wessler M (1997) The intelligent room project. In: Proceedings of the 2nd International Cognitive Technology Conference (CT’97), pp 271– 278 [6] Chen F, Gao Y, Chen S, Ma Z (2007) Connect-based subgoal discovery for options in hierarchical reinforcement learning. 3th International Conference on Natural Computation (ICNC 2007) 4:698–702 [7] Cook DJ, Das SK (2007) How smart are our environments? an updated look at the state of the art. Pervasive Mobile Computing 3(2):53–73 [8] Davidsson P (2001) Multi agent based simulation: beyond social simulation. In: MABS 2000: Proceedings of the second international workshop on Multiagent based simulation, Springer-Verlag, USA, pp 97–107 [9] Dodier R, Y DL, Riesz J, Mozer MC (1994) Comparison of neural net and conventional techniques for lighting control. applied mathematics and computer science. Applied Mathematics and Computer Science 4:447–462 [10] Fillinger A (Accessed in June 2008) The nist mark-iii microphone array. http://www.nist.gov/smartspace/mk3_presentation.html [11] Fillinger A (Accessed in June 2008) The nist smart space project. http://www.nist.gov/smartspace/ [12] Frank R (2000) Understanding smart sensors. Measurement Science and Technology 11(12):1830, URL http://stacks.iop.org/0957-0233/ 11/1830 [13] Freitas C, Marreiros G, Ramos C (2007) IGTAI - An Idea Generation Tool for Ambient Intelligence. In: 3rd IET International Conference on Intelligent Environments, pp 391–397, Ulm, Germany [14] Goel S, Huber M (2003) Subgoal discovery for hierarchical reinforcement learning using learned policies. In: FLAIRS Conference, pp 346–350 [15] Grudin J (2002) Group dynamics and ubiquitous computing. Commun ACM 45(12):74–78, DOI http://doi.acm.org/10.1145/585597.585618 [16] Heider T, Kirste T (2002) Intelligent environment lab. Computer Graphics Topics 15(2):8–10 [17] Huber GP (1982) Group decision support systems as aids in the use of structured group management techniques. In: Proc. of second international conference on decision support systems
878
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
[18] Huber GP (1984) Issues in the design of group decision support systems. MIS Quarterly [19] Huber M (2005) Smart Environments: Technology, Protocols and Applications, Wiley, chap Automated Decision Making [20] Johanson B, Fox A, Winograd T (2002) The interactive workspaces project: Experiences with ubiquitous computing rooms. IEEE Pervasive Computing 1(2):67–74 [21] Kang YB, Pisan Y (2006) A survey of major challenges and future directions for next generation pervasive computing. In: Computer and Information Sciences, ISCIS 2006, vol 4263/2006, pp 755–764 [22] Karacapilidis N, Papadias D (2001) Computer supported argumentation and collaborative decision making: the hermes system. Infor Syst 26(4):259–277 [23] Kraus S, Sycara K, Evenchik A (1998) Reaching agreements through argumentation: A logical model and implementation. Artificial Intelligence 104:1– 69 [24] Kulkarni A (2002) Design Principles of a Reactive Behavioral S ystem for the Intelligent Room. Bitstream: The MIT Journal of EECS Student Research [25] Le Gal C (2005) Smart Environments: Technology, Protocols and Applications, Wiley, chap Smart Offices [26] Le Gal C, Martin J, Lux A, Crowley J (2001) Smartoffice: design of an intelligent environment. IEEE Intelligent Systems 16(4):60–66, DOI 10.1109/5254. 941359 [27] Lewis L (1982) Facilitator: A microcomputer decision support systems for small groups. PhD thesis, University of Louisville [28] Marreiros G, Sousa J, Ramos C (2004) Webmeeting - a group decision support system for multi-criteria decision problems. In: International Conference on Knowledge Engineering and Decision Support, Porto, Portugal, pp 63–70 [29] Marreiros G, Ramos C, Neves J (2005) Dealing with emotional factors in agent based ubiquitous group decision. LNCS 3823:41–50 [30] Marreiros G, Ramos C, Neves J (2006) Multi-agent approach to group decision making through persuasive argumentation. In: 6th International Conference on Argumentation ISSA 2006 [31] Marreiros G, Santos R, Ramos C, Neves J, Novais P, Machado J, Bulas-Cruz J (2007) Ambient intelligence in emotion based ubiquitous decision making. In: Artificial Intelligence Techniques for Ambient Intelligence, IJCAI 07, India [32] Marreiros G, Santos R, Freitas C, Ramos C, Neves J, Bulas-Cruz J (2008) Laid - a smart decision room with ambient intelligence for group decision making and argumentation support considering emotional aspects. International Journal of Smart Home 2(2):77–94 [33] Marsa-Maestre I, de la Hoz E, Alarcos B, Velasco JR (2006) A hierarchical, agent-based approach to security in smart offices. In: Proceedings of the First International Conference on Ubiquitous Computing (ICUC-2006) [34] Marsa-Maestre I, Lopez-Carmona MA, Velasco JR, Navarro A (2008) Mobile agents for service personalization in smart environments. Journal of Networks (JNW) 3(5):30–41
Smart Offices and Intelligent Decision Rooms
879
[35] Mikic I, Huang K, Trivedi M (2000) Activity monitoring and summarization for an intelligent meeting room. In: Workshop on Human Motion, pp 107–112 [36] Minder D, Marrón PJ, Lachenmann A, Rothermel K (2005) Experimental construction of a meeting model for smart office environments. In: Proceedings of the First Workshop on Real-World Wireless Sensor Networks (REALWSN 2005) [37] Mitchell TM (1997) Machine Learning. McGraw-Hill Higher Education [38] Moore D (2002) The idiap smart meeting room. IDIAP-COM 07, IDIAP [39] Neves J (1984) A logic interpreter to handle time and negation in logic data bases. In: ACM 84: Proceedings of the 1984 annual conference of the ACM on The fifth generation challenge, ACM, New York, NY, USA, pp 50–54, DOI http://doi.acm.org/10.1145/800171.809603 [40] Nijholt A (2004) Meetings, gatherings, and events in smart environments. In: ACM SIGGRAPH international Conference on Virtual Reality Continuum and Its Applications in industry, pp 229–232 [41] Nunamaker JF, Dennis AR, Valacich JS, Vogel D, George JF (1991) Electronic meeting systems. Commun ACM 34(7):40–61, DOI http://doi.acm.org/ 10.1145/105783.105793 [42] Nunamaker JF, Briggs RO, Mittleman DD, Vogel DR, Balthazard PA (1996) Lessons from a dozen years of group support systems research: a discussion of lab and field findings. Journal Management Information Systems 13(3):163– 207 [43] O’Keefe D (1990) Persuasion: Theory and Research. SAGE Publications [44] Osborn AF (1963) Applied Imagination 3rd ed. Scribner, New York [45] Petzold J, Bagci F, Trumler W, Ungerer T, Vintan L (2004) Global state context prediction techniques applied to a smart office building. In: Communication Networks and Distributed Systems Modeling and Simulation Conference, USA [46] Petzold J, Bagci F, Trumler W, Ungere T (2005) Next location prediction within a smart office building. In: Gellersen HW, Want R, Schmidt A (eds) Third International Conference PERVASIVE 2005, Munich, Germany [47] Prante T, Stenzel R, Röcker C, Streitz N, Magerkurth C (2004) Ambient agoras: Inforiver, siam, hello.wall. In: CHI ’04: extended abstracts on Human factors in computing systems, ACM, NY, USA, pp 763–764 [48] Pruitt DG (1981) Negotiation Behavior. Academic Press, New York [49] Ramos C, Augusto JC, Shapiro D (2008) Ambient intelligence - the next step for artificial intelligence. IEEE Intelligent Systems 23(2):15–18 [50] Russell SJ, Norvig P (1995) Artificial intelligence: a modern approach. Prentice-Hall, Inc., Upper Saddle River, NJ, USA [51] Salancik G, Pfeffer J (1977) Who gets power and how they hold on to it: A strategic-contingency model of power. Organizational Dynamics 5:3–21 [52] Santos R, Marreiros G, Ramos C, Neves J, Bulas-Cruz J (2006) Multi-agent Approach for Ubiquitous Group Decision Support Involving Emotions, LNCS, vol 4159, Springer Berlin / Heidelberg, chap Ubiquitous Intelligence and Computing, pp 1174–1185
880
Carlos Ramos, Goreti Marreiros, Ricardo Santos, Carlos Filipe Freitas
[53] Satyanarayanan M (2001) Pervasive computing: vision and challenges. Personal Communications, IEEE [see also IEEE Wireless Communications] 8(4):10–17 [54] Shrobe DHE (Accessed in June 2008) Intelligent room project. http://w5.cs.uni-sb.de/ butz/teaching/ie-ss03/papers/IntelligentRoom/ [55] Stanford V, Garofolo J, Galibert O, Michel M, Laprun C (2003) The (nist) smart space and meeting room projects: signals, acquisition annotation, and metrics. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003, vol 4, pp 6–10 [56] Streitz NA, Röcker C, Prante T, van Alphen D, Stenzel R, Magerkurth C (2005) Designing smart artifacts for smart environments. Computer 38(3):41– 49 [57] Trumler W, Bagci F, Petzold J, Ungerer T (2003) Smart doorplate. Personal Ubiquitous Computing 7(3-4):221–226 [58] Waibel A, Schultz T, Bett M, Denecke M, Malkin R, Rogina I, Stiefelhagen R, Yang J (2003) Smart: the smart meeting room task at ISL. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp 281–286 [59] Want R, Hopper A, Falcao V, Gibbons J (1992) The active badge location system. ACM Transactions Information Systems 10(1):91–102 [60] Weiser M (1991) The computer for the twenty-first century. Scientific American 265(3):94–104 [61] Wellner P (1991) The digitaldesk calculator: tangible manipulation on a desk top display. In: UIST ’91: Proceedings of the 4th annual ACM symposium on User interface software and technology, ACM, NY, USA, pp 27–33 [62] Yuizono T, Munemori J, Nagasawa Y (1998) Gungen: groupware for a new idea generation consistent support system. In: Proceedings 3rd Asia Pacific on Computer Human Interaction, pp 357–362 [63] Zachary W, Ryder J (1997) Handbook Of Human-Computer Interaction, Elsevier Science, chap Decision Support Systems: Integrating Decision Aiding and Decision Training, pp 1235–1258 [64] Zimmermann HJ (1999) Practical Applications of Fuzzy Technologies. Kluwer Academic Publishers, Norwell, MA, USA
Smart Classroom: Bringing Pervasive Computing into Distance Learning Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
1 Introduction In recent years, distance learning has increasingly become one of the most important applications on the internet and is being discussed and studied by various universities, institutes and companies. The Web/Internet provides relatively easy ways to publish hyper-linked multimedia content for more audiences. Yet, we find that most of the courseware are simply shifted from textbook to HTML files. However, in most cases the teacher’s live instruction is very important for catching the attention and interest of the students. That’s why Real-Time Interactive Virtual Classroom (RTIVC) always plays an indispensable role in distance learning, where teachers and students located in different places can take part in the class synchronously through certain multimedia communication systems and obtain real-time and mediarich interactions using Pervasive Computing technologies [1]. The Classroom 2000 project [2] at GIT has been devoted to the automated capturing of the classroom experience. Likewise, the Smart Classroom project [3] at our institute is focused on Tele-education. Most currently deployed real-time Tele-education systems are desktop-based, in which the teacher’s experience is totally different from teaching in a real classroom. For example, he/she should remain stationary in front of Yuanchun Shi Key Laboratory of Pervasive Computing, Tsinghua University, Beijing 100084, China, e-mail: [email protected] Weijun Qin Key Laboratory of Pervasive Computing, Tsinghua University, Beijing 100084, China, e-mail: [email protected] Yue Suo Key Laboratory of Pervasive Computing, Tsinghua University, Beijing 100084, China, e-mail: [email protected] Xin Xiao Key Laboratory of Pervasive Computing, Tsinghua University, Beijing 100084, China, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_33, © Springer Science+Business Media, LLC 2010
881
882
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
a desktop computer instead of moving freely in the classroom. The goal of Smart Classroom project is to narrow the gap between the teacher’s experience in Teleeducation and that in the traditional classroom education, by means of integrating these two currently separated education environments together. Our approach was to move the user interface of a real-time Tele-education system from the desktop into the 3D space of an augmented classroom (called Smart Classroom) so that in this classroom the teacher could interact with the remote students with multiple natural modalities just like interacting with the local students. However, to provide this type of distance learning in large scale, there still remain some barriers: • Desktop-based teaching metaphors are not natural enough for most teachers. Most current RTIVC systems are desktop-based, i.e. the users should remain stationary in front of a desktop computer and use the keyboard or mouse to operate the class or to interact with others. This metaphor is particularly unacceptable for the teachers, because their experience here is of much difference from that in a real classroom. In a real classroom, they can move freely, talk to the students with hand-gesture and eye contact and illustrate their ideas conveniently by scribbling on the blackboard. Many teachers involving in Teleeducation, when talking with us, complained that this divergence of the experience makes them uncomfortable and reduced the efficiency of the teaching and learning activity. • Lack of adequate technologies to cope with large-scale access. Most Teleeducation schools simply adopted commercial videoconference products as the operating platform for RTIVC, where all clients should connect to a centered MCU (Multi-Point Controlling Unit) and data initiated from one client is replicated (sometime maybe mixed) and forwarded to all other clients by MCU. However these systems are not scalable, for the maximum user number (usually ten more) is rigidly limited by the capacity of the MCU. So, today, most teleeducation schools can only operate RTIVC classes on a small student number base. A possible approach to address the scalability issue is leveraging the IP Multicast technology, where no central data-replicating node like MCU is required. But the current state of the art of IP Multicast is still not fully matured. • Lack of adequate technologies to accommodate students with different network and device conditions in one session. Most current RTIVC systems have rigid requirements on the network and device conditions of the clients. Clients with inferior conditions either could not join the session or could not get smooth service quality. On the other hand, clients with superior conditions could not fully take advantage of their extra capabilities. Since handheld devices such as Pocket PC and Smart Phone, and wireless network such as GPRS and WiFi are becoming more and more popular, it is a natural demand to allow people to accept lifelong education through these devices, while they will inevitable posses diversified capabilities and network connections. The Smart Remote Classroom project at our institute is a long-term project to overcome the above-mentioned difficulties in current practice of RTIVC and building an integrated system as an exemplar for the next generation real-time interac-
Smart Classroom: Bringing Pervasive Computing into Distance Learning
883
tive distance learning in China. The aim of this project is to develop new pervasive computing technologies in the classroom to provide practical natural convenient multimodal interfaces and context-aware applications to assist local teacher, and to develop large-scale remote e-learning interactive application to enhance the class activity between local teacher and remote students. The rest of the chapter will be organized as follows: Sect. 2 introduce the overview of Smart Classroom and gives a typical use experience class scenario. Sect. 3 presents key multimodal interfaces and context-awareness technologies for building sensor-rich classroom environment. Large-scale real-time collaborative distance learning technology is investigated in Sect. 4 in order to provide augmented learning experience and interactions between local class and remote students. In Sect. 5, we introduce the infrastructure called Smart Platform to integrate various components to create virtual class and a conclusion is made finally in Sect. 6.
2 Smart Classroom: Sensor-Rich Smart Environment In this section, we briefly introduce the layout of Smart Classroom which was settled in Key Laboratory of Pervasive Computing, Tsinghua University and give a typical user experience scenario in distance learning.
2.1 The Layout of Smart Classroom The Smart Classroom is physically built in a separate room of Pervasive Computing Lab in Tsinghua University. As Fig. 1 shown, several video cameras, microphone arrays are installed in it to sense human’s gesture, motion and utterance. According to the characteristic of invisibility in pervasive computing environment, we deliberately removed all the computers out of sight. Two wall-sized projector displays are mounted on two vertically crossed walls. According to their purposes, they are called “Media Board” and “Student Board” separately. The Media Board is used for lecturer’s use as a blackboard, on which prepared electronic courseware and lecturers’ annotation are displayed. The Student Board is used for displaying the status and information of remote students, who are part of the class via Internet. The classroom is divided into two areas, complying with the real world classroom’s model. One is the teaching area, where is close to the two boards and usually dominated by lecturer. The other is the audience area, where is the place for local students. Why are both remote students and local students supported in this room? The answer is simple, that we’re complying with the philosophy of Natural and Augmented. Natural means we’ll obey real-world model of classroom as much as possible to provide lecturer and students the feeling of reality and familiarity, which leads to the existence of local students. Augmented means we’ll try to extend beyond the limitation imposed by the incapability of traditional technology, which
884
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
is the reason for remote student. Here is a snapshot of Smart Classroom Prototype as Fig. 2.
2.2 A Typical User Experience Scenario in Smart Classroom The following is a typical user-experience scenario happened within the Smart Classroom. Multiple persons enter the room through the door. At the door, there is an audiovisual identification module identifying the entering person’s identity through facial and voice identification. If the person is identified as lecturer, he is granted the control right of the Smart Classroom. Besides, he takes a badge embedded with location sensor. The visual motion-track module tracks the lecturer’s motion in the room. Once he steps into the teaching area, he will be able to use gesture and voice command to exploit the Smart Classroom to give lessons. Persons in the Smart Classroom other than lecturer are deemed as local students. When the lecturer is in the teaching area, he can start the class by just saying, “Now let’s start our class.” The Smart Classroom starts Microphone Array Agent to capture and recognize his voice command, and then launches necessary modules such as Virtual Mouse agent, Same View agent (which will be talked about later). Lecturer loads prepared electronic courseware by utterance like, “Go to Chapter 1 of Multimedia course”. In the meanwhile, The Smart Cameraman Agent was activated to focus on the lecturer
StudentBoard
C
C
C
MediaBoard
C
Audience Area Location Receiver
Teacher Area Microphone Array SmartBoard
Fig. 1 The Layout of Smart Classroom
Projector
C
Camera
Smart Classroom: Bringing Pervasive Computing into Distance Learning
885
based on his current location and show the close-up view. The HTML-based courseware is then projected on the wall display. Lecturer can use hand-motion to stimulate the Virtual Mouse agent to annotate on the electronic board. Several type of hand gestures are assigned corresponding semantic meanings, which cause several operations like highlighting, annotating, adding pictures, remove object, executing links, scrolling pages etc, on the electronic board. Lecturer can also grant speech right to remote students by finger pointing or voice command, like “Suo, please give us your opinion”. On the Student Board, remote students’ photos and some information as name, role, speech right etc are displayed. When a remote student requests for floor, his icon on the Student board twinkles. Once the lecturer grants the floor to a specific remote student, his video and audio streams are synchronously played both in the Smart Classroom and on other remote students’ computers.
3 Key Multimodal Interface and Context-Awareness Technologies In this section, we present the key multimodal interface and context-awareness technologies in Smart Classroom which merge speech, handwriting, gestures, location tracking, direct manipulation, large projected touch-sensitive displays and laser pointer tracking which aim to provide enhanced experience for both of teachers and students in Smart Classroom and detect the basic contextual information to create context-aware applications to facilitate human to achieve their task.
Fig. 2 A Snapshot of Smart Classroom Prototype
886
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
3.1 Location Tracking In the Smart Classroom Project, the aim of location tracking is to capture the target’s indoor location context using location sensors and positioning technologies to provide location-based services. In the scenario of the Smart Classroom, the lecturer’s location is captured from the taken badge sensor and calculated from multiple signal receivers. Thus, the Smart Classroom system can provide context-aware services based on the lecturer’s location. For example, the Smart Cameraman Agent made of camera pan and tilt can focus on the lecturer based on his precise position when the lecturer is speaking inside the teaching area. For supporting location-aware computing in indoor smart environments, the location sensing/positioning system not only need to provide objects’ precise location, but also should own such characteristics as: isotropy and convenience for portability. In this subsection, we present an indoor location sensing system, Cicada. This System is based on the TDOA (time difference of arrival) between Radiofrequency and ultrasound to estimate distance, and adopts a technology integrating Slide Window Filter (SWF) and Extended Kalman Filter (EKF) to calculate location. Consequently, it not only can determine the coordinate location within 5cm average deviation either for static objects or for mobile objects, but also owns a nearly omni-directional working area. Moreover, it is able to run independently, mini and light so that it is very easy to be portable and even embedded into people’s paraphernalia.
3.1.1 Working Principle and Framework Being somewhat similar to the physical principle of Cricket [4], Cicada is also based on the TDOA between RF (radiofrequency) and ultrasound. Being different from Cricket, Cicada is of active mode. The framework of Cicada is shown in Fig. 3. As Fig. 3 shows, Cicada consists of three main parts: CBadge, CReader and Location Server. CBadges are carried by users or attached to those objects located, which emit RF and Ultrasound at the same time periodically. And each RF corresponding to each CBadge modulates the CBadge’s ID. CReaders are deployed on the fixed location of a building, e.g. ceilings, whose coordinate location are known by beforehand measurement. Because the propaganda speeds of RF and Ultrasound are different, the TDOA between them from a CBadge to a CReader is direct proportional to the distance between them, where the coefficient is the velocity of sound in air (neglecting the propaganda time of RF). According to the theory, a CReader can infer the distance from a CBadge, and then report it to a dedicated computer, Location Server, through a serial port cable. The Location Server collects all distance and calculates out location. Applications can acquire location from the Location Server as its clients. The close-up of CBadge and CReader is shown in Fig. 4. A RMB coin and a ruler (the unit is cm) are nearby. The CBadge has a rechargeable 3V lithium battery attached on its back. The omni-directional ultrasound transmitter of CBadge is composed of 5 ultrasound sensors being mutually orthogonal. The CReader has two
Smart Classroom: Bringing Pervasive Computing into Distance Learning
887
ports: one is USB, which connects to the serial port of a computer by a USB-toCOM Switching; and another is parallel port, which connects to a computer’s parallel port. The parallel port is unplugged unless CReader is re-programmed. And when Cicada runs, the USB port must connect to the Location Server (being plugged) all the time.
3.1.2 Positioning Algorithm After acquiring the distance set from CReaders, how to infer objects’ coordinate location is concerned with the positioning algorithm. In Cicada, the positioning computation includes two phases: distance filter and location calculation. • Distance Filter Due to ultrasound’s reflection, obstruction and diffraction from wall, furniture and instruments etc, the indoor multi-path effect occurs very often. That is, quite a few of distances is invalid because they do not traverse along a LOS (line-ofsight). Before the distances go into location calculation, the invalid value of them must be filtered out carefully. Here Cicada adopts a method called Slide Window Filter (SWF). The data received from CReaders includes CReader’s ID, CBadge’s ID, distance between the CReader and the CBadge, and the timestamp when the CBadge emits its signal, which is quaternion[r, b, d,t]. If the CReader and CBadge are given, the data from CReaders is a set of pairs [ti , di ], i = 1, 2, ..., L, according to time’s ascendant order.
Fig. 3 Working Principle and Framework of Cicada
888
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
Define a distance tuple D as the triple Di = [ti , di , vi ], where vi = (di −di−1 )/(ti − ti−1 ). A slide widow is a cyclic queuing to stores recent distance tuple for each pair of CReader and CBadge. In slide widow, what the rear points at is the distance tuple received most recently, and what the front points at is the distance tuple received earliest, which was dequeued just now. Define the average velocity of slide widow as V = 1n ∑ni=1 vi . When a new distance tuple N = [t, d, v] is received, if v ≤ min{α V ,Vmax }, it will be added into the slide window; otherwise it will be rejected as an invalid distance. Here Vmax is the maximal velocity of objects’ moving, and α represents the maximal acceleration. For example, in our actual applications, Vmax is set to 1.5m/s, which is a typical moving velocity for human indoors, and α = 1.5. • Location Calculation On location’s calculation method, after comparing with an intuitive method, Linear Equations Method (LEM), we adopt the Extended Kalman Filter (EKF) as Cicada’s location calculation algorithm. Kalman Filter [5] is an optimal estimation method for a linear dynamic system perturbed by Gaussian white noise, while Extended Kalman Filter [6] is the Kalman Filter extended for non-linear system. EKF is composed of circular iterations, and each of iterations is called a time-step, which consists of two phases: prediction and correction, and each time-step can guarantee that the error covariance between the estimated value and actual value is minimal. Moreover, since EKF works on time domain, it owns low computation complexity, against those working on frequency domain, such as Wiener filter. On location calculation, Cicada adopts the EKF based on position-velocity model (PV model).
(a) CBadge Fig. 4 The CBadge and CReader in Cicada
(b) CReader
Smart Classroom: Bringing Pervasive Computing into Distance Learning
889
We evaluate the position precision performance of Cicada system as Table shown in the Smart Classroom. From the result, the average distance deviation for tracking the moving object is under 26cm. Table 1 The Position Precision of Cicada
PARAMETER
VALUE
Average Deviation of X-axis Average Deviation of Y-axis Average Deviation of Z-axis Average Planar Deviation Average Distance Deviation
5.7cm 3.8cm 23cm 7.3cm 26cm
3.2 Direct Manipulation based on Laser Pointer Tracking Large displays are widely equipped in smart environment these days. In the Smart Classroom, large displays are commonly used for displaying coursewares on the surface of walls or Smart Boards. However, traditional interaction devices which are designed to suit desktop screen, such as mice, keyboards, have various limitations in such environments. In this subsection, we present a novel human-computer interaction system, known as the CollabPointer, for facilitating interaction with large displays in Smart Spaces. A laser pointer integrated with three additional buttons and wireless communication modules is induced as input device in our system and three features distinguish the CollabPointer from other interaction technologies as follows: • First, the coordinates of the red laser point on the screen emitted by the laser pointer are interpreted as the cursor’s position and the additional buttons on it wirelessly emulate a mouse’s buttons through radio frequency. It enables remote interaction at any distance. • Second, when multiple users are interacting, with two-steps associating methods described in this subsection, our system can identify different laser pointers and support multi-user collaboration. • Last but not least, the laser pointer emits its identity through radio frequency during interaction. The system receives it and treats different users separately.
3.2.1 Design of CollabPointer Device A common laser pointer integrated with three additional buttons and wireless communication modules is introduced as the interaction device in our system. In ad-
890
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
dition, to receive the RF signal emitted by the laser pointer and transmit it to the computer, a new hardware called the Receiver is also introduced. • The laser pointer Fig. 5 shows the architecture of it. There are totally three additional buttons on the laser pointer, On/Off button, Right button and Left Button. Their functions are described as Table 2. Table 2 Functions of Buttons
BUTTONS
FUCTIONS
On/Off But- (1)Emitting a laser beam (2)Broadton casting the user’s Right Button Wirelessly emulating mouse’s right button Left Button Wirelessly emulating mouse’s left button
On/Off button: This button is a switch, not only for turning on the laser pointer but also for broadcasting the user’s identity through wireless communication modules. If the button is down, the laser beam is emitted and the ID of the laser pointer is broadcasted through radio frequency at the same time. The system receives the ID through the Receiver. According to its ID, we assign different users with different access priorities. For example, an administrator has more power than a guest.
Fig. 5 Architecture of Laser Pointer
Smart Classroom: Bringing Pervasive Computing into Distance Learning
891
Right Button: This button wirelessly emulates a standard mouse’s right button. The state of it is transmitted to the computer through radio frequency. The computer receives it via the Receiver and matches it to a standard mouse’s right button. When multiple users are interacting simultaneously, users are ranked according their identities and the computer is controlled by the user who has the highest access priorities. For example, when a teacher and a student are pressing the Right Button simultaneously, the system responds to the teacher’s action while ignores the student’s. Left Button: It is nearly the same as the Right Button. The only difference is that it emulates a mouse’s left button. • The Receiver It receives the radio frequency emitted by the laser pointer, decodes it, and then transfers it to the computer through USB interface. According to the three buttons integrated on laser pointer, the computer explains the received signal as the laser pointer’s ID, the Right Button’s state or the Left Button’s state. Figure 6 is the prototype of the Receiver. First, it decodes the radio frequency emitted by the laser pointer. Next, the IC 74LS00 is used to transit the electrical signal to logical zero, which serves as an interrupt input for the microcontroller. Finally, and also the most important, the microcontroller is programmed to communicate with the computer through USB interface. Phillip’s PDIUSBD12 IC servers as a bridge between the microcontroller and the computer.
Fig. 6 Prototype of the Receiver
892
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
3.2.2 Interaction Mode For a single user, the laser pointer emulates a standard mouse and can be utilized to interact at a distance. For multiple users, the system can identify a various laser pointers and afford seamless and parallel collaboration among several users. In addition, we can obtain each user’s identity while he is interacting. According to it, different people are treated differently. • Single User In this mode, the coordinates of the laser spot on the display are mapped to the position of the cursor, which means that the user can directly utilize the laser pointer to manipulate the cursor. The Right/Left button on the laser pointer wirelessly emulates a mouse’s right/left button. Using this new interaction device, people can “point and click” wirelessly at any distance. A computer-vision-based module called Laser2Cursor was utilized to locate the laser point’s position and mapped it to the position of the cursor. Laser2Cursor embodies a number of ideas not seen in previous work on laser pointer tracking. First, we have developed a training process to improve the system’s adaptability. By learning the background of the image captured by the cameras and parameters such as color segmentation and motion detection thresholds, our system automatically adapts to new environments. Second, to improve the system’s robustness, we integrate multiple cues such as color, motion, and shape in the laser spot detection. Because most people’s hands are unsteady, when a per son aims a laser pointer, the spot’s exact position usually jitters. We use this characteristic as an additional cue to detect the spot. Next, a Kalman filter smoothes the spot’s trajectory, which tends to be irregular and rough. In addition, the laser pointer broadcasted the user’s ID via radio frequency while it is working. The system receives it through the Receiver and treats users separately. For example, we assign different users with different accessing priorities. In Smart Classroom, where the system has been implemented, teachers and students are assigned different accessing priorities. With teachers’ laser pointer, the user can view all files on the computer; while with students’ one, he can just view his own document. • Multi-User When multiple users are interacting simultaneously, associating laser spots on the screen with corresponding laser pointers is the basis for collaboration. As shown in Fig. 7, we achieve this aim by two steps. The first is to associate laser strokes with corresponding laser pointers and the second is to associate laser spots with laser strokes.
3.2.3 Use Case in Smart Classroom We have utilized VC++6.0 to develop an application named M-Drawing to enhance collaboration for a number of users. When it runs, a totally transparent window
Smart Classroom: Bringing Pervasive Computing into Distance Learning
893
covers the screen. Multiple users can draw on it with laser pointers simultaneously and strokes appear different colors according to different users. Fig. 8 is the scenario that two people are discussing about a presentation slide.
3.3 Speaker Tracking based on Microphone Array and Location Sensing Practically, the teacher sometimes had to take wireless microphone to speak if the class was held in a large classroom, however it’s uncomfortable to wear such a strange equipment. In the Smart Classroom Project, the speaker tracking technology was investigated to capture the speaker’s voice without taking any equipment
Fig. 7 Associating Laser Spots to Laser Pointers
Fig. 8 Two Users are Discussing.
894
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
on the speaker’s clothes using microphone array and location tracking device Cicada mentioned in Section 3.1. The goal of Cicada is to detect the location of the participants in the meeting room and determine their identities. A small badge containing distinctive user ID is attached to each participant, so every participant has a unique ID and can be distinguished from the others. With the badges, all the participants can be tracked simultaneously and precisely. Four microphone arrays fixed to the walls are used to capture the audio feature of the speaker. Each microphone array has eight omni-directional microphones in a horizontal line with an inter-sensor separation of 8 cm. Audio signals obtained from the microphone arrays are processed to estimate the TDOA between different microphone channels. These two modalities are subsequently integrated by a speaker association algorithm for the purpose of identifying where and who the current active speaker is. The following subsections will detail our system.
3.3.1 Capturing Audio Feature using Microphone Array Precise person localization, which has become a basic technology in a smart classroom, is useful in a variety of domains, including experience recording, activity analysis, and camera steering. Meanwhile, the reliable identification of the speaker (e.g. teacher), pointing who is speaking, can provide additional information for videoaudio indexing and retrieval tasks in experience recording. The task of speaker localization and identification poses three basic problems: locating all the participants in the smart classroom, distinguishing the speaker from other silent participants, and determining the speaker’s identity to provide the basic context information in smart classroom environment. Our system uses four microphone arrays to capture the audio feature of the speaker. As shown in Fig. 9, four microphone arrays are placed on different walls of the room at the height of 150 cm. The X, Y plane of microphone array coordinate system is parallel to the ground. Since the arrays are placed at approximately mouth level of the users, we can make the simplifying assumption that sound sources are on the microphone array plane. The audio feature is based on estimates of the TDOA between different microphone channels in the array. Several practical TDOA estimation algorithms have
(a) An eight-element microphone array
(b) Four arrays fixed on different walls
Fig. 9 Deployment of Microphone Array in Smart Classroom
Smart Classroom: Bringing Pervasive Computing into Distance Learning
895
been explored in [7] and [8], and we choose to use the PHAT-GCC (Phase Transform Generalized Cross-Correlation) method, which is relatively robust in the room environment. Consider two of the microphones mi and m j in an array, and let xi and x j respectively be the signal recorded at corresponding microphone. Then xi and x j can be modeled respectively as: xi (t) = hi (t) ∗ s(t) + ni(t) (1) x j (t) = h j (t) ∗ s(t − τi j ) + n j (t) where s(t) denotes the source signal, hi (t) is the acoustic impulse response between the sound source and the microphone i, ni (t) is the respective additive noise (assumed to be zero mean and uncorrelated with the source signal), and τi j presents the relative delay of the signal. Then the PHAT-GCC function can be expressed as Ri j (τ ) =
1 2π
+∞ X (ω )X ∗ (ω ) i j −∞
Xi (ω )X j∗ (ω )
e jωπ d ω
(2)
where Xi (ω ) is the Fourier transforms of x j (t). The relative time delay τi j is estimated as the time lag with the global maximum peak in the PHAT-GCC function
τˆi j = argτ maxRi j (τ )
(3)
For any pair of microphones, the TDOA can determine a hyperboloid where the sound source is located. Theoretically, by grouping microphones into pairs and estimating the TDOA of each pair, the source location can be found at the point where all associated hyperboloids intersect. Normally, the sound is assumed to be in the far field of the array, and then a hyperboloid can be simplified to a cone, which indicates the DOA (Direction of Arrival) of the source. However, the simplification can introduce significant estimation error when the aperture of the array is large or when the source is relatively near the array. Since we can acquire precise location information of all the participants in the room from Cicada system, microphone arrays do not need to work out the speaker location directly. Instead we can simply use TDOA to associate the sound source with one of the participants. In this way, we can avoid the estimation error introduced by simplification, and improve the performance of the system.
3.3.2 Integration with Microphone Array and Location Tracking Cicada and microphone arrays are integrated by a fusion algorithm to associate the sound source with one of the participants in order to determine who the current speaker is. We assume that each of the participants in the meeting room carries a CBadge provided by Cicada system, so every participant will have a distinctive ID and can be precisely tracked.
896
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
For each participant i tracked, we obtain a 3D position, Xi , which represents the location (x, y, z) of the CBadge in Cicada coordinate system. Since Cicada and microphone array coordinate system are aligned, (x, y) is approximately the participant’s mouth position in microphone array plane. Let s j denote the two dimensional position of a participant in the microphone array plane, and let mi1 and mi2 respectively be the position of the microphones in pair i. If the participant s j speaks, then the TDOA between the pair of microphones can be calculated as s j − mi1 − s j − mi2 (4) c where c is the sound propagation speed. We set up the associated statistical model as follow
τi (s j ) = τi (mi1 , mi2 , s j ) =
p(τˆi |s j ) = N(τˆi ; τi (s j ), f (στ ))
(5)
where N(−; μ , σ ) is a Gaussian density with mean μ and standard deviation σ , and τˆi indicates the actual TDOA estimated by microphone pair i. στ is the standard deviation obtained from a training TDOA dataset of a speaker at a known location, while f (στ ) is a linear function of στ . We can add up the probability density values from all pairs of microphones for the participant s j , and normalize by m to get a better estimate P(s j |Tˆ ) =
p(Tˆ |s j )P(s j ) ∑nj=1 p(Tˆ |s j )P(s j )
(6)
The participant with the highest posterior probability is selected to be the speaker. The prior probability can also be set to be unequal according to the application scenario. For example, in a classroom, the teacher is the person that is normally speaking. Therefore, higher prior probability value can be assigned to the teacher. We suppose that there is no simultaneous speech in the room, so each speech segment belongs to one specific participant. Then, the system needs to associate the sound source with a certain participant only at the beginning of each speech segment. In order to detect the speech activity, we adopt a simple and efficient speech detector, constructed by calculating the signal zero-crossing rate. A speech detector is implemented on each microphone channel respectively. If more than half of all the channels detect speech activity, the system will start the speaker association process.
3.4 Context-awareness in Smart Classroom context-awareness enhances human-centric, intelligent behavior in a smart environment; however, context-awareness is not widely used due to the lack of mature design pattern to support context-aware applications. In the Smart Classroom, we
Smart Classroom: Bringing Pervasive Computing into Distance Learning
897
demonstrate a context-aware application called Smart Cameraman which is used to switch the live-video scene at the remote side based on the situational contexts, such as the teacher’s gesture, context. Firstly, we present a formal context model, which combines first order probabilistic logic (FOPL) and web ontology language (OWL) ontologies, to provide a common understanding of contextual information to facilitate context modeling and reasoning about imperfect and ambiguous contextual information and to enable context knowledge sharing and reuse. At then, we introduce a probabilistic context reasoning method to deduce high-level knowledge from the low-level uncertain data from senors, activators, or software agents.
3.4.1 Context Modeling Combined with FOPL and Ontology A well-defined context model is an important key to access the context in any context-aware system [9]. For instance, Henricksen et al. [10] investigated the unified modeling language (UML) and entity-relationship (ER) modeling approach to represent context structures and properties. Gu et al. [11] presented an ontologybased context model to derive high-level contexts from low-level context data. The important context sources are captured from embedded sensors in a smart space environment which give uncertain, imperfect data. However, most of ontology-based context models fail to represent uncertainty, while logic-based context models fail to describe semantic relationships between context entities [9]. Here, the fundamental ontology-based and logic-based context models are combined in a first-order probabilistic logic to represent the basic context structure and construct a probabilistic inference mechanism, which combines the expressive power of first-order logic with the uncertainty context reasoning of probabilistic theory [13]. This shared understanding of specific domains gives context modeling which uses ontology and semantic web services to describe the concepts and relationships of context entities in smart environment [12]. first order probabilistic logic (FOPL) is used to represent the basic context structure which combines first order logic and probabilistic models in the machine learning community. The definitions of terminology, including Field, Predicate, Context Atom, and Context Literal are presented in the following. • Field ∈ F ∗ , where a Field is a set of individuals belong to the same class, e.g., Person = {Qin, Shi, Suo}, Room = {Room526, Room527} • Predicate ∈ V ∗ , where a Predicate indicates the relationship among the entities or the properties of an entity, e.g. location, co − locate. • ContextAtom ∈ A∗ , where ContextAtom is represented as predicate(term,term, ...) in which a term is a constant, a variable, or a function followed by a parenthesized list of terms separated by commas with a predicate acting on the terms. For example, location(Qin) indicates Qin’s location, and co − locate(Qin, Suo) indicates that Qin and Suo are located in the same place. • ContextLiteral ∈ L∗, where ContextLiteral is represented as the form of contextAtom = v in which contextAtom is the instance of ContextAtom and
898
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
v indicates the status of contextAtom or the value of the terms. For example, location(Qin) = Room527 indicates that Qin’s location is Room527. Thus, the context knowledge is represented as the form of Pr(L1 , L2 , L3 ) = c, where L1 , L2 , L3 ∈ ContextLiteral, which indicates the concurrent probability of several ContextLiterals. For example, Pr(location(Qin) = Room527, location(Suo) = Room527) = 0.76 indicates that the probability of the fact that Qin is located in Room527 while Suo is located in Room527 equals 0.76. The structures and properties of this basic model are described in an ontology language to define the conceptual contexts in a rich semantic level. The basic context structure is represented using web ontology language (OWL). Two OWL classes are defined as PriorProb and CondProb as in the approach of Ding et al. [14] for representing probabilities. A prior probability of a context literal is defined as the instance of class PriorProb, which has the two mandatory properties hasContextLiteral and hasProbValue. A conditional probability of a context literal is defined as the instance of class CondProb, which has three mandatory properties hasContextLiteral, hasProbValue and hasCondition. An example of an FOPL statement is represented by an OWL expression in Fig. 10. The purpose of the ontology-based context model is to formalize the structured contextual entities in smart spaces by making use of the ontology methodology to define the concepts and relationships of the context elements. The context ontology is divided into a core context ontology for general conceptual entities in the smart space and an extended context ontology for the domain-specific environment, e.g. the classroom domain. The core context ontology defines very general concepts for the context in the smart space that are universal and sharable for building contextaware applications. The extended context ontology defines additional concepts and vocabularies for supporting various types of domain-specific applications. The core context ontology investigate seven basic concepts of user, location, time, activity, service, environment, and platform, which are considered as the basic and general entities existed in Smart Space as shown in Fig. 11. Part of the core context ontology is adopted from several different widely-accepted consensus ontologies, e.g. DAML-Time, OWL-S, etc. The instance of Smart Space consists of classes of User, Location, Time, Activity, Service, Environment and Platform.
Pr(L1|L)=0.76 L L1 0.76 Fig. 10 FOPL statement represented in an OWL expression
Smart Classroom: Bringing Pervasive Computing into Distance Learning
899
• User: As user plays an important and centric role in smart space applications, this ontology defines the vocabularies to represent profile information, contact information, user preference and mood which are sensitive to user current activity or task. • Location, Time and Activity: Note that the relevancy among location, time, and user’s activity facilitates the validation of inconsistent contextual information because these contexts might be sensed by various sensors with different accuracies. For example, it’s obvious that the sensed context, Dr. Qin is having a meeting in the Room526 at 10:00 a.m., meets conflicts with another sensed context, Dr. Qin is having a meeting in the different Room527 at 10:00 a.m. By checking the activity’s location from Dr. Qin’s schedule, we can confirm which one is correct. • Platform and Service: The platform ontology defines descriptions and vocabularies of hardware devices or sensors, and software infrastructure in a smart space. The service ontology defines the multi-level specifications of services that the platform provides in order to support service discovery and compo-
Time [time]
Activity [act]
takePlace
Location [loc]
locateAt
altitude
Scheduled Activity
longtitude latitude
timeInstant
engageIn
Class Activity
geoSpatial locateAt
timeInterval
use
User [usr]
locateAt
locateAt
role profile preference use serve
Service [ser]
Platform [plat]
provide
existIn
mood
Environment [env]
contain
profile
noiseLevel
model
light
Software
grounding
temperature Multi-Agent System (MAS)
Atomic Service
humidity
Hardware
Composed Service
Device
Distributed Component Model
Sensor
Legend
Core ontology
Fig. 11 Context Ontology
Extended ontology
rdfs:subClassOf
owl:Property
900
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
sition. In the Semantic Web community, OWL-S is adopted as a standardized framework to describe service in general. • Environment: The environment ontology defines the context specification of physical environment conditions that user interacts with, such noise level, light condition, humidity and temperature, etc. The extended context ontology extends the core context ontology, and defines the details and additional vocabularies of which apply to various different domains. For instance, in classroom domain, User has subclasses such Teacher, Local Student, and Remote Student who play the different roles and control rights in the class activities. The advantage of extended context ontology is that the separation of domain reduces the scale of context knowledge and burdens of context processing for pervasive computing applications, and facilitates the effective context inference with the limited complexity [14].
3.4.2 Probabilistic Context Reasoning The context reasoning domain uses a rule-based inference mechanism which uses knowledge-based model construction (KBMC) to deduce high-level, new context knowledge from low-level detected facts. Within the FOPL framework, context rules are defined in as the form of Pr(Lh |Lb1 , Lb2 , ...) = c :" LC1 , LC2 , LC3 , ...,, which means that the probability of Lh is c for the constraints LC1 , LC2 , LC3 and conditions Lb1 , Lb2 . Note that LCi denotes only the context fact and the others denote arbitrary Context Literals. For instance, in the classroom scenario, the statement Pr(TeacherStatus(Teacher) = talking|Speaking(Student) = f alse) = 0.7 :" IsBlackBoardTouched(Room527) = f alse denotes the rules that when the blackboard of Room527 has not been touched, the probability that the teacher is talking equals 0.7 for the condition that the student is silent. After representing the context rule, an additional property element rdfs:dependOn is defined to capture the dependency relationship between the datatype and the object properties in OWL. The importance of the FOPL-based context rule is that probability and Bayesian Network tools can then be used to reason with uncertain context. The benefit of the rdfs:dependOn element is used to translate the resource description framework (RDF) graph into the Bayesian network’s direct acyclic graph (DAG). Each node of the DAG represents a ContextLiteral, with directed arcs between nodes representing causal dependencies between ContextLiterals. The scale of the DAG is reduced by constraints, including valid syntax rules, the independence hypothesis of a causal set that extends the causal independence definition[11], the average distribution hypothesis of the residual probability, and the conditional independence hypothesis. These present the generation of the unnecessary nodes in the net so as to minimize the scale of the DAG and ascertain the exclusivity of the answer distribution. Therefore, the constraints and hypotheses control the inference complexity within an acceptable range.
Smart Classroom: Bringing Pervasive Computing into Distance Learning
901
3.4.3 Case Study in Smart Classroom For the smart classroom project, a smart cameraman module was designed to change the live-video scene to a situational context according to the class activity in a classroom by switching an array of cameras. By making use of context-aware services, remote students were able to focus their attention on relevant scenes on the client side. In this case, the context-awareness provided by the middleware captures the contextual information relevant to the user’s activity and provides class activity clues to the smart cameraman module. The middleware also delivers customized video to remote students with various qualities due to the various capabilities of their computer or systems, such as display screen size or network bandwidth. The smart cameraman scenario uses four types of context rules in the class activity: • Teacher Writing on the MediaBoard: When the teacher is writing comments on the MediaBoard, the smart cameraman module may select a close-up view of the board, as shown in Fig. 12(a). • Teacher Showing a Model: When the teacher holds up a model, the smart cameraman module may zoom in on the model as shown in Fig. 12(b). • Remote Student Speaking: When a remote student is speaking, live video of the student may be delivered to other remote students. • Other: In all other situations, the smart cameraman module may select an overview of the classroom shown in Fig. 12(c). A predefined probability between 0 and 1 is attached to the context rules using basic structure of the first-order probabilistic logic partially shown in Table 3. When an event (e.g. the teacher writing on the mediaboard) occurs, the concurrent probability distribution of the camera’s status is reconstructed according to the context rules, so that the camera tracks the live focus. A case generator shown in Fig. 12(d) has been developed to simulate a variety of situations and contextual information to test the functionalities of the smart cameraman module.
Fig. 12 (a) Teacher writing on MediaBoard; (b) Teacher showing a model; (c) Teacher having a discussion with local students; (d) Case generator
902
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
Table 3 Examples of context rules defined in the smart cameraman scenario
Context Rules FOPL Formula CR1
CR2
CR3
Pr(IsStatus(Camera) = CLOSEUP_V IEW |Action(Tearcher) = W RIT ING, IsStatus(MediaBoard) = T RUE) = 0.8 :" OnTouched(MediaBoard) = T RUE Pr(IsStatus(Camera) = CLOSEUP_V IEW |Action(Tearcher) = SHOW ING) = 0.15 :" OnTouched(MediaBoard) = T RUE Pr(IsStatus(Camera) = CLOSEUP_V IEW |Action(Tearcher) = SPEAKING) = 0.05 :" OnTouched(MediaBoard) = T RUE
4 Large-Scale Real-Time Collaborative Distance Learning Supporting 4.1 Totally Ordered Reliable Multicast Large-scale interactive applications have demanding requirements on underlying transport protocols for efficient dissemination of real-time multimedia data over heterogeneous networks. Existing reliable multicast protocols failed to meet these requirements due to following reasons: 1) most protocols presume the existence of multicast fully-enabled network infrastructure, which is usually not the case for current Internet; 2) protocols that are able to support multiple concurrent data sources only have limited scalability; 3) few of them have implemented an end-to-end TCPfriendly congestion control policy. In TORM protocol, Multicast is used wherever applicable, and Unicast tunnels are created dynamically to connect session members located in separated multicast “islands”(refer to Fig. 13, where Reliable Multicast Server is called RMS for short and Reliable Multicast Client is called RMC for short). High scalability is achieved by organizing members of a session into a hierarchical structure, but in contrast to most of existing tree-based protocols, any number of concurrent sources are allowed to exist in a session. In order to support interactive applications that involve multiple
Fig. 13 Architecture of TORM
Smart Classroom: Bringing Pervasive Computing into Distance Learning
903
users cooperatively operating based on a shared state, TORM also incorporates two serialization algorithms, i.e., distributed one and centralized one, to ensure totally ordered message delivery through all members. Another eminent feature of TORM is a novel congestion control scheme that is able to response to congestions in a timely fashion and can fairly share bandwidth with other competing network flows. Here real-time measurements of reception rate of all receivers, packet losses, and variations of Round-Trip-Time are used as feedbacks for sender rate adjustments. Finally, network heterogeneity is fully addressed in TORM by partitioning receivers with different bandwidth capacities into homogeneous “domains” so that receivers behind bottleneck links are not likely to slow down the entire session. Trans-coding (AMTM) can also be applied on the border of domains to provide receivers with data in different qualities that best match their capabilities. To testify the efficiency of TORM, we compared the loss recovery performance of TORM with SRM/Adaptive through software simulation (more specifically, with NS2). The recovery latency and duplicate-request number of these two protocols were tested with star, 2-clusters and random topology. The result shows that TORM has better performance than SRM as a reliable multicast protocol for large scale, real-time and interactive applications.
4.2 Adaptive Multimedia Transport Model Adaptive Content Delivery (ACD) [15] bears the best efficiency among all possible types of Adaptive Multimedia Delivery schemes [16], in which application-layer semantics of the delivered data is coupled with underlying transport mechanisms. In Smart Remote Classroom project, the live instructing content and recorded courseware are in a compound multimedia document format. We developed AMTM to provide differentiated services for the delivery of this format of data to dynamically trans-code the multimedia data according to the variation of capabilities of each user as mentioned above without data redundancy. The Multimedia Compound Document Semantic Model in AMTM can describe the data organization, PQoS (Perceived QoS) and transforming of compound documents with the notions of Media Object, Content Info Value and Status Space of Transforming respectively. A compound multimedia document is parsed as a structural description of embedded Media Objects as is shown in Fig. 14 (a). Fig. 14 (b) illustrates a typical transforming status tree, where the original media is labeled with Status0, and converted into the new node Status11 through operation1 with parameter1, while sub-node Status13 is transformed to newer nodes, namely Status21, Status 22, Status 23 and so on. The AMTM itself is used to abstract the process of adaptive delivery as finding the optimal resource allocation scheme and associated transformation plan for the embedded Media Objects by searching in the Status Space of Transforming. This policy is implemented in the RMS of TORM.
904
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
4.3 SameView: Real-Time Interactive Distance Learning Application in Smart Classroom SameView is an application developed for real-time interactive distance learning based on the proposed TORM and AMTM platform. Fig. 15 shows a snapshot of a SameView client.
4.3.1 Interaction Channels SameView provides a set of interaction channels for the teacher and students to efficiently achieve the goal of teaching and learning. • Mediaboard, which is a shared whiteboard capable of displaying multimedia contents in HTML format. The teacher can show slides for the class on the board. Moreover, he can add annotations or scribble on the slides on the fly. All actions the teacher makes on the board, such as jumping between slides, scrolling the slides and writing on the slides, will be reflected on each student’s client program. When permitted by the teacher, a student can also take control of the Mediaboard, for example, write down his solution to a problem issued by the teacher.
Courseware Status 0 Operation n Param m
Live Audio
Live Audio
HTML Lecture
Status 11
Status 12
Status 13
Status 14
Operation I Param j
Frame
Frame
Status 21
Frame
Status 22
Status 23
(b) A example transforming status tree of a Media Object
Video File
Audio File
Text
Static Media
Image
(a) The Media Object Tree of a composed multimedia document
Fig. 14 Basic Notions in AMTM
Image
Status 15
Smart Classroom: Bringing Pervasive Computing into Distance Learning
905
• Audio/Video. The students can hear and see the audio and video of the teacher side. In addition, Student can also broadcast his audio and video to the others, when permitted by the teacher, for example, when the teacher asks him to give comments on a topic. • Chat. In addition, teachers and students can communicate with text messages.
4.3.2 Session Management The users participating a class through SameView will play different roles in the class. The possible roles are Chair, Presenter, Audience and Anonymous. The following chart shows their respective privileges. Table 4 Privileges of Possible Roles in Class Role
Change Action on Broadcast User Number Listed on Parother user’s Mediaboard Audio/Video with this Role ticipant List role to others
Chair Presenter Audience Anonymous
Allowed NA NA NA
Allowed Allowed NA NA
Allowed Allowed NA NA
1 >= 0 >= 0 >= 0
Yes Yes Yes No
The teacher usually plays the role of Chair, while students are always initially assigned to the role of Audience. As a class going, the teacher can dynamically change the role of a student as necessary, for example, to invite the student to give comments on a topic. If there is more than one participant broadcasting Audio/Video at the same time, only the one who speaks loudest will exceed.
Fig. 15 A Snap of the SameView Client
906
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
4.3.3 Class Recording SameView can capture the exact process of a class by recording everything happened on all interaction channels, such as the slides the teacher showed, the annotations made on the slides as well as the live video and audio, into a structured compound document. The recorded information in the document keeps synchronized in the time-line and a SameView Player program is provided to play back the document. This way, students can review the class any time they want. Actually, this recorded document is also a good type of courseware for E-Learning. In addition, we also provide a post-edit toolkit for the teacher to edit the recorded document as necessary, such as to correct some mistakes made on the class, or to add some tags and indices to the document which can help the student to use the document more efficiently. Fig. 16 shows the interface of the post-editing tool.
5 Smart Platform: Multiagent-Based Infrastructure As an important notion in the vision of Pervasive Computing, smart environment is defined as an enhanced physical space where people can get the services of computer systems without approaching to computers or using the cumbersome keyboard or mouse to interact with them. Typically a smart environment is a distributed system that involves many sensors, perception devices, software modules and computers. To develop and support such a complicated system, some type of software platform is a must. Smart Platform is just designed and developed as a solution for this software platform of smart environment which is used as infrastructure of Smart Classroom to integrate all the software components. In this section, we describes the features
Fig. 16 Post-editing Tool for the SameView Recorded Class
Smart Classroom: Bringing Pervasive Computing into Distance Learning
907
and the architecture of Smart Platform, and provides a description of how to develop an agent on this platform. We just assume a general familiarity with software engineering practices and programming language concepts. Some familiarity with the basics of the XML language is also helpful, as XML serves as the basis communication language throughout the whole platform. The Smart Platform is modeled as a Multi-Agent System, in which the basic unit is agent. Each agent is an autonomous process that either contributes some services to the whole system or uses the service of other agents to achieve a specific goal. As a software platform, Smart Platform implements a runtime environment for the agents and an agent develop kit, which include a programming interface and some debug tools, for the developers.
5.1 Architecture Smart Platform runs upon networks-connected computers in a Smart Space. Smart Platform masks the boundary of the involved the computers and provides a uniform running environment and highly structured communication model for the software modules run on the platform.This figure shows the architecture of the whole Smart Platform. The runtime environment is composed of three kinds of components, which are Agent, Container and Directory Service(DS). • An Agent is the basic encapsulation of the software modules of the systems. • Each computer participating in the runtime environment will host a dedicated process called Container, which provides system-level services for and maintains the agents that run on the same computer. In some sense, Container is the mediator between agent and DS. It makes the low-level communicate details transparent to agent developers and provides a simple communication interface for agent. • There is one global dedicated process called DS in the environment. The DS mediate the “Delegated” communications between agents and provide services such as agent query, dependency resolution.
Fig. 17 Architecture and Components of Smart Platform
908
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
5.2 The Features and Design Objectives 5.2.1 Dynamically Find and Join In the traditional way, if we want to access a service on the network, we must have the prior knowledge of the access address of the service, i.e. the IP address of the server and the port of the service on the server. While typically there will be dozens of agents in a decent Smart Space system, it is so tedious if we have to manually interconnect every agent. Instead, in the “find and join” mechnims, Smart Platform uses IP multicasting to dynamically discover and assemble the computation environment, eliminating the need of manual configuration.
5.2.2 The combination of Delegated Communication and Peer to Peer Communicatioin In Smart Platform, we provide two different communication modes, “Delegated” and “Peer to Peer”. In “Delegated” mode, all the messaages between agents are mediated by a special runtime module called DS. Agents post and accept messages without caring who is the other party. DS will forward the messages to the proper agent according to the contents of the message. Thus, the agents are significantly decoupled with each other. However, in this mode, the DS is burdened with a heavy load, so it is not suitable for the case where agents need high volume of data exchange. On the other hand, in “Peer to Peer” mode, the connection is established between agents directly, this is for the case where two agents need high bandwidth and real time communications. Here DS also plays an important role for helping the two agents to setup and maintain the communication channel. Agents are free to select the proper communication mode according to their specific requirements.
5.2.3 Automatically Manage and Resolve of the Agent Dependencies In a multi-agent system, agents must collaborate with each other. One agent may use the services provided by another agent. We called this relationship as agent dependence. The topological structure of the dependencies between agents may be a tree or even a net. The Smart Platform can manage and resolve those dependencies. When each agent joins the computation environment, it must announce the services it provides and the services it depends. Smart Platform will store this information in a persistent storage. If an agent starts up asking for a service and the agent, which provide this service happens not to be in running, the Smart Platform will use the stored knowledge to locate the agents and automatically launch the agent. This feature is called “Agent Dependency Resolution”.
Smart Classroom: Bringing Pervasive Computing into Distance Learning
909
By this mechanism, it is no longer needed to manually start all the agents in a Smart Space system. Just start some core agents in a system; the whole system can be put into a determinate state.
5.2.4 XML based ICL (Interagent Communication Language) ICL is the definition of the syntactic structure, and the semantic to some extent, of the messages between agents. We choose the XML as the basis of the syntax of the ICL in Smart Platform. The inherent advantages of XML benefit the Smart Platform in some aspects. (1) The extensibility and its ability to flexibly describe almost all kinds of data makes the ICL can be easily extentended. (2) As the one of standard of technology of Internet, it eases the inter-operation of the Smart Platform with other heterogeneous systems. (3) There are many software libraries for the processing of XML in both industry and academe, which ease the development of the Smart Platform.
6 Conclusion In Smart Classroom Project, we developed a set of key technologies for real-time interactive distance learning and make a new paradigm of real-time interactive distance learning with pervasive computing technologies: 1) Integrating various kinds of sensors (e.g. location tracking, microphone array), multimodal interaction technologies (e.g. speech recognition, direct manipulation, speaker tracking) and context-aware applications in order to bring ambient intelligence into a real traditional classroom environment; 2) Unifying the face-to-face education and teleeducation with the Smart Classroom, which in one hand give a consistent teaching experience to teachers and in other hand reduce teachers’s workforce needed, for the teacher do not need to give the same class for the on-campus students and remote students separately. 3) Able to accept large-scale users to access the virtual classroom simultaneously with different network and device conditions. 4) The class can be recorded and turned into a piece of courseware for E-learning.
References [1] Satyanarayanan, M.: Pervasive computing: vision and challenges. IEEE Personal Communications. (8), pp. 10–17. (2001) [2] Abowd, G.D.: Classroom 2000: An experiment with the instrumentation of a living educational environment. IBM System Journal. (38), pp. 508–531. (2000)
910
Yuanchun Shi, Weijun Qin, Yue Suo, Xin Xiao
[3] Shi, Y.C., Xie, W.K., Xu, G.Y., Shi, R.T., Chen, E.Y., Mao, Y.H., Liu, F.: The smart classroom: merging technologies for seamless tele-education. IEEE Pervasive Computing. (2), pp. 47–55. (2003) [4] Miu, K.L.: Design and implementation of an indoor mobile navigation system. (2002) [5] Welch, G., Bishop, G.: An introduction to the kalman filter. In: Tutorial of SIGGRAPH. pp. 1–81. (2001) [6] Grewal, M.S., Andrews, A.P.:Kalman filtering: theory and practice. PrenticeHall Inc, New Jersey, USA. (1993) [7] Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Transaction on Acoustics, Speech and Signal Processing. (4), pp. 320–327. (1976) [8] Brandstein, M.S., Silverman, H.F.:A practical methodology for speech source localization with microphone arrays. Computer, Speech, and Language. (11), pp. 91–126. (1997) [9] Strang, T., Linnhoff-Popien, C.: A Context Modeling Survey. In: Proceedings of UbiComp Workshop on Advanced Context Modeling, Reasoning and Management. (2004) [10] Henricksen, K.: A framework for context-aware pervasive computing applications. (2003) [11] Gu, T., Pung, H.K., Zhang, D.Q.: A service-oriented middleware for building context-aware services. J. Network and Computer Science. 18, 1–18. (2005) [12] Guarino, N.: Formal ontology and information systems. In: Proceedings of FOIS. pp. 3–15. (1998) [13] Poole, D.: First-order probabilistic inference. In: Proceedings of IJCAI. pp. 985 – 991. (2003) [14] Ding, Z.L., Peng, Y., Pan, R.: Soft computing in ontologies and semantic web. (2005) [15] Zhang, H.J.: Adaptive content delivery: a new research area in media computing. In: Proceedings of the 2000 International Workshop on Multimedia Data Storage, Retrieval, Integration and Applications. pp. 13–15. (2000) [16] Clark, D., Tennenhouse, D.: Architectural considerations for a new generation of protocols. In: Proceedings of SIGCOMM. (1999)
Ambient Intelligence in the City Overview and New Perspectives Marc Böhlen and Hans Frei
1 Overview and Motivation and Limitations Ambient intelligence has the potential for improving urban life specifically and the commonwealth in general. As artists and architects working in and with ambient intelligence, our hopes and anxieties towards ambient intelligence are not primarily in the technical domain. Our interest lies in re-making urbanity with pervasive technologies as a means to invigorate urban life. For this we need to take a break, after almost two decades of ambient intelligence related research, and recalibrate all instruments. We use the term ambient intelligence, coined by Philips to describe a future in which people are surrounded with networked sensors embedded in everyday objects that respond to their needs in a loose manner. We are not particularly concerned with the demarcation lines between ambient intelligence, ubiquitous computing, calm computing, pervasive computing and amateur computing. From our perspective efforts emanating from all of these fields modulate urban life and their effects overlap and reinforce one another. This text will weave its way through many topics, including contributions from political philosophy that are usually not included in debating merits and weaknesses of ambient intelligence systems in engineering publications. Some issues will remain untouched, others treated more lightly than they deserve to be. Our goal is to take stock of critical voices and expand the discussions around ambient sensing and control in the city to a conceptual kit for thinking about building livable cities for the 21st century.
Marc Böhlen University at Buffalo, Dept. of Media Study, 231 Center for the Arts, e-mail: marcbohlen@ acm.org Hans Frei Zürich, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_34, © Springer Science+Business Media, LLC 2010
911
912
Marc Böhlen and Hans Frei
2 Ambient intelligence and Urbanism – a Reality Check Ambient Intelligence (AmI) has been for the past years an ambitious project, one that includes the home, office, library, museum, department store, hospital, airport, as well as the infrastructures that support them [1]. AmI’s scope covers all aspects of living, including work, leisure, healthcare and transportation. Many of the locations AmI has set its sights on are core components of the city as we know it and live it. AmI in the city is in some ways the Situationists’ dream come true [78], [27]. Constant’s projects for New Babylon (1957-72), like all the utopian projects by Cedric Price, Archigram and Superstudio that it influenced, seem much less unrealistic today because it is now possible to design and build milieus that interact with their inhabitants directly and intimately. Doors to new urban activities are now open that allow people to share space like they share data on a network [62]. AmI in the city is a step further in the direction of ephemeralization: doing more with less as Buckminster Fuller described technical progress. As steel-and-glass has replaced brick-and-mortar, data flows change material objects, partially fulfilling prophesies of observers of the history of modernity such as Berman [10] who predicted that “all that is solid melts into air”. Urbanity today is as much a function of transmitted data as it is of built houses, streets and monuments [65]. Social contacts can be easily made and selectively maintained with ubiquitous technologies [11], [65], [64], [32]. Even one’s presence is altered via ambient micro blogging utilities that aggregate continuous streams of short sentences to new forms of presence some compare to Extra Sensory Perception [85], [86], [87]. AmI in the city is also the end of the virtual city. SimCity Societies [96], the sequel to the blockbuster city building simulation game SimCity [95] promised to simulate social forces and cultures at work in a city. But several players and reviewers were lukewarm about the result. “There is something not quite right with SimCity Societies”, so VanOrd in the online game review site Gamespot [89]. Players design city societies by composing through the choice of buildings of various types and opposing forces. Need more stringent rules? Add a town hall. Need more authority? Add a police station. Thre is a disconnect between the elaborate visualization of urbanity and ‘social forces’ that are to be created from within the built structures. Simulations have limits and materiality is bound to return with a vengeance. The increasingly popularity of locative utilities such as GPS for geotagged information and the rise of location based activities and services [72] mark parallel efforts to reintegrate the real world into databased abstractions. AmI is part of the next wave of systems defining a new shared reality. AmI in the city also increases sustainability: The world’s first 100 per cent carbon free community will be built in Masdar near Abu Dhabi by Foster and Partners [37]. Behind the mix of traditional (oriental) and modern (western) architecture, electronic networks control waste, water, traffic, and energy. Masdar (UAE, 2007 by Foster + Partners , Ras Al Khaimah (UAE. 2004-) and Waterfront City (UAE, 2008) both by Rem Koolhaas with OMA, Dongtan (China, 2005-) by Arup, Logroño (Spain, 2008-) by MVRDV and others are perfect ‘ecological solutions’ that push all other urban issues aside. Sustainability cum AmI has formed a solid and successful
Ambient Intelligence in the City
913
coalition with the global economy. With advanced adaptive climate control, even the most uninhabitable places can put themselves on the global map. Who can question the merits of skiing in the desert if the artificially generated climate is CO2-neutral? AmI in the city is a new form of production of space: Cities used to be functions of power, industrial production and commerce [58]. Now they are functions of consumerism, security and culture. AmI accelerates this trend. Consumption, security and culture are now the key issues organizing urban life. Each of them is related to the space of flows [15], altering the constraints of exchange across physical distance and the institutions that organize it. What the cathedrals were for the middle ages, the airport is today: the most extensive and complex built projects of your times [80]. In the case of the consumer world, a seamless transition from the global to the local stage is required. The fact that the infrastructure that makes the accompaning securitization efforts possible has slipped into the background [94] only proves how successfully and effectively pervasive sensing systems are operating. Without a doubt AmI intensifies urban life. Some see the built city, qua city, as old fashioned [32] while urbanists make use of AmI technologies without considering the consequences for its inhabitants. As successful a marriage as AmI and urbanism should logically be, a second look reveals serious discord. AmI in the city has the potential of harming the very thing it should promote. People become data objects [50]. The interactive spaces provided by ambient systems are reduced to spectacle [24], and the ephemeralization itself increases junkspace [51]. Sustainability breads totalitarian immersion [23], [79]. What was elegantly delegated into the background returns to the forefront with a vengeance. AmI in the city is like Sex in the City. In both cases the public becomes a feeding ground for private needs. While the protagonists in the TV-show have to actively seek new opportunities, the AmI enabled city directly supports its customers, reducing the public to public access [76]. AmI in the city is not an isolated phenomenon. The dialectics between private interests and public domain are to be found in many aspects of our lives. Economic privatization and the wish for and perceived need of one’s own home creates in its wake a retreat from public life. AmI home automation systems partake in the “rhetorics of privatization” by promising to deliver technologies that are so effective one need not leave the house anymore. However, being networked with the world is not equivalent to participating in it. The advantage of having your shopping done automatically and never having to stand in line at the store will result in not meeting anyone standing in line at the store, or anywhere else for that matter. Many academics, when describing the impact of AmI refer to sociologists like Henri Lefebvre and Michel de Certeau. [32]. Guided by these authors, they describe the new urban reality predominantly as a multiplication and intensification of social dynamics. But in the view of political philosophy the public has a different meaning. Here, the public is characterized by antagonism instead of by shared interests. Martin Heidegger, for example, interpreted the res of the res publica as a thing that has the power to bring together what it separates [45]. In this tradition Bruno Latour wrote: “We don’t assemble because we agree, look alike, feel good, are socially compatible or wish to fuse together but because we are brought by divisive
914
Marc Böhlen and Hans Frei
matters of concern into some neutral, isolated place in order to come to some sort of provisional makeshift (dis)agreement” [56].
3 Ambient Intelligence and Philosophies of Technology Social scientists, political philosophers and historians of technology have a long history of reflecting on the many ways in which technologies impinge on daily life. Mumford [68], Giedion [43], Ellul [28], Feenberg [33] and others have formulated selective accounts of how automation technologies change the fabric of human life on individual, group and public levels. While some of the details of these texts are no longer relevant, many of the general observations remain valid. Ellul in particular retains a following (in addition to some critics) precisely because his concern is more focused on the ‘big picture’. For example, Ellul used the term convergence of technologies to describe the multiplicative effect of adding several technical components into a single, larger system. Indeed, the ease of living through interwoven technical systems that is at the core of AmI is a form of Ellulian convergence. Smart kitchens in sync with energy management systems function quietly while people relax in the living room watching movies on demand. Convergence however can cut two ways. It produces an effect more powerful than the sum of its individual parts, but also increases the opacity of the system. Convergence creates a kind of background discomfort, even when everything seems to be working just fine. Ellul also reminds us that no technology can really liberate us (from other technologies). AmI rests on the unarticulated assumption that ‘smart’ technology is inherently better that ‘dumb’ (old) technology, and that our lives will be better with new smart systems. Ellul points out that the only basis for this belief is the lack of tested alternatives. Ellul’s observations on the dynamics between the public and private domains under the influence of technologies retain validity in the context of AmI in the city. When technology enters the public domain, it receives new mandates for which it is not properly prepared. Technologies change in the hands of corporations and states. They change because the state is charged not only with maintaining law and order, but also with establishing just relations amongst its citizens. It therefore imposes limitations on pure technique (in the Ellulian sense) of private persons, such as forbidding the free manufacture of explosives and drugs while maintaining the right to do so itself. Ellul reminds us to differentiate between the making of technique and the politics of technique, between technique as a domain specific solution and its embedding in the world. The problems of the second act are not part of the design process of the first one. How technologies are applied defines how we come to live with them [3], [4]. Furthermore, failing to consider the dynamics of this second act tends to upset the balance of private and public responsibilities, and AmI in the city is subject to these same dynamics. In Medienkultur, Flusser asks how politics change under the impact of technology, such as artificial intelligence [36]. Flusser understands that the most important changes made to public space through technology are not those changes directly
Ambient Intelligence in the City
915
intended, but the (long term) side effects that alter the way one thinks about public and urban space. While Flusser claims that the information revolution diminishes public space, he understands that this is really a function of how the revolution occurs, how the networks are built and who controls them. The ability to understand and modify the way the system works is dependent, so Flusser, on technological literacy: “Die grundlegende Frage ist ... die des Schaltplans der Kanäle”. Flusser calls for something like Agre’s critical technical practice [2], [4] but a public one. What is needed, according to Flusser, is a dialog-based circuit for information transmission. Andrew Feenberg [33] pushes the call for dialog in technical design even further. Feenberg rejects Ellul’s essentialist and reductivist theories of technology and sees technology as a contested field where individuals and social groups can struggle to influence and change technological design, uses, and meanings. Feenberg advocates for democratic debate in situating, limiting and controlling technical systems such as (some) environmental advocacy groups already demand. This grass roots based opportunity for the reconstruction of technologies is at the core of Feenberg’s line of argumentation. The question of interest here is how AmI might be (re)constructed – who draws up the plans and how this might occur in the context of the city. Without a clearer notion of this we can hardly hope to design and build socially robust ambient systems. In particular the traditional delimitations of private versus public interests seem at risk. Ubiquitous technologies open public domains more and more to private corporations, while private living rooms and desktops become public market places. Ubiquitous technologies at their worst favor individual interests over those of the commonwealth in ways older systems never could. Hanna Arendt compared the public sphere to a table, something that connects us as we sit around it. In The Human Condition, Arendt formulated two theses on the public that receive new significance under AmI. [7]. The first one states that technical progress increases the sphere of influence of individuals while that of the public domain recedes. The second one states that the private sphere and the public sphere are interdependent, and that the two domains require a balancing act in order to remain in equilibrium. Bruno Latour [55] suggests we build a parliament of things, open to all and everything, including humans beings, animals, plants, insects and inanimate objects to balance our new situation. Indeed, in such a parliament of things, AmI could play a very important role because it could help to articulate the intelligence, the rights and the agency of the ambient. Decried by some as naiv, by others as visionary, Latour’s parliament of things retains radical appeal. And radical are the changes we need, we believe, for today we are no where close to balancing out the dynamics emanating from AmI in the city.
4 Out of Balance With a history of suspicions towards large scale technical systems in public spaces, AmI and the humanities seem predestined to clash. Sure enough, even a cur-
916
Marc Böhlen and Hans Frei
sory search through some ambitious ambient and ubiquitous intelligence projects launched in the last few years stokes this fire. The final report of the European Union’s Information Society Directorate in 2001 [49] is verbose on usage scenarios but mostly silent on qualitative content of ambient systems of the future. While this important document does acknowledge the need for building AmI upon a “humanist foundation”, it erects the foundation at the wrong spot: the user interface. But the roots of long term perspective this text seeks to address run much deeper.
4.1 Hydra’s New Heads The University of Notre Dam’s Hydra project [16] is an example of the kind of ‘smart’ system designed for public spaces hat would make any critical observer shudder. Hydra, funded in part by the US Department of Defense, is a military and urban surveillance system that attempts to combine a number of uncoordinated video surveillance camera views of scenes from different locations in order to deduce a “global threat”. From the project website we read the following: Uncertain scene recognition from any single view is checked and validated with potential captures from another angle in order to improve the recognition accuracy. Our infrastructure widely deploys a number of high fidelity video sensors and stores all the captured streams for certain well defined durations. These sensors are deployed as necessary in an ad hoc fashion. Smarter sensors and computer hubs can then analyze these stored streams to detect and review emerging threats, or potentially capture events that are easily missed by simpler and real time algorithms.
Like the mythical Hydra, this system is designed with robustness against the loss of one of its parts. E-hydra can operate even when the video sensing or the processing components are compromised. The system designers were creative in their expansion of surveillance methods, coining the terms retrospective surveillance and telepostsence. Retrospective surveillance attempts to review captured scenes from several sites in order to validate whether a hint of threat detected at one site is part of a larger pattern. Telepostsence is defined as the selective combining and rejecting of small video segments from a larger stream to create a unique multimedia experience (a summary video). Technically impressive as these preemptive surveillance schemes appear, they assume that the early elements of threat can be clearly defined. What would constitute such a threat primitive in a busy city, and how would one be able to filter false positive results from the final elegantly composed summary video stream, particularly when the original footage is discarded? This E-hydra monster may grow additional heads even without having its current ones chopped off.
Ambient Intelligence in the City
917
4.2 Information Cities ‘Information cities’ has been used by software architecture researchers [34] to describe cities not limited by physical boundaries. Information cities facilitate the access and exchange of information related to services usually provided by cities through web services. The concept imagines the city on servers becoming a full scale urban institution replacing, eventually, all public services and expanding into commerce, emergency services, health care as well as cultural and social activities. This changes buildings and by extension the city because they now function not only as material but also as data carriers as Mitchell mentions in City of Bits [65]. Ebay, Yahoo and similar portals are prototype information cities that have no shells but house and entertain millions of inhabitants 24 hours a day all over the world. But not all of what is built is melting into air. public life is imploding into the private living room. From the comfort of your home you can perform your duties as citizen, your needs as consumer as well as your dreams as a tourist. Because the information city is where your living room is, your living room can be anywhere. People who stay at home during their vacation and enjoy a staycation might opt to travel with utilities such as Easier-Travel.com and live off other peoples’ vacation memories [13]. Energy price increases have discouraged some, but not all mobile McMansion RV owners from taking their Italian marble-floored, high-tech controlled, window-size flat-screen high-definition television enabled and satellite connected systems on the road [77]. An architectonic consequence of applied information displays applied to buildings is that facades have become fashion statements. Buildings, however ambitious the initial project, tend to fall out of favor after 25 years as is now the case in Frankfurt’s Museum für Moderne Kunst [93]. In this vein one can imagine future cities to be designed with durable infrastructures for transportation, water, electricity, waste management, communication and surveillance while the visible architecture skins change as frequently as the society of spectacle desires [24]. Architecture then becomes media-production, encouraged and defined by the fad of the day. Facades become billboards rather than interfaces. If information is piped in only one direction from the emitter to the consumer, then cities “could support an as yet unimaginable form of totalitarism” [35]. The impact of intelligent information systems on architecture can also be observed in the development of Computer Aided Design (CAD). While CAD was first used for designing the appearance of a building, it is now used for organization and environmental control in buildings [46]. Evolutionary design approaches, gleaned from adaptive systems control and data analysis, have been integrated into architectural design [39]. Fuller and Haque’s proposal [40] to consider the design dynamics in open source software as a model for architectural design and urban planning belongs to the same genealogy. Inspired by the FLOSS (Free, Libre or Open Source Software) movement, the authors suggest the equivalent of an open source license for the construction of cities. This includes altering the role of the architect from an ‘idealizer’ of form to a ‘facilitator’ of design processes. Architecture should begin immediately with building and construction. In urban planning the notion of
918
Marc Böhlen and Hans Frei
plan completeness does not exist. The same quality characterizes Cedric Price in his famous, never built but highly influential project Fun Palace [73]. For Fuller and Haque incompleteness is operational. They understand building and city regulations as a kind of generating algorithm for the design of neighborhoods, assuming that one can compute the opposing tensions of all given constraints. Systems built under the proposed Urban Versioning System license (UVS, UVS 1.0) would remain unfinished and evoke modularity and granularity of participation; allowing people of various levels of expertise and commitment to participate in the design process. However, the rules for participation in large programming tasks are not really as gentle and accommodating as Fuller implies. Jonathan Corbet recently wrote a comprehensive guide to contributing to the Linux Kernel [18], a show case of collaborative open software development. He describes in detail the review process that all changes and additions (patches) must adhere to. The cyclical process is hierarchical and ordered along a predefined time line of review steps performed by many different people. There is exactly one person who can merge patches into the mainline kernel repository if previous groups fail to agree: Linus Torvalds. Furthermore, code standards and patch formatting are well defined and strongly enforced. It is the quality of the code that ensures its inclusion into future releases (mainline tree). This quality control is enforced by an equally elaborate system of reviews, the duration of which is highly variable. Contributors who fail to respond to the community’s inquiries and comments can see their code removed from the mainline tree. Even after inclusion in the tree, developers are encouraged, required even, to be responsive to questions, comments and suggestions for further improvement. Still, the kernel development community can teach everyone interested in harnessing the added value in group dynamics a lesson. If work occurs only behind closed doors, many problems remain hidden longer than they should be, with adverse effects for the overall outcome of the endeavor. But openness and participation are not the only essentials for public life. Dissent, argumentation and conflict need to be supported. Models based on maximizing efficiency filter disruptive elements out of the system. Models that facilitate ordering concert tickets, buying groceries and paying taxes can hardly become models for improved public life that thrives with antagonism and debate.
5 Ambient Agoras As part of the Disappearing Computer EU research initiative [25], Streitz and collaborators [82] have formulated novel ways of addressing the problem of computerbased support for collaboration between local and remote distributed teams. The authors stress the fact that the Ambient Agoras project goal goes “beyond traditional support for productivity oriented tasks in the office” and focuses also on designing experiences. According to the authors, the specific development was guided by three objectives. First, awareness devices should help members of a distributed team communicate in a natural way. Second, the interfaces should be adapted to the changing requirements of emerging office concepts as well as to the increased mobility of em-
Ambient Intelligence in the City
919
ployees. Third, a privacy concept should give people control of determining if they are being monitored and allow them to select different roles in the sensor-augmented smart environment. The third point is of interest for AmI in the city. The project authors went to great lengths to personalize activity monitoring technologies, including the integration of a ‘privacy manifesto’ compiled by the researchers Lahlou and Jegou as part of the same EU initiative [60]. The text lists nine key concepts that ambient system designs are urged to consider: think before doing, re-visit classic solutions, openness, privacy razor, third-party guarantee, make risky operations expensive, avoid surprise, consider time, good privacy is not enough. The more important ones are quoted from the guidelines document here. Openness: “Systems should give human users access to what they do, do it, and do nothing else. Help human users construct a valid and simple mental model of what the system does. Goals, ownership and state of system should be explicit, true, and easily accessible to human users, in a simple format.” Privacy razor: “Human user characteristics seen by the system should contain only elements which are necessary for the explicit goal of the activity performed with the system. No data should be copied without necessity.” Expensive risky operations: “No system is 100 per cent privacy safe. Human users should be made aware of which operations are privacy-sensitive. Operations identified as privacy-sensitive should be made costly for the system, the human user, the third party.” Avoid surprises: “Human users should be made aware when their activity has an effect on the system. Acknowledgement should be explicit for irreversible major changes. Cancellation should be an option as much as possible, not only in the interface, but in the whole interaction with the system.” Consider time: “An expiry date should be the default option for all data.” Compared to the current unregulated world of privacy (mis)management, this document is truly noteworthy. While these concepts are a real contribution to advancement in the construction of socially robust ambient sensing systems, they are not sufficient. Some researchers have proposed improving privacy design through every more stringent methods of data access control [57]. However, we believe that the problem lies in a different corner. We believe that trustable systems are more important than precise systems. How can people rely on these concepts if they can not see them in action and check them at will? Without trust, none of the nine concepts can be effective. Trust is paramount, but difficult to generate and not to be confused with the promotional slogan of ‘trusted computing’ [67] that focuses on access restrictions and secure internal operation instead of enabling external control and allowing users to verify and ‘trust’ their computing systems. Furthermore, the implementation details of the nine point privacy manifesto are critical. Good intentions can become ineffective in compromised implementation details. For example, Ambient Agoras proposes to free people from indiscriminate observation by ambient sensing technologies. This is achieved by a custom designed two part mobile device (ID stick and reader) called the personal aura that enables users to control their appearance in a smart environment by deciding whether they
920
Marc Böhlen and Hans Frei
want to be visible for remote colleagues, and if so, in which social role they want to appear. However, the ‘opting out’ through the privacy mode does not guarantee privacy from other sensing modalities. Lounge cameras persistently watch the workers as they enjoy their break, independent of their personal privacy aura choice. The concept of an ambient experience space as opposed to an ambient observation space is somewhat misleading, offering ‘designer surveillance’ to placate any pesky opposition. Aesthetically pleasing intrusive surveillance is no better than poorly styled and color mismatched surveillance. In this regard the observations from Mitsubishi Electric Research Laboratories [74] on the advantages of lowresolution occupancy sensing over high-resolution video surveillance are important (albeit intuitively obvious). Video cameras are usually ceiling mounted and watch bodies and faces while infra red sensors are wall mounted and sense motion. Infra red sensors that detect only thermal infra red emissions are blind to the kind of data most people are sensitive about. Indeed, infra red presence sensors were strongly preferred by a sample user group over video cameras as they generated a more relaxing experience of being perceived but not being watched. The advantage of the experience metaphor used by the authors of Ambient Agoras is that it actively acknowledges the pervasiveness inherent in smart networked sensing systems and implicitly stipulates that people should be able to feel comfortable in its vicinity. But experience culture and leisure culture are rheoric devices that discourage dissent. If ‘fun’ is to be had, who will want to argue? In this sense the experience metaphor is a distractor. It distracts from the fact that the experience created is geared to work flow improvement only. Even in lounge spaces people remain tentatively available for work related conversations while having their coffee break, effectively compromising the very idea of a break from work.
5.1 Combat Zones In Ambient Intelligence and the Politics of Urban Space, Crang and Graham [22] identify three emerging dynamics in the “politics of visibility”: commercialization, militarization and the artistic endeavors to re-enchant and contest the urban informational landscape. Quoting McCue [63] the authors see in AmI the enabler of robust identity dominance “to provide battle space awareness necessary to identify, track and target lurking insurgents, terrorists and other targets”. The militarization and commercialization of public space through ambient systems finds a counterweight in the work of activists and artists, who, according to the authors, attempt to make visible what the other two forces make invisible. Since the majority of civilian life occurs in cities, the holy grail of surveillance within the war on terror and asymmetric warfare is the technological unveiling of cities and urban life. To achieve this, say the authors, biometric sensors will need to verify and code people’s identities, as they flow through national or other borders, through finger prints, iris scans, DNA, face and voice/accent recognition as well as gait and odor recognition. The authors join others [21] in fearing the forward looking nature of military-like surveillance
Ambient Intelligence in the City
921
that replaces the act of observing and watching with that of ‘identifying’, and transforms urban space into combat zones that see. Few ambient sensing paradigms are attracting more critical attention than RFID technologies [81], [54]. Some authors such as Kranenburg foresee a complete interweaving of urban infrastructure and traceable objects [53]. He points out that the making of the new networked cities is happening behind closed doors, with little public knowledge, discussion or consent. He also points out that the call for trust in such systems can not be created as long as there is no transparency in data management: who really knows what and who will be checking all the data recorded by the continuously growing network of tagged objects. While the data emanates from the public domain, it resides largely with private parties reducing the public to an unknowing bystander. Kranenburg currently sees the only recourse to connecting to the public in, well, scandals: RFID in passports leaking data, patents by industry to scan your garbage, a secret plot to chip all old people preventing them from wandering off on their own. Kranenburg proposes ways of acting against the trend. A new interpretation of life style management might create specifically tailored privacies instead of a single personal privacy allowing one to choose the kinds of traces one leaves in a sensor network, not unlike the personal aura of the Ambient Agoras project. The tagged and bookmarked world of RFID is the inverse of another imagined scenario: RFID-lessness and falling off the information grid. If an object or a person is not encoded and available to search engines once everything is assumed to be tagged and searchable, it effectively, informationally, ceases to exit. Already, pets entering the European Union must have a transponder implanted [31]. Untagged pets will be confiscated. It might not be long before untagged creatures, including human beings, constitute a new class of lost and disenfranchised beings, unable to find help or be found by others [6]. In the future, no one will be allowed to exist outside searchable databases, and those who do, do so at their own peril.
6 Amateur Ambient Intelligence Several impulses on how ambient sensing can make city life richer come not from the ambient intelligence professionals nor architects and city planners, but from, hobbyists, dilettantes, and early technology adapters. Amateurs lack expertise but also the bias and occasional ‘group think’ that often accompanies expertise. In the realm of interaction design, artists have conceived of usage scenarios human computer interaction designers generally do not consider [14]. In Help from Strangers – Media Arts and Ambient Intelligence we described the selective decoupling of utility form technical invention to generate alternative forms of experiencing automation systems. The following paragraphs will weave through projects chosen for their contributions to inventing new ways of combining people with networked sensing particular to urban contexts.
922
Marc Böhlen and Hans Frei
6.1 Slow Space Design Sociality, spatialization and temporality refer to the readings between the lines of urban culture, a concept notably formalized by de Certeau [23]. This is a form of urban context awareness where the context is soft, fuzzy and not computable. Wait a moment, look at that tree. There is no reason to hurry. Such events are not to be found on a city map; they are generated at random by people wandering through the streets, going about their daily chores. Temporality is a key component of urban experience. The Amble Time Project designed by the Media Lab Europe [26] and described by Galloway in ubiquitous computing in the city [41] utilizes the fact that city maps generally do not include temporal information. Amble Time adds an element of time to a PDA-based tourist map. With a GPS system and an assumed average walking speed, it creates a bubble that indicates the possible place one might reach within an hour on foot. Alternatively, given a final destination, it can show where a pedestrian could roam along the way and still arrive in a timely manner. Amble Time is an interesting example of a project that does not, but easily could be modified to suit commercial interests. The project would be experienced in a completely different manner if the ambling were designed to lead people to stores in lieu of parks. Since businesses will probably pay to have themselves included in these temporal walk maps, this might be unavoidable. It will be all the more important that AmI practioners take particular care in making sure people can add their own paths, place signs and notices for friends, and easily remove all preset entries, including those of paying patrons.
6.1.1 Ritornell Sonic complexity is a hallmark of urban experience. Sonic City [42] contributes to the many sounds that pulsate through the urban landscape. It enables people to create sound tracks while walking through the city. The installation allows people to leave at hidden places in public spaces pre-recorded messages that are then whispered to by-passers as they lean towards a sensor. These audio tags create a novel and sometimes unscripted way for people to interact with their physical and social surroundings. Cities are full of underspecified spaces, spaces that do not seem to belong to anybody, but are used by everyone. Taking a cue from Marc Augé’s seminal text on non-places [9], Texting Glances, a collaborative project at the University of Dublin, activates urban waiting spaces such as bus stops, doctors’ offices, airports and similar places where people gather, usually with no interest in the location itself, and wait for something not related to the place itself to occur. Texting Glances is a mobile phone based collaborative game that generates a story. As people wait, they text (with their mobiles) into a system which in turn responds with an image. As more people participate, a multi-authored story is generated and shared with the public through large displays in a train station.
Ambient Intelligence in the City
923
Ambient technologies often seem to counter the urban qualities of event, flow and temporality. Technologies create an almost contradictory sense of consistency and coherency while the urban world that is experienced is one of change, flow and surprise. Part of this, so Galloway, stems from the tendency to imagine new technologies as representational artifacts rather than temporal performative practices. Galloway suggests that ubiquitous technologies should begin to understand themselves more as diverse procedures and performances as opposed to stable (networked) artifacts. How engineers should actually ‘perform’ such work however, is not discussed at all. The heavy lifting is graciously left to others.
6.2 Unusual Encounters The Cornell Lab of Ornithology’s citizen science projects, such as NestWatch [19] offer lay people the opportunity of contributing to bona fide research as bird watchers. Bird watching is an activity that relies on patient people in the field dedicated to the task. Cornell is making use of the widespread interest in bird watching by offering lay people the chance of participating in the lab’s well respected research and engaging them as free lab assistants. Precisely formulated templates, when properly filled out, provide the data that the researchers seek. Observations can be input from any internet enabled computer. A program aggregates and analyzes the volunteers’ inputs. The eyes and ears of Cornell’s Ornithology Lab are expanded far beyond its campus by this human computer network, and as a result, many people identify with the work. SETI’s (Search for Extraterrestrial Intelligence) goal is to detect intelligent life outside Earth [5]. Radio SETI uses radio telescopes to listen for narrow bandwidth radio signals from space. Such signals are not known to occur naturally, so detection might provide evidence of extraterrestrial technology. The terabytes of data collected by the system are distributed to idling computers all over the world that work on small data chunks and return their results to a central server. Anyone can sign up and donate their computer’s idle cycles to the SETI project. This citizen computing schema makes clever use of the internet’s non-centralized infrastructure and of computers idling online. For the participants, the allure of the project resides in contributing in a modest but personally perceptible and community appreciated manner to a grand and unique challenge. In return for free computing cycles, SETI@home delivers the satisfaction of contributing to an unsolved universal riddle.
6.3 Mobile Devices Everywhere Mobile device communication habits are forming new ways of experiencing and suffering from the city. Jan Chipchase of Nokia’s Mobile HCI Group describes his group’s experiences studying mobile TV use in South Korea. Chipchase asked a
924
Marc Böhlen and Hans Frei
survey group why they used the mobile TV devices [17]. Answers included the desire for immediate knowledge, the desire to kill boredom and the wish to stay up to date with popular events. The times and places device use occurred include bars (waiting for friends), traffic lights or during the commute, which often exceeds one hour, one-way, in Seoul. The city is an obstacle course that demands patience from its inhabitants while the newest communication technology distracts from the fundamental traffic congestion problems. The city as built environment is most noticed when device reception fails due to intermittent signal strength. In an other study, Tamminen and co-authors describe the mobile phone habits of a study group navigating their ways through Helsinki via public transportation [84]. The mobile phone, as the newspaper earlier, has become a personal space claiming mechanism. Sitting in a bus or train, travelers engaged in all kinds of useful and useless mobile applications during their transits and mark a desire for privacy when concentrated on mobile information delivery. Temporarily and situationally, the public bus seat morphs to a personal space. Not all usage of phones and TVs is rational. The desire to be wirelessly connected varies from querying bus schedules to feigning phone calls in response to peer pressure [44]. Fake phoning, as this activity has been coined, shows how communication technologies, the most ubiquitous of ambient information systems, are capable of supporting some of the least rational behaviors of modern urbanites. Pathologies are to be expected in global mobile information device proliferation, and not all kinds of user generated content are equally desirable. When camera enabled mobile phones are used indiscriminately by mobs that are not always smart [75] urbanity suffers by its own design and creates its very own participatory panopticon [52].
7 Ambient Intelligence for The People AmI systems usually function autonomously, with high level control removed from those who experience it. This could be countered by making public participation in the design of AmI systems should become a requirement. Interestingly, computer science has precedents in this regard. The Participatory Design tradition formulated in Scandinavia [12] was originally built upon the desire to include the needs of IT professionals into the design of the workplace. Several projects expanded the initial results to include experience based design methods that took the practical knowledge of workers such as nurses into the design of health care facilities. For AmI it might be necessary to expand the Participatory Design principles to include (lay) people’s input. How might this happen?
Ambient Intelligence in the City
925
7.1 Knowledge from Many Minds In the future, AmI will become part of the infrastructure of cities such as water, gas and electricity have in the past. One important challenge is finding out what people on the street really think and feel about ambient control systems and to use the knowledge of many minds as feedback for the design of these systems. Most assessments of AmI systems, such as Ambient Agoras, survey only a few people on too few issues. It is not possible to generalize from such limited and biased data. Given the ubiquity of the internet, querying large numbers of people seems to have become easy. Infotopia [83] delivers a good framework for discussion on what to expect from gathering data from many people through online deliberation, blogging, wikis and similar venues. Deliberation is often heralded as a democratic feature of the web. Sustein describes how, contrary to intuition, deliberating groups do not do well at aggregating information objectively and fall prey to group think. However, prediction methods, such as prediction markets with incentives, can achieve, according to Sunstein, the kind of knowledge isolated minds can not. To make use of this in our case would require however, that one defines the experience of an AmI system according to market mechanisms and bids on its success or failure. How much are you willing to bet that people do not want to be watched entering the office building in the morning? No information market can make judgments of value, but they can aggregate private judgments of individuals where there is an incentive to share them. For AmI systems, the incentive might not be directly monetary but emotional (fear of surveillance, loss of opportunity). Interestingly, the scope of prediction markets is constantly expanding. Recently, one successfully predicted the acquittal of pop star Michael Jackson. The Hollywood Stock Exchange [47] regularly anticipates box office successes and failures.
8 Remaking Public Space, with and in Spite of Ambient Intelligence As mentioned above, AmI systems in the city do not act in full view of the public. They do not define representative facades of buildings. Rather, they form a network of possibilities and obstructed opportunities that modulate life in the city. The next section of this text seeks to describe a few ambient intelligence like projects that seek to find new ways of augmenting urbanity and renegotiating the limits of private and public domains with the current state of sensing technologies and computational infrastructure.
926
Marc Böhlen and Hans Frei
8.1 Paths of Surveillance The Institute of Applied Autonomy [48] created a special service for anyone wishing to avoid ubiquitous surveillance. I-See is a web-based application charting the locations of closed-circuit television (CCTV) surveillance cameras in urban environments. With i-See, users can find routes devoid of these cameras through paths of least surveillance [Fig. 1] that allow them to walk around their cities without fear of being caught on tape by unregulated security monitors. The program takes starting and destination points as inputs and maps the shortest distance with the least number of security cameras. The program relies on a list of camera locations as a reference for the path generation. Several cities, including Manhattan and Ljubljana, have already been mapped out and readied for i-See, now in its third generation. Unfortunately, the army of surveillance cameras grows faster than the small group of activists can work, making the impressive system less effective than it has the potential to be.
Fig. 1 A path of least surveillance traced in Manhattan (courtesy of the Institute of Applied Autonomy).
Ambient Intelligence in the City
927
8.2 Staging City Life OptionalTimes [59] is an interactive movie controlled by movements and moods of city dwellers that plays itself out on a public stage in the middle of the town of Almere. People’s motions within the public space act as triggers for interactive domino effects. Activities sensed by the motion detectors become input for an evolving narrative. The affordances of the sensing system are translated into narrative opportunities, playing themselves out differently based on the activity level of pedestrians moving into and out of the scene. For example, a man walks with great haste through the square, the installation senses his pace and releases images of a reality speeded up. Then one of the fictive characters enters the square in a rush. The authors see the work as a new way to relate to one’s surroundings and translate these relations into a public space and a stage for all to see. The movie is viewed at the same location where it is made and builds tension between the ideal and the real. The system works with multiple time frames: events are recorded and periodically projected back into the square. OptionalTimes mediates the constant change in public culture with an architectural backdrop that remains static. People walk into and out of the stage that is the city, temporarily participating in a fictive event only to disappear back into their own reality. The project is succinctly location specific. The town square of Almere, reflected in the various characters, projections and re-projections that dive in and out of it, has its kinetic portrait taken.
8.3 Paths of Production MILK [71] is a narrative GPS visualization system. The project tracks one of the dairy transportation lines from Latvia to the Netherlands, through the complete trade network between the source of the raw material and the destination of the final processed, packaged and consumed product; all the way from the cows’ utters to the mouths of the consumers. The collected data was augmented with audio recordings and photos of the places and people that make the transportation possible. Minutiae of their daily lives and routines were added to the database of geographic coordinates cross Latvia and into the Netherlands. The composite dataset was edited into a visual story, played at exhibition venues in ‘cartographic’ order from the east to the west, following the path of the milk. Different parts of this path can be viewed through a website. Portraits of the farmers, closeup shots of hands of workers and truck drivers are complemented to the paths they created while casual and sometimes insightful comments scroll across the bottom of the screen. Images of objects, buildings and maps add informational dimensions to the intelligently designed shots of the participants. The artist group intended and succeeded in building a new landscape like representation [88] of economic transactions of an anonymous life staple surrounded with subjective events a single media carrier, audio, image or locational data capturing system alone can not create.
928
Marc Böhlen and Hans Frei
8.4 Personal Pollution Control The participatory urbanism project [69], [70] aims to promote new styles and methods for individual citizens to become proactive in their involvement with the city, neighborhood and, says the project author Paulos, ‘urban self reflexivity’. Mobile phones double as measurement instruments for some environmental descriptors (carbon monoxide, sulfur dioxide, and nitrogen dioxide). According to Paulos this allows data usually collected and monitored by government agencies to be expanded and differentiated by location. It also allows lay people to contribute data with their own mobile devices. One scenario from the project website imagines Mary suffering from asthma as she is gardening in her yard. She checks her particulate matter sensor on her mobile phone and finds that it reports dangerous levels of exposure, mostly likely from wood burning pollutants. The city’s centralized system has no knowledge of the problem, so Mary checks sensor data from people in her neighborhood. She sees that the problem is concentrated around a single cul-de-sac near her. She also notices that several homes in that area are using fireplaces and generating excessive airborne pollutants, forbidden by local ordinance. Mary forwards the measurement collection to her local air quality measurement district where action is taken to enforce the clean air policy. Here, networked information devices allow lay people to augment government services where they fail to operate properly. As opposed to the scenarios described in the EU Report on Ambient Intelligence in 2010 [49], participatory urbanism advocates a bottom up approach to environmental protection and enforcement. Measuring pollutants, even when it is performed properly, is only one part of environmental control, and not the most challenging one. Still, the project succeeds in mapping out a way in which networked citizen surveillance might help individuals assert their personal concerns vis-à-vis complacent government agencies.
8.5 Full Body Ambient Intelligence AmI in the city expands to ambient urban recreation. City waterfronts are coveted areas of public leisure removed from hectic city life but well within information delivery infrastructures. The Glass Bottom Float project (GBF) is a practical example of critical ambient intelligence applied to environmental monitoring, real time responsiveness and long term observation [8]. GBF is a beach robot [Fig. 2] that floats in lakes on public beaches and informs beach visitors of the water quality and, at a later stage, on the expected pleasure of swimming in the waters. More often than not, people are informed by anonymous government agencies about water and air quality via occasional official news releases, often after critical thresholds have been exceeded. Furthermore, most official sources of environmental information are updated too infrequently [Francy2006] and the data are usually not collected where people experience their environment. Identification with water quality is often unintuitive for lay people due to abstract measures used by the scien-
Ambient Intelligence in the City
929
Fig. 2 GBF at Woodlawn Beach State Park, NY (courtesy of realtechsupport).
tific community. In order to address this, the GBF project has devised an additional measure of water quality related to people’s immediate uses of recreational waters. The authors refer to this as the swimming pleasure measure (SPM). SPM is based on several inputs [Fig. 3], including water temperature, pH, conductivity, chlorophyll, turbidity, salinity, dissolved oxygen and total dissolved solids. A local weather station delivers real time data on air temperature, relative humidity, air pressure, wind speed and direction. A commercial fish finder is reconfigured to deliver water depth, wave motion, positional information as well as the presence of fish large and small. A hydrophone listens for speed boats passing by, people playing on the float and underwater marine life.
Fig. 3 GBF sensor suit and the water proof box (courtesy of realtechsupport).
930
Marc Böhlen and Hans Frei
SPM combines quantitative and qualitative data. It is a complex measure that includes subjective ways of knowing. In order to capture this, swimmers at the beach are interviewed for their opinion on the water conditions. They are asked for either a word (awful to excellent) or a number (1 to 10) that matches their experience. This input is sent to the remote database and aggregated with the sensor data to see when and how the two different ‘views’ on the water overlap or differ, leading to the possible prediction of people’s opinions not unlike Von Ahn performs in his harnessing of human input to automate image annotation [91], [92]. Furthermore, the park manager enters his daily water bacteria measurements into the system. In the US, the Environmental Protection Agency stipulates that public swimming waters contain no more than 235 cfu (coliform forming units)[30]. This input serves as a control for the internal assessment and ensures that the calculated measure of the swimming pleasure does not blatantly contradict government regulations. This is important as GBF collects data that is not traditionally part of water quality assessment protocol. GBF does not attempt to replace current procedures, inadequate as they may be. GBF is an addition to ongoing efforts, and places them in a new context. The multi-pronged and collaborative approach improves the quality of the data as well as the technical and social robustness of the project. It also ensures that the government controlled beach operator actively supports the system, a prerequisite for project success. The SPM is shared with the public in multiple ways. Beach visitors are encouraged to spend the afternoon daydreaming on the float as it bobs about in the water and maps out paths of least contamination. Depending on the water quality a light flashes an beach side practices conform colored quality code (red: danger, blue: caution, green: good). The latest sensor data is available on mobile phones so people can check the local conditions before they drive off to the beach (and waste time in traffic) [Fig. 4]. Recordings from the hydrophone give visitors samples of the underwater audio landscape and remind them that humans share the water with many other species as well as motorboats.
Fig. 4 GBF output modalities: mobile, resting spot, micro-blog (courtesy of realtechsupport).
GBF also sends short text messages to a micro blog Twitter. The twittering connects the fickle but rough watery environment with fickle online communities removed from the outdoors. It brings things that concern us all into the private domain through personal communication media.
Ambient Intelligence in the City
931
9 AmI in the City, Revitalized We conclude with a scenario informed by the discussions above. Sal awakens. It is a bright and sunny day. Sal looks out of the window of her condominium. Several flocks of birds are displayed on the networked active window. The city airport has been tracing geese living close to the airport to warm them of approaching aircraft. Warining the geese prevents collisions between planes and birds. Preventing even one plane from colliding with geese offsets the costs of collecting this information for decades, and allows the system to be freely shared at no cost. When air traffic is low at night, the system tracks urban bats and makes the bat sightings available to networked active windows. Sal knows now that many bats frequent her neighborhood in June, that some stay into July and that all disappear after August. Sal’s phobia of bats has vanished once she began to follow the bats’ trajectories. Sal gets her jogging gear on. Not the new one with the body sensors. It was damaged in the dryer a few days ago. While slipping on some old unwired jogging pants, she reminds herself that it makes no sense to check body vitals every time you go outdoors. There is just too much data around these days. It is a warm morning as Sal jogs over to the neighborhood MicroPublicPark. Sal sees a crowd of people over at the far end of the park. Someone is gesticulating towards the large water tower Water-IS-us recently installed in the park. Curious, she jogs along for a closer look. She spots Joe and inquires about the commotion. Joe, a member of Water-IS-us, tells Sal that the MicroPublicPark’s water tower control room seemed to be working just fine, at last. MicroPublicPark is open to all. It is a surveillance-free space that symbolizes and performs public maintenance of shared water resources. It operates in full view of the public. Anyone can watch the system as it analyzes the city’s water supply and informs on the state of events. Not only does the system analyze the water and warn of compromised water quality, it monitors the sources from which the water is collected. An upstart company specializing in sensor networks donated the system and volunteers keep it working. The city offers research and development grants to people who propose and implement improvements to the PublicMicroPark. Considerable expertise in designing state of the art environmental control has been collected as a result. University students get course credits for helping out. The system is networked with other MicroPublicPlaces throughout the city. Direct links with MicroPublicPlaces across the globe are in progress. Anyone can stop by the water bar at anytime and get a cup of water to taste the results of the system. Sal takes a cup of fresh MicroPublicPark water. “Hmm. Tastes like water from a glacier stream. I am surprised!” Joe adds that the MicroPublicPark has a state of the art water analysis and augmentation system that combines the most advanced technologies with old technologies. Like the Romans the MicroPublicPark stores the water in vast underground cisterns lined with lime stone from an old local quarry. The ambient water flow monitoring system can keep track of the water from the small streams outside of the city all the way to the glass of water in your hand. Furthermore, the system keeps the public abreast of water related commerce. The
932
Marc Böhlen and Hans Frei
stock prices of the main players in the global water business are displayed on the large data screen together with the water analysis results. Company video conferences are streamed to the display and announced in the park so everyone can see how they present themselves. Media reports and scandals are made public. Recently a company in Bolivia, GreenWater, tried to buy the water rights to large parts of South America. They argued that by privatizing the water resources they would ensure that they are put to efficient use, that they maximize share holder profit and, as a consequence, save the earth from wasting precious water. Sal is not sure if she believes the story. Joe points to the control room display where nasty electronic comments are being posted next to the GreenWater’s company logo. GreenWater has climbed to the top of the chart of bad companies, an honor no company can take lightly due to the wide appeal this popular chart holds with globally networked consumers. “In Shenzhen”, Joe notes, “products from GreenWater have been pulled from the shelves due to massive consumer protest”. Sal is not sure if Joe is getting carried away. “Gotta go, see you later”, she says as she jogs off through the park. At home under the shower Sal realizes that she is drinking the water she is showering in. Only a small part of the world’s population has that luxury, she reminds herself. Sal checks MicroPublicPark’s URL she committed to memory, scrolls past the list of events and planned activities and finds many more references to water mismanagement in the US and reported water rights theft in Africa. The list seemed endless. Sal is confused but determined to revisit the MicroPublicPark again the next morning.
10 Outlook The model of the Greek polis, the European city, and the American suburb no longer suffice as models of urbanity in the age of ambient intelligence. We need new models, models that have more Africa [29] and Mars in them. The public changes under ambient intelligence. information cities inspired exclusively by economic and technical infrastructure models can not scale to urban satisfaction. City planning is already a multi-disciplinary domain where professionals and non-professionals of all kinds collaborate to balance contradicting interests and needs. Ambient systems design needs to come out from the bunker and move into full view such that it becomes a public affair. Knowledge from other domains must help AmI expand its current practice. In the end, the (AmI) engineers, managers and politicians will build the new ambient cities. But we all need to come together and work together, even though we do not get along that well.
Ambient Intelligence in the City
933
References [1] Aarts, E, Encarna cao, J.: The Emergence of Ambient Intelligence, Berlin, Germany, Springer, 2006. [2] Agre, P.: Computation and the Human Experience, Cambridge University Press, 1997. [3] Agre, P.: Peer-to Peer and the Promise of Internet Equality. In: Communications of the ACM, Vol.46, Nr.2, pp. 39–42, 2003. [4] Agre, P.: Internet Research: For and Against. In: Mia Consalvo, Nancy Baym, Jeremy Hunsinger, Klaus Bruhn Jensen, John Logie, Monica Murero, and Leslie Regan Shade, eds, Internet Research Annual, Volume 1: Selected Papers from the Association of Internet Researchers Conferences 2000–2002, New York: Peter Lang, 2004. [5] Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., and Werthimer, D.: SETI@home: an experiment in public-resource computing. Commun. ACM 45, 11, pp. 56–61, 2002. [6] Andrejevic, M.: Nothing comes between me and my CPU: smart clothes and ubiquitous computing, Theory, Culture and Society, vol. 22, no. 3, pp. 101– 119, 2005. [7] Arendt, H.: The Human Condition. Chicago: University of Chicago Press, 1958. [8] Atkinson, J., Böhlen, M. : The Glass Bottom Float Project, work in progress. [9] Augé, M.: Non-lieu, Introduction á une anthropologie de la surmodernité. Edition de Seuil, 1992. [10] Berman, M.: All That Is Solid Melts Into Air: The Experience of Modernity. New York, (1982), Penguin Books, 1988. [11] Fuller, B.R.: Nine Chains to the Moon, Anchor Books 1938, 1971. [12] Bφ dker, S., Ehn, P., Kammersgaard, J., Kyng, M., Sundblad, Y.: A UTOPIAN experience. On design of powerful computer-based tools for skilled graphic workers. In G. Bjerknes, P. Ehn, and M. Kyng, editors, Computers and Democracy, a Scandinavian Challenge, pp. 251–278. Aldershot, Gower, Avebury, England, 1987. [13] Böhlen, M., Rider, S., Baldwin, M.: Easier-Travel. Extreme tourism and the null trip, International Conference on Tourism and Literature: Travel, Imagination, Myth, Harrogate, UK, 2004. [14] Böhlen, M.: Help from Strangers, Media Arts in Ambient Intelligence, Advances in Ambient Intelligence, Frontiers of Artificial Intelligence and Applications (FAIA), IOS Press, Amsterdam, The Netherlands, 2007. [15] Castells, M.: The Rise of the Network Society, The Information Age: Economy, Society and Culture Vol. I. Cambridge, MA; Oxford, UK, 1996. [16] Hydra: A Robust and Self Managing Video Sensing System for Retrospective Surveillance, Defense Intelligence Agency DIA-MASINT, PI Surendar Chandra, co-PI Pat Flynn, University at Notre Dam, 2006.
934
Marc Böhlen and Hans Frei
[17] Chipchase, J., Yanqing, C., Jung, Y., Personal TV: A Qualitative Study of Mobile TV Users , In: Interactive TV: a Shared Experience, 5th European Conference, EuroITV , pp. 195–204, Amsterdam, The Netherlands, 2007. [18] Corbet, J.: The Linux Developer Network. The Linux Foundation. How to Participate in the Linux Community, August 2008. http://ldn.linuxfoundation.org/ how-participate-linux-community, accessed Dec 21, 2008 [19] The Cornell Lab of Ornithology. Nestwatch http://watch.birds. cornell.edu/nest/home/index, accessed Dec 21, 2008 [20] Cramer, F.: Re Interview with Christopher Kelty: the Culture of Free Software, Nettime, a moderated mailing list for net criticism, collaborative text filtering and cultural politics of the nets, 9:59 PM, 25th August, 2008. [21] Crandall, J.: Anything that moves: Armed vision, C Theory, June edition, 1999. [22] Crang, M., and Graham, S.: Ambient intelligence and the politics of urban space, Information, Communication and Society, 10:6, pp. 789–817, 2007. [23] de Certeau, M.: L’Invention du Quotidien. Vol. 1, Arts de Faire. Union générale d’éditions pp. 10–18. 1980. [24] Debord, G., E.: La société du spectacle, 1967, Editions champ libre 1971. [25] The Disappearing Computer, EU-funded initiative on the Future and Emerging Technologies activity of the Information Society Technologies (IST) research program. http://www.disappearing-computer.net/, accessed Jan 25 2009. [26] Taylor, A., Donovan, B., Foley-Fisher, Z., and Strohecker, C.:Time, voice, and Joyce. Proceedings of the First ACM, Workshop on Story Representation, Mechanism and Context, pp. 67-70, 2004. [27] Dourish, P., Anderson, K., Nafus, D.: Cultural Motilities: Diversity and Agency in Urban Computing. In: Human-Computer Interaction, INTERACT 2007, pp. 100–113, 2007. [28] Ellul, J.: La Technique ou l’enjeu du siècle, Librairie Armand Colin 1954 (English 1964). [29] Kelly, K.: Gossip is Philosophy. Interview with Brian Eno. Wired Magazine (The Wired Digital), March 2005. [30] Report of the Experts Scientific Workshop on Critical Research Needs for the Development of New or Revised Recreational Water Quality Criteria, EPA 823-R-07-006, 2007. [31] Regulation No 998/2003 of the European Parliament and of the Council on the animal health requirements applicable to the non-commercial movement of pet animals and amending Council Directive 92/65/EEC, 2003. [32] Fassler, M.; Terkowsky, C. (ed.): Die Zukunft des Städtischen. Urban Fictions. München: Wilhlem Fink Verlag, 2006. [33] Feenberg, A.: Questioning Technology, Routledge 1999. [34] Ferguson, D., Sairamesh, J., and Feldman, S.: Open frameworks for information cities. Commun. ACM 47, 2, pp. 45–49, 2004.
Ambient Intelligence in the City
935
[35] Flusser, V.: Curies’ Children. In: Artforum, No.5, Vol. 38, p. 36, May 1990. [36] Flusser, V.: Der städtische Raum und die neuen Technologien, In: Medienkultur, Fischer Verlag, Frankfurt, 1997. [37] The world’s frst zero carbon, zero waste city in Abu Dhabi, 2007. http: //www.fosterandpartners.com/News/291/Default.aspx, accessed Jan 30, 3009. [38] Francy, D.S., Darner, R.A. and Bertke, E.: Models for predicting recreational water quality at Lake Erie beaches: U.S. Geological Survey Scientific Investigations Report 2006. [39] Frazer, J.: An Evolutionary Architecture. London: Architectural Association, 1995. [40] Fuller, M., and Haque, U.: Situated Technologies Pamphlet 2, Urban Versioning System 1.0, The Architectural League New York, 2008. [41] Galloway, A.: Ubiquitous computing and the city, Cultural Studies, 18:2, pp. 384–408, 2004. [42] Gaye, L., Holmquist, L. E., Mazé, R.: Sonic City: Merging Urban Walkabouts with Electronic Music Making, UIST’02, Paris, France, October 2002. [43] Giedion, S.: Mechanization Takes Command, Oxford University Press, 1948. [44] Harmon, A.: 2005, Reach Out and Touch No One, The New York Times, April 14th, 2005. [45] Heidegger, M.: Das Ding, 1950. [46] Hovestadt, L.: Strategien zur Überwindung des Rasters. In: Archithese, Nr.4, pp. 76-84, 2006. [47] The Hollywood Stock Exchange http://www.hsx.com/about/, accessed Jan 17 2009. [48] Institute for Applied Autonomy, i-SEE, 2003. http://www. appliedautonomy.com/isee.html, accessed Jan 28, 2009. [49] ISTAG: Scenarios for Ambient Intelligence in 2010. Final Report of the European Commission, Information Society Directorate-General, compiled by Ducatel, K., Bogdanowicz, M., Scapolo, F., Leijten, J., Burgelman, J-C, Seville, 2001. [50] Kittler, F.: Grammophon Film Typewriter. Berlin, Brinkmann and Bode, 1986 [51] Koolhaas, R.: Junkspace, 2002. [52] Kramer, M. A., Reponen, E., and Obrist, M.: MobiMundi, exploring the impact of user-generated mobile content – the participatory panopticon. In Proceedings of the 10th international Conference on Human Computer interaction with Mobile Devices and Services . MobileHCI ’08. ACM, New York, NY, pp. 575–577, 2008. [53] R. van Kranenburg. The Internet of Things. A critique of ambient technology and the all-seeing network of RFID, Network Notebooks 02, Institute of Network Cultures, Amsterdam, 2007. [54] Kuitenbrouwer, K.: RFID and Agency. The Cultural and Social Possibilities of RFID. In: Open, Nr.11, pp. 50–59, 2006. [55] Latour, B.: Das Parlament der Dinge. Für eine poltische Oekologie. Frankfurt a.Main, 2001. Oringial: De l’acteur-réseau au parlement des choses, in M
936
[56] [57]
[58] [59] [60] [61] [62]
[63]
[64]
[65] [66] [67]
[68] [69]
[70]
[71] [72] [73]
Marc Böhlen and Hans Frei
(Mensuel, marxiste, mouvement) numéro 75 spécial sur Sciences, Cultures, Pouvoirs (interview J. C. Gaudillère), pp. 31-38, 1995. Latour, P.: How to Make Things Public. In: Latour, Peter; Weibel, Peter (ed): Making Things Public. Atmospheres of Democracy. Cambridge: MIT, 2005. Lederer, S., Dey, A. K., Mankoff, J.: Everyday privacy in ubiquitous computing environments, presented at Ubicomp 2002 Privacy Workshop, Gothenburg, Sweden, 2002. Lefebvre, H.: The Production of Space. Malden: Blackwell, 1991 (french: 1974). OptionalTimes/Amere, SKOR, Museum De Paviljoens, 2008 http:// optionaltime.com/3/, accessed Jan 20 2009. Lahlou, S., Jegou, F.: European Disappearing Computer Privacy Design Guidelines V1.1, Ambient Agoras IST-DC, 2004. Ludger, H.: Strategien zur Überwindung des Rasters. In: Archithese, Nr.4, pp. 76–84, 2006. Matsuzawa, I: Public Space for Ambient Intelligence. Siegerprojekt des International Architectural Design Competition, organized by NIT DoCoMo, 2006. McCue, C.: Data mining and predictive analytics: battlespace awareness for the war on terror, Defense Intelligence Journal, vol. 13, nos 1/2, pp. 47–63, 2005. McCullough, M.: On Urban Markup: Frames Of Reference in Location Models For Participatory Urbanism, Leonardo Electronic Almanac, vol. 14, issue, 03, 2006. Mitchell, W., J.: City of Bits. Cambridge (MA): The MIT Press, 1995. Mitchell, W., J.: E-topia, Urban life, Jim, but not as we know it, MIT Press, 1999. Oltsik, J.: Trusted Enterprise Security. How the Trusted Computing Group (TCG) Will Advance Enterprise Security. White Paper. Enterprise Strategy Group, 2006. Mumford, L.: Technics and Civilization. New York: Harcourt, 1934. Paulos, E.: Participatory Urbanism. http://www.urbanatmospheres/net/ ParticipatoryUrbanism/index.html, accessed Dec 21, 2008. Paulos, E., Honicky, R. and Hooker, B. Citizen Science: Enabling Participatory Urbanism. in Foth, M. ed. Handbook of Research on Urban Informatics: The Practice and Promise of the Real-Time City, IGI Global., Hershey, PA, 2008. MILK, Esther Polak and collaborators http://milkproject.net/, accessed Jan 22, 2009 Prada, J.M.: , The emergence of the geospatial web, posted on NETTIME, Jan21, 2009. Price, C.: Fun Palace (1961), in: The Square Book Academy Press 2003, (reprinted from: Cedric Price: Works II, 1984).
Ambient Intelligence in the City
937
[74] Reynolds, C., Wren, C.: Worse is Better for Ambient Sensing, Pervasive 2006 Workshop on Privacy, Trust and Identity Issues for Ambient Intelligence. Dublin, Ireland, (MERL TR2006-005), 2006. [75] Rheingold, H. : Smart Mobs: the Next Social Revolution. Perseus Publishing, 2002. [76] Sassen, S: Public Interventions. The Shifting Meaning oft he Urban Condition. In: Open, Nr.11, p.21, 2006. [77] Schwartz, J.: Housing Crisis? Try Mobile McMansions, New York Times, December 2, 2007. [78] Situationiste Internationale, Nr.1, Juni 1958. [79] Sloterdijk, P.: Architektur as Immersionskunst, in: Archplus, Die Produktion von Praesenz, No. 178, pp. 58–61, 2006. [80] Stalder, F.: The Space of Flows: notes on emergence, characteristics and possible impact on physical space. c I T y: reload or shutdown? 5th International PlaNet Congress Paris, August 26th–September 1st, 2001. [81] Sterling, B.: Shaping Things, MIT Press, 2005. [82] Streitz, N. et al.: Smart Artefacts as Affordances for Awareness in Distributed Teams. In: Streitz, N., Kameas, A., Mavrommati, I., eds.: The Disappearing Computer, LNCS 4500, pp. 3-s29, Springer Verlag Berlin, Heidelberg, 2007. [83] Sunstein C.R.: Infotopia, How Many Minds Produce Knowledge, Oxford University Press, 2006. [84] Tamminen, S., Oulasvirta, A., Toiskallio, K., and Kankainen, A., Understanding mobile contexts, Pers Ubiquit Comput, 8, 2, 135–143, 2004. [85] Tufekci Z.: On the Internet, Everybody Knows You’re a Dog: Presentation of Self for Everyday Surveillance. (Presented at American Sociological Association, 2007). [86] Tufekci, Z.: Can You See Me Now? Audience and Disclosure Regulation in Online Social Network Sites. Bulletin of Science, Technology and Society, 2008. [87] Tufekci, Z.: Grooming, Gossip, Facebook and Myspace: What Can We Learn About Social Networking Sites from Non-Users. Information, Communication and Society. Volume 11, Number 4, June, pp. 544–564, 2008. [88] Tuters, M., Varnelis, K.: Beyond Locative Media: Giving Shape to the Internet of Things, Leonardo - Volume 39, Number 4, pp. 357–363, 2006. [89] VanOrd, K.: SimCity Societies Review, GameSpot, posted Nov 16, 2007. [90] Virtual New York City http://www.nyc.gov, accessed Jan 24, 2009. [91] von Ahn, L.: Blum, M., Hopper, N. and Langford, L.: CAPTCHA, Using Hard AI Problems for Security. In Advances in Cryptology, Eurocrypt, pp. 294–311, 2003. [92] von Ahn, L.: Games With A Purpose. In IEEE Computer Magazine, June, pp. 96–98, 2006. [93] Voss, J., Maak, N.: Mit Camouflage durch die Krise, Debatten, Die Zukunft der Staedte, in: Frankfurter Allgemeine Zeitung, 3. January 2009. [94] Weiser, M.: Weiser, M., The Computer for the 21st Century, Scientific American, 265, 3, pp. 66–75, 1991.
938
Marc Böhlen and Hans Frei
[95] Wright, W.: SimCity, 1989. [96] Wright, W.: SimCity Societies, published by Electronic Arts, 2007.
The Advancement of World Digital Cities Mika Yasuoka, Toru Ishida and Alessandro Aurigi
1 Introduction Since the early 1990s, and particularly with the popularization of the Internet and the World Wide Web, a wave of experiments and initiatives has emerged, aiming at facilitating city functions such as community activities, local economies and municipal services. This chapter reviews advancements of worldwide activities focused on the creation of regional information spaces. In the US and Canada, a large number of community networks using the city metaphor appeared in the early 1990s. In Europe, more than one hundred similar initiatives have been tried out, often supported by large governmental project funds. Asian countries are rapidly adopting the latest information and communication technologies for actively interacting real-time city information and creating civic communication channels. All of the above cases are loosely related but independent activities, and thus their goals, services, and organization bodies differ. Wide varieties of names and buzzwords have been employed for labeling localized information and communication networks, depending on the fashion of the moment and their characteristics. ‘Cybercities’, ‘virtual cities’, ‘community networks’ and ‘civic networks’ are examples of names that have been attached to all sorts of projects. These definitions are very vague and therefore disputed. In this chapter, we will follow the terminology developed in our research community during the past decade and refer to such information spaces as ‘Digital Cities’. Mika Yasuoka IT University of Copenhagen e-mail: [email protected] Toru Ishida Department of Social Informatics, Kyoto University & Japan Science and Technology Agency email: [email protected] Alessandro Aurigi School of Architecture, Planning and Landscape, Newcastle University e-mail: A.Aurigi@ newcastle.ac.uk H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_35, © Springer Science+Business Media, LLC 2010
939
940
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
Digital cities have been initiated by three distinct actors: 1) non-profit electronic community forums such as the ‘freenet’ movement in the US, 2) commercial services as local information portals by private companies, and 3) governmental initiatives for city informatization. The digital cities can be categorized according to their socio-technical and virtual-physical dimensions as illustrated in Figure 1. On the socio-technical diPhysical Cleveland Free Net Blacksburg Electronic Village Seattle Community Network TeleCities
Digital City Shanghai Digital City Kyoto
Social
Technical Virtual Helsinki De Digitale Stad WELL
Virtual Fig. 1 A classification of digital cities according to their socio-technical and virtual-physical dimension.
mension, one trend is a high-tech direction where computer scientists play a central role and seek for ways of applying advanced cutting-edge technologies to regional information spaces. At the opposite end of the scale, we find a social direction that focuses on how to facilitate citizens’ community life using local information. On the virtual-physical dimension, some digital cities resemble virtual worlds like second life [26] with functions to support community of interests [14], while others have close ties to real cities providing functions to access regional information and activities. To some degree, though, the socio-technical and virtual-physical dimensions of digital cities are inseparable. Although digital cities in the early development stages tended to focus on one of these aspects, they have gradually been interwoven. Digital cities nowadays have become completely heterogeneous, reflecting the cultural background to the corresponding cities. In addition they have become more integrated with citizens’ daily life with the use of advanced technology.
The Advancement of World Digital Cities
941
This chapter overviews the history and further advancement of American, European and Asian digital cities from the socio-technical and virtual-physical dimensions. It observes the heterogeneity of models and experiments in the field, and reflects on the wider cross-continental issues of the articulation of community, industries and government within the local information sites.
2 American Community Networks In the early days of the Internet, somewhat avant-garde attempts were made in many places to explore concepts for the future of the network. It seems that no clear vision had been formed about the structure of the information space for regional communities at that time. President Clinton’s administration announced the National Information Infrastructure Initiative (NII) in March 1993, and the NII agenda was proposed six months later. Both efforts contributed to establish the foundations of information networks in the US. NII was proposed as one of the key scientific and technical policies directed towards satisfying the High Performance Computing Act. It specified concrete target values for six items like universal access and scientific and technological research. Targeted research areas included economy, medicine, life-long education, and administrative services. As a consequence, it promoted and catalyzed a wide range of network studies. However, due to the size of the country it was impractical to depend on the state sponsored information superhighway to construct an information network that could cover all regions. The political tradition of community-centered, privately inspired grass-roots and the ever strong emphasis on freedom of speech made itpossible to spread local information communities across the country. At the dawn of community information network, the leading community activists already recognized the necessity for local and independent networks to support the everyday life of citizens. Such activities quickly gained a high profile in the US and a great deal of worldwide visibility because of their innovative potential. The Cleveland Free-Net, born in 1986, was the world’s first citizen-led community network. It started from a free electronic help line that connected doctors and patients. The Cleveland Free-Net provided a communication channel for everyone with a modem and computer. It promoted the exchange of ideas and opinions between the citizens of the Cleveland region. At the same time, the WELL (Whole Earth ’Lectronic Link), a pioneer ‘virtual community’ that would become famous worldwide, was born [23]. Stemming from a series of various electronic ‘conferences’, it formed a great virtual community where people of all generations and occupations could gather from all over the world. They ruled out anonymity and emphasized “exchange with a human face.” The WELL was a network-based virtual community but it was not bound to locality. Though some of its ‘conferences’ extended from the virtual into the physical world and promoted face-to-face meetings and offline exchanges, the WELL was not designed to support regional communities. The early years of networked communities were characterized by a
942
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
strong emphasis – to the point of being an almost blind faith – towards the myth of the inevitability of the global village and what had been hailed as the ‘death of distance’ [4]. But a line between globally-oriented ‘non-grounded’ projects and ‘grounded’ virtual cities with a clear relationship with geographical communities had to be drawn, because location-based digital spaces were rapidly emerging [1]. Many typical examples of region-oriented community networks started at the beginning of the 1990s [7]. The Blacksburg Electronic Village [5] was born in 1991, and the Seattle Community Network [24, 25] appeared in 1992. They differed in various ways. The former was started by a consortium consisting of regional companies and universities, while the latter emerged from civil activities. In the case of Blacksburg, Virginia Polytechnic Institute and State University worked with Bell Atlantic and local authorities. They built a consortium to create a regional virtual information space. The first two years were spent preparing computers and communication equipment. Dial up services started in 1993. Blacksburg Electronic Village gradually grew large enough to attract regional citizens. In Blacksburg’s case, the leading role was taken not by the citizens who would use the network, but by the university, companies and local administration. The network was constructed from the technological viewpoint of companies and universities aiming at transferring research results into telecommunication business. In the same way, many of the early community networks were promoted by ‘technical’ leaders. The citizens lacked technological knowledge and experience to take the lead. Most community networks required the guidance provided by technical leaders in constructing network infrastructure. However, once the technical infrastructure was in place, the users often rapidly acquired the necessary skills to manage the equipment. Later the development of these local information networks slowed down. For example in 1995 the number of activities of Blacksburg Electronic Village decreased. This was the result of a fundamental disagreement between technology providers and users due to different goals and expectations to the community networks. The Seattle Community Network (SCN) emerged from civil activities. In 1990, SCN was started as a part of the Computer Professional for Social Responsibility (CPSR) initiative with the goal of creating a cyber space accessible to the public. In 1992, the project was launched in Seattle and started to work on its ideals, vision and strategy. Services started in 1994. Though SCN was faced with financial problems and competition from commercial portals, it grew in size by cooperating with regional libraries that gave an entry platform to the local community by offering a network accessible to everyone. SCN was not led by universities or the city administration, but by the citizens. Its purpose was to provide a sustainable information space for the region’s inhabitants. Its services included e-mail provision, homepage creation, and support for regional activities rooted in everyday life. SCN took a leading role for grass-root networking. Just after SCN went in operation, similar initiatives were made in many places across the US. In 1993, the Greater Detroit Free Net started services, and in 1995, Genesee Free Net commenced operation. In 1998, the number of community networks exceeded 300. But not all of them were successful. On the contrary, most were plagued by problems. For example,
The Advancement of World Digital Cities
943
volunteer-based activities often suffered from chronic fund shortages, highlighting the economic viability of community-driven electronic networking. After the birth of pioneering community networks like Blacksburg and Seattle, commercial sites like AOL Digital City, and Microsoft Sidewalk came into being. These were profit-oriented portals that provided local information. Their outstanding usability and rapid growth raised the warning that “the pursuit of profit will destroy grass-root community networks.” On the other hand, small ventures also started local portals. Even though these sites are run by companies they, unlike the commercial portals offered by global companies, are offering community sites dedicated to specific regions [24]. They act as an intermediate service medium between community networks and commercial portals.
3 European Digital Cities The European conceptualization of the ‘digital city’ started with the experience of Amsterdam’s De Digitale Stad (DDS) [3, 2] in 1994. The Amsterdam case was the first to use the word ‘digital city’. This project quickly got well known and admired for the scope of its aim and openness. Because of it, DDS became a paradigm, and it led a sort of ‘digital city’ movement within Europe. The Dutch digital city started its activities as a non-profit grass-root organization, but the crucial difference to earlier American experiments was the government’s direct support of the project. DDS’s functionality ranged from the support of community activities to the encouragement of political discourse, like linking the citizens to the administration. DDS was initially funded by the regional administration, but later it became financially independent. To stand on its own feet, the digital city acquired a company-like character. To cope with the radical technological changes, top-down decisions were often required. The beginning of DDS was slanted towards the aspects of democracy, administration, politics and economy. Its ‘urban’ character was also strongly metaphorically emphasized by its interface. It presented and structured the system graphically by employing a city metaphor with thematic squares, cafes and ‘residential’ sections hosting individuals’ websites, hence the ‘digital city’ idea. Several functions on DDS indicated that it was a nongrounded virtual community that would offer web space, email, and social virtual places where users were not necessarily interested in Amsterdam as a city. Commercial pressure eventually made DDS change into something rather different from its initial ‘vision’. It lost its urban metaphorical character and had to compete with a much more varied and articulated offer of free Internet services and connectivity becoming available from many other companies. This has eventually weakened much of its attractiveness and its ‘cutting edge’ status as a digital city [17]. At the end of 1995, Finland decided to commemorate the 450th anniversary of Helsinki. As one part of its celebrations, Helsinki initiated the construction of a high-speed metropolitan network. The consortium was established by Elisa
944
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
Communications (former Helsinki Telecom) and Helsinki city. The members included companies like IBM and Nokia, and universities in Helsinki. They jointly developed virtual tours as well as public services. For example, Helsinki City Museum provided a cultural service for the citizens and visitors interested in the history of Helsinki [16, 15]. A three-dimensional virtual space was set up where the visitors could wander around the Helsinki city hall of 1805. Even though the actual hall did not exist anymore, they built it on the Web and re-created the atmosphere of that age. Helsinki virtual city also offered technically advanced functions. For example, visitors to the city on the three-dimensional virtual city model of Helsinki could make a phone call just by clicking the screen. The notable point here is that Europe has many digital city initiatives born through the cooperation of public administrations, companies and social activists. Frameworks similar to Helsinki have been introduced in other European cities. In Ireland, for instance, the Eircom telecommunications company constructed the ‘Ennis Information Age Town’ under public and voluntary partnerships [6]. In Europe, it has been an important theme to integrate and coordinate the efforts of the private, public and voluntary sectors towards better regional and local information systems. TeleCities was an alliance of EU cities that started the European Digital City Project in 1993 [19]. This was originally characterized as a program to support telematics applications and services for urban areas. The TeleCitie’s consortium characterizes itself as “an open network of local authorities dedicated to the development of urban areas through the use of information and telecommunications technologies.” Its target is to share the ideas and technologies born from various city projects and to strengthen the partnerships among EU cities through this sharing. In this model, each city sets two targets: (1) to utilize information and communication technologies to resolve social, economic and regional development issues, and (2) to improve the quality of social services through the use of information. The TeleCities support program allows each city to take its own course of actions on facilitating the formation of partnerships and the successful bidding for European Community funds. However, while this approach seemed to have a good potential in bringing together local authorities and the industrial and commercial sectors, grassroot communities and voluntary projects were left out. As a consequence, many initiatives appeared to be ‘pushed’ to their potential users in a top-down fashion. Especially in the early stages of projects’ conceptialization and construction, the top-down management failed to stimulate the citizens’ participation, even though it ensured good levels of support. In some cases, management became aware of this operational gap and started to emphasize the need to “base the informatization on society” or the importance of informatization in resolving social issues. Along with this direction, Vienna city, one of the oldest members of TeleCities, created the informatization plan called ‘The Strategy Plan for Vienna 2000’. In Vienna, both the city and the citizens shared the responsibilities for informatization. Under the strategy plan, the city was responsible for the civil services and the citizens were responsible for the projects executed by individual communities. The digital city was run by this cooperative structure. For example, if the city wanted to resolve the issue of digital signatures, the citizens would participate willingly in
The Advancement of World Digital Cities
945
a trial project and enable its speedy introduction. Such collaboration was realized based on this cooperative structure. Moreover, their plans, processes, results and lessons learned were reported to the EU and used in the other cities, so that effective information sharing was realized. Similarly, in the Italian city of Bologna, the ‘Iperbole’ – Internet for Bologna and the Emilia Romagna region – initiative was started and run by the local municipality with the intention of being fairly open to grassroot contributions. The city council would provide information and civic services, but would leave an open door to voluntary organizations and citizen groups. This approach allowed the digital city to become an information provider and publisher as well as a plaza for citizens within the official digital city. About 100 cities from 20 countries have been taking part in TeleCities. EU’s support of city informatization has been strengthening, and each city has developed its own digital counterpart taking advantage the sharing of best practices, project plans, and success stories. Although TeleCities acquired members steadily and facilitated projects continuously, it found that their activities still largely lacked the commercial viewpoint. In order to accommodate this need, in 2001, three of the main urban informatization organizations, eris, ELANET and TeleCities, moved towards cooperation. They also supported Global Cities Dialogue in order to construct a worldwide urban network.
4 Asian City Informatization The most significant trend in Asia was the emergence of city informatization as a governmental national project. Though the momentum generated by American grass-root activities had a great influence, digital cities in Asia were created as a part of governmental initiatives. The first country in Asia to implement an informatization project was Singapore. The administration started the Singapore IT2000 Master Plan in 1992. In 1996, it launched the plan called ‘Singapore One: One Network for Everyone’ to develop a broadband communication infrastructure and multimedia application service. Korea proposed the KII (Korea Information Infrastructure) in 1995 in response to the American NII. In 1996, the Malaysian administration announced a plan called the Multimedia Super Corridor. Their new high-tech cities, Putrajaya and Cyberjaya, were a part of the Malaysian e-Government and the surrounding regions were designated as multimedia zones. An organization similar to the Western community network was the research project called Malaysian Institute of Microelectronic Systems (MINOS). MINOS, which also acted as an Internet provider, broke away from the Malaysian government in 1996. Three different sectors: public administration, business and citizens cooperated with each other to improve everyday life and promote social interaction. They realized technical innovations with the aim of contributing to the development of the country. In Japan, the regional network Koala was born with the assistance of a local prefecture. In 1985, it set up an information center, in 1994 it connected to the Web, and in 2000 it was reorganized as a business corporation that promoted community
946
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
networks. After that, many regional community networks have been developed in Japan with assistance from the administration and telephone companies. The leader of informatization in Asia was mainly government. It was rare to see leadership exerted by civil activities or grass-root organizations. The reaction from the industrial world was also slow. As for Koala, in spite of its public face, it started as a part of the informatization policy set by the local government. Their policy of “introducing IT proactively and implementing coast-to-coast informatization” was originally proposed by the government. What tended to happen were the emergence of a precise and relatively rigid government-oriented information strategy, rather than the clarification of rules and purposes for community networks in order to prepare the ground for bottom-up initiatives. As a result, the promoters of digital cities often conducted large-scale investments such as laying optical fiber lines or equipping all schools with PCs. The activities of Asian digital cities could be seen as mainly governmental initiatives conducted in the name of city informatization. An interesting example of rural networking was seen in Yamada village, Japan. This project started in 1996 with the aim of reversing the depopulation trend in rural areas. It can be defined as a regional informatization project to stimulate village life. The community site, which is mainly for village citizens, was developed with the support of the administration. In 1998, the ratio of connected villagers reached 60%, and the interaction both within and outside village increased. The Yamada project greatly contributed to a better understanding of the potential of civil participation in rural areas. However it is worth noting that the Yamada village project was also mainly driven by the administration. So, despite of its success, it was still not a citizen-led activity. More recently, however, due to a better understanding of the limitations of the topdown approach, many countries have started, since the late 90’s, to preach the importance of civil initiative. This movement has been fostered by the American community networks mentioned previously. People who have experienced the grassroot activities in the US played an important role in introducing the concept in Asia. In Japan, organizations like the Community Area Network Forum (CAN) was established. CAN was inspired by American community networks and its goal was to create a rich information space and to promote human communication in actual communities by utilizing the Internet. CAN itself was not a digital city but an organization promoting to build regional networks in Japan. Apart from the community networking movement, several technology-oriented initiatives were made in Japan. In 1998, Kyoto digital city project [12] initiated by researchers at NTT and Kyoto University developed a social information infrastructure for urban everyday life. Digital City Kyoto covered shopping, business, transportation, education and social welfare as well as tourism. Digital City Kyoto forum was supported by not only local authorities and researchers, but also local citizens, shop owners and students. Being led by computer scientist from telecom companies and universities, Digital City Kyoto had strong advantages in using advanced technologies such as GIS, VR, animation and social agents. Eventually, however, topdown initiatives weakened further developments of vital activities, and the project finalized in 2001 without further development by users. Still, it is noteworthy that
The Advancement of World Digital Cities
947
experience and knowledge of using advanced technologies such as real-time city sensors, 2D-3D maps, and rich video data was accumulated.
5 Advances in Information and Communication Technologies As described in the previous sections, the early digital cities had distinct characteristics. Asian digital cities concentrated on using advanced information technology while North American digital cities focused on strong social functions for supporting local community. The level of interaction with the real community also varied. Some digital cities created virtual communities with weak ties to the designated local communities while others created digital versions of physical cities. Today, digital cities are widespread and have become more mature and diverse. The role of digital cities has transformed over time, reflecting the nature of the existing local cities and rapid advances in technologies. The distinct socio-technical and virtual-physical characteristics of early digital cities have often disappeared in current digital information spaces. It is hard to categorize modern digital cities as either technology test-beds or community spaces. They have evolved according to functional demands of their local communities. For instance, if a real world city attracts tourists, its digital city have to provide functions to support tourists as well as to support local residents with the well-integrated information and online services. Recently, the integration of the virtual-physical dimension has accelerated due to the advent of wide spread wireless communication networks and mobile devices. Many personal digital assistants (PDAs) and mobile phones are equipped with advanced technology such as digital cameras, videos, personal data storages, GPSs, WiFi and Bluetooth. They have made it possible to create virtual information cities deeply rooted in reality. In the early development stages of digital cities, several attempts to integrate the virtual and physical world were made, for example Digital City Kyoto and Virtual Helsinki mentioned earlier in this chapter. In the Digital City Kyoto project, several sensors were installed in Kyoto city [12] showing the real-time state of Kyoto bus services such that Kyoto citizens and tourists could find efficient transportation routes. Vision sensors in Kyoto Station were synchronized with the virtual Kyoto Station and made it possible to simulate evacuation systems in Digital City Kyoto [22]. In Finland, the mobile telephone network connected to a 3D model of Virtual Helsinki and provided users route navigation as well as location information [15]. Although these early examples of digital cities never went beyond an experimental level, later digital cities have kept on improving their interactivity with wireless communication networks and devices. In Digital City Kyoto, a city navigation system using existing city information was developed [13]. Originally, Digital City Kyoto was accessed only from desktop computers, but by providing location-based services, users also started to access it from devices with lower computational power such as PDAs and mobile phones. The use of such wireless communication and mobile technology has gradually blurred the border between virtual and physical cities.
948
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
Today there exists many other systems apart from digital cities that try to interact with real cities. Some use mobile devices such as PDAs or mobile phones. Some target citizens while others target tourists. Mai and colleagues [18] developed a system that provided positional information in real-time with the help of a digital city model. Their system enabled tourists, who often have difficulty locating themselves on a printed map, to identify their location using images of objects captured by their PDA. With combined location information from GPS and user-captured images, the system gave positional information that was detailed enough to be used by first time visitors. In addition to local information networks, special service networks such as medical information networks, traffic information networks and smart home networks have been established to actively support citizen’s daily life. By connecting local hospitals, libraries and schools with cutting edge digital environment technologies, a foundation for a new generation of digital cities is emerging.
6 Technologies in Digital City Kyoto In order to grasp a holistic view of information and communication technologies applied to digital cities, we describe some of the information and communication technologies of Digital City Kyoto in more detail. Digital City Kyoto was designed based on the latest technologies of the time including GIS, VR, animation and social agents. The Digital City Kyoto used the three-layer model illustrated in Figure 2 [10].
Fig. 2 The three-layer architecture of Digital City Kyoto.
1. The first layer, called the information layer, integrates and reorganizes Web archives and realtime sensory data using the city metaphor. A geographical database was used to integrate different types of information. A tool was created for viewing and reorganizing digital activities made by people in the city.
The Advancement of World Digital Cities
949
2. The second layer, called the interface layer, uses 2D maps and 3D virtual spaces to provide intuitive views of digital cities. The animation of moving objects such as avatars, agents, cars, buses, trains, and helicopters demonstrate some of the dynamic activities in the cities. If an animation reflects a real activity, the moving object can become an interesting tool for social interaction: users may want to click on the object to communicate with it. 3. The third layer, called the interaction layer, is where residents and tourists interact with each other. Communityware technologies are applied to encourage interactions in digital cities. The above three-layer architecture is very effective in integrating various technologies. Only little guidance was required to determine where each technology should be positioned. Some of the technologies developed in Digital City Kyoto are introduced below.
6.1 The Information Layer When Digital City Kyoto was constructed, operations on the Web sites mainly involved text. Users searched for information by keywords and software robots retrieved the information. This search-and-retrieve metaphor works well even now when a wide variety of search methods have appeared. If the Internet is to be used for everyday life, however, the geographic interface is also important. As shown in Figure 2, the core of our digital city is GIS. The geographic database connects 2D and 3D interfaces to the Web and sensor-sourced information. From a system architecture point of view, introducing the geographic database allows us to test various interface information technologies independently. GeoLink [8] is a collection of links to public spaces in the map of Digital City Kyoto developed by Kaoru Hiramatsu and holds approximately 5400 pages. These public spaces including restaurants, shopping centers, hospitals, temples, schools and bus stops were collected in a single map. As shown Figure 3, GeoLink can visualize how web pages relate to physical locations distributed throughout the city. People would directly register their pages in geographic databases, but the system would automatically determine the X-Y coordinates of each Web page. The challenge was, however, the unique ways of expressing addresses adopted in Kyoto. As Kyoto has more than 1200 years history, there are various ways to express the same address that makes the automatic process complicated. GeoLink uses a new data model called augmented WEB space. As shown in Figure 4, the augmented Web space is composed of conventional hyper links and geographic links. A typical example of a geographical generic link is ‘within 100 meters’. The link is called generic, because it is created according to each query issued by users. Suppose that the query “restaurants within 100 meters from the bus stop” is posed. The links are virtually created from a Web page for the bus stop to those restaurants located within 100 meters of the bus stop. Efficient query processing methods for the augmented Web space have been developed.
950
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
Fig. 3 GeoLink developed by Kaoru Hiramatsu.
Fig. 4 Information integration using GIS.
Real-time sensory information includes bus schedules, traffic status, weather condition, and live video from the fire department. In Kyoto, more than 300 sen-
The Advancement of World Digital Cities
951
sors have already been installed, gathering the traffic data of more than 600 city buses. Each bus sends its location and route data every few minutes. Such dynamic information really makes the digital city live.
Fig. 5 Bus Monitor by Yasuhiko Miyazaki.
In a first experiment, real-time bus data was collected and displayed on the digital city. Figure 5 shows a bus monitor developed by NTT Cyber Solutions Laboratories. Buses in the city run on the Web in the same manner as in the physical city. Realtime city information is more important for people who are acting in the real city than for those who are sitting in front of desktop computers. For example, people would like to know when the next bus is coming, where the nearest vacant parking lot is, whether they can reserve a table at a restaurant, and what is on sale in a department store close to them. Digital City Kyoto could provide live information to mobile users through wireless phones.
6.2 The Interface Layer An enormous amount of information can be accessed through the Internet. Humans have never experienced any similar expansion before in terms of scale and speed. People’s various activities are being recorded in the semi-structured database called the Web over time. Consequently, it might be possible to visualize human activities and social interaction, which cannot be directly measured, by investigating the information collected on the Internet.
952
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
In GeoLink, the latitude and longitude of a shop are determined from its Web page address, and a link is placed on the map accordingly. As information on the Internet increases, downtown areas would be gradually visualized on the map. We can ‘feel’ the activities in the town from the map showing restaurants, schools, hospitals and shops. Observing this map over several years enables us to understand the growth of the town. It is important that this observation can be carried out by anyone. A map of a digital city is not just a database to measure distances or to check addresses. A map is expected to become a new interface that helps us understand the activities of the city. Similarly, the three dimensional (3D) graphic technology is a key component of the interface layer, when used in parallel with 2D maps. Providing 3D views of a digital city allows non-residents to get a good feel for what the city looks like, and to plan actual tours. Residents of the city can use the 3D interface to pinpoint places or stores they would like to visit, and to test walking routes. Figure 6 shows the 3D implementation of Shijo Shopping Street (Kyoto’s most popular shopping street).
Fig. 6 A 3D implementation of Shijo Shopping Street.
3DML (http://www.flatland.com) is used for this visualization. It is not well suited to reproduce gardens and grounds, but has no problem with modern rectilinear buildings (see Figure 7 as an example). 3DML was originally used to construct games. Since it is also suitable for building a town, this tool was applied to develop 3D cities. A 3DML city can be basically understood as a city made of building blocks. For a start, a digital camera is used to take photos of the city. Some corrections are made in Photoshop, since it is inevitable that the upper parts
The Advancement of World Digital Cities
953
Fig. 7 A 3D implementation of Kamo River in Kyoto by Stefan Lisowski.
of buildings appear smaller due to perspective. The next step is to pile blocks up to create the buildings and paste the corresponding photos. Theyt are not precise 3D space like those produced by virtual reality professionals as it is more a tool for non-professionals. Technical knowledge of virtual reality is not necessary to build a city in this way. Only patience is required.
Fig. 8 Kobe City after the 1995 Earthquake by Stefan Lisowski.
Figure 8 shows a reproduction of Kobe right after the earthquake in 1995 created using existing photos. This work was requested by the Kyoto University Disaster Prevention Research Institute, which has collected 15,000 photos taken in the Kobe
954
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
area. The organization tried to display the photos on the Web, which was not easy to search through. By rearranging the photos in a 3D space rather than displaying them like a photo album, the data could provide a different angle of the sight. It was not easy to reproduce buildings from the photos, but it was possible to reproduce one part of the city. In performing this, it was realized that reproducing a 3D virtual city from photos or videos is an attractive approach. Two approaches to building a 3D space has been developed: modeling it with CAD and reproducing it from photos or videos. 3DML can be seen as a tool that combines the two approaches while giving the pleasure of personal creation.
6.3 The Interaction Layer Social interaction is an important goal in digital cities. Even if we build a beautiful 3D virtual space, if no one lives in the city, the city cannot be attractive. Katherine Isbister, a social psychologist from Stanford University, hit on an interesting idea. To encourage intercultural interaction in digital cities, she implemented a digital bus tour for foreign visitors. This idea was implemented in Digital City Kyoto. The tour was designed to be an entry point for foreigners to the digital city, as well as to Kyoto itself. The tour has been implemented within the Web environment using I-Chat and Microsoft’s agent technology (see Figure 9). A tour guide agent leads the visitors, who can interact with the system in many ways, through the Nijo Castle in Kyoto simulated using 3DML.
Fig. 9 Digital Bus Tour with Agent Guide (By Katherine Isbister).
Kyoto city, as the owner of Nijo castle, allowed that photos were taken inside the castle. The 3DML world created from those photos is so beautiful that nobody would believe the work was done by non-professional photographers. To prepare for creating the tour guide agent, Isbister participated in several guided tours of
The Advancement of World Digital Cities
955
Kyoto. She noticed that the tour guides often told stories to supplement the rich visual environment of Kyoto and provided explanations of what Japanese people, both past and present, did in each place [9]. This system includes two new ideas. One is a service that integrates information search and social interaction. Unlike Web searches, which are solitary tasks, the digital bus tour creates an opportunity for strangers to meet. Another is to embody social agents that perform tasks for more than one person. Unfortunately, this agent cannot understand the contents of conversations at this stage. After this experience, however, a virtual city platform called FreeWalk [21, 20] and a scenario description language Q [11] has been developed. It was planned that agents should participate in the conversation of tourists on the bus and become famous as bus guides in the Internet world. Guide agents will stimulate various ideas about future digital cities.
6.4 Lessons Learned We have developed key technologies for digital cities and future research directions from the experience of Digital City Kyoto. 1. Technology for information integration is essential to accumulate and reorganize urban information. Digital cities typically handle Web information and real-time sensor-based data from physical cities. Voluminous high quality digital archives can also be accessed through digital cities. The idea of ‘using a map’ is commonly observed in digital cities. In this case, technologies are needed to integrate different kinds of urban information via geographical information systems (GIS). It is also necessary to introduce a technology to handle photos and videos of the city so as to understand its activities. Great numbers of sensors embedded in the city will collect visual information automatically. These infrastructures will make ‘sending and receiving information’ in everyday life much easier. 2. Technology for public participation is unique to digital cities. To allow various individuals and organizations to participate in building digital cities, the entire system should be flexible and adaptive. For designing a human interface that supports content creation and social interaction, a new technology is required that encourages people with different backgrounds to join in. Social agents can be a key to encouraging people to participate in the development and life of digital cities. So far, most digital cities adopt the direct manipulation approach to realize friendly human-computer interactions. Social agents are used to support human-human interaction. The direct manipulation approach allows users to explicitly operate information objects. Since social agents (human-like dog-like, bird-like and whatever) will have the ability to communicate with humans in natural languages; users can enjoy interacting with the agents and access information without explicit operation. This allows a digital city to keep its human interface simple and independent of the volume of stored information.
956
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
3. Technology for information security becomes more important as more people connect to the digital city. For example, it is not always appropriate to make links from digital cities to individual homepages. We found that several kindergartens declined our request to link them to the digital city. This differs from the security problem often discussed for business applications, where cryptography, authentication and fire-walls are major technologies. Just as we have social laws in physical cities such as peeping-tom laws, digital cities should introduce social guidelines that provide the security, so that people feel comfortable about joining the information space.
7 Conclusion The purpose of this chapter has been to elucidate the digital city activities after 1990 and to provide some reflections on geographical characteristics and cuttingedge technologies. We have described how digital cities have been developed by various organizations and how advanced information technology has been applied to provide location based services. After intensive contributions of wide varieties of professionals to digital cities, current digital cities have become deeply rooted in the daily life of citizens and widely accepted. All the more, empowered by information and communication technologies, current digital cities reflect more city activities than ever. Cities – the physical ones – are complex entities. They have been successful for centuries and have kept their prominence despite all sorts of past predictions about their dissolution, just because of their ability to concentrate and somehow articulate the co-existence of community, voluntary movements, politics, commerce, tourism, culture and so forth. Digital Cities need to deal with the same complex mix of things in order to attract citizens, retain their usage, and function as entities that ‘augment’ their physical counterparts. A city famous for its tourist attractions will not develop its digital city without an eye on tourism, because the commercial aspect is a part of the basic structure of every day life for them. Popular squares in a city would not take roles of the center of the city as they normally do without the presence of shops around the squares.The same environment, we expect, should be applied to virtual squares and places. Digital cities together with already established city information infrastructures could form a foundation for further development of city informatization. Based on wireless and mobile networks and handheld devices, we believe that the future digital cities are moving into a new development phase where the physical and virtual city is merging to a single entity.
The Advancement of World Digital Cities
957
References [1] Aurigi A, Graham S (1998) The ‘crisis’ in the urban public realm. In: Loader BD (ed) Cyberspace Divide: Equality, Agency and Policy in the Information Society, Routledge, pp 57–80 [2] van den Besselaar P (2005) The life and death of the great Amsterdam Digital City. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-ofthe-Art Survey, vol 3081, Springer-Verlag, pp 64–93 [3] van den Besselaar P, Beckers D (1998) Demographics and sociographics of the Digital City. In: Ishida T (ed) Community Computing and Support Systems, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 1519, Springer-Verlag, pp 109–125 [4] Cairncross F (1998) The Death of Distance: How the Communication Revolution Will Change Our Lives. Texere Publishing [5] Carroll JM (2005) The Blacksburg Electronic Village: A study in community computing. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 42–63 [6] Gotzl I (2002) TeleCities: Digital cities network. In: Tanabe M, van den Besselaar P, Ishida T (eds) Digital Cities II: Computational and Sociological Approaches, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 2362, Springer-Verlag, pp 98–106 [7] Guthrie K, Dutton W (1992) The politics of citizen access technology: The development of public information utilities in four cities. Policy Studies Journal 20(4) [8] Hiramatsu K, Kobayashi K, Benjamin B, Ishida T, Akahani J (2000) Mapbased user interface for digital city kyoto. The Internet Global Summit (INET2000) [9] Isbister K (2000) A warm cyber-welcome: Using an agent-led group tour to introduce visitors to kyoto. In: Ishida T, Eds KI (eds) Digital Cities: Experiences, Technologies and Future Perspectives, Lecture Notes in Computer Science, State-of-the-Art Survey, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 1765, Springer-Verlag, pp 391–400 [10] Ishida T (ed) (1998) Community Computing and Support Systems, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 1519. SpringerVerlag [11] Ishida T (2002) Q: A scenario description language for interactive agents. IEEE Computer 35(11):54–59 [12] Ishida T (2005) Activities and technologies in Digital City Kyoto. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 162–183 [13] Koda T, Nakazawa S, Isida T (2005) Talking Digital Cities: Connecting heterogeneous digital cities via the universal mobile interface. In: van den Besselaar
958
[14] [15]
[16]
[17] [18]
[19]
[20]
[21] [22]
[23] [24]
[25]
[26]
Mika Yasuoka, Toru Ishida and Alessandro Aurigi
P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 233–246 Lave J, Wenger E (1991) Situated Learning: Legitimate Peripheral Participation. University of Cambridge Press Linturi R, Simula T (2005) Virtual Helsinki: Enabling the citizen; linking the physical and virtual. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 110–137 Linturi R, Koivunen M, Sulkanen J (2000) Helsinki Arena 2000 — augmenting a real city to a virtual one. In: Ishida T, Isbister K (eds) Digital Cities: Experiences, Technologies and Future Perspectives, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 1765, Springer-Verlag, pp 83–96 Lovink G (2004) The rise and the fall of the Digital City metaphor and community in 1990s Amsterdam. In: S G (ed) The Cybercities Reader, Routledge Mai W, CTweed, Dodds G (2005) Recognizing buildings using system and a reference city model. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 284–298 Mino E (2000) Experiences of European Digital Cities. In: Ishida T, Isbister K (eds) Digital Cities: Experiences, Technologies and Future Perspectives, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 1765, SpringerVerlag, pp 58–72 Nakanishi H (2004) Freewalk: A social interaction platform for group behavior in a virtual space. International Journal of Human Computer Studies (IJHCS) 60(4):421–454 Nakanishi H, Yoshida C, Nishimura T, Ishida T (1999) Freewalk: A 3d virtual space for casual meetings. IEEE Multimedia 6(2):20–28 Nakanishi H, Koizumi S, TIshida (2005) Virtual cities for real-world crisis management. In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 3081, Springer-Verlag, pp 204–216 Rheingold H (1994) The Virtual Community. Secker & Warburg Schuler D (2002) Digital Cities and Digital Citizens. In: Tanabe M, van den Besselaar P, Ishida T (eds) Digital Cities II: Computational and Sociological Approaches, Lecture Notes in Computer Science, State-of-the-Art Survey, vol 2362, Springer-Verlag, pp 72–82 Schuler D (2005) The Seattle Community Network, anomaly or replicable model? In: van den Besselaar P, Koizumi S (eds) Digital Cities III: Information Technologies for Social Capital, Lecture Notes in Computer Science, State-ofthe-Art Survey, vol 3081, Springer-Verlag, pp 16–41 secondlifecom (2008)
Part VIII
Societal Implications and Impact
Human Factors Consideration for the Design of Collaborative Machine Assistants Sung Park, Arthur D. Fisk, and Wendy A. Rogers
1 Introduction Recent improvements in technology have facilitated the use of robots and virtual humans not only in entertainment and engineering but also in the military (Hill et al., 2003), healthcare (Pollack et al., 2002), and education domains (Johnson, Rickel, & Lester, 2000). As active partners of humans, such machine assistants can take the form of a robot or a graphical representation and serve the role of a financial assistant, a health manager, or even a social partner. As a result, interactive technologies are becoming an integral component of people’s everyday lives. Developing useful and usable assistants in everyday situations requires an interdisciplinary approach involving engineering, computer science, and psychology. Engineering fundamentals and skills to produce a functional system are important as well as mathematical foundations of computing such as computer architecture, algorithm theory, and embedded systems (Kitts & Quinn, 2004). Understanding the interaction with such systems also requires the scientific knowledge of mental processes and behavior of the users. Our perspective is from this psychology side, specifically, human factors psychology. The goal of this branch of psychology is a better understanding of user characteristics, attitudes, roles, and expectations of machine assistants. Characteristics such as perceived usefulness, ease of use, and complexity are critical to consider during the design given that in a pervasive ambi-
Sung Park Atlanta, GA: Georgia Institute of Technology, School of Psychology, Human Factors and Aging Laboratory. e-mail: [email protected] Arthur D. Fisk Atlanta, GA: Georgia Institute of Technology, School of Psychology, Human Factors and Aging Laboratory. e-mail: [email protected] Wendy A. Rogers Atlanta, GA: Georgia Institute of Technology, School of Psychology, Human Factors and Aging Laboratory. e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_36, © Springer Science+Business Media, LLC 2010
961
962
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
ent intelligent environment, user interaction with these assistants is moving into the foreground. In this chapter, we propose multiple heuristic guidelines for understanding such user variables that influence the acceptance of advanced technologies in ambient intelligent environments. Providing designers with heuristic tools to be considered during the design process to increase acceptance of such technology is the fundamental goal of our chapter. This chapter can also inform marketing approaches and the development of training programs and materials that support virtual human and robotic technology use.
2 Collaborative Machine Assistants (CMAs) Many terminologies are used interchangeably to refer to virtual humans or robots or to emphasize different aspects of them. Examples include affective virtual humans (Picard, 1997), anthropomorphic agents (Koda & Maes, 1996; Takeuchi & Naito, 1995), embodied conversational agents (Cassell, Sullivan, Prevost, & Churchill, 2000), relational agents (Bickmore & Picard, 2005), social robots (Breazeal, 2003), and assistive robots (Feil-Seifer & Mataric, 2005). These terms clearly overlap and are not exclusive. Embracing such overlap, we introduce the term, Collaborative Machine Assistant (CMA), which we define as a system or device that performs or assists a human in the performance of a task in a collaborative manner. We intend this term to include both graphical (i.e., virtual humans) and physical (i.e., robots) representations of systems with the purpose of assisting human users in performing essentially any type of task. The nature of assistance is a recursive process where both entities, users and CMAs, work together toward a common goal. This is a radically different form of assistance than we have seen with traditional machines (e.g., servant robot), which tend to be exclusively task-driven. Some tasks that require precision, timing, and coordination may benefit from traditional assistants whereas other tasks that require flexibility and intelligent problem-solving would benefit more from the collaboration of humans and CMAs (cf., Fong, Thorpe, & Baur, 2003). Because these new machine characteristics are fundamentally different from those of traditional virtual humans and robots, we perceived the need for a label that captures these unique characteristics - hence, CMA. There are two reasons we include both graphical assistants (virtual humans) and physical robots in our definition. First, we recognize a set of commonalities between the two non-human entities that have recently emerged in the ambient intelligent environment. Many characteristics involving acceptance of such high technology (e.g., perceived usefulness) apply to both entities and thus might help inform designers of both virtual agents and robots. Second, there is a disconnection between the two research communities but we would like to encourage more communication and collaboration between them. Specifically, literature on human-machine interaction
Human Factors Consideration for the Design of Collaborative Machine Assistants
963
has centered on either the virtual human or the physical robot, but rarely considered both type of agents (Dragone, Duffy, & O’Hare, 2005). CMAs are different from traditional assistants because they are collaborative, adaptive, social, and personalized. That is, a CMA will likely adapt to support the needs of the users, thus becoming a personalized assistant. This is possible because personal robots can model abstraction of the world as opposed to traditional robots that use static models for preprogrammed commands. CMAs are also capable of social interactions that require varying degrees of intelligence, including an ability to communicate with humans through natural language. Social interaction is important when CMAs assist users to perform self-care tasks (e.g., disease management) that require users to be convinced to make health-related changes. Characteristics of CMAs intertwine with task demands. For example, the need of social mechanisms and explicit communication is fundamental for tasks requiring collaboration and adaptivity.
3 User Acceptance of CMA An individual’s acceptance of a CMA becomes critical when considering personal use of such technology. Many variables have been identified as relevant to technology acceptance; some of which relate to the technology itself (technology characteristics) and others to the characteristics of the individual user (user characteristics). Understanding the potential variables that influence acceptance of a CMA may provide designers the opportunity to influence levels of acceptance. For example, variables such as knowledge and anxiety about a CMA may be changed through training and instruction. One important dimension of technology acceptance is whether the product is incrementally new or radically new. As described earlier, CMAs are radically different from traditional robots or virtual humans. Radical technologies such as CMAs are revolutionary, original, and discontinuous (Green, Gavin, & Aiman-Smith, 1995). Variables that predict acceptance of incremental technology may not be applicable to radical technology such as CMAs.
3.1 What is Acceptance? What does it mean to accept a given CMA? Defining what is acceptance of a CMA and specifying a valid index of acceptance (or rejection) is important to understanding factors that influence acceptance. Consensus from the acceptance literature is that acceptance is a combination of attitudinal, intentional, and behavioral acceptance (Fishbein & Ajzen, 1975) as described in Table 1. The idea is that attitudes influence intentions that in turn influence behaviors. Thus a person may have positive beliefs about a CMA, may have decided to
964
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
Table 1 Acceptance Types (adapted from Fishbein & Ajzen, 1975)
Acceptance Type Attitudinal tance Intentional tance Behavioral tance
Definition
Example
Accep- Positive evaluation; be- “I think that is a useful product.” liefs about something. Accep- Decisions to act in a cer- “I will buy that product.” tain way. Accep- Actions. Using the product.
purchase that CMA, or actually carried out acceptance behaviors such as purchase and use. The fact that CMAs are a radical technology means that designers must be mindful of the attitudinal level of acceptance because radical technologies are often not as readily accepted as incremental innovations (Dewar & Dutton, 1996; Green, Gavin, & Aiman-Smith, 1995).
3.2 User Characteristics 3.2.1 Age Chronological age is a general demographic variable that is thought to be related to technology acceptance. Early work led to the view that age negatively influences new product acceptance (e.g., Gilly & Zeithaml, 1985). However, a closer look into the relationship between age and acceptance provides interesting insights. Age relates to acceptance through mediators such as feelings of inability to adopt and use new technology (Breakwell & Fife-Schaw, 1988). Specific acceptance of a robot in the home environment was related to age in one study whereby younger adults had more positive attitudes regarding this idea (Dautenhahn et al., 2005); however, this study assessed individuals only up to age 55 and the mediating variables for this effect were not clear. Age also moderates the relationship between user perceptions and acceptance (Venkatesh, Morris, Davis, & Davis, 2003). Specifically, perceived ease of use (which we will delve more into in the following section) positively influences acceptance more so for older adults than younger individuals. More fully understanding the relationship between age and technology acceptance is important because older adults could potentially benefit from CMAs. The growth in the older segment of the population (i.e. over age 65) and the shortage of labor in the healthcare population inspired the application of robotics in assistedliving environments (Pollack, 2005). Similarly, CMAs might perform a variety of collaborative activities with older adults, allowing them to continue their independent living at home.
Human Factors Consideration for the Design of Collaborative Machine Assistants
965
Providing training programs and materials that support CMA use may diminish fears of adopting a radical technology such as CMAs. Consideration of whether a certain design change could make the CMA more easy to use is very important for older adults. Benefits of using CMAs should be presented effectively and clearly because older adults are willing to accept advanced technology if the benefit of using it is evident to them (Caine, Fisk, & Rogers, 2007; Melenhorst, Rogers, Bouwhuis, 2006). Robots with social skills may be more favored by older adults than robots with fewer social skills. Older adults were reportedly more comfortable having a social interaction (i.e., a conversation) with a robot that had more social characteristics such as attentive listening, smiling, and using the older person’s name (Heerink, Krose, Wielinga, & Evers, 2006). The social characteristics of a CMA may contribute to the increment in social presence when interacting with a CMA companion and result in a higher acceptance, through higher enjoyment. One important point not to gloss over is that technology developments such as CMAs should be designed to augment the capabilities of older adults; that is, to enable them to do as much as possible independently. Designers should carefully allocate functions (either to the human user or to the machine) that would assist older adults but not impede or substitute their capabilities as it is critical that these technologies are there to support older adults’ chosen living environment (Carpenter, Van Haitsma, Ruckdeschel, & Lawton, 2000).
3.2.2 Technophobia (Fear of Technology) Technophobia is defined as fear of or dislike for new technology. Perceived anxiety toward technology negatively influences acceptance (Brosnan, 1999). Some CMAs may be disliked because they almost are too human-like, an idea called the “uncanny valley” originally proposed by Mori (1970, in Japanese) and discussed in MacDorman (2005; See Figure 1). The uncanny valley concept refers to increasing acceptance as a robot increases its human-likeness, up until a critical point. This point, referred as the uncanny valley, is when the similarity to the human becomes almost but not completely perfect. The subtle imperfections become disturbing or even repulsive. To create more accepted and useful CMAs designers should be aware that depending on the goal of CMA usage it may be wise to adopt a less realistic or caricatured representation. Researchers do not yet understand the full extent of this uncanny valley; that is, at which point people become apprehensive about a human-like robot. For example, Oyedele, Hong, and Minor (2007) found that people’s perceptions of robots differ depending on the context. More specifically, participants were indifferent about the humanness of robotic images in the context of touching or holding a robot. However, participants showed more concern for robotic images’ similar to that of humans in the context of communicating with robots, watching robots in a movie, and living in the same house with robots.
966
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
3.2.3 Knowledge One factor that influences the acceptance of technology is the users’ existing knowledge of a product group. Prior knowledge influences the content of thinking (Alba & Hutchinson, 1987) which, in turn, influences diffusion success (Gatignon & Robertson, 1985). Based on the literature, there appears to be a complicated relationship between knowledge levels and technology acceptance. Users’ knowledge levels may constrain their ability to understand an innovation depending on the continuity of the technology (Moreau, Markman, & Lehmann, 2001). For continuous innovation, experts reported higher levels of comprehension and preferences for the technology as well as higher perceptions of benefits. A different pattern was observed for discontinuous innovations; compared to novices, experts’ entrenched knowledge related to lower comprehension and preferences and fewer perceived benefits. The same effect is most probable when replacing “accepted” ways of completing a given activity. Careful consideration of target user population and their level of knowledge should precede the design of CMAs.
3.2.4 Culture The most prominent model describing the acceptance of technology is the Technology Acceptance Model (TAM; Davis, 1986). The TAM is suggested to be robust across technologies, persons, and times (Davis, 1986; 1989; Davis, Bagozzi, & Warshaw, 1989); however, it may not predict technology use across all cultures. Straub,
Fig. 1 An illustration of the uncanny valley idea (adapted from MacDorman, 2005, and based on Mori, 1970) whereby perceived familiarity increases with human likeness of a robot but at some point it decreases dramatically.
Human Factors Consideration for the Design of Collaborative Machine Assistants
967
Keil, and Brenner (1997) administered the same instrument on technology (email) use to employees of airlines of three different countries: Japan, Switzerland, and the U.S. When system use (measured as self-reported frequency of use of emails) and perceived usefulness and perceived ease of use were analyzed, the TAM significantly explained about 10% of the variance in usage behavior in both the U.S. and Swiss samples but was non-significant for the Japanese sample. Perceived usefulness predicted usage behavior for both the U.S. and Swiss samples, but not for the Japanese sample. Perceived ease of use was not significant for any of the three country samples but this is consistent with other studies (Adams, Nelson, & Todd, 1992; Davis, 1989) suggesting that perceived ease of use exerts an indirect effect on system use (through perceived usefulness) or the impact of perceived ease of use becomes less important at post-adoption stage. Based on Hofstede’s (1980) research on cultural dimensions, Straub et al. (1997) suggested that cultural factors such as uncertainty avoidance and collectivist sentiments might influenced the differential results of Japan. The literature is inconclusive as to how cultural factors influence the use and acceptance of CMAs. Shibata’s (2004) examination on individual assessment of robots from a cross-cultural perspective revealed that there were no significant disparities between all countries surveyed. However, Oyedele, Hong, and Minor (2007) reported that U.S. respondents were less apprehensive to robotics images than were the South Korean respondents.
3.3 Technology Characteristics 3.3.1 Perceived Usefulness Perceived usefulness is defined as the extent to which a technology is expected to improve a potential user’s performance (Davis, 1989, 1993) and is considered as a summary measure of all benefits related to a technology. In general, perceived usefulness increases the acceptance of technologies (Chau & Hu, 2002) for all three levels of acceptance, that is, for attitudinal acceptance, intentional acceptance, and behavioral acceptance. The consensus in the literature is that perceived usefulness is more important than ease of use (Davis, 1989), especially for post-adoption attitude formation (Karahanna, Straub, & Chervany, 1999). This implies that whether CMAs meet users’ needs and expectations is the key for their continual and persistent use. Many robotics engineers claim that their robots are useful and intuitive without considering users in the design process or deploying appropriate user testing (Adams, 2002). However, human factors research has repeatedly shown that the development of effective and usable interfaces or systems requires the consideration of the users’ point of view from the beginning of the design process (e.g., Cooke & Durso, 2007). Failure to do so results in an interface users are unwilling to accept.
968
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
3.3.2 Perceived Ease of Use Ease of use refers to the amount of effort required to effectively use a technology. Acceptance increases with an increase in the perceived ease of use (Davis, 1989) in all three levels of acceptance. People’s initial decision to use a technology is more influenced by whether it seems easy to use than its perceived usefulness. One potential characteristic related to perceived ease of use is the flexibility of making a technology perform its functions. The less performance flexibility a technology has, the lower the perceived ease of use of a technology. If flexible usage is not possible then operation of the technology should be concise and intuitive.
3.3.3 Perceived Complexity Perceived complexity can be defined as the degree to which an innovation is perceived as difficult to understand and use (Rogers, 2003). The consensus is that complexity decreases the acceptance of technology (Aiman-Smith & Green, 2002). This closely relates to users’ beliefs about their ability to use the technology (i.e., selfefficacy; Fang, 1998). The more complex a technology is, the lower someone’s belief about one’s ability to use the technology, and the lower the degree of technology acceptance.
3.3.4 Perceived Social Skill of the CMA (Social Intelligence) Social intelligence is defined as a person’s ability to get along with people in general (Vermon, 1933). A socially intelligent person has a better ability to perceive and judge other people’s feelings, thoughts, and attitudes (Ruyter, Saini, Markopoulos, & Breemen, 2005). In our context, the general idea is that a socially intelligent robot can communicate more effectively compared with a robot absent of such intelligence; will be more pleasant to interact with; and therefore more readily accepted (e.g., see Heerink et al., 2006). The benefits of social intelligence have been reported by Ruyter et al. (2005). They used the Wizard of Oz technique to simulate human social behaviors through a robotic interface (iCat). A robotic interface with some social intelligence generated a positive bias in the user’s perception of technology in the home environment and enhanced user acceptance for such systems. The social robot also triggered social behavior by the users such as more laughter and relational conversation than the neutral robot.
Human Factors Consideration for the Design of Collaborative Machine Assistants
969
3.4 Relational Characteristics Everyday interactions with CMAs are typically long-term, persistent, and relational. Some researchers purposely design a CMA to build and maintain social and emotional relationship with their users (e.g., Bickmore & Picard, 2005). Relationship is viewed in social psychology as a formulation based on an interaction where a change in behavior and the cognitive and emotional state of a person produces a change in the state of other person (Kelley, 1983). Hence, for two people to be in a relationship with each other they must interact and, as a consequence, each partner’s behavior must have been influenced (Berscheid & Reis, 1998). In our context, this change of state can be mutual in all three levels (behavior and cognitive and emotional state). However, in most circumstances of interaction with a CMA, changing the users’ state (e.g., learned a task, informed, entertained) and not the CMA’s is the intended goal. For example, CMAs take an assistant or advisor role in many cases, making the primary goal of the interaction to advise the user (i.e., change of cognitive state; Park & Catrambone, 2007). For another example, a CMA can play a comforting and caring role (Bickmore & Schulman, 2006), where the goal of the interaction is to provide emotional support (i.e., change of emotional state). Bickmore and his colleagues have conducted relationship research to gain insights into designing a CMA purposely built to maintain long-term, social-emotional relationship with their users. For example, Bickmore and Picard (2005) investigated techniques for constructing and maintaining a human-computer long-term relationship. They identified a number of constructs (e.g., trust, engagement, enjoyment, and productivity) known to be related to relationship quality in social psychology, and applied them in designing a long-term interaction. Applying such constructs, they developed and evaluated an exercise adoption system that employed a CMAlike system, in a long-term experiment with many participants, who were asked to interact with it daily over one month. Compared to a CMA without any relationshipbuilding skills, the relational CMA was respected more, liked more, and trusted more. In addition, participants expressed a significantly greater desire to continue working with the relationship agent. The purpose of this current section is to illustrate the psychological constructs and theories, in social psychology, that would potentially provide insights into understanding a human’s long-term relationship with a CMA. We have taken both a top-down and a bottom-up approach in this process. From a top down approach, we have reviewed and organized the factors important when forming and developing a relationship. Additionally, we have separated the relationship interaction into beginning and maintaining stages, because different constructs play different roles at different stages of the relationship (e.g., attraction is relatively less important after the beginning stage). Designers of CMAs should be mindful of such relationship development. This is important not only when designing CMAs with the purpose of having a relationship (i.e., relational agent) but also when expecting a lengthy (long-term) interaction with CMAs not specifically designed to form a relationship. From a bottom-up approach, we describe two specific long-term relationship models: the service relationship model and the advisor-advisee relationship model,
970
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
which have similarity to the human-CMA relationship. Within these models, we discuss relevant social constructs (expectations, communications, trust, etc.) that have been identified as key variables influencing the relationship between people and examine how these variables should be considered in the design of an effective and useful CMA.
3.4.1 Beginning a Relationship Voluntariness. The voluntariness dimension plays an important role in a relationship (Berscheid & Reis, 1998). Specifically, it provides insights in understanding the future development of the relationship. If a partner’s initial interaction is involuntary, their interaction will continue only if those conditions are not subject to change (e.g., they are forced to work in the same office). Conversely, if the interaction is voluntary, continuation of the relationship depends mostly on the attraction to each other (e.g., human attraction to a CMA). More importantly, certain psychological processes induced by voluntariness may be generated that will affect the relationship’s subsequent quality and future. For example, Seligman, Fazio, and Zanna (1980) induced dating couples to adopt either an intrinsic cognitive set (i.e., their enjoyment as motivations to continue the relationship) or extrinsic set (i.e., external reasons to continue the relationship). They found that those individuals whose awareness of their extrinsic reasons for continuing the relationship were heightened viewed the probability of marrying their partners as significantly lower and reported less affection for their partners than did individuals in the intrinsic group. This association between an individual’s perception (i.e., intrinsic or extrinsic motivation) and relationship outcomes is also supported by correlational data (Blaise et al., 1990; Fletcher et al., 1987). Although many relationship scholars are finding the voluntariness dimension to be useful in understanding premarital and marital relationships (Rempel, Holmes, & Zanna, 1985) as well as friendships (Fischer, 1975), little research has investigated the voluntariness dimension for human-CMA interactions. Xiao, Catrambone, and Stasko (2003) investigated whether CMAs should be proactive or reactive with a paradigm in which participants were introduced to a text editing system that used key press combinations to invoke the different editing operations. Participants were asked to make changes to a document with the aid of a virtual agent. The proactive virtual humans did not enhance performance but were instead viewed as intrusive. The fact that users were forced to interact with virtual agents may have been the problem. This may be one of the reasons why the well-known Microsoft Office Assistant (“Clippit”) failed. The virtual agent appears on the screen uninvited and offers unsolicited advice. Designers of CMAs should be mindful of the distinction between involuntary and voluntary interactions and its effect on long-term interactions. Attraction. Attraction to another individual is the most frequent motivator of voluntary attempts to initiate interaction (Berscheid & Reis, 1998). Two basic princi-
Human Factors Consideration for the Design of Collaborative Machine Assistants
971
ples of attraction from the social psychology literature have relevancy to CMAs: familiarity and physical attractiveness. Familiar people usually are judge to be relatively safer than unfamiliar people. The general effect has been demonstrated in an experiment where people were asked to give their impressions of various national groups, including a fictitious group called the “Danerians.” Participants reported that the unfamiliar groups including the Danerians have undesirable qualities and disliked them (Hartley, 1946). Familiarity is responsible for the universal findings that people are more likely to initiate relationships with people in close physical proximity than they are with people even a short distance away (Segal, 1974). Familiarity as a major factor of attraction, which induces initial interaction with others, provides insights into CMAs. Users undeniably vary in preferences and background, and therefore feel varying degrees of familiarity with CMAs. The wellknown Microsoft Office Assistant (“Clippit”) was implemented universally in terms of appearance and behaviors to different users from different nations. The only difference was the regional language. It is doubtful whether users from other countries would feel the same level of familiarity as a user from the U.S. in terms of not only the appearance but also its manner of interaction (e.g., level of politeness, custom of greeting and farewell). In fact, it is probably impossible to embrace everyone’s familiarity and come up with a universal CMA. Consequently, a good design approach is to provide users a choice from a range of virtual agents or social robots. On this basis Xiao et al. (2007) found that the user experience of CMA-like interactions depended largely on the individual’s preconceived notions (e.g., familiarity) and preferences and that people, when allowed to choose a CMA, viewed the familiar CMAs as more likable, more trustworthy, and more valuable. Another basic principle of attraction is physical attractiveness. There is a great deal of evidence that physical attractiveness influences a partner’s interaction (Funder & Dobroth, 1987; Goldstein & Papageorge, 1980). The well-known “what is beautiful is good” stereotype effect - that physical attractiveness is linked to the inference of positive personal qualities - has been replicated many times (Berscheid & Reis, 1998). In a meta-analysis to examine this stereotype’s strength and generality, Eagly et al. (1991) found that generally people do ascribe more favorable personality traits and more successful life outcomes to attractive people than to less attractive people. The magnitude of the effect was largest on the social-competence dimensions and intellectual-competence dimensions, and nonexistent on the integrity and concern for others. This finding implies that attractiveness has an effect on CMAs being regarded as more intelligent. Further empirical research is required to determine whether people ascribe favorable personality traits to CMAs, as they would do to humans. Nevertheless, we can at least speculate that depending on the CMA’s roles as a companion or an assistant, varying its level of attractiveness might have different effects on the user.
972
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
3.4.2 Maintaining a Relationship People continue to maintain relationships with only a small fraction of the persons they meet, even if the initial encounter generated attraction (Levinger, 1980). Attraction is not a requirement of a developing relationship but rather the partners’ impact on each other (as in a CMA’s impact on the user). People evaluate rewards and costs after the initial contact, and if the evaluations and expectations seem favorable, the relationship should continue to grow (Altman, 1974). Similarly, there is no reason to believe that a user will continue an interaction with a CMA. The user can always withdraw from it or simply not “turn it on” if that is possible. What is important for designers to understand is that the rewards, whether emotional or performance based, must be available for the user to observe. Empathy is one of the important constructs in building and maintaining relationships. Empathy involves the process of attending to, understanding, and responding to another person’s expression of emotion. Empathy is the foundation for behaviors that enhance relationships, including accommodation, social support, intimacy, effective communication, and problem solving (Berscheid & Reis, 1998). This is true not only for intimate relationships but also for working alliances (e.g., advisoradvisee relationship, physician-patient relationship). Decreasing perceived social distance is important for maintaining a relationship. There are many strategies known in social psychology that decrease social distance and these are outlined in Table 2.
3.4.3 Service Relationship We now turn to the two specific relationships models. These relationships were chosen due to their similarities to CMA-human interactions. A service relationship is a long- term relationship, wherein a customer expects to interact with the service provider again in the future. Interestingly, a marriage metaphor has been used to understand the service relationship (Celuch, Bantham, & Kasouf, 2006) which enabled researchers to explore how relationships develop and change; the importance of social elements (e.g., trust, commitment); and cooperative problem solving. As expectation forms the core of relationship schemata (Planalp, 1987), it also plays a major role in the marriage metaphor. Expectation relates to behaviors that contribute to the outcome (e.g., a partner behaving in a cooperative and collaborative manner) and the outcomes themselves (Benun, 1986). Partners might improve interaction either by altering expectations on desired outcomes or by altering expectations on how they would interact. With CMAs, users’ expectations are certainly different from when they interact with traditional agents. Users generally expect more human-like behavior and more flexibility from CMAs. Xiao (2006) claimed that expectations or perceptions of users of CMAs are subject to enormous individual differences. For this reason, Xiao further emphasizes the importance of flexibility in CMA design. Providing sufficient training or practice with the virtual agent
Human Factors Consideration for the Design of Collaborative Machine Assistants
973
Table 2 Strategies Beneficial to Maintaining a Relationship (adapted from Bickmore & Picard, 2005)
Strategy for maintaining Examples a relationship Continuity behavior
References
Behavior (e.g., greetings, farewells, Gilbertson and talk about time spent apart) to (1998) bridge the time people are apart are important to maintain a sense of persistence in a relationship.
Emphasizing common- Increases solidarity and rapport. alities
et
al.
Gill et al. (1999)
Maintaining the history Talking about the past and future Planalp, (1993); of interaction and reference to mutual knowledge Planalp & Benson are one of the most reliable cues (1992) people use to differentiate interaction between acquaintances and strangers. Being positive cheerful
and Prosocial strategies (e.g., being Dindia & Baxter nice) are known to predict liking (1987) and increase relational satisfaction.
Reciprocal deepening Reciprocal sharing increases trust, Altman & Taylor self-disclosure closeness, and liking, and has been (1973); Moon demonstrated to be effective in (1998) text-based human-computer interactions. Sharing tasks
Helping out or sharing tasks have Stafford & Canary relational benefits by communicat- (2001) ing affection or commitment to the partner.
Use of humor
One of important relationship maintenance strategy and has been demonstrated to increase liking in human-computer interaction as well as with virtual agents.
Cole & Bradac (1996); Morkes, Kernal, & Nass (1999); McGuire (1994); Stafford & Canary (1991)
974
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
might provide opportunity and time for users to adjust their expectations of what they can achieve through the interaction and how to best interact with CMAs. In a service relationship, communication behaviors influence problem-solving efficacy. This includes non-defensive listening, paying attention to what a partner is saying while not interrupting; active listening, summarizing partner’s viewpoint; disclosure, sharing of ideas and information, direct stating of point of view; and editing, interacting politely and not overacting to negative events (Bussod & Jacobson, 1983). One partner’s communication behavior will influence the other partner’s behavior. For example, often a failure to edit negative emotions will result in the expression of reciprocal negativity from the other partner (Celuch, Bantham,& Kasouf, 2006). Generally, a unilateral disclosure of information or ideas can elicit reciprocal disclosure from the other partner. The nature of the tasks determines the nature of communication between users and CMAs, and the design of the communication method should be deliberate. For example, when a task requires disclosure of a user’s view on a certain event, it is probably a good idea to provide the CMA’s (i.e., designer’s) view first and ask for one in return. Expectations, communications, and appraisals (how one might evaluate the other) all influence the longer-term outcomes of the relationship such as satisfaction, trust, and commitment. Most marketing studies have noted that service providers should put emphasis on these variables to extend their relationship with their customers (Lee & Dubinsky, 2003). Designers who are specifically developing CMAs for a long-term relationship should be aware of these factors.
3.4.4 Advisor-Advisee Relationship Another long-term relationship studied in-depth is the advisor-advisee relationship. Advice-giving situations are interactions where advisors attempt to help the advisees find a solution for their problems (Lippitt, 1959) and reduce uncertainty (Sniezek & Swol, 2001). Finding a solution or making a decision is social, because information or advice may be provided by others. Research on advice taking has shown that decisions to follow a recommendation are not based on an advisee’s assessment of the recommended options alone (Jonas, Schulz-Hardt, Frey, & Thelen, 2001) but also on other factors, such as characteristics of the advisee, the advisor, and the situation. For example, advisees are more influenced by advisors with a higher level of trust (Sniezek & Swol, 2001), confidence (Sniezek & Buckley, 1995), and a reputation for accuracy (Yaniv & Kleinberger, 2000). Trust is the expectation that the advisor is both competent and reliable (Barber, 1983). Trust cannot emerge without social uncertainty (i.e., there must be some risk of getting advice that is not good for the advisee); trust can also reduce uncertainty, by limiting the range of behavior expected from another (Kollock, 1994). Bickmore and Cassell (2001) implemented a model of social dialogue with CMAs and demonstrated how it has an effect on trust.
Human Factors Consideration for the Design of Collaborative Machine Assistants
975
Confidence is the strength with which a person believes that an opinion or decision is the best possible one (Peterson & Pitz, 1988). Higher confidence can act as a cue to expertise and influence the advisee to accept the advice. With virtual humans, a confident voice, facial expression, and tone of language can increase the acceptance of the CMA’s recommendations. Another factor in this relationship is the emotional bond or rapport. Building rapport is crucial in maintaining a collaborative relationship. Studies showed a significant emotional bond between therapist and client (Horvath & Greenberg, 1989), supervisor and trainee (Efstation, Patton, & Kardash, 1990), and graduate advisor and student (Schlosser & Gelso, 2001). Future studies might examine if rapport between humans and CMAs varies as a function of the length of the relationship, display of affect by the agent, and the type of task. Despite the similarities we have been highlighting, there are factors in a humanCMA relationship that are likely to have a different weighting relative to a humanhuman relationship. For example, human-human advisor-advisee relationship can have monetary interdependency. The advisor might receive profits from advisee’s decision or suffer loss of reputation and even job security (Sniezek & Swol, 2001). The decision making process is affected by this monetary factor which does not exist in a human-CMA relationship. In another example, studies show that advisors (e.g., travel agents, friends) conducted a more balanced information search than the advisee; however, when presenting information to their advisee, travel agents provided more information supporting their recommendation than conflicting with it (Jonas, Schulz-Hardt, Frey, & Thelen, 2001). Assuming CMAs provide objective and balanced information to the users, then CMAs may be favored compared to humans in some advisor-advisee relationships.
4 Design Guidelines Soon people will be surrounded with embedded technology such as CMAs at home in the ambient intelligent environment (Aarts et al., 2001). Users may need to know how to communicate with an increasing number of devices in such an environment (Ruyter et al., 2005). This complexity necessitates that user-centered design principles proposed for technologies in general that are used by older adults (e.g., see Fisk, 1999; Rogers, Meyer, Walker, & Fisk, 1998) become more critical in the design of CMAs. The end user of a CMA should be given extensive attention early in, and at each stage of, the design process. Obviously, the first step is to identify the probable users. Who are the users of CMA? What segment of user population is involved? The designer must consider information including but not limited to age, geographical location, cultural background, and the level of experience with technology in general and CMA technology specifically. Users may be a well-constrained group, for example, engineers with an extensive knowledge of robots; or, a general consumer group where the design pro-
976
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
cess can be more complicated due to the size and diversity of the group (Aarts et al., 2001). Age influences acceptance of CMA. Providing training and auxiliary materials reduces feelings of inability to adopt a radical technology (Rogers et al., 1996). Consideration of whether a certain design change could make the CMA easier to use is as important for older adults as it is for younger adults. It is also critical to understand user’s perception and attitudes toward CMAs. For example, what are the user’s perceptions of a robot companion at home? The following are some questions to address (adapted from Dautenhahn et al., 2005, but based on general user-centered design principles): • • • •
Are people accepting of the idea of CMA at home? What specific tasks do users want CMA to assist? What appearance of CMA is acceptable? What are users’ attitudes towards a socially interactive CMA in terms of behavior and character traits? • What aspects of social interaction do users find the most and least acceptable? Dautenhahn et al. (2005) showed in their study that participants were generally in favor of a robot-as-companion, saw the robots’ potential role as assistants, machines, or servants but not as friends. Human-like communication was desirable for a robot companion but human-like behavior and appearance were less important. Similar results were reported by Yee, Bailenson, and Rickertsen (2007) in their meta-analysis of human-like appearances in virtual human interfaces. They found that the presence of a facial representation produced more positive interactions (i.e., increment in task performance, increase in liking) than the absence of a facial representation; however, the realism of appearance did not have a significant effect. Nevertheless, the degree of human-like behavior and appearance of CMA should depend on the context, task space, and purpose of CMA. It is also critical to consider technology characteristics (e.g., perceived usefulness, perceived ease of use) during the design. Designers of CMAs should ask themselves whether the following heuristics are met: • Do users perceive a need for a CMA? • Do users feel that the CMA is solving the problem, by helping them to do something they would otherwise be unable to do? • Do users believe that the CMA will be easy to use? • Are there any design changes that could make the CMA more easy to use? • The CMA should be tested with potential users to validate that it is perceived as easy to use (and, of course, that it is easy to use • Do users perceive the product as complicated or difficult to understand? Generally design should attempt to minimize the complexity that users perceive. Ambient-intelligent environments can be achieved by user-centered design where the user is the center of the design process. Iterative design processes through user evaluations and testing (see Fisk, Rogers, Czaja, Charness, & Sharit, (2004; in press) for a review of requisite protocols) become more important in the design of CMAs
Human Factors Consideration for the Design of Collaborative Machine Assistants
977
than that of traditional robots because CMAs have collaborative and social functionalities that require rigorous testing with the user. Initial decisions to adopt and use CMAs may be most influenced by whether it seems easy to use and one’s decision to continue to use a CMA may result from the perceived usefulness of the CMA. The appropriate CMA’s social presence must be embedded in the design effort. Equally important is verifying the correctness of the ambient intelligent environment in which the CMA is imbedded. Literature suggests while such systems execute decisions on behalf of people little attention has be given to ensure the correctness of those decisions (Augusto & McCullagh, 2007). This is especially important in a potentially feasible ambient intelligent environment such as hospitals where safety is critical. CMAs (e.g., nursebots) in such environment should be thoroughly tested to avoid accidents (e.g., CMA colliding with patients) perhaps by adopting predictive models of human motion patterns (Bennewitz, Burgard, & Sebestian, 2002). Augusto & McCullagh (2007) provided an effective tool to model intelligent systems and the behavioral patterns of users, helping to analyze and verify behavioral related events.
5 Conclusions In this chapter, we attempted to provide a better understanding of user and technology characteristics of collaborative machine assistants, referred to as CMAs. We proposed heuristic guidelines for understanding characteristics such as perceived usefulness, ease of use, and complexity that influence the acceptance of CMAs. Our goal was to provide designers with a framework for use during the design process to increase acceptance of such radically new technology. Human factors consideration in development of robots and virtual humans often tends to be an afterthought as the process stems from an engineering perspective (Adams, 2002). General principles of user-centered design should be considered from the beginning to develop useful and acceptable CMAs. Designers must assess users’ perceptions early and throughout the design and development process. User tasks and goals should be clearly identified as well as users’ knowledge and experience with virtual humans or robots. Actual usability of the CMA must be evaluated with time for modifications if the target user population finds the CMA less than easy to use, useful, or socially appropriate. Understanding the range of potential variables that influence acceptance of CMA should inform designers of opportunities to influence levels of acceptance. Acceptance can be achieved through proper design and through proper training and instruction concerning the CMA. Acknowledgements The authors were supported in part by a grant from the National Institutes of Health (National Institute on Aging) Grant P01 AG17211 under the auspices of the Center for Research and Education on Aging and Technology Enhancement (CREATE; www.create-center.org).
978
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
References [1] Aarts, E. H. L., Harwing, R., & Schuurmans, M. (2001). Ambient Intelligence. In: Denning, P. (Ed.), The Invisible Future (pp.235–250)., New York: McGraw Hill. [2] Adams, D. A., Nelson, R. R., & Todd, P. A. (1992). Perceived usefulness, ease of use, and usage of information technology: A replication, MIS Quarterly, 16(2), 227-248. [3] Adams, J. A. (2002). Critical Consideration for Human-Robot Interface Development. 2002 AAAI Fall Symposium: Human Robot Interaction Technical Report FS-02-03, 1-8. [4] Aiman-Smith, L., & Green, S. G. (2002). Implementing new manufacturing technology: The related effects of technology characteristics and user learning activities. Academy of Management Journal, 45, 421. [5] Alba, J. W. & Hutchinson, J. W. (1987). Dimensions of Consumer Expertise. Journal of Consumer Research, 13, 411-454. [6] Altman, I. & Taylor, D. (1973). Social Penetration: The Development of Interpersonal Relationships. NY: Holt, Rinhart & Winston. [7] Augusto, J. C. & McCullagh, P. (2007). Ambient Intelligence: Concepts and Applications. International Journal on Computer Science and Information Systems, 4(1), 1-28. [8] Bennewitz, M., Burgard, W., & Thrun, S. (2002). Using EM to learn motion behaviors of persons with mobile robots. Proceedings of IEEE/RSJ International Conference on Intelligent Robots and System, 502-507. [9] Benun, I. (1986). Cognitive components of martial conflict. Behavior Psychotherapy, 47, 302-309. [10] Berscheid, E. & Reis, H. (1998). Attraction and close relationships. In D. Gilbert, S. Fiske, & G. Lindzey (Ed.), The handbook of Social Psychology, 193-281. NY: McGraw-Hill. [11] Bickmore, T. & Cassell, J. (2001). Relational agents: a model and implementation of building user trust. Proceedings of the ACM Computer Human Interaction 2001 Conference. [12] Bickmore, T. & Picard, R. (2005). Establishing and Maintaining Long-Term Human-Computer Relationships. ACM Transactions on Computer-Human Interaction, 12(2), 293-327. [13] Bickmore, T. & Schulman, D. (2006). The Comforting Presence of Relational Agents. Proceedings of the ACM Computer Human Interaction 2006 Conference. [14] Blais, M. R., Sabourin, S., Boucher, C., & Vallerand, R. J. (1990). Toward a motivational model of couple happiness. Journal of Personality and Social Psychology, 59, 1021-1031. [15] Breakwell, G. M., & Fife-Schaw, C. (1988). Aging and the impact of new technology. Social Behaviour, 3, 119-130. [16] Breazeal, C. (2003). Toward social robots. Robotics and Autonomous Systems, 42, 167-175.
Human Factors Consideration for the Design of Collaborative Machine Assistants
979
[17] Brosnan, M. J. (1999). Modeling technophobia: A case for word processing. Computers in Human Behavior, 15, 105-121. [18] Bussod, N. & Jacobson, N. (1983). Cognitive behavioral martial therapy. Counseling Psychologist, 11(3), 57-63. [19] Caine, K. E., Fisk, A. D., & Rogers, W. A. (2007). Designing privacy conscious aware homes for older adults. Proceedings of the Human Factors and Ergonomics Society 50th Annual Meeting. Santa Monica, CA: Human Factors and Ergonomics Society. [20] Carpenter, B. D., Van Haitsma, K., Ruckdeschel, K., & Lawton, M. P. (2000). The psychosocial preferences of older adults: A pilot examination of content and structure. The Gerontologist, 40, 335-348. [21] Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (2000). Embodied Conversational Agents. Cambridge, MA: The MIT Press. [22] Celuch, K. G., Bantham, J. H., & Kasouf, C. (2006). An extension of the marriage metaphor in buyer-seller relationships: An exploration of individual level process dynamics. Journal of Business Research, 59, 573-581. [23] Chau, P. Y. K., & Hu, P. J. (2002). Examining a model of information technology acceptance by individual professionals: An exploratory study. Journal of Management Information Systems, 18, 191. [24] Cole, T. & Bradac, J. (1996). A lay theory of relational satisfaction with best friends. Journal of Social Personal Relationships, 13(1), 57-83. [25] Cooke, N. M. & Durso, F. T. (2007). Stories of Modern Technology Failures and Cognitive Engineering Successes. Taylor-Francis. [26] Dautenhahn, K., Woods, S., Kaouri, C., Walters, M., Koay, K. L. & Werry, I. (2005). What is a robot companion - Friend, assistant or butler? Proceedings of the IEEE IROS, (pp. 1488-1493). Edmonton, Canada: IEEE. [27] Davis, F. D. (1986). A technology acceptance model for empirically testing new end-user information systems: theory and results. Doctoral dissertation, Sloan School of Management, Massachusetts Institute of Technology. [28] Davis, F. D. (1989). Perceived usefulness, perceived ease of use and user acceptance of information technology. MIS Quarterly, 13, 319-339. [29] Davis, F. D. (1993). User acceptance of information technology: System characteristics, user perceptions and behavioral impacts. International Journal of Man-Machine Studies, 38, 475-487. [30] Davis, F. D., Bagozzi, R. P., & Warshaw, P. R. (1989). User acceptance of computer technology: A comparison of two theoretical models. Management Science, 35. [31] Dewar, R. D., & Dutton, J. E. (1986). The adoption of radical and incremental innovations: An empirical analysis. Management Science, 32, 1422. [32] Dindia, K. & Baxter, L. A. (1987). Strategies for maintaining and repairing marital relationships. Journal of social and personal relationships, 4(2), 143158. [33] Dragone, M., Duffy, B.R. & O‘Hare, G.M.P. (2005). Social Interaction between Robots, Avatars and Humans, Proceedings of 14th IEEE International Workshop on Robot and Human Interactive Communication (RO-MAN 2005).
980
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
[34] Eagly, A. H. (1978). Sex differences in influenceability. Psychological Bulletin, 85, 86-116. [35] Efstation, J., Patton, M., & Kardash, C. (1990). Measuring the working alliance in counselor supervision. Journal of Counseling Psychology, 37, 322329. [36] Fang, K. (1998). An analysis of electronic-mail usage. Computers in Human Behavior, 14, 349-374. [37] Feil-Seifer D. & Mataric’ M. J. (2005). Defining Socially Assistive Robotics. Proceedings of the 5th IEEE International Conference on Rehabilitation Robotics, 465-468. [38] Fischer, C. (1975). Toward a subcultural theory of urbanism. American Journal of Sociology, 80, 1319-1341. [39] Fishbein, M. & Ajzen, I. (1975). Beliefs, Attitude, Intention, and Behavior: An Introduction to Theory and Research. Reading, MA: Addison-Wesley Publishing. [40] Fisk, A. D. (1999). Human Factors and the older adult. Ergonomics in Design, 7(1), 8-13. [41] Fisk, A. D., Rogers, W. A., Charness, N., Czaja, S. J., & Sharit, J. (2004). Designing for older adults: Principles and creative human factors approaches. Boca Raton, FL: CRC Press. [42] Fisk, A. D., Rogers, W. A., Charness, N., Czaja, S. J., & Sharit, J. (in press). Designing for older users: Principles and creative human factors approaches (2nd ed.). Boca Raton, FL: CRC Press. [43] Fletcher, G., Fincham, F., Cramer, L., & Heron, N. (1987). The role of attributions in the development of dating relationships. Journal of Personality and Social Psychology, 53, 481-489. [44] Fong, T., Thorpe, C., & Baur, C. (2003). Robot, asker of questions. Robotics and Autonomous Systems, 42, 235-243. [45] Funder, D. C., & Dobroth, K. N. (1987). Differences between traits: properties associated with interjudge agreement. Journal of Personality and Social Psychology, 52, 409-418. [46] Gatignon, H. & Robertson, T. S. (1985). A Propositional Inventory for New Diffusion Research, Journal of Consumer Research, 11, 849-867. [47] Gilbertson, J., Dindia, K., & Allen, M. (1998). Relational continuity constructional units and the maintenance of relationships, Journal of Social and Personal Relationships, 15(6), 774-790. [48] Gill, D., Christensen, A., & Fincham, F. (1999). Predicting marital satisfaction from behavior: Do all roads really lead to rome? Personal Relationships, 6, 369-387. [49] Gilly, M. C. & Zeithaml, V. A. (1985). The elderly consumer and adoption of technologies. Journal of Consumer Research, 12(3), 353-357. [50] Goldstein, A. G. & Papageorge, J. (1980). Judgments of facial attractiveness in the absence of eye movements. Bulletin of the psychonomic society, 15, 269-270.
Human Factors Consideration for the Design of Collaborative Machine Assistants
981
[51] Green, S. G., Gavin, M. B., & Aiman-Smith, L. (1995). Assessing a multidimensional measure of radical technological innovation. IEEE Transactions on Engineering Management, 42, 203-214. [52] Hartley, E. L. (1946). Problems in prejudice. NY: King’s Crown Press. [53] Heerink, M., Krose, B., Evers, V., & Wielinga, B. (2006). The Influence of a robot’s social abilities on acceptance by elderly users. Proceedings of the 15th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN 2006) (pp. 521-526). Piscataway, NJ: IEEE. [54] Hill, R. W., Gratch, J., Marsella, S., Rickel, J., Swartout, W., & Traum, D. (2003). Virtual humans in the mission rehearsal exercise system. Kynstlich Intelligenz, 17(4), 32- 38. [55] Hofstede, G. (1980). Culture’s Consequences: International Differences in Work-Related Values, Beverly Hills, Ca: Sage. [56] Horvath, A. & Greenberg, L. (1989). Development and validation of the working alliance inventory. Journal of Counseling Psychology, 36, 223-233. [57] Johnson, W. L., Rickel, J. W., & Lester, J. C. (2000). Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environments, International Journal of Artificial Intelligence in Education, 11, 47-78. [58] Jonas, E., Schulz-Hardt, S., Frey, D., & Thelen, N. (2001). Confirmation bias in Sequential information search after preliminary decisions: An expansion of Dissonance theoretical research on “selective exposure to information”. Journal of Personality and Social Psychology, 80, 557-571. [59] Karahanna, E., Straub, D. W., & Chervany, N. L. (1999). Information technology adoption across time: A cross-sectional comparison of pre-adoption and post-adoption beliefs. MIS Quarterly, 23, 183-213. [60] Kelley, H. (1983). Epilogue: An essential science. In H. Kelley, E. Berschedi, A. Christensen, J. Harvey, T. Huston, G. Levinger, E. McClintock, L. Peplau, & D. Peterson (Ed.), Close Relationships, 486-503. NY. [61] Kitts, C. & Quinn, N. (2004). An interdisciplinary field of robotics program for undergraduate computer science and engineering education, Journal on Educational Resources in Computing, 4(2), article no. 3. [62] Koda, T.& Maes, P. (1996). Agents with faces: the effect of personification. Proceedings of the 5th IEEE International Workshop on Robot and Human Communication, 189-194. [63] Kollock, R. (1994). The emergence of exchange structures: an experimental study of uncertainty, commitment, and trust. American Journal of Sociology, 100, 313-345. [64] Lee, S. & Dubinsky, A. J. (2003). Influence of salesperson characteristics and customer emotion on retail dyadic relationships. The International Review of Retail, Distribution, and Consumer Research, 13(1), 21-36. [65] Levinger, G. (1980). Toward the analysis of close relationships. Journal of Experimental Social Psychology, 16, 510-544. [66] Lippitt, R. (1959). Dimensions of the consultant’s job. Journal of Social Issues, 15, 5-12.
982
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
[67] MacDorman, K. F. (2005). Androids as an experimental apparatus: Why is there an uncanny valley and can we exploit it? Proceedings of the CogSci 2005 Workshop: Toward Social Mechanisms of Android Science, 106-118. [68] McGuire, A. (1994). Helping behaviors in the natural environment: dimensions and correlates of helping. Personal Social Psychological Bulletin, 20(1), 45-56. [69] Melenhorst, A. S., Rogers, W. A., & Bouwhuis, D. G. (2006). Older adults’ motivated choice for technological innovation: Evidence for benefit-driven selectivity. Psychology and Aging, 21, 190-195. [70] Moreau, C. P., Markman, A. B., & Lehmann, D. R. (2001). ’What is it?’ Categorization flexibility and consumers’ responses to really new products. Journal of Consumer Research, 38(1), 14-29. [71] Morkes, J., Kernal, H., & Nass, C. (1999). Humor in Task-Oriented ComputerMediated Communication and Human-ComputerInteraction, ACM Computer Human Interaction 1998 Conference, 215-216. [72] Moon, Y. (1998). Intimate self-disclosure exchanges: Using computers to build reciprocal relationships with consumers. Cambridge, MA: Harvard Business School. Report: 99-059. [73] Oyedele, A., Hong, S., & Minor, M. S. (2007). Contextual Factors in the Appearance of Consumer Robots: Exploratory Assessment of Perceived Anxiety Toward Humanlike Consumer Robots, Cyberpsychology & Behavior, 10(5), 624-632. [74] Park, S. & Catrambone, R. (2007). Social Facilitation Effects of Virtual Humans. Human Factors, 49 (6), 1054-1060. [75] Peterson, D. & Pitz, G. (1988). Confidence, uncertainty, and the use of information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 85-92. [76] Picard, R. (1997). Affective Computing. Cambridge, MA: MIT Press. [77] Planalp, S. (1987). Interplay between relational knowledge and events. In R. Burnett, P. McGhee & D. Clarke (Eds.),Accounting for relationships: Explanation, representation and knowledge (pp. 175-191). New York: Methuen. [78] Planalp, S. (1993). Friends’ and acquaintances’ conversations II: Coded differences. Journal of Social Personal Relationships, 10, 339-354. [79] Planalp, S., & Benson, A. (1992). Friends’ and acquaintances’ conversations I: Perceived differences. Journal of Social Personal Relationships, 9, 483-506. [80] Pollack, M. E. (2005). Intelligent technology for an aging population: The use of AI to assist elders with cognitive impairment. AI Magazine, 26, 9-24. [81] Pollack, M. E., Engberg, S., Matthews, J. T., Thrun, S., Brown, L., Colbry, D., Orosz, C., Peintner, B., Ramakrishnan, S., Dunbar-Jacob, J., McCarthy, C., Montemerlo, M., Pineau, J., & Roy, N. (2002). Pearl: A Mobile Robotic Assistant for the Elderly, AAAI Workshop on Automation as Eldercare. [82] Rempel, J., Holmes, J., & Zanna, M. (1985). Trust in close relationships, Journal of Personality and Social Psychology, 49, 95-112. [83] Rogers, E. M. (2003). Diffusion of Innovations (5th ed.). New York: Free Press.
Human Factors Consideration for the Design of Collaborative Machine Assistants
983
[84] Rogers, W. A., Fisk, A. D., Mead, S. E., Walker, N., & Cabrera, E. F. (1996). Training older adults to use automatic teller machines. Human Factors, 38, 425-433. [85] Rogers, W. A., Meyer, B., Walker, N., & Fisk, A. D. (1998). Functional limitations to daily living tasks in the aged: A focus group analysis. Human Factors, 40, 111-125. [86] Ruyter, B., Saini, P., Markopoulos, P., & Breemen A. (2005). Assessing the effects of building social intelligence in a robotic interface for the home. Interacting with Computers, 17, 522-541. [87] Schlosser, L. Z. & Gelso, C. (2001). Measuring the Working Alliance in Advisor-Advisee Relationships in Graduate School. Journal of Counseling Psychology, 48(2), 157-167. [88] Seligman, C., Fazio, R. H., & Zanna, M. P. (1980). Effects of salience of extrinsic rewards on liking and loving. Journal of Personality and Social Psychology, 38, 453-460. [89] Shibata, T. (2004). An overview of human interactive robots for psychological enrichment. Proceedings of the IEEE, 92, 1749-58. [90] Sniezek, J. A. & Buckley, T. (1989). Social influence in the Advisor-Judge Relationship. Annual Meeting of the Judgment and Decision Making Society, Atlanta, Georgia. [91] Sniezek, J. A. & Buckley, T. (1995). Cueing and Cognitive Conflict in JudgeAdvisor Decision Making, Organizational behavior and human decision processes, 62(2), 159-174. [92] Sniezek, J. & van Swol, L. (2001). Trust, confidence, and expertise in a judgeadvisor system. Organizational Behavior and Human Decision Processes, 84, 288-307. [93] Stafford, L. & Canary, D. (1991). Maintenance strategies and romantic relationship type, gender and relational characteristics. Journal of Social Personal Relationships, 8, 217-242. [94] Straub, D., Keil, M., & Brenner, W. (1997). Testing the technology acceptance model across cultures: A three country study. Information & Management, 33(1), 1-11. [95] Takeuchi, A. & Naito, T. (1995). Situated facial displays: towards social interaction. ACM Computer Human Interaction 1995 Conference, 450-455. [96] Venkatesh, V., Morris, M. G., Davis, G. B., & Davis, F. D. (2003). User acceptance of information technology: Toward a unified view. MIS Quarterly, 27, 425. [97] Xiao, J. (2006). Empirical studies on embodied conversational agents. Doctoral dissertation, Georgia Institute of Technology, Atlanta, GA. [98] Xiao, J., Stasko, J., & Catrambone, R. (2003). Be quiet? Evaluating proactive and reactive user interface assistants. Proceedings of the workshop of embodied conversational agent. [99] Xiao, J., Stasko, J., & Catrambone, R. (2007). The role of choice and customization on users’ interaction with embodied conversational agents: Effects
984
Sung Park, Arthur D. Fisk, and Wendy A. Rogers
on perception and performance. Proceedings of the ACM Computer Human Interaction 2007 Conference, 1293-1302. [100] Yaniv, I. & Kleinberger, E. (2000). Advice taking in decision making: Egocentric discounting and reputation formation. Organizational Behavior and Human Decision Processes, 83, 260-281. [101] Yee, N., Bailenson, J. N., & Rickertsen, K. (2007). A meta-analysis of the impact of the inclusion and realism of human-like faces on user experiences in interfaces. Proceedings of the ACM Computer Human Interaction 2007 Conference, 1-10.
Privacy Sensitive Surveillance for Assisted Living – A Smart Camera Approach Sven Fleck and Wolfgang Straßer
1 Introduction An elderly woman wanders about aimlessly in a home for assisted living. Suddenly, she collapses on the floor of a lonesome hallway. Usually it can take over two hours until a night nurse passes this spot on her next inspection round. But in this case she is already on site after two minutes, ready to help. She has received an alert message on her beeper: “Inhabitant fallen in hallway 2b”. The source: the SmartSurv distributed network of smart cameras for automated and privacy respecting video analysis. Welcome to the future of smart surveillance! Although this scenario is not yet daily practice, it shall make clear how such systems will impact the safety of the elderly without the privacy intrusion of traditional video surveillance systems. The demand for video surveillance systems is immensely increasing in various fields of security applications and beyond. Camera systems cover, e.g., airports, train stations and are nowadays also installed in trains and some aircrafts. Further examples include monitoring of power plants, shopping malls, museums, parking garages and hotels. A typical casino in Las Vegas uses over 2,000 cameras [39, 11]. Over 4 million cameras are installed in the UK. Moreover, additional applications arise that are not motivated by security but have its purpose in marketing, e.g., statistics in retail stores like measuring shelf attractiveness or getting information about the customers (gender, child/adult etc.). Traffic monitoring is another field. Areas aiming at increased individual safety independent of terrorist threats further extend the application range of such systems. One important example is the field of assisted living where we currently concentrate on. According to [24] the US population over Sven Fleck SmartSurv Vision Systems GmbH, spin-off of University of Tübingen, WSI/GRIS, Tübingen, Germany, e-mail: [email protected] Wolfgang Straßer University of Tübingen, WSI/GRIS, Tübingen, Germany, e-mail: strasser@gris. uni-tuebingen.de H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_37, © Springer Science+Business Media, LLC 2010
985
986
Sven Fleck and Wolfgang Straßer
age 85 will triple from 4.5 million in 2003 to 14.2 million by 2040. Automated 24/7 surveillance to ensure safety of the elderly while respecting privacy becomes a major challenge. Falls are one of the most serious health risks among seniors over the age of 65 [30], affecting more people than stroke and heart attacks combined. Moreover, detecting falls to get immediate help reduces the risk of hospitalization by 26% and death by over 80%. Traditional “CCTV”1 surveillance systems used in the field with their centralized processing and recording architecture together with a simple multi-monitor visualization of the raw video streams bear several drawbacks and limitations. The most relevant problem is the total lack of privacy. Additionally, the necessary communication bandwidth to each camera and the computational requirements on the centralized servers strongly limit such systems in terms of expandability, installation size and spatial & temporal resolution of each camera. As the visualization is counterintuitive and fatiguing due to the massive load of raw video data, manual tracking is not always performed reliably and activities often remain undetected. For example, tracking a suspect on his way requires to manually follow him within a certain camera and when he leaves one camera’s view, an appropriate camera has to be identified, manually selected. Furthermore, the operator then has to put himself in the new point of view to keep up tracking. Obviously, this already strongly limits the number of persons concurrently tracked. A military study states that after approximately 22 minutes, an operator will miss up to 95% of scene activities [5]. The personnel costs are another major issue. An automated and privacy respecting 24/7 surveillance system is thus a strongly desired goal. Latest video analysis systems emerging today are based on centralized approaches which impose strong limitations in terms of expandability and privacy. A distributed smart camera approach is proposed that instead qualifies for surveillance in privacy sensitive areas due to the fact that major processing is performed automated and embedded in each camera node. A real world prototype system at a home for assisted living has been set up. The proposed system’s goal set is to analyze the real world and reflect all relevant – and only relevant – information in an integrated virtual counterpart, live and in a 24/7 way. Important is that privacy recommendations of the specific application are taken into account. The top level overview of the proposed concept and system is illustrated in Fig. 1. It is a fully sensor driven system. Section 2 gives a classification of video surveillance systems. After describing the related work (Section 3), Section 4 describes the problems of centralized and distributed surveillance systems. In Section 5 the architecture of our system is presented, the smart camera component is detailed in Section 6. Section 7 presents the visualization component and at the same time illustrates results of the whole system, followed by the conclusion.
1
CCTV – Closed Circuit Television
Privacy Sensitive Surveillance for Assisted Living
987
Fig. 1 Overview of proposed approach for smart surveillance.
2 Classification of AAL & Video Surveillance Systems Video surveillance systems can be classified by the following criteria: • Manual vs. automated – This defines if the video analysis is performed by a human operator or if relevant information is already derived from the raw video stream in an automated way. • Centralized vs. distributed – Automated analysis can be further divided by two philosophies, if the processing is performed on a centralized server installation or distributed and embedded within a smart cameras network. Fig. 2 illustrates where the proposed – distributed & automated – approach is placed within the design space of camera technology and degree of automation. The camera technology is denoted on the abscissa and can be divided into centralized and decentralized approaches. On the ordinate the degree of automation (and thus the level of abstraction of the results) in video scene analysis is given. This starts from a purely manual (CCTV) approach, continues over motion detection to video analysis in image domain. Although this already includes the abstraction from pixel level to object level, it is still tied to individual cameras. This in combination with a centralized camera approach is the typical combination of automated state of the art systems today. Moving such approach to smart cameras would lead to the blue circular area on the bottom right. Although this is then automated and executed in a distributed way, not the full potential of smart cameras is leveraged: All cameras act as individual cameras with no common world model. The approach in this article instead addresses tracking and activity recognition in world coordinates and thus to abstract from camera domain to world domain. This in combination with a smart camera based approach for distributed processing is depicted as yellow circle.
988
Sven Fleck and Wolfgang Straßer
Fig. 2 Classification of video surveillance systems with respect to degree of automation and camera technology.
3 Related Work 3.1 Related Work – Surveillance Traditional CCTV systems obviously belong to the class “manual” where all raw video data is analyzed by human operators and centrally stored using DVR1 solutions.
3.1.1 Automated Surveillance Systems Yilmaz, Javed and Shah give an extensive survey on object tracking [48]. The IEEE Signal Processing issue on surveillance surveys the current status of automated surveillance systems, e.g., Foresti et al. present "active video based surveillance systems". Siebel et al. especially deal with the problem of multi camera tracking and person handover within the ADVISOR surveillance system [38]. Hampur et al. describe IBM’s S3 comprehensive multiscale tracking system which is also detailed in [23] and includes distance estimation using stereo vision. The IE book on “Intelli1
DVR – digital video recording
Privacy Sensitive Surveillance for Assisted Living
989
gent distributed video surveillance systems” [43] also provides a broad overview. At UCSD, Trivedi et al. presented distributed interactive video arrays (DIVA) for situation awareness [42]. This work encompasses a comprehensive system for traffic and incident monitoring [34] which has been installed at multiple real world scenarios around San Diego (I-5, Qualcomm stadium, Gaslamp quarter, Coronado bridge). Shah et al. presented Knight, an automated real world surveillance system [37] running on a centralized server. It mainly targets railroad crossing scenarios. A complete system for aircraft activity monitoring within the AVITRACK project is described by Kampel et al. in [27]. Collins et al. in collaboration with DARPA show the implementation of a system for autonomous video surveillance in [6]. A broader overview about human activity recognition is given in [33] by Ribeiro et al. All these systems share the fact that they are host centralized which requires the transmission of video feeds which costs bandwidth and leaves the door wide open for misuse. A distributed smart camera approach instead offers many benefits, only very few systems have been published that take advantage of those. Quaritsch et al. presented a multi camera tracking method in image domain on distributed, embedded smart cameras with a fully decentralized handover procedure between adjacent cameras [32].
3.1.2 Privacy Respecting Surveillance Actually relatively few surveillance systems addressing privacy issues have been published: Schiff et al. proposed a very interesting approach, the “respectful camera” [35] where the problem of automatically obscuring faces in real time to assist in visual privacy enforcement is considered. To this end, persons are required to wear colored hats or vests. This approach primarily targets surveillance applications where the raw video stream is necessary, e.g., if an alarm is detected. Also at CVPR, a workshop on privacy research in vision has been started where research is taken out to scramble videos as shown by Dufaux and Ebrahimi [14] or de-identificate faces as presented by Gross et al. [22]. A layered approach where privacy levels are coupled to users’ access control rights is presented by Senior et al. [36]. Cucchiara et al. at the University of Modena also cover face obscuration [9] within their comprehensive video surveillance approach.
3.1.3 Assisted Living Assisted living systems can be divided into manual and automated ones. State of the art products like Philips Lifeline [30] are manual systems, i.e., the user wears a wireless help button which he can press in case of an emergency, e.g., in case of a fall – if he is still in the condition of doing so. This already tremendously helps. Automated systems go further and can be classified into non-visual and visual systems. Intel’s Proactive Health Strategic Research Project [13] and Tabar et al. [40] belong to the former class. Each senior has to wear a tag which however can be lost
990
Sven Fleck and Wolfgang Straßer
or forgotten easily. Visual surveillance by stationary cameras instead provides contactless observation without the need to equip the elderly. Important to note is the work by Aghajan et al., especially [1, 28] where a distributed accident management system for assisted living is presented which combines a wireless sensor solution worn by each person with a visual surveillance system. Privacy is addressed by not recording the camera feed if no accident is detected. Cucchiara et al. [10], Hauptmann et al. [24] and Williams et al. [46] present their approaches for assisted living and visual falling person detection, Toreyin et al. in [41] combine visual cues with audio cues to increase robustness.
3.2 Related Work – Smart Cameras A variety of smart camera architectures designed in academia and industry exist today and typically target machine vision and automation applications, although recently they start being used in surveillance. What all smart cameras share is the combination of a sensor, an embedded processing unit and a connection, which is nowadays often a network unit. The processing means can be roughly classified in DSPs, general purpose processors, FPGAs and a combination thereof. Wolf et al. identified smart camera design as a leading-edge application for embedded systems research [47]. The special issue [12] of the EURASIP Journal on Embedded Systems provides a comprehensive overview of current research in the field of embedded vision systems. Bernhard Rinner’s group designed an innovative smart camera (Bramberger et al. [3]) for distributed embedded surveillance together with ARC. It consists of a network processor and a variable number of DSPs. Chalimbaud and Berry presented a smart camera based on FPGAs [4]. Kleihorst et al. presented a smart camera mote with a Xetal-II high performance yet low power SIMD processor [29]. Hengstler et al. presented MeshEye [25], an embedded, low-resolution stereo vision system on ARM7TDMI RISC microcontroller basis. The idea of having Linux running embedded on the smart camera becomes more and more common (Matrix Vision, Basler, Elphel, FESTO). From the other side, i.e., the surveillance sector, IP based cameras are emerging where the primary goal is to transmit live video streams to the network by self contained camera units with (often wireless) Ethernet connection and embedded processing that deals with the image acquisition, compression (MJPEG or MPEG4), a webserver and the TCP/IP stack, and offer a plug ’n’ play solution. Further processing is typically restricted to, e.g., user definable motion detection. All the underlying computation resources are normally hidden from the user. The border between the two sides becomes more and more fuzzy, as the machine vision originated smart cameras get (often even GigaBit) Ethernet connection and on the other hand the IP cameras get more computing power and user accessability to the processing resources. For example the processors of the Axis IP cameras are also fully accessible and run Linux.
Privacy Sensitive Surveillance for Assisted Living
991
4 Identified Problems of Current Approaches and Resulting Impact To summarize, despite the benefits that video surveillance systems provide, the systems used today impose severe problems: • Problem #1: Privacy. Privacy is totally ignored by today’s systems that fully rely on the accessability to raw video feeds. Both on the way to the operator and the operator itself impose potential risk points of misuse. • Problem #2: Overstraining data. The sheer mass of raw video data is simply overstraining the operator. After approximately 22 minutes, even a concentrated operator will miss up to 95% of scene activities [5]. Distractions like the zooming to attractive women as documented by studies [26] make it even worse. Tracking and handover even of single persons on camera level is a challenging task, tracking multiple persons concurrently is practically impossible. • Problem #3: Limited flexibility & expandability. Adding new cameras to a centralized system can easily exceed the available bandwidth or the processing capabilities of the centralized server. To get more insight in why privacy is identified as the problem #1 of today’s centralized surveillance architectures a brief overview of privacy as property is given.
4.1 Privacy Privacy is a fundamental and very personal property to be respected – also by law – to allow each individual to keep control of the flow of information about himself. The definition of privacy needs to be seen in a nuanced light as it comprises multiple individual points. According to [16], privacy comprises confidentiality, anonymity, self-determination, freedom of expression and control of personal data. In the world of video surveillance it is especially important to guarantee privacy, as persons within a perimeter covered by cameras have very little choice of being surveilled or not, whereas e.g., in the case of cell phone tracking the user still has the choice to turn his phone off.
4.1.1 The Privacy Dilemma and its Impact The royal academy of engineering gives a nice overview of the dilemmas of privacy and surveillance [16]: “both promise great benefits and pose potential threats. In some cases those benefits and threats are so sharply opposed, and yet so equally matched in importance, that dilemmas arise”. One apparent example is visual surveillance within an aircraft to recognize suspicious activities like group forming and attack preparation. On one hand, this could permit the prevention of fatal consequences. On the other this poses an intrusion into privacy of each passenger
992
Sven Fleck and Wolfgang Straßer
being monitored from close distance. A second example is within a home for assisted living: monitoring if a person falls while taking a shower or if he falls asleep during taking a bath are extremely helpful. However, the privacy issue is obvious. As no technical solution to this dilemma is known, typically the commensurability between privacy and surveillance is taken by law. Often, the law differentiates if a permanent, 24/7 operation of surveillance takes place, if data is recorded and on the other hand if a notification about the surveillance operation is present and if a special interest (threat) motivates such installations. Thus, only compromises are achieved which limits the added value of surveillance and at the same time still represents a direct intrusion into privacy.
4.1.2 Centralized Systems: Where Does the Privacy Breach Happen? As shown in Fig. 3 the privacy chain can fail at multiple points: In classical nonautomated “CCTV” type of applications security personnel is inherently embedded within the loop to analyze the raw video data. Obviously, this opens the floodgates to misuse (Fig. 3,(1)). Studies report security persons zooming to attractive women and secretly uploading video feeds to publicly accessible portals like YouTube – while at the same time missing relevant activities. Second, both for manual and for centralized & automated systems privacy sensitive video data can be picked up within the transmission channel (even if the signal is encrypted) from the camera all the way to a centralized server (Fig. 3,(2)). Even worse, this can happen unnoticeable, especially if it includes a wireless connection. In both cases no indication is apparent to the person in front of the camera that his privacy relevant video data is misused. With a smart camera based approach only the camera itself remains as
Fig. 3 Privacy breach points in a surveillance system. Top: Manual, centralized architecture. Bottom: Distributed smart camera based architecture as proposed.
point of attack (Fig. 3,(3)). So, the camera housing should be sealed to prevent both electronical (network attacks) and mechanical access.
Privacy Sensitive Surveillance for Assisted Living
993
5 The Proposed SmartSurv Approach for AAL Based on [19] the SmartSurv 3D Surveillance System is presented in the following. The system serves as a prototype implementation to come closer to the vision depicted in Section 1.
5.1 The Smart Camera Paradigm Centralized computer vision systems typically consider cameras only as simple sensors. The processing is performed after transmitting the complete raw sensor stream via a costly and often distance-limited connection to a centralized processing server. We think it is more natural to also physically embody the processing in the camera itself: what algorithmically belongs to the camera is also physically performed in the camera. The idea is to compute the information where it becomes available – directly at the sensor – and transmit only results that are on a higher level of abstraction.
5.2 SmartSurv 3D Surveillance Architecture The top level architecture of our distributed surveillance and visualization system is given in Fig. 4. It consists of three levels: sensor analysis level (left column), virtual world level (center column) and visualization level (right column). In the following, all components are described briefly. • Scene acquisition: The 3D model acquisition of background scenes with a mobile sensor platform is covered within this component. Alternatively, a 2D floor plan can be used. • Scene analysis: A distributed network of smart cameras for scene analysis where both geo referenced tracking and activity recognition are performed simultaneously, embedded in each camera node. All camera nodes transmit their results to the server level. This smart camera approach allows for optimal scalability, i.e., the processing requirements of a new camera node are completely covered by its embedded processing means. The bandwidth and server performance requirements are only gradually increasing by each camera node. A plug ’n’ play concept allows easy adding of new cameras. • Virtual world: The virtual world acts as an abstraction interface to decouple the visualization clients from the camera based scene analysis. It is implemented by the server level to receive all information from the smart camera network, stores it in the virtual world model and provides the access to this model for visualization clients. It manages configuration and initialization of all camera nodes, collects the resulting tracking and activity recognition data and takes
994
Sven Fleck and Wolfgang Straßer
Fig. 4 SmartSurv 3D Surveillance Architecture. Left column: Scene acquisition (The Wägele) & distributed scene analysis (smart camera network). Center column: Server (virtual world) level. Right column: Visualization level.
care of object handover between cameras. It comprises a camera server node as the main component, a webserver and a web application node, a data base storage node, a statistics node and an alarm node. The system’s scalability is also ensured on server level. Each node can run on a dedicated server, however, also all server level nodes can run together on a single server PC. Following a Plug ’n’ Play concept, a new camera is automatically detected as soon as it is attached to the network and can be added to the server. • Integrated visualization: Relevant results of the distributed scene analysis are embedded live in one consistent and geo referenced 3D world model. Three different visualization options are available.
5.3 Privacy Filters Depending on the privacy requirements of the application and the used visualization the following options are available: • The person is represented by an augmented symbol reflecting the position and state (e.g., normal or falling). This is the preferred visualization option. • The object currently tracked is embedded within the respective visualization. This results in a virtual world model that aims at reflecting as much as possible
Privacy Sensitive Surveillance for Assisted Living
995
the real world. Either the live texture of the ROI covering the tracked object or a silhouette as illustrated in Fig. 15, right can be transmitted. • The live video feed, e.g., of the most relevant camera can be enabled. An automated switching to display the most relevant camera stream, overlaid with tracking results is available as shown in Fig. 9. For increased privacy demands the silhouette can be transmitted instead as illustrated in Fig. 8. Future work could use a multi-level privacy filter approach according to the alert level and user privileges, e.g., if a certain medical condition is met, a live video is transmitted, but then also only to the responsible medical doctor.
5.4 Smart Camera Hardware Smart cameras from Matrix Vision are used within our system. Each smart camera consists of a sensor, a Xilinx Spartan 3 FPGA for low level computations, a PowerPC processor for main computations and an Ethernet networking interface. The mvBlueLYNX 600 series performs its main computations on a 400 MHz Motorola MPC 8245 PowerPC with MMU & FPU running Embedded Linux. It further comprises 64 MB SDRAM (64 Bit, 133 MHz), 32 MB NAND-FLASH (4 MB Linux system files, approx. 40 MB compressed user filesystem) and 4 MB NOR-FLASH (bootloader, kernel, safeboot system, system configuration parameters). IP based connection is used for field upgradeability, parameterization of the system and for transmission of the tracking results during runtime. It consumes about 5 W power. The mvBlueCOUGAR series offers Gigabit Ethernet (GigE) and Power over Ethernet (PoE) as distinctive features besides a more compact form factor (39 x 39 x 96 mm3 ).
5.5 SafeSenior - The Installation at An Assisted Living Home To evaluate the system within a real world scenario, a full prototype system has been installed within a house of Germany’s leading provider for assisted living homes for the elderly (see Fig. 5). It comprises four mvBlueLYNX 621CX XGA smart cameras and one server PC. Thereby, the system covers the main hallways of the building’s ground level as depicted in Fig. 17 with each cameras’ position and field of view marked as stylized eye. Fig. 6 shows the chosen position and viewing angle of the four smart cameras installed in the home for assisted living together with a snapshot image from the viewpoint of each camera. The cameras in the hallways are mounted vertically to better use the spatial resolution of the long hallway. Each camera brings its calibration data with it when registering to the system. It is automatically embedded within the ground plan as shown in Fig. 6, once its camera parameters are known. The system has been running 24/7 for several months now since its installation early 2007 (with no activity recognition at that time). The algorithms have
996
Sven Fleck and Wolfgang Straßer
Fig. 5 Our installation in a home for assisted living. Smart cameras are marked with rectangles.
Fig. 6 Views from the respective smart cameras within the home for assisted living.
been developed with this system as testbed. Although a systematic survey is part of future work, informal questioning of several inhabitants showed that the system is accepted very well. The persons feel safer as their fear of falling is somewhat attenuated. Potential critics due to the fact that cameras are used were very seldom, partly because an information poster clarifies that only automated video analysis takes place with no user access to video data.
Privacy Sensitive Surveillance for Assisted Living
997
6 Smart Camera Architecture The smart camera architecture is illustrated in Fig. 7. It can be roughly partitioned
Fig. 7 Smart Camera Node’s Architecture. Top: Motion detection and tracking. Bottom: SVM based activity recognition/classification.
in the tracking related parts (Fig. 7, top) and the classification & activity recognition part (Fig. 7, bottom.) The upper parts consist of the following components which are based on [18] and detailed in the following: a background modeling & auto init unit, multiple instances of a particle filter based tracking unit, 2D → 3D conversion units and a network unit.
6.1 Background Modeling and AutoInit Within the background modeling unit we take advantage of the fact that each camera is mounted statically. This enables the use of a background model for segmentation of moving objects. This unit has the goal to model the actual background in real time, i.e., foreground objects can be extracted very robustly. Additionally, it is important, that the model adapts to slow appearance (e.g., illumination) changes of the scene’s background. Elgammal et al. [15] give a nice overview of the requirements and possible cues to use within such a unit in the context of surveillance. Due to the embedded nature of our system, the unit has to very computationally efficient to meet the real time demands. Our unit works on a per-pixel basis and is capable of running at 20+ fps on any smart camera used. This is extended by a mixture model
998
Sven Fleck and Wolfgang Straßer
to further increase robustness by having multiple background hypothesis alive at the same time. To handle segmentation (i.e., the transformation from raw pixel level to object level), erosion and region growing steps are applied.
6.1.1 Shadow Suppression Shadow suppression is required to increase the robustness of the applications on top of the tracking, i.e., the activity recognition to work robustly. Prati et al. [31] give an overview of different shadow detection techniques. An existing approach of Cucchiara et al. [7, 8] has been chosen to implement on the smart camera.
6.1.2 Auto Initialization and Destruction If the region is not already tracked by an existing particle filter, a new filter is instantiated with the current appearance as target and assigned to this region. An existing particle filter that does not find a region of interest near enough over a certain amount of time is deleted. This enables the tracking of multiple objects, where each object is represented by a separate color based particle filter. Two particle filters that are assigned the same region of interest (e.g. two people that walk close to each other after meeting) are detected in a last step and one of them is eliminated if the object does not split up again after a certain amount of time.
6.2 Tracking Our tracking engine is based on particle filter units using color distributions in HSV (i) space. For each person/object p tracked at time t, the state p Xt of hypothesis (sample) i comprises its position, size and velocity. The tracking unit is also capable of adapting the target’s appearance during tracking using either the pdf’s unimodality (as in [20]) or the background model’s information as adaptation rate to circumvent overlearning. On the mvBlueLynx 420CX smart camera (200 MHz PowerPC), we achieve about 17 fps for the whole tracking pipeline when one object is tracked, 15 fps for two objects. The 600 series and the mvBlueCOUGAR (400 MHz) stay above 20 fps, also for multiple objects. Besides the number of objects, i.e., the number of particle filter instances, the frame rate is also affected by the target size on pixel level. We use a subsampling approach to attenuate this effect. Tracking examples are shown in Fig. 8. Please see [18] for further details. The approximated pdf p( p Xt |Zt ) for each person/object p tracked is then reduced: only the maximum like(i) lihood sample p Xt , representing the most probable position and scale in camera coordinates, is further processed in the 2D → 3D conversion unit.
Privacy Sensitive Surveillance for Assisted Living
999
Fig. 8 Smart camera indoor and outdoor tracking examples.
6.3 2D → 3D Conversion Unit Only the maximum likelihood sample, that represents the most probable position and scale in camera coordinates, is further processed in the 2D → 3D conversion unit. Alternatively, the mean can be used. However, in case of multimodal distributions (which particle filters can handle) this would lead to a worse estimate. This unit converts from image domain to a common (and geo referenced) 3D world coordinate system using a flat floor assumption and the camera’s calibration data. To summarize, each smart camera node provides its results in geo referenced world coordinates, taking the origin of the local model (e.g., floor plan or Wägele model) into account of which the GPS position is known.
6.4 Handover To achieve a seamless inter-camera tracking decoupled from each respective sensor node, the handover unit merges objects tracked by different camera nodes if they are in spatial proximity. To reduce mismatches during handover due to spatial ambiguities the object’s appearance is also taken into account. No video data has to be transmitted for the handover process, a color histogram is used for measuring similarity in appearance. Fig. 9 illustrates a typical handover procedure. This information is also used to track persons beyond the field of view of single cameras. The extracted data is then sent to the server along with the texture of the object (if enabled) using the network unit.
6.5 Activity Recognition Each smart camera is capable of embedded classification and activity recognition based on a supervised learning approach utilizing Support Vector Machines (SVMs). The camera is trained by showing and manually labeling relevant situations (e.g., falling persons) or objects. All extracted feature vectors are automatically processed in an offline optimization step and the optimized kernel is downloaded onto each smart camera to perform activity recognition for each object currently tracked.
1000
Sven Fleck and Wolfgang Straßer
Fig. 9 Automated handover: Each person currently tracked is handed over automatically to the next relevant camera, managed by the server. Top row: Shown for the person in dark cloths. Bottom row: Shown for the person in the light shirt.
Also object classification is possible with the same framework: The classification works both on static images (e.g., based on appearance) and on their progression over time. The classification & activity recognition unit aims at classifying objects according to their appearance, position (from the tracking unit) or any derived feature vector. The activity recognition comprises two main parts: the feature manager and the activity manager (see Fig. 7, bottom). Both are described below.
6.5.1 Activity Manager The activity manager holds control of all activities (and objects) k to recognize concurrently. Every activity is instantiated on startup and requests the required feature extractors f from the feature manager which are then sent to the embedded SVM classifier engine. Currently, falling person detection is implemented as a first activity to demonstrate our architecture and at the same time deal with the most important activity with respect to assisted living.
6.5.2 Feature Manager The feature manager collects all requested features of the different activities and builds an intersection of those, so every feature extractor is maximally triggered once per frame. The results of the extractors are then put in a ring buffer, which exists for every extractor. In those ring buffers the results of the last n-frames are hold to represent temporal characteristics. The feature manager then assembles the different feature vectors according to the request of each activity, and returns them
Privacy Sensitive Surveillance for Assisted Living
1001
to the activity manager. As a first start, within the context of activity recognition five feature extractors are implemented until now with to be used especially for falling person detection. These include the temporal characteristics of aspect ratio, deviation of aspect ratio and temporal characteristics of similarity in histogram space using different measures (Bhattacharyya distance etc.) Experiments showed that especially a combination of aspect ratio and histogram features performs quite well. Due to the underlying tracking mechanism and the known correspondence between tracking and activity recognition, both the activity and its location where it happened is transmitted for further visualization and fast response (e.g., in case of falling persons). No video is necessary to be transmitted.
6.6 The Classification & Activity Recognition Flow The whole flow for using the SVM engine on the camera is split in two phases: • 1. Training phase Within the training phase the supervised learning process is carried out. Here, an optimized kernel is generated and trained. • 2. Online phase The online phase applies during use of the system. These phases are also differentiated in terms of speed requirements. To ensure maximal flexibility an approach is taken where the smart camera behaves like “hardwarein-the-loop”. The training phase is transparently outsourced to the server to fully exploit the latest SVM optimization frameworks available today without the need of porting them to the smart camera. All that is required on the smart camera are efficient implementations of the feature extractors and the SVM online evaluation. The former is not necessary to be reimplemented on the server, instead the feature extraction runs transparently on the camera also within the training phase. This ensures that exactly the same extractors are used online as within training and avoids double implementations with the corresponding danger of bugs. From the camera perspective, hardware-in-the-loop here can thus be interpreted as “server-in-the-loop” compared to a full smart camera implementation. After the training phase the kernel is downloaded to the embedded SVM engine on the smart camera for fully embedded online evaluation.
6.6.1 Training Phase The training is performed using the SPIDER framework from MPI, Tübingen [45] which is an object oriented SVM framework in MATLAB. After choosing a kernel function (e.g., RBF) the kernel parameters (σ ,C) are optimized using a grid search with exponential steps in conjunction with cross validation. To this end, the performance under different combinations of kernel parameters is evaluated and the one with the best cross selection precision is chosen and trained. Cross validation is a preferred method especially if only few training examples are available. The loss of
1002
Sven Fleck and Wolfgang Straßer
the kernel considered as best is computed. Within this framework the classification performance of different feature combinations can easily be tested and compared. Finally, the kernel together with the optimized parameters are downloaded to the camera in the field.
6.6.2 Online Phase Within the online phase the whole SVM classification process is performed embedded on the camera. We chose to use SVMs not only due to their theoretical and practical benefits in terms of classification performance but also due to their relatively moderate computational demands in the online phase: During runtime, basically only a dot product between the feature vector combination of the actual camera image and each support vector has to be computed for each iteration. If the feature extractors are discriminative for the specific application and the training examples are representative, only few support vectors are necessary, thus speeding up the classification process. Examples both in terms of classification and activity recognition are depicted in the following.
6.6.3 Activity Recognition at The Assisted Living Home Fig. 10 shows a sequence of a person tracked and analyzed within the hallways of the assisted living demonstrator. The event of falling is detected which is finally visualized as stylized icon as shown later in the web visualization (Fig. 17) and in the Google Earth visualization (Fig. 15). The falling person’s position is marked with a
Fig. 10 Detected falling sequence at the assisted living home.
red warning icon and transmitted as a text message to a given phone via the alarm handler. The corresponding extracted feature vectors from this video sequence of the relevant camera are depicted in Fig. 11 where the detected falling event is marked. Encouraging results are made with the combination of aspect ratio, aspect ratio deviation and histogram differences. It showed that the best matching kernel is a RBFKernel. The loss after a three-fold cross validation was 0.64%. The training data was extracted from 11 different video files, wherein the amount of frames labeled as falling, compared to those labeled as not falling, is unbalanced. Our resulting kernel consists of 37 support vectors of 1508 training vectors (2.45%) which indicates
Privacy Sensitive Surveillance for Assisted Living
1003
Fig. 11 Features over time, falling sequence is marked.
good generalization performance and discriminative features. Moreover, the same kernel can be used for all cameras installed in the assisted living home. The complete prototype system with activity recognition enabled runs on the smart camera node with 8-10 fps. Feature values over time are shown in Figure 11, the frames which indicate a falling person is marked red. Figure 10 shows the corresponding scene where the person is detected as falling. The system shows quite promising recognition performance, however the positive training examples (persons actually falling) are not enough to provide a quantitative performance analysis yet.
7 Visualization Fully Decoupled from Sensor Layer We present three different visualization options which are designed for a more intuitive understanding of the surrounding, decoupled from the sensor network.
7.1 3D Visualization Node - XGRT This visualization option enables that results of the camera network are directly embedded within a 3D model. To this end, we developed a mobile model acquisition platform called “The Wägele”1 . It allows for easy acquisition of indoor and outdoor scenes: 3D models are acquired just by moving the platform through the scene to be captured. Thereby, geometry is acquired continuously and color images are taken in regular intervals. Our platform (see Fig. 12) comprises an 8 MegaPixel omnidirectional camera (C1 in Fig. 12) in conjunction with three laser scanners (L1-L3 in Fig. 12) and an attitude heading sensor (A1 in Fig. 12). Two flows are implemented 1
Wägele – Swabian for a little cart
1004
Sven Fleck and Wolfgang Straßer
to yield 3D models: a computer vision flow based on graph cut stereo and a laser scanner based modeling flow. After a recording session the collected data is assembled to create a consistent 3D model in an automated offline processing step. More details can be found in [2, 17]. The visualization node gets its data from the server
Fig. 12 The Wägele: Two setups of our mobile platform. Left: Two laser scanners L1, L2 and one omnidirectional camera C1. Center & Right: Three laser scanners L1, L2, L3 and omnidirectional camera closely mounted together.
node and renders the information (all objects currently tracked) embedded in the 3D model. This visualization option is based on the eXtensible GRaphics Toolkit (XGRT) developed by Wand et al. at our institute which is a modular framework for real time point based rendering [44]. Tracked objects can be displayed as sprites using live textures [18]. The viewpoint can be chosen arbitrarily, also a fly-by mode is available that moves the viewpoint with a tracked person/object. Results of the GRIS installation with both indoor and outdoor cameras is described now. First, a 3D model of both the indoor hallway and the outdoor environment has been acquired. Afterwards, the camera network has been set up and calibrated relative to each model. Fig. 13 illustrates results within the XGRT visualization. Each tracked persons (see Fig. 13,(1, 3)) is embedded within the acquired 3D model as live texture (Fig. 13,(2, 4)). Even under strong gusts where the trees were heavily moving, our per pixel noise process estimator enabled robust tracking by spatially adapting to the respective background movements.
7.2 Google Earth as Visualization Node A geo referenced visualization of the scene analysis results embedded in Google Earth [21] is introduced in the following. The produced results of the surveillance system are permanently updated on a password protected web server which can be accessed ubiquitously by multiple users. According to the privacy needs of the application each embedded object is represented either as live texture or filtered
Privacy Sensitive Surveillance for Assisted Living
1005
Fig. 13 XGRT visualization: (1, 2): Indoor experiment. (3, 4): Outdoor experiment. (1, 3): Live smart camera view with overlaid tracking information. (2, 4): Live XGRT rendering of tracked object embedded in 3D model acquired by the Wägele platform.
as icon within the Google Earth model, augmented by its status. The latter further increases the applicability of the system in areas with high privacy concerns, as the video data can stay completely in each smart camera. Tracking and activity results are updated at 20 Hz. Furthermore, in case of an recognized emergency a live overlay of the camera view can be displayed, updated at lower frequency (e.g. 1 Hz) to save bandwidth. Of course, the viewpoint within Google Earth can be chosen arbitrarily during runtime, independent of the objects being tracked (and independent of the camera nodes). A car tracking example is illustrated in Fig. 14: As a car approaches one camera’s field of view, a new particle filter is automatically instantiated and tracks the car along its way. Within the Google Earth visualization the car is embedded as a stylized icon. Figure 15 shows the resulting Google Earth visualization of the home of assisted living, overlaid by a floor plan. Currently, one person is present within the perimeter covered by the prototype installation. It is detected as a moving object, automatically tracked and analyzed. In this moment, the person’s activity is recognized as falling and thus marked by a red warning icon at the corresponding position in geo referenced world coordinates. No person specific data is visualized, the live view is just shown for demonstration purposes but could be enabled to appear in case of such a detected alarm situation. Not only the results of multiple cameras but even of multiple complete installations at different locations around the world could be visualized concurrently in this virtual world to provide a single-point of information. This addresses the problem of getting overstrained by the sheer mass of video data (problem #2), which is even more present if multiple installations have to be covered concurrently, is also addressed with this approach. At the same time, the visualization shows the benefit of the spatial information in world domain of the event, not just a result in 2D image domain (“falling person in camera #555”). As the viewpoint can be chosen arbitrarily during runtime, Fig. 16 shows a zooming sequence to a detected falling person in the assisted living home. This shall demonstrate how a scenario with multiple homes for assisted living could look like: Not only the results of multiple cameras but even of multiple complete installations at different locations around the world could be visualized concurrently in this virtual world to provide a single-point of information. This addresses the problem of getting overstrained by the sheer mass of video data (problem #2, which is even more present if multiple installations have to be covered concurrently, is also addressed with this approach. At the same time, the visualization shows the benefit
1006
Sven Fleck and Wolfgang Straßer
Fig. 14 Live tracking results embedded in Google Earth. The approaching car is tracked as “object3” in world domain, visualized as stylized car icon. An overlaid video feed of one camera node’s output is shown on bottom right, respectively.
of the spatial information in world domain of the event, not just a result in 2D image domain (“falling person in camera #555”).
7.3 Web Application for Visualization An interactive web visualization application has been developed to cover users with just a web browser independent of Google Earth. Thereby, the system is ubiquitously available to any user, at the same time protected by standard secure techniques. Fig. 17) depicts the web visualization. In contrast to regular synchronous web page calls, only incremental changes are updated by rendering objects and activities in the floor plan on-the-fly. The update frequency can be chosen at runtime up to 10 Hz. No privacy critical data is shown unless an emergency is recognized. In this case the live view of the associated camera is automatically integrated at lower frequency (e.g., 1 Hz). To summarize, using the latter two ways of visualization, the resulting information from the distributed camera network is first integrated in one consistent world model which is then made accessible worldwide for multiple users independently.
Privacy Sensitive Surveillance for Assisted Living
1007
Fig. 15 Google Earth visualization of the assisted living home prototype system. A person is tracked and classified as falling by the smart camera network – reflected within the Google Earth visualization located in world coordinates. On the right the live view of the respective smart camera is shown, overlayed by the 2D tracking output.
7.4 Statistics The information within the virtual world model results in the following statistics. Fig. 18 illustrates the number of persons tracked concurrently within assisted living prototype over 24 hours of operation. In Fig. 19 the distribution of number of events per hour within a day is shown, accumulated over an entire week. As we do not have an automated way of detecting false tracking results, no qualitative results showing false object detection rate are available yet. One can see an early start in the day, a significant peak around lunch time and a very early drop at 7p.m. where the elderly persons already go to their rooms. To further address problem #2 – the overstraining data problem – a timescaled visualization functionality is available. Thus, besides the live view of the current situation reflected in the virtual world model it is possible to access past time spans
1008
Sven Fleck and Wolfgang Straßer
Fig. 16 Google Earth visualization: Zooming in to a person recognized as fallen.
Fig. 17 Ground plan with embedded camera information, tracking and activity recognition results. Every camera’s pose in conjunction with its field of view is depicted within this plan. Concurrently, four persons are tracked by the smart camera network whereas one person has been falling (red icon on the right).
and get a quick overview of the situation, e.g., of the last 24 hours in the home for assisted living. The user can jump between events (tracking/activity recognition results), only time spans where events happened are logged for visualization. All events within the chosen interval width are shown concurrently to reduce the chance of missing events in case of fast forward. A temporal (“replay”) animation in a definable speed is available.
Privacy Sensitive Surveillance for Assisted Living
1009
Fig. 18 Distribution of number of persons concurrently tracked by the camera network over one week of operation.
Fig. 19 Events over 24h within the whole perimeter covered by the camera network at the home for assisted living.
8 Conclusion and Impact Surveillance applications are booming in various fields of applications. However, current manual and centralized automated systems impose several strong drawbacks, especially in terms of privacy. Three problems have been identified: lack of privacy, sheer mass of raw video data overstraining the operators, and limited flexibility of today’s systems. A distributed and automated smart camera based approach has been proposed that addresses these problems. The approach is based on the smart camera paradigm where a camera behaves like an application specific sen-
1010
Sven Fleck and Wolfgang Straßer
sor that just transmits the results which are on a higher level of abstraction. The presented architecture and system is capable of distributed tracking in geo referenced world coordinates. Furthermore, it features activity recognition and is extendible by arbitrary events as long as according features are designed. To this end, each smart camera is capable of embedded classification using Support Vector Machines (SVMs). Only results are then transmitted to ensure privacy which are embedded in one consistent geo referenced world model for visualization. A real world prototype system at a home for assisted living has been set up. It has been running 24/7 for several months now and shows quite promising performance. As a first activity, falling person detection has been addressed. An intuitive visualization system with three innovative options is presented where all information of the sensor systems is integrated in one consistent and geo referenced world model – the virtual world. The user is fully decoupled from the camera level and just sees the resulting movements within a geo referenced map. The idea is that he does not have to care about the underlying processes (which camera a person currently tracks and when a handover occurs), he just sees the path of each person in a world map, enriched by its status of the activity recognition.
8.1 Future Directions Future work includes a systematic survey of the inhabitants of the assisted living home to gain more insight in how such a system is seen from the user’s perspective. Moreover, it goes without saying that also challenges of the proposed approach have to be mentioned. The performance of the system relies on the performance of the automated video analysis algorithms. These do not complement the human operators but replace them from sensor level all the way up to a level where the information is not directly privacy related any more. The trust in the camera system from both sides is to be made clear: The persons being surveilled have to trust that it really does work as privacy respecting as advertised. With CCTV-style systems the user can at least be aware that he is currently being watched (now that he does not have a choice). At the same time, the operator has to trust in video analysis and self-diagnosis capabilities as no method of control is available to check its operativeness. One solution could be to certify the manufacturers of such systems as “trusted”. Additionally, a selfdiagnosis unit within the camera should check if it is still fully operational (e.g., by checking if the lens is still clean). The most important remaining challenge is, that automated video analysis is still a hot research topic with unsolved problems. Especially, the more abstract activities involving multiple persons and objects are topics of future research. Moreover, fully distributed and collaborative processing and decision making is a major research direction for future work. This could increase the robustness significantly, both on algorithmic level in terms of computer vision and with respect to fault-tolerance on hardware level. Performing the handover process in a distributed way is a first area to start with. Fusing object classification and activity recognition results is also an interesting future direction. Moreover, instead
Privacy Sensitive Surveillance for Assisted Living
1011
of using the discrete privacy filter options, i.e., either raw video texture or stylized icons, a continuous de-identification filtering could be used to get information in between by combining the presented architecture with techniques like the “respectful camera” and “de-identificated faces” mentioned in Section 3.1.2. According to the importance level of the activity (e.g. falling) and status of the user (e.g., a medical doctor) a more fine-grained access to privacy related data could be given in cases where video data should still be of interest. Acknowledgements We would like to thank the students Florian Busch, Roland Loy, Florian Ohr, Christian Vollrath and Florian Walter for their contributions, Peter Biber for the joint work on the Wägele platform, Benjamin Huhle for many fruitful discussions and Matrix Vision for their generous support and successful cooperation.
References [1] Aghajan, H.K., Augusto, J.C., Wu, C., McCullagh, P.J., Walkden, J.A.: Distributed vision-based accident management for assisted living. In: T. Okadome, T. Yamazaki, M. Makhtari (eds.) ICOST, Lecture Notes in Computer Science, vol. 4541, pp. 196–205. Springer (2007) [2] Biber, P., Fleck, S., Wand, M., Staneker, D., Straßer, W.: First experiences with a mobile platform for flexible 3D model acquisition in indoor and outdoor environments – the Wägele. In: 3D-ARCH’2005: 3D Virtual Reconstruction and Visualization of Complex Architectures. Mestre-Venice, Italy (2005) [3] Bramberger, M., Doblander, A., Maier, A., Rinner, B., Schwabach, H.: Distributed embedded smart cameras for surveillance applications. IEEE Computer 39 (2006) [4] Chalimbaud, P., Berry, F.: Embedded active vision system based on an FPGA architecture. EURASIP J. Embedded Syst. 2007(1), 26–26 (2007). DOI http: //dx.doi.org/10.1155/2007/35010 [5] CNN: Smart cameras spot shady behavior. http://edition.cnn.com/ 2007/TECH/science/03/26/fs.behaviorcameras/index. html [Accessed 1.3.2008] (2008) [6] Collins, R., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O.: A system for video surveillance and monitoring. Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2000) [7] Cucchiara, R., Grana, C., Piccardi, M., Prati, A.: Detecting objects, shadows and ghosts in video streams by exploiting color and motion information. In: ICIAP ’01: Proceedings of the 11th International Conference on Image Analysis and Processing, p. 360. IEEE Computer Society, Washington, DC, USA (2001) [8] Cucchiara, R., Grana, C., Piccardi, M., Prati, A., Sirotti, S.: Improving shadow suppression in moving object detection with hsvcolor information. In: Intel-
1012
[9] [10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Sven Fleck and Wolfgang Straßer
ligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pp. 334–339 (2001) Cucchiara, R., Prati, A., Vezzani, R.: A system for automatic face obscuration for privacy purposes. Pattern Recognition Letters 27(15), 1809–1815 (2006) Cucchiara, R., Prati, A., Vezzani, R.: A multi-camera vision system for fall detection and alarm generation. Expert Systems 24(5), 334–345 (2007). DOI 10.1111/j.1468-0394.2007.00438.x. URL http://dx.doi.org/ 10.1111/j.1468-0394.2007.00438.x Datz, T.: What happens in Vegas stays on tape. http://www.csoonline. com/read/090105/hiddencamera_vegas_3834.html [Accessed 1.3.2008] (2008) Dietrich, D., Garn, H., Kebschull, U., Grimm, C., Ben-Ezra, M. (eds.): Embedded Vision Systems, vol. 2007. EURASIP Journal on Embedded Systems (2007) Dishman, E.: Inventing wellness systems for aging in place. Computer 37(5), 34–41 (2004). DOI http://doi.ieeecomputersociety.org/10.1109/MC. 2004.1297237 Dufaux, F., Ebrahimi, T.: Scrambling for video surveillance with privacy. In: Conference on Computer Vision and Pattern Recognition Workshop on Privacy Research in Vision (CVPRW-PRIV), p. 160. IEEE Computer Society, Washington, DC, USA (2006). DOI http://dx.doi.org/10.1109/CVPRW.2006. 184 Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground modeling using nonparametric kernel density for visual surveillance. In: Proceedings of the IEEE, vol. 90, pp. 1151–1163 (2002) of Engineering, T.R.A.: Dilemmas of privacy and surveillance - challenges of technological change. Tech. rep., The Royal Academy of Engineering, London (2007). URL http://www.raeng.org.uk/policy/reports/ pdf/dilemmas_of_privacy_and_surveillance_report.pdf Fleck, S., Busch, F., Biber, P., Straßer, W.: Graph cut based panoramic 3D modeling and ground truth comparison with a mobile platform - the Wägele. In: IAPR Canadian Conference on Computer and Robot Vision (CRV 2006). Quebec City, Canada (2006) Fleck, S., Busch, F., Straßer, W.: Adaptive probabilistic tracking embedded in smart cameras for distributed surveillance in a 3D model. EURASIP Journal on Embedded Systems, Special Issue on Embedded Vision Systems 2007(29858) (2007) Fleck, S., Loy, R., Vollrath, C., Walter, F., Straßer, W.: SmartClassySurv - a smart camera network for distributed tracking and activity recognition and its application to assisted living. In: ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC-07). Vienna, Austria (2007) Fleck, S., Straßer, W.: Adaptive probabilistic tracking embedded in a smart camera. In: IEEE Embedded Computer Vision Workshop (ECVW) in conjunction with IEEE CVPR 2005. San Diego, CA (2005)
Privacy Sensitive Surveillance for Assisted Living
1013
[21] Google: Google Earth. http://earth.google.com/ [Accessed 1.3.2008] (2008) [22] Gross, R., Sweeney, L., de la Torre, F., Baker, S.: Model-based face deidentification. In: CVPR Workshop on Privacy Research in Vision (CVPRWPRIV), p. 161. IEEE Computer Society (2006) [23] Hampapur, A., Brown, L.M., Connell, J., Lu, M., Merkl, H., Pankanti, S., Senior, A.W., fe Shu, C., , li Tian, Y.: Multi-scale tracking for smart video surveillance. IEEE Transactions on Signal Processing 22 (2005) [24] Hauptmann, A.G., Gao, J., Yan, R., Qi, Y., Yang, J., Wactlar, H.D.: Automated analysis of nursing home observations. IEEE Pervasive Computing 3(2), 15– 21 (2004). DOI http://dx.doi.org/10.1109/MPRV.2004.1316813 [25] Hengstler, S., Prashanth, D., Fong, S., Aghajan, H.: Mesheye: a hybridresolution smart camera mote for applications in distributed intelligent surveillance. In: IPSN ’07: Proceedings of the 6th international conference on Information processing in sensor networks, pp. 360–369. ACM, New York, NY, USA (2007). DOI http://doi.acm.org/10.1145/1236360.1236406 [26] Johannes Widmer, M.G.: Spiegel Online: Die schöne neue Welt der Überwachung. http://www.spiegel.de/flash/0,5532,15385, 00.html [Accessed 1.3.2008] (2008) [27] Kampel, M., Aguilera, J., Thirde, D., Borg, M., Fernandez, G., Wildenauer, H., Ferryman, J., Blauensteiner, P.: 3D Object Localisation and Evaluation from Video Streams. In: F. Lenzen, O. Scherzer, M. Vincze (eds.) 28th Workshop of the Austrian Association for Pattern Recognition (AAPR), Digital Imaging and Pattern Recognition, pp. 113–122. Obergurgl, Austria (2006) [28] Keshavarz, A., Maleki-Tabar, A., Aghajan, H.: Distributed vision-based reasoning for smart home care. In: ACM SenSys Workshop on Distributed Smart Cameras (DSC) (2006) [29] Kleihorst, R., Abbo, A., Schueler, B., Danilin, A.: Camera mote with a highperformance parallel processor for real-time frame-based video processing. In: ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC07) (2007) [30] Philips: Philips Lifeline. http://www.lifelinesys.com/ [Accessed 1.3.2008] (2008) [31] Prati, A., Mikic, I., Cucchiara, R., Trivedi, M.: Analysis and detection of shadows in video streams: A comparative evaluation. In: IEEE CVPR Workshop on Empirical Evaluation Methods in Computer Vision, Kauai (2001) [32] Quaritsch, M., Kreuzthaler, M., Rinner, B., Bischof, H., Strobl, B.: Autonomous multicamera tracking on embedded smart cameras. EURASIP Journal on Embedded Systems 2007, Article ID 92,827, 10 pages (2007). Doi:10.1155/2007/92827 [33] Ribeiro, P.C., Santos-Victor, J.: Human activity recognition from video: modeling, feature selection and classification architecture. In: Workshop on Human Activity Recognition and Modelling, HAREM 2005 (2005) [34] Sangho Park, M.M.T.: Video analysis of vehicles and persons for surveillance. Intelligent and Security Informatics: Techniques and Applications (2007)
1014
Sven Fleck and Wolfgang Straßer
[35] Schiff, J., Meingast, M., Mulligan, D.K., Sastry, S., Goldberg, K.: Respectful cameras: Detecting visual markers in real-time to address privacy concerns. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2007) [36] Senior, A., Pankanti, S., Hampapur, A., Brown, L., Tian, Y.L., Ekin, A., Connell, J., Shu, C.F., Lu, M.: Enabling video privacy through computer vision. IEEE Security and Privacy 3(3), 50–57 (2005). DOI http://doi. ieeecomputersociety.org/10.1109/MSP.2005.65 [37] Shah, M., Javed, O., Shafique, K.: Automated visual surveillance in realistic scenarios. IEEE MultiMedia 14(1), 30–39 (2007). DOI http://dx.doi.org/10. 1109/MMUL.2007.3 [38] Siebel, N., Maybank, S.: The advisor visual surveillance system. In: ECCV 2004 workshop Applications of Computer Vision (ACV) (2004) [39] Staff, S.E.: Integrator unveils security system for Wynn casino. http://www.securityinfowatch.com/article/article. jsp?siteSection=344&id=9566 [Accessed 1.3.2008] (2008) [40] Tabar, A.M., Keshavarz, A., Aghajan, H.: Smart home care network using sensor fusion and distributed vision-based reasoning. In: ACM International Workshop on Video Surveillance & Sensor Networks, VSSN 2006 (2006) [41] Toereyin, B.U., Dedeoglu, Y., Cetin, A.E.: HMM based falling person detection using both audio and video. In: MUSCLE Network of Excellence Project (2006) [42] Trivedi, M.M., Gandhi, T.L., Huang, K.S.: Distributed interactive video arrays for event capture and enhanced situational awareness. IEEE Intelligent Systems, Special Issue on Homeland Security (2005) [43] Velalstin, S., Remagnino, P.: Intelligent Distributed Video Surveillance Systems. Institution of Engineering and Technology (2006) [44] Wand, M., Berner, A., Bokeloh, M., Jenke, P., Fleck, A., Hoffmann, M., Maier, B., Staneker, D., Schilling, A., Seidel, H.P.: Processing and interactive editing of huge point clouds from 3D scanners. Computers & Graphics 2008(1) (2008) [45] Weston, J., Elisseeff, A., Bakir, G., Sinz, F.: Spider machine learning framework. http://www.kyb.tuebingen.mpg.de/bs/people/ spider/ [Accessed 1.3.2008] (2008) [46] Williams, A., Xie, D., Ou, S., Grupen, R., Hanson, A., Riseman, E.: Distributed smart cameras for aging in place. In: Workshop on Distributed Smart Cameras, DSC 2006 (2006) [47] Wolf, W., Ozer, B., Lv, T.: Smart cameras as embedded systems. Computer 35(9), 48–53 (2002) [48] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4) (2006). DOI 10.1145/1177352.1177355. URL http: //portal.acm.org/citation.cfm?id=1177352.1177355
Data Mining for User Modeling and Personalization in Ubiquitous Spaces Alejandro Jaimes
1 Introduction User modeling (UM) has traditionally been concerned with analyzing a user’s interaction with a system and with developing cognitive models that aid in the design of user interfaces and interaction mechanisms. Elements of a user model may include representation of goals, plans, preferences, tasks, and/or abilities about one or more types of users, classification of a user into subgroups or stereotypes, the formation of assumptions about the user based on the interaction history, and the generalization of the interaction histories of many users into groups, among many others. UM is by definition a cross-disciplinary discipline involving HCI, psychology, AI, design, and others, and UM has spanned many application areas, focusing on different aspects related to user behavior. Kobsa [52] gives a historical overview of UM from the late 1970s until 2001, discussing some of the early Expert Systems as well as many commercial applications. Historically, a lot of the work on user modeling has been concerned with human-computer interaction [31, 12], thus the models have often been applied most directly to the design of user interfaces and/or the "dialogue" users undertake with systems. Research in UM in recent years, however, has expanded significantly to other areas and applications. The growth is triggered in part due to the large amounts of data that can be collected from users in multiple scenarios (web, mobile, etc.). This availability of data has created additional research opportunities for personalization [82, 5, 69, 29], and led to a proliferation of business models based on advertising and information filtering. Recommender systems [4, 16, 2] , for example, is a high growth area in which UMs are created for the purpose of recommending products and services, as well as for advertising. In addition, user modeling has become further integrated into the Data Mining and Knowledge Discovery process, where one
Alejandro Jaimes Telefonica Research, Emilio Vargas 6, Madrid, Spain, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_38, © Springer Science+Business Media, LLC 2010
1015
1016
Alejandro Jaimes
of the goals of building user models is classifying users into groups for marketing and other purposes. One of the areas of significant growth for UM is ubiquitous computing. That community’s modeling, until recently focused mostly on context: the user’s location, physical environment, and social environment. The emphasis was stronger on modeling context than on modeling the user [46]. In the UM community, on the other hand, because of the influence of ubiquitous computing, there has been an increasing concern for the inclusion of context in UMs. The joining of work from the two communities was motivated by the desire to create methods that enable intelligent environments to capture and represent information about users and contexts so as to enable the environment to adapt to both [46]. A common theme is the importance of enabling users to understand system decisions that may be based on a variety of user- and context-related information collected over time in different locations and situations. Important topics of ubiquitous user modeling include methods for user-centered design and in-depth evaluation for mobile context-aware systems. The explosion of availability of real-time user data presents many technical challenges that require the development of new algorithms and approaches for data mining, modeling and personalization, particularly considering the importance of the socio-cultural context in which technologies are used, and more specifically, in ubiquitous environments. The discipline of UM, however is extremely large since it aims to model users at many different levels and for different applications. This chapter will present an introductory overview to some of the relevant topics of user modeling in ubiquitous environments. In particular, it will give an introduction to user modeling, to knowledge discovery and data mining, to context modeling, mobile user modeling, to the application of KDD techniques for user modeling, and finally to relevant human factors. The goal of this chapter is not to serve as a tutorial (many good survey and tutorial articles are cited for each sub-topic), but rather to give an overview of this important and challenging discipline. The rest of the chapter is organized as follows. In section 2 we will give an introduction to user modeling. Section 3 discusses data mining and knowledge discovery. In section 4 we give an overview of context and mobile UM. In section 5 we discuss data mining in some UM applications, while in section 6 we discuss human factors. Finally, we give a brief overview of issues related to evaluation in section 7 and conclude with a discussion of future directions in section 8.
2 User Modeling A user model (UM) can be roughly defined as a representation, in some form, of particular aspects of a user, for a specific application context. Definitions of UMs abound, but it is generally understood that a user model should include some or all information about the user’s knowledge, beliefs, goals and plans for achieving these goals, abilities, attitudes, and preferences [52], thus a UM is a set of information structures designed to represent one or more of the following elements [52]:
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1017
1) representation of goals, plans, preferences, tasks, and/or abilities about one or more types of users; 2) representation of relevant common characteristics of users pertaining to specific user subgroups or stereotypes; 3) the classification of a user in one or more of these subgroups or stereotypes; 4) the recording of user behavior; 5) the formation of assumptions about the user based on the interaction history; and/or 6) the generalization of the interaction histories of many users into groups. A user model is created through a user modeling process in which unobservable information about a user is inferred from observable information from that user, where in many applications the observable information is obtained from direct interactions of the user with a system [83, 84]. From the interaction perspective (Figure 1 and [31]), user modeling can be thought of as including two main components: implicit and explicit communication channels. Explicit communication channels generate most of the explicit data, while for implicit channels knowledge about the domain, communication process, and other aspects is considered. This relates to what Heckman defines as situated interaction [40] : the explicit, implicit and indirect interaction between a human and an intelligent environment (or systems and their interaction devices), that take situation awareness into account.
Fig. 1 User Modeling in HCI (from Fischer [31]).
User Models can be represented in many different ways: a set of keywords, or tags for example, can be used to represent a user’s tastes or preferences. In early systems (e.g., [14]), hand-crafted rule-based approaches were used to represent user
1018
Alejandro Jaimes
models, in particular for plan recognition [18]. As pointed out in [83], however, these early systems suffered from the knowledge bottleneck problem, which refers to the manual labor-intensive effort required to build the necessary knowledge bases (mostly rules) that correspond to the UM, and the fact that rule-based systems in general are difficult to extend or adapt. One consequence of this, and of the increasing availability of data collected from large groups of users (e.g., in the WWW), led to widespread use of statistical models, which are concerned with the use of observed sample results (which are observed values of random variables) in making statements about unknown, dependent parameters [56]. In predictive statistical models for user modeling (e.g., decision trees, neural networks, Bayesian networks, and many others), the parameters represent aspects of a user’s future behavior, such as his/her goals, preferences, and forthcoming actions or locations [83]. Zukerman and Albrecht [83] give an overview of the main statistical methods for UM. These are roughly classified into linear models, TFIDF models (most popular for dealing with text, but also applied in image analysis), Markov models, Neural Networks, Classification, and Rule-Induction. These techniques are also used in data mining and knowledge discovery, both of which have direct applications to user modeling (see Section 3 and also [4]). It is important to note the distinction between the approach to construct the model and the model’s representation, although different methods may be used with the same representation. For example, a simple UM may consist of 5 keywords to describe a user. The keywords can be obtained explicitly from the user (e.g., via a questionnaire asking her to describe her preferences), or implicitly (e.g., keywords are automatically extracted from books bought by the user, and those keywords are used to describe her). In this chapter we are mainly interested in techniques that automatically build and update user models. In general, two main user modeling approaches have been adopted, contentbased, in which the main assumption is that each user exhibits a particular behavior under a given set of circumstances, and that this behavior is repeated under similar circumstances, and collaborative (at times also referred to as social filtering [75]) where the main assumption is that people within a particular group tend to behave similarly under a given set of circumstances (that people that have behaved similarly are likely to behave similarly in the future). In the content-based approach, the behavior of a user is predicted from his or her past behavior, while in the collaborative approach, a user’s behavior is predicted from the behavior of other like-minded people [83]. UM in this respect has had a strong impact in commercial applications in which retail items are recommended, and in advertising (e.g., users are recommended retail items others have bought, or display adds in web pages are selected using recommendation techniques). Examples include music, books, hotels, electronic items such as cameras and many others. Given the strong relationship between recommendation techniques and UM, we outline the main techniques in Table 1 (see [16] for details). In the next section we give an overview of data mining techniques in relation to user modeling.
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1019
Table 1 Recommendation techniques (from [16]). U is a set of users and u is the current user. Each technique specifies if and how a user model is built, but there are many variations in each case (i.e., there are several implementations of collaborative filtering algorithms).
3 Data Mining & Knowledge Discovery (KDD) Various definitions have been used in the literature for Data Mining and Knowledge Discovery in Databases (KDD). Fayad et. al. [30], defined knowledge discovery in databases as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [30]. Data mining has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [32], and as the science of extracting useful information from large data sets or databases [38]. Data mining attacks such problems as obtaining efficient summaries of large amounts of data, identifying interesting structures and relationships within a data set, and using a set of previously observed data to construct predictors of future observations [44]. The terms Data Mining and Knowledge Discovery are often used interchangeably, but some researchers have made distinctions where in most cases Data Mining is considered the entire process of applying computer-based methodology, including new techniques for knowledge discovery, to data [50]. Often the term Knowledge Discovery is used to refer to processes that produce explicit information from large datasets that can be understood by a user, in contrast with, for example, predictive modeling, in which models are built which produce probabilities for particular variables (e.g., probability that a call is fraudulent). The term KDD, however, has more recently been used as an acronym for Knowledge Discovery and Data Mining, which encompasses several research methodologies and techniques across various application areas (e.g., pattern discovery, analytics, decision support, interactive data mining, and others).
1020
Alejandro Jaimes
As an example of what the KDD process might be like, we outline the main steps presented in [37]: • Developing and understanding of the application domain and the goals of the data mining process • Acquiring or selecting a target data set • Integrating and checking the data set • Data cleaning, preprocessing, and transformation • Model development and hypothesis building • Choosing suitable data mining algorithms • Result interpretation and visualization • Result testing and verification • Using and maintaining the discovered knowledge In some sense, traditional KDD assumes the availability of large datasets and the realization of processes to discover knowledge from the data as a whole using a set of tools [37], not necessarily the building of user models for on-line, real time use as is the case with recommender systems. Data sets can be very large, often requiring days of computer time to create a single model. From this perspective, some of the manual steps outlined above may not apply directly to "traditional" user modeling tasks (e.g., data cleaning, preprocessing, and transformation), but they are relevant in designing systems that rely on UMs. It is important to also note that the term KDD does include real-time analysis (e.g., stream mining), but that is a relatively new approach. The field of KDD has grown tremendously in the last few years, in large part due to the increase in computational power and lower storage costs. KDD research and applications span a wide range of sub-areas, and distinctions can be made based on the type of data examined. For example: geographic data mining [65], web mining [69, 54], time series mining [7, 51, 80], link mining [35], hypertext data mining [20], temporal knowledge mining [72], and workflow mining [1], among several others. The techniques or scope also span many areas, for example, visual data mining [67], soft computing [66], and evolutionary algorithms [33]. In addition, one can consider the goal of KDD (e.g., interestingness [62]), and particular applications such as marketing [58], biology [59], and many others. KDD techniques can be applied to practically any type of data and domain, as evidenced by an unusually large number of survey articles focusing on particular aspects of KDD. A lot of the work in KDD, particularly with respect to model development and hypothesis building, uses machine learning techniques. Hosking et al [44] discuss the differences in three approaches to machine learning: classical statistics, Vapnik’s statistical learning theory, and computational learning theory. Undoubtably, many statistical models exist for explaining relationships in a data set or for making predictions [44]: cluster analysis, discriminant analysis and nonparametric regression can be used in many data mining problems-it would be tempting for a statistician to regard data mining as no more than a branch of statistics [44]. The differences in scale and approach are significant, however. For instance, most of the mathematical techniques used in classical statistics assume that the form of the correct model is
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1021
known and that the problem is to estimate its parameters. In data mining, on the other hand, the form of the correct model is usually unknown [44]. There is a large body of literature on KDD, and it is beyond the scope of the chapter to enter into details (see [4] for an overview of data mining techniques). As mentioned earlier, predictive statistical modeling techniques are often used in UM and are closely related with the work in KDD. Key differences between the two areas include the goals, scope, and outputs: while in KDD the goal may be segmentation of a large dataset (e.g., for marketing purposes), grouping of many individuals, or "understanding" of a large data set (e.g., for Business Intelligence applications), in UM the goal might be obtaining models that are specific to individuals and that can be used directly in particular applications (e.g., a recommender system, a mobile museum guide, etc.), often with amounts of data that would be considered small in KDD tasks (e.g., a UM task would consider data for a single user, while a DM task would typically focus on large groups of users). The overlap between the two fields is significant, however and one could argue that some approaches to UM are indeed data mining techniques (e.g., social network analysis to characterize users can be considered to be a UM task and also a KDD task). Some of the challenges in using KDD techniques for UM occur because of the differences in scale and in the availability of the data. For instance, statistical methods are not immune to problems related to the lack of information available for some users, or the change of their interests over time. The sparse data problem, which is related to the cold start problem, refers to situations where there is insufficient information to categorize users or items [3], while concept drift represents the change over time of the attributes that characterize a user. This means that "historical" models, acquired early may become inadequate or obsolete. Efficiency issues pertain to the ability of statistical models to cope with vast amounts of data during all stages of the user modeling process: model building, prediction generation, and online adaptation, which demand the delivery of results in real time. One of the key differences between KDD and UM, particularly in ubiquitous environments, is the role of context. In traditional KDD, a large dataset is collected and analyzed, with an assumption that context remains fairly constant. In applications that utilize UM techniques, on the other hand, context may change in real time and be crucial in determining how an application behaves.
4 Context & Mobile User Modeling Ubiquitous user modeling generally encompasses user interaction with mobile devices as well as with the surrounding environment. Context is important in the construction of all UMs [27] (see short surveys in [10, 21, 78]), and should be considered not only in setting parameters for specific techniques, but also in the design of the system and interaction within particular applications. What’s more, in ubiquitous environments context plays a crucial role because it determines, up to a large extent not only the type of possible interaction (i.e., dependent on the actual device used),
1022
Alejandro Jaimes
but also the real-world choices the user may make (e.g., a mobile, location-based application to recommend restaurants is constrained by the restaurants available in a particular location). Strang and Popien [78] briefly survey recent context modeling techniques in ubiquitous applications. The authors base their survey on the possible data structures commonly used in context modeling. We briefly outline the main ones from [78]: Key-value models: these are simple attribute-value pairs. Markup schemes: typically a hierarchical data structure consisting of markup tags, attributes, and content using language derivatives of SGML. An example of this is presented in [15]. Other markup-type representations are presented in [17, 49, 73, 11]. Graphical models: such models include Unified Modeling Language (UML), and the object-role modeling approach. The approach presented in [41] bases the model on fact types classified into static or dynamic (profiled, sensed, or derived). These models are closely related to traditional Entity-Relationship (ER) models [22]. Object-oriented: object oriented models such as those in [74, 23] aim at representations based on an encapsulated object paradigm, in which interfaces that allow interaction with the objects is clearly defined and objects are easily reusable. The output of each cue depends on a single sensor, but different cues may be based on the same sensors. The context is modeled as an abstraction level on top of the available cues. Thus the cues are objects providing contextual information through their interfaces, hiding the details of determining the output values. Logic-based models: in these types of models, context is defined as a set of facts, expressions and rules. In the work in [61], for instance, a formalization was proposed which allows for simple axioms for common sense phenomena. Other works (e.g., [36]) can be said to focus more on context reasoning and less on context modeling, but the overall framework is similar. Ontology-based models: these models are built using a set of fixed concepts and relationships, although some researchers have proposed using ontologies in frameworks that allow reuse of knowledge and arbitrary numbers of sub-concepts. Strang and Popien [78] compare the different approaches above based on a fixed set of criteria: distributed composition (how well does the model work in distributed environments?), partial validation (it is generally difficult to validate a model because all of the necessary information is not available at once), richness and quality of information (model should consider variability of quality and availability of information), incompleteness and ambiguity (data may be incomplete or ambiguous), level of formality (since the models are deployed in distributed environments ), and applicability to existing environments. Generally speaking, the items to include in a particular model depend on the specific application, and as mentioned earlier, many techniques exist to create the models and maintain them. Simple key-value models, for instance, can be constructed
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1023
from the output of sensors: those outputs could be fed into a classifier that would assign each sensor a label. The combination of such labels, one for each sensor would constitute the context at a particular point in time. It is also important to note that the models used in a particular application need not exclude others-more often than not, a combination of the techniques above may be used. Since ubiquitous user modeling encompasses user interaction with mobile devices as well as with the surrounding environment, it refers to ongoing modeling and exploitation of user behavior with a variety of systems which share their user models [40], creating a need to model users, devices, and sensors [19], sometimes in distributed fashion [76]. It means that ideally the user’s behavior is constantly tracked at any time, at any location and in any interaction context, in particular in what Heckman defines as situated interaction [40]: the explicit, implicit and indirect interaction between a human and an intelligent environment (or systems and their interaction devices), that take situation awareness into account. Heckman’s [40] view is that these user models are shared and that they can either be used for mutual or for individual adaptation goals. In addition, they can be differentiated from generic user modeling by the three following concepts: ongoing modeling, ongoing sharing and ongoing exploitation. Because of this and the changing nature of the way we behave, one of the main desired features in ubiquitous user modeling is adaptability, which refers to the capability of a system to maintain a model of some characteristics of its users in order to provide them with information in a way that best suits their needs at a particular time and place [13]. This refers not only to adaptability of the model, but also to how the model is used and, in general, to the way the "system" responds to changing conditions. For illustration purposes we show Figure 2 (from [76]). In the figure, one of the key components of ubiquitous user modeling is communication and information flow between the different parts of the "system." In this particular example, there are multiple devices and models, and they all communicate through brokers, which in turn contribute to the issue of adaptability discussed above. In ubiquitous UM, another key issue is that of decentralization. This is important in scenarios such as that shown in Figure 2: different devices and objects in the environment may enter or leave a particular user’s context at any given time, so the design of architectures to manage multiple objects plays a key role. As an example, Heckmann et al [39] present architectural modules for decentralized user modeling, especially the user model exchange language UserML and the General User Model Ontology (GUMO) [40]. In [39] a user model service manages information about users: the u2m.org user model service is an application independent server with a distributed approach for accessing and storing user information. It provides the possibility of exchanging data between different applications, and adds privacy and transparency information to the statements about the user. The semantics for all concepts are mapped to the GUMO ontology and applications can retrieve or add information to the server by simple HTTP requests, or by an USERML web service. Along the same lines, Kobsa and Fink [53] present a user modeling server (UMS) as a Lightweight Directory Access Protocol (LDAP) directory server that is complemented by several ’pluggable’ user modeling components and can be accessed
1024
Alejandro Jaimes
Fig. 2 Information flow in a ubiquitous environment (from Specht et al [76])
by external clients. The central Directory Component (see Figure 3) comprises the Communication, Representation, and Scheduler sub-systems. The Communication sub-system is responsible for handling the communication with external clients of the user modeling server (e.g., user-adaptive applications), and with the User Modeling Components which are internal clients of the Directory Component. Each User Modeling Component performs a dedicated user modeling task, such as collaborative filtering, domain-based inferences, etc. The Representation sub-system is in charge of managing the directory contents (mostly user information), while the main tasks of the Scheduler are to wrap the LDAP server in a component interface and to mediate between the different sub-systems and components of the User Modeling Server. The communication infrastructure does not mandate a specific distribution topography, so components can be flexibly distributed across a network of computers, e.g. dependent on available computing resources. As the reader may have noted from the previous examples, one of the key issues in ubiquitous UM is the continuous exchange of information from different sources. This exchange can, of course, be very complex and inaccurate if the models differ significantly, so one of the common approaches to ubiquitous UM is to use ontologies. For illustration purposes we give a brief overview of the GUMO user modeling framework [40]. The GUMO framework is presented for the interpretation of distributed user models in intelligent semantic web enriched environments. One of the key arguments presented in [40] is that a commonly accepted top level ontology for user models would simplify the exchange of user model data between different user-adaptive systems. The authors present a particular ontology based on OWL, a semantic web language, and make a distinction between user model ontology and user modeling ontology, which would additionally include the inference techniques
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1025
Fig. 3 Overview of the server architecture fromKobsa and Fink [53]. Components communicate via CORBA (Common Object Request Broker Architecture), Lightweight Directory Access Protocol (LDAP), and Open Database Connectivity (ODBC) APIs
or knowledge about the research area in general. The motivations behind the authors’ proposal for a commonly accepted top-level ontology include the fact that, aside from the challenges outlined above (different data sources, sensors, devices, etc.), the existence of different ontology make it difficult for different devices to exchange information. In general, different kinds of data that user-adaptive systems may need include user data and usage data [52]. User data is further distinguished into demographic data, user knowledge, user skills and capabilities, user interests and preferences and user goals and plans. Other aspects include physiological state, mental state, emotional state, personality, demographics, and so on. The GUMO framework includes the user’s dimensions that are modeled within user-adaptive systems like the user’s heart beat, the user’s age, the user’s current position, the user’s birthplace or the user’s ability to swim. Figure 4 shows an overview of some of the user’s dimensions included in the GUMO ontology: basic dimensions which include emotional state, characteristics, and personality. Note that depending on the application, only some of the items may be available, and that many of them may require manual input. It is also worthwhile noting that some of the dimensions can change frequently (e.g., emotional state), while others (e.g., personality) are fairly fixed over long periods of time.
1026
Alejandro Jaimes
Fig. 4 Basic user dimensions: emotional states, characteristics and personality from Heckmann et al [40]. The ontology can be browsed at www.gumo.org.
5 Data Mining in User Modeling & Applications In the previous sections we have discussed several aspects of user modeling-at this point it should be apparent to the reader that there are several key challenges in ubiquitous user modeling, among which is the collection of data for creating and maintaining the various knowledge sources for UM (including data from devices and sensors [19]). Of these challenges, we are concerned mainly with the automatic collection of data from mobile devices and how such data can be utilized to build UMs using data mining techniques. In particular, we are interested in constructing UMs from data, that accurately capture user behavior and interests in order to make predictions. Many of the current approaches to build user models by mining data fall into the realm of Reality Mining, which is defined as the collection of machine-sensed environmental data pertaining to human social behavior [28]. Ashbrook and Starner [6], for example, use GPS information collected from users carrying a GPS receiver and logger over several months to predict future user movements. UMs could thus be built to answer questions in situations such as this one: the user is currently at home. "What is the most likely place she will go next?" and "How likely is the user to stop at the grocery store on the way home from work?" Combining the answers to these questions for several users can lead to serendipitous meetings: if the response to "Where is the user most likely to go next?" is the same for two colleagues, they can be alerted that they’re likely to meet each other. Such information could thus be utilized to build UMs that could be used in many end-user mobile applications and also to understand behavior of groups. Eagle and Pentland [28] perform similar tasks in a study of 100 individuals carrying GPS and Bluetooth equipped cell phones over several months. They collect additional information including call and SMS data, as well as data from face to face encounters obtained from the Bluetooth sensors, which are also used to detect presence in certain locations (e.g., at home, work, or similar). UMs with this kind of data can be extensive, ranging from representations of "isolated" individual behav-
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1027
ior (e.g., individual movement and activity patterns) to analysis of social networks (e.g., a UM can include information on a person’s social behavior). The possibilities of collecting other types of data have also been explored. For example, Yannakakis et al [81] collect data from heart rate activity of children playing in an interactive physical playground to build user models of individual "fun" preferences. A comprehensive statistical analysis shows that children’s reported entertainment preferences correlate well with specific features of the heart rate signal. Such data can be used for the design of games, but similar frameworks could be exploited for health applications. Another area of interest is the deployment of mobile applications that make recommendations in the context of location-based services [8]. In essence, these are services that use location data [42] in combination with UMs. Within this group, several application scenarios have emerged over the years, many of which can be placed in three major categories: adaptive mobile web guides [55] and museum guides [9, 77], navigation systems, and shopping assistants. Many of these applications have functionalities in common with desktop-based adaptive hypermedia systems, at least in the functionality of the interface. For illustration purposes we show the taxonomy in Figure 5 from [12]. Adaptation of the interface can be divided into two groups: presentation and screen navigation support. Note that in the ubiquitous case (i.e., with mobiles) these two categories gain special significance given the display and input limitations.
6 Human Factors Human factors, from cognitive modeling to modeling of cultural differences and privacy are of great importance in UM. Culture plays a key role in computing in general, but it is particularly important in ubiquitous environments, specifically due to the recent proliferation of mobile technologies in developing regions. Although the importance of culture is recognized, however, in general the integration of cultural factors into the majority of computing applications has received little attention [45]. Precisely because of the wide spread of mobile phones, however, there has been a stronger focus on understanding cultural differences, not only for UM, but also for making mobile-device design decisions. We discuss a few examples. Choi and Lee [26], for instance, present a study on cultural influences on mobile data service design. The authors identify twenty-one critical user-experience attributes that show a clear correlation with characteristics of the user’s culture. Marcus and Gould [60] focus on web design using cultural factors, placing particular emphasis on Hofstede’s [43] framework of cultural dimensions. Hofstede carried out research in 72 countries, which resulted in a table that assigns five scores (for five dimensions) to each of the 72 countries that were surveyed. The following dimensions were proposed by Hofstede: Power Distance, Individualism vs. Collectivism, Masculinity vs. Feminity, Uncertainty Avoidance, and Long-term vs. Short-term Orientation.
1028
Alejandro Jaimes
Fig. 5 A taxonomy of adaptive hypermedia technologies (from Brusilovsky [12])
As a more detailed example, Jhangiani [47] present results of a study with users in the U.S.A. and India to identify possible relationships between national culture, disability culture, and design preferences of cell phone interfaces. They found some correlations of design preferences of cell phone features based on Choi et at.’s study [26] and the underlying cultural dimensions presented by Hofstede when group level analysis in focus groups was performed. Differences were also found in the ratings of the hardware attributes between disability groups, and differences in usability ratings were found based on nationality and disability groups. Content analysis of the focus group sessions in that study provided an insight into the preferences in cell phone interface components and gave a better understanding of the mobile/cell phone culture in the two countries (U.S.A. and India). In [47], these results are summarized to provide guidelines for designing cross-cultural user interfaces that are
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1029
nationality specific and disability specific. The pyramid model in Figure 6 shows the results of a proposed holistic process of designing cell phones for users with disabilities, integrating the findings of [47] and [48]’s pleasurability framework. The framework consists of four pleasures: sociological (relationships with others), ideological (people’s values and beliefs), psychological (cognitive and emotional characteristics) and physiological (anthropometrics). In Figure 6, each of these is mapped to specific design aspects of mobile devices considering people with disabilities. Although the work in [47] deals with design, the issues studied are very relevant to the construction of user models.
Fig. 6 A mobile design pyramid model from Jhangiani [47])
As a third example, Reinecke et al [71] introduce a cultural user model ontology (CUMO). The approach is presented to alleviate the bootstrapping problem (the tedious collection of assumptions about the user’s preferences when employing the user model for the first time). This particular UM incorporates Hofstede’s cultural dimensions and further concepts that factor cultural influences into the user model ontology. These include classes and properties that describe the cultural influence through different places of residence. CUMO comprises the user’s birthplace (Figure 7), the currentResidence and formerResidence, all having the range Location. Location is further subdivided into the subclasses Continent and Country, and so on. For the definition of concepts in CUMO, the authors assume to acquire information about the user from both an initial questionnaire and the user’s interaction with an application. For instance, tracking mouse movements, such as how much time is spent hovering or how many errors were made, gives information about how the user copes with adaptation.
1030
Alejandro Jaimes
Fig. 7 Culturally influencing concepts in CUMO (from Reinecke et al [71])
Another very relevant topic to UM is the management of privacy. There is a large body of literature that contemplates privacy issues, so we concentrate on a single example to give the reader some insight into the issues that are contemplated. Lehikoinen et al [57] analyze privacy from the perspective of the social sciences and its relationship to technology. In particular, their analysis is based on extending Altman’s [64] theoretical privacy framework to privacy in ubiquitous computing. According to Altman, privacy is a two-way interactive process, making it a promising framework in ubiquitous environments where people, devices, and the environment interact with each other [57]. The authors extend the social framework and provide examples of control mechanisms in each communication context (Table 2).
Table 2 Examples of control mechanisms available in different forms of interaction (from Lehikoinen et al [57])
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1031
7 Evaluation Evaluation of UM is important but very challenging because in large part the goal is to satisfy individual users so subjective experiences need to be considered. Evaluation of UMs in ubiquitous environments, in contrast to "traditional" UM evaluation [24, 53, 25], posses additional challenges. We discuss some examples as a way of an overview of the challenges. Theofanos and Scholtz [79] give a brief overview of evaluation of UM in ubiquitous environments in a restaurant ordering application context. In particular, they define 9 evaluation areas as follows: attention, adoption, trust, conceptual models, interaction, invisibility, impact and side effects, appeal, and application robustness. Froehlich et al [34] discuss three methods for in situ evaluation of (and with) ubiquitous computing devices to augment interviewing and questionnaire techniques: wizard of oz techniques, in situ data collection for self-report techniques, and using a mobile device. In the first method data is collected manually by an evaluator directly contacting participants to obtain data (evaluator initiated and evaluator entered). In the second method data collection was initiated by context triggers (location sensing in particular), and the participants’ entered self-reported data. In the third method data collection was participant initiated (rather than evaluator or context determined) and participant entered. As a final topic, Miettinen and Oulasvirta [63] discuss how a mobile environment implies that the user will constantly "switch" between tasks, and that this switching must be taken into account in the design of ubiquitous systems. We show some interesting conclusions of "time-sharing", from [63] in Table 3. In summary, evaluation of ubiquitous user modeling includes evaluation of the technical components (e.g., connections between devices, accuracy of algorithms, etc.) and human evaluation. Human evaluation refers to evaluation of the interaction within a specific application context and environment. One of the key challenges is that such evaluation depends on the interaction (and interface) as well as on the model itself and the context in which the "application" is deployed. Proper evaluation of UMs therefore requires several methodologies to obtain user feedback in ways that it can be generalized and used in the UM desing process in a way that such UMs remain adaptable enough for multiple, diverse users.
8 Conclusions & Future Directions In this chapter we have given a brief overview of user modeling and of data mining in the context of ubiquitous user modeling applications. We introduced basic concepts and outlined techniques in user modeling and Knwoledge Discovery and Data Mining (KDD), giving an overview of topics in context modeling, and data mining in user modeling, and briefly discussing human factors and evaluation issues. User Modeling (UM) is a very large cross-disciplinary discipline involving HCI, psychology, AI, design, and several others, with applications in virtually all areas
1032
Alejandro Jaimes
Table 3 Predicting time-sharing strategies, examples, and possible rationales (from Miettinen and Oulasvirta [63])
of computing in which user behavior is important. The goal of this chapter, was therefore not to serve as a tutorial (many good survey and tutorial articles are cited for each sub-topic), but rather to give an overview of this important and challenging discipline. Indeed, several topics of importance were omitted or mentioned only briefly. Many of the technical challenges of "traditional" UM remain in ubiquitous UM, however, as additional challenges emerge including the following:
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1033
• Decentralization and integration(construction and acquisition of distributed user models) • Distributed architectures and interoperability of personalized applications • User control of ubiquitous data and data ownership • Privacy, security and trust in decentralized user modelling • Ethics, cultural heritage, languages, diversity and accessibility in ubiquitous knowledge discovery • Knowledge modelling, integration and management for personalization in constrained environments • Semantics of ubiquity and ubiquitous knowledge • Memory models (augmented memories and shared memories) • Reasoning methods in constrained environments • Collection and formats of content, structure, and usage data in ubiquitous environments • Automated user modeling, mining of content, structure, and usage data in ubiquitous environments • Evaluation of user models and user adaptivity These are of course are closely related to additional challenges from an interaction perspective (see [68]). A lot of progress has been made in UM and the advent of evermore powerful mobile devices, and greater connectivity (not just bandwidth, but also standads) is opening the door to the massive spread of UM technologies in ubiquitous settings. This is fueled in part by the success of the WWW in which personalization is increasingly playing a more significant role, and also by the greater ease of collecting and re-using user data in mobile settings. One of the areas where there perhaps will be strong growth is the area which combines UM and mobile social networking [70], but the number of application areas (e-commerce, telecommunications, health, entertaintment, etc.) is very large, making Ubiqutious UM a key discipline for the future of computing. Acknowledgements The author would like to thank the editors of this volume for their guidance, patience, and support.
References [1] van der Aalst WMP, van Dongen BF, Herbst J, Maruster L, Schimm G, Weijters AJMM (2003) Workflow mining: a survey of issues and approaches. Data Knowl Eng 47(2):237–267 [2] Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17(6):734–749
1034
Alejandro Jaimes
[3] Albrecht D, Zukerman I (2007) Introduction to the special issue on statistical and probabilistic methods for user modeling. User Modeling and UserAdapted Interaction 17(1-2):1–4 [4] Amatriain X, Jaimes A, Oliver N, Pujol JM (2010) Data Mining Fundamentals for Recommender Systems, Springer, Hingham, MA, USA. Handbook of Recommender Systems [5] Ardissono L, Maybury M (2004) Preface: Special issue on user modeling and personalization for television. User Modeling and User-Adapted Interaction 14(1):1–3 [6] Ashbrook D, Starner T (2003) Using gps to learn significant locations and predict movement across multiple users. Personal Ubiquitous Comput 7(5):275– 286 [7] Berkhin P (2006) A survey of clustering data mining techniques [8] Berkovsky S (2005) Ubiquitous user modeling in recommender systems. In: Ardissono L, Brna A, Mitrovic A (eds) Proceedings of the 10th international conference on user modelling, pp 496–498 [9] Bohnert F, Zukerman I, Berkovsky S, Baldwin T, Sonenberg L (2008) Using interest and transition models to predict visitor locations in museums. AI Commun 21(2-3):195–202 [10] Bolchini C, Curino CA, Quintarelli E, Schreiber FA, Tanca L (2007) A dataoriented survey of context models. SIGMOD Rec 36(4):19–26 [11] Brown P, Bovey J, Chen X (1997) Context-aware applications: from the laboratory to the marketplace. IEEE Personal Communications 4(5):58–64 [12] Brusilovsky P (2001) Adaptive hypermedia. User Modeling and User-Adapted Interaction 11(1-2):87–110 [13] Brusilovsky P, Maybury MT (2002) From adaptive hypermedia to the adaptive web. Commun ACM 45(5):30–33 [14] Buchanan BG, Shortliffe EH (1984) Rule Based Expert Systems: The Mycin Experiments of the Stanford Heuristic Programming Project The AddisonWesley s. in artificial intelligence. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA [15] Buchholz S, Hamann T, Hübsch G (2004) Comprehensive structured context profiles (cscp): Design and experiences. IEEE International Conference on Pervasive Computing and Communications Workshops p 43 [16] Burke R (2002) Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction 12(4):331–370 [17] Capra L, Emmerich W, Mascolo C (2001) Reflective middleware solutions for context-aware applications. In: REFLECTION ’01: Proceedings of the Third International Conference on Metalevel Architectures and Separation of Crosscutting Concerns, Springer-Verlag, London, UK, pp 126–133 [18] Carberry S (2001) Techniques for plan recognition. User Modeling and User Adapted Interaction 11(1):1–31 [19] Carmichael DJ, Kay J, Kummerfeld B (2005) Consistent modelling of users, devices and sensors in a ubiquitous computing environment. User Modeling and User-Adapted Interaction 15(3-4):197–234
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1035
[20] Chakrabarti S (2000) Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations 1:1–11 [21] Chen G, Kotz D (2000) A survey of context-aware mobile computing research. Tech. rep., Hanover, NH, USA [22] Chen PPS (1976) The entity-relationship model—toward a unified view of data. ACM Transactions on Database Systtems 1(1):9–36 [23] Cheverst K, Mitchell K, Davies N (1999) Design of an object model for a context sensitive tourist guide. In: Computers and Graphics, pp 883–891 [24] Chin DN (2001) Empirical evaluation of user models and user-adapted systems. User Modeling and User-Adapted Interaction 11(1-2):181–194 [25] Chin DN, Crosby ME (2002) Introduction to the special issue on empirical evaluation of user models and user modeling systems. User Modeling and User-Adapted Interaction 12(2-3):105–109 [26] Choi B, Lee J Iand Kim (2006) Culturability in mobile data services: A qualitative study of the relationship between cultural characteristics and userexperience attributes. Internacional Journal of Human-Computer Interaction 20(3):171–206 [27] Coutaz J, Crowley JL, Dobson S, Garlan D (2005) Context is key. Communications of the ACM 48(3):49–53 [28] Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Personal Ubiquitous Comput 10(4):255–268 [29] Eirinaki M, Vazirgiannis M (2003) Web mining for web personalization. ACM Trans Internet Technol 3(1):1–27 [30] Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview, American Association for Artificial Intelligence, Menlo Park, CA, USA, pp 1–34. Advances in knowledge discovery and data mining [31] Fischer G (2001) User modeling in human–computer interaction. User Modeling and User-Adapted Interaction 11(1-2):65–86 [32] Frawley W, Piatetsky-Shapiro G, Matheus C (1992) Knowledge discovery in databases - an overview. AI Magazine 13:57–70 [33] Freitas A (2003) A survey of evolutionary algorithms for data mining and knowledge discovery. Advances in evolutionary computing: theory and applications pp 819–845 [34] Froehlich J, Chen MY, Consolvo S, Harrison B, Landay JA (2007) Myexperience: a system for in situ tracing and capturing of user feedback on mobile phones. In: MobiSys ’07: Proceedings of the 5th international conference on Mobile systems, applications and services, ACM, New York, NY, USA, pp 57–70 [35] Getoor L, Diehl CP (2005) Link mining: a survey. SIGKDD Explorations Newsletter 7(2):3–12 [36] Ghidini C, Giunchiglia F (2001) Local models semantics, or contextual reasoning = locality + compatibility. Artificial Intelligence 127(2):221–259 [37] Goebel M, Gruenwald L (1999) A survey of data mining and knowledge discovery software tools. SIGKDD Explor Newsl 1(1):20–33
1036
Alejandro Jaimes
[38] Hand D, Mannila H, Smyth P (2001) Principles of Data Mining. MIT Press, Cambridge, Massachussets [39] Heckmann D, Schwartz T, Brandherm B, Kröner A (2005) Decentralized user modeling with userml and gumo. In: Peter Dolog JV (ed) Decentralized, Agent Based and Social Approaches to User Modeling, Workshop DASUM-05 at 9th International Conference on User Modelling, UM2005, Edinburgh, Scotland, pp 61–66 [40] Heckmann D, Schwartz T, Brandherm B, Schmitz M, von WilamowitzMoellendorff M (2005) Gumo : The general user model ontology. User Modeling 2005 pp 428–432 [41] Henricksen K, Indulska J, Rakotonirainy A (2003) Generating context management infrastructure from high-level context models. In: 4th International Conference on Mobile Data Management (MDM), pp 1–6 [42] Hightower J, Borriello G (2001) Location systems for ubiquitous computing. Computer 34(8):57–66 [43] Hofstede G (1984) Culture’s consequences: international differences in workrelated values. SAGE Press, London and Beverly Hills [44] Hosking JRM, Pednault EPD, Sudan M (1997) A statistical perspective on data mining. Future Gener Comput Syst 13(2-3):117–134 [45] Jaimes A, Sebe N, Gatica-Perez D (2006) Human-centered computing: a multimedia perspective. In: Proceedings of the 14th annual ACM international conference on Multimedia, ACM, New York, NY, USA, pp 855–864 [46] Jameson A, Krüger A (2005) Preface to the special issue on user modeling in ubiquitous computing. User Modeling and User-Adapted Interaction 15(34):193–195 [47] Jhangiani I (2006) A cross-cultural comparison of cell phone interface design preferences from the perspective of nationality and disability. Master’s thesis, Virginia Polytechnic Institute and State University, Blacksburg, Virginia [48] Jordon PW (2000) Designing pleasurable products: An Introduction to the New Human Factors. Taylor and Francis, London [49] Kagal L, Korolev V, Chen H, Anupam J, Finin T (2001) Centaurus: a framework for intelligent services in a mobile environment. In: Proceedings of the International Workshop on Smart Appliances and Wearable Computing (IWSAWC), pp 195–201 [50] Kantardzic M (2003) Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley and Sons, New York [51] Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, pp 102–111 [52] Kobsa A (2001) Generic user modeling systems. User Modeling and UserAdapted Interaction 11(1-2):49–63 [53] Kobsa A, Fink J (2006) An ldap-based user modeling server and its evaluation. User Modeling and User-Adapted Interaction 16(2):129–169
Data Mining for User Modeling and Personalization in Ubiquitous Spaces
1037
[54] Kosala R, Blockeel H (2000) Web mining research: a survey. SIGKDD Explor Newsl 2(1):1–15 [55] Kruger A, Baus J, Heckmann D, Kruppa M, Wasinger R (2007) Adaptive mobile guides. The Adaptive Web, Springer, Hingham, MA, USA, pp 521–549 [56] Larson H (1969) Introduction to Probability Theory and Statistical Inference. Wiley International Edition, New York [57] Lehikoinen JT, Lehikoinen J, Huuskonen P (2008) Understanding privacy regulation in ubicomp interactions. Personal Ubiquitous Comput 12(8):543–553 [58] Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining (KDD 98), pp 73–79 [59] Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactons on Computational Biology and Bioinformatics 1(1):24–45 [60] Marcus A, Gould EW (2000) Crosscurrents: cultural dimensions and global web user-interface design. interactions 7(4):32–46 [61] McCarthy J, Buvac S (1994) Formalizing context (expanded notes). Tech. rep., Stanford, CA, USA [62] McGarry K (2005) A survey of interestingness measures for knowledge discovery. Knowl Eng Rev 20(1):39–61 [63] Miettinen M, Oulasvirta A (2007) Predicting time-sharing in mobile interaction. User Modeling and User-Adapted Interaction 17(5):475–510 [64] Miller HJ, Han J (1975) The Environment and Social Behavior. Brooks Cole Publishing, Monterrey [65] Miller HJ, Han J (2001) Geographic Data Mining and Knowledge Discovery. Taylor & Francis, Inc., Bristol, PA, USA [66] Mitra S, Pal S, Mitra P (2002) Data mining in soft computing framework: a survey. Neural Networks, IEEE Transactions on 13(1):3–14 [67] Ferreira de Oliveira M, Levkowitz H (2003) From visual data exploration to visual data mining: a survey. Visualization and Computer Graphics, IEEE Transactions on 9(3):378–394 [68] Peintner B, Viappiani P, Yorke-Smith N (2008) Preferences in interactive systems: Technical challenges and case studies. AI Magazine 29(4):13–24 [69] Pierrakos D, Paliouras G, Papatheodorou C, Spyropoulos CD (2003) Web usage mining as a tool for personalization: A survey. User Modeling and UserAdapted Interaction 13(4):311–372 [70] Preibusch S, Hoser B, Gürses S, Berendt B (2007) Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling. In: Proceedings of Workshop on Data Mining for User Modeling [71] Reinecke K, Reif G, Bernstein A (2007) Cultural User Modeling With CUMO: An Approach to Overcome the Personalization Bootstrapping Problem. In: First International Workshop on Cultural Heritage on the Semantic Web at the 6th International Semantic Web Conference (ISWC 2007), Busan, South Korea
1038
Alejandro Jaimes
[72] Roddick JF, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. IEEE Trans on Knowl and Data Eng 14(4):750–767 [73] Ryan N (1999) Contextml: Exchanging contextual information between a mobile client and the fieldnote server. URL http://www.cs.kent.ac.uk/ projects/mobicomp/fnc/ConteXtML.html [74] Schmidt A, Van Laerhoven K (2001) How to build smart appliances? Personal Communications, IEEE [see also IEEE Wireless Communications] 8(4):66–71 [75] Shardanand U, Maes P (1995) Social information filtering: algorithms for automating “word of mouth”. In: CHI ’95: Proceedings of the SIGCHI conference on Human factors in computing systems, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, pp 210–217 [76] Specht M, Lorenz A, Zimmermann A (2005) Towards a framework for distributed user modelling for ubiquitous computing. In: DASUM 2005 workshop in conjunction with User Modeling 2005 [77] Stock O, Zancanaro M, Busetta P, Callaway C, Krüger A, Kruppa M, Kuflik T, Not E, Rocchi C (2007) Adaptive, intelligent presentation of information for the museum visitor in peach. User Modeling and User-Adapted Interaction 17(3):257–304 [78] Strang T, Popien CL (2004) A context modeling survey. In: UbiComp 1st International Workshop on Advanced Context Modelling, Reasoning and Management, Nottingham, pp 31–41 [79] Theofanos M, Scholtz J (2001) A diner’s guide to evaluating a framework for ubiquitous computing applications. In: Davis H, Douglas Y, Duran DG (eds) Hypertext 2001. Proceedings of the Twelfth ACM Conference on Hypertext and Hypermedia, ACM Press [80] Warren Liao T (2005) Clustering of time series data–a survey. Pattern Recognition 38(11):1857–1874 [81] Yannakakis GN, Hallam J, Lund HH (2008) Entertainment capture through heart rate activity in physical interactive playgrounds. User Modeling and User-Adapted Interaction 18(1-2):207–243 [82] Zimmermann A, Specht M, Lorenz A (2005) Personalization and context management. User Modeling and User-Adapted Interaction 15(3-4):275–302 [83] Zukerman I, Albrecht DW (2001) Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction 11(1-2):5–18 [84] Zukerman I, Zukerman I, Albrecht DW, Nicholson AE (1999) Predicting users’ requests on the www. In: Proceedings of the Seventh International Conference on User Modeling (UM 99), pp 275–284
Experience Research: a Methodology for Developing Human-centered Interfaces Boris de Ruyter and Emile Aarts
1 Introduction At the turn of the century, a distinguished group of researchers identified the potential devastating effects of rapid technological developments, as described by the generalized Moore’s law, for the balanced relationship between humans and technology (Aarts et al., 2001). Whilst not ignoring the threads and risks of so called technology push, the Ambient Intelligence (AmI) vision was introduced to emphasize the positive contribution these technologies could bring to our daily lives. Within the AmI vision human needs are positioned centrally and technology is seen as a means to enrich our life. In course terms Ambient Intelligence refers to the embedding of technologies into electronic environments that are sensitive and responsive to the presence of people. In Ambient Intelligence, the term ambience refers to technology being embedded on a large scale in such a way that it becomes unobtrusively integrated into everyday life and environments. Hence, the ambient characteristic of AmI has both a physical and social meaning. The challenge for ambient technologies is to become invisible while still providing meaningful functionality to these end users (Weiser, 1991). Such challenge is not trivial to address and requires radical different approaches to human – system interaction (De Ruyter et al., 2005). The term intelligence reflects the situation in which the digital surroundings exhibit specific forms of cognition, i.e. the environments should be able to recognize the people that inhabit them, personalize according to individual preferences, adapt themselves to the users, learn from their behavior and possibly act upon their behalf. In AmI we distinguish between several levels of system intelligence: context aware, personalized, adaptive and anticipatory system intelligence: Boris de Ruyter Philips Research Europe, HTC 34, AE 5656 Eindhoven, The Netherlands Emile Aarts Philips Research Europe, HTC 34, AE 5656 Eindhoven, The Netherlands H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_39, © Springer Science+Business Media, LLC 2010
1039
1040
Boris de Ruyter and Emile Aarts
• Context aware. The environment can deter-mine the context in which certain activities take place, where context relates meaningful information about persons and the environment, such as positioning and identification. • Personalized. The environment can be tailored to the individual needs of users. It can recognize users and adjust its appearance to maximally support them. Automatic user profiling can capture individual user profiles through which personalized settings and in-formation filtering can be accommodated. • Adaptive. The environment can change in response to the users’ needs. It can learn from recurring situations and changing needs, and adjust accordingly. • Anticipatory. The environment can act upon the user’s behalf without conscious mediation. It can extrapolate behavioral characteristics and generate pro-active responses. The user benefits of the AmI paradigm are aimed at improving the quality of people’s lives by creating the desired atmosphere and functionality via system intelligence and interconnected systems and services. Several futuristic scenarios have illustrated how technology can become supportive in people’s daily lives (ISTAG, 2001). By mid 2006, a consortium of five European partners grouped under the name SWAMI , focused further on some potential threats and vulnerabilities of AmI scenarios. Although these scenarios are built around the same technological developments as those AmI was responding to, the scenarios are particular in their focus on addressing human needs. With these “dark side” scenarios, the SWAMI consortium emphasized that positioning human needs in the centre of technology development is not enough to ensure that the balance between humans and technology will be safeguarded. Although scenarios have been written and books have been published, the potential solution for this problem has not been provided other than suggestions for more technology (e.g. security related algorithms) development. As technology and society are changing, the vision of AmI has also changed over the years. New requirements for the enabling technologies that relate to ethics, new methodologies for empirical research to better understand the context in which these applications will be positioned, a shift from system intelligence to social intelligence, are just some examples of challenges that call for a paradigm shift in Ambient Intelligence research. Next we discuss some important trends that influence not only the definition of Ambient Intelligence but also its research approach.
1.1 From Entertaining to Caring Whereas Ambient Intelligence research has traditionally been focusing on user experiences in more entertainment oriented scenarios, there is a recent move towards the deployment of Ambient Intelligence technologies for Wellbeing and Care related application scenarios. Wellbeing and Care applications cut across the domains of Lifestyle (e.g. persuasive fitness applications) and Healthcare (e.g. remote patient monitoring systems for chronic care patients). It should be clear that the development of applications related to our wellbeing and care will demand for some im-
A Methodology for Developing Human-centered Interfaces
1041
portant shifts in the Ambient Intelligence research paradigm. For a more detailed discussion on the possible ambient assisted living applications as well as future challenges for AmI research are discussed in De Ruyter and Pelgrim (2007).
1.2 From System Intelligence to Social Intelligence As AmI technologies are becoming more part of our daily life and are taking the role of coaching and caring solutions, there is are increased expectations of AmI technologies to adapt and fit into social contexts. With this we observe that the levels of system intelligence in the AmI paradigm require complementing with social intelligence. The earliest definition of Social Intelligence was coined by Thorndike (1920) and described as: “the ability to understand and manage other people and to engage in adaptive social interactions”. To adhere to and behave in a social intelligent manner is clearly a new challenge for AmI environments for which new destinations have been identified in the area of wellbeing and care. When we consider human-human social interactions, we see that there are several characteristics that make certain individuals stand out and more liked by others, or which convey an air of trustworthiness, competence and dependability (Ford & Tisak, 1983). This list of social intelligent characteristics is large and includes attributes like “being nice and pleasant to interact with” and “being sensitive to other people's needs and desires”. In its broadest definitions social intelligence is “...a person’s ability to get along with people in general, social technique or ease in society, knowledge of social matters, susceptibility to stimuli from other members of a group, as well as insight into the temporary moods of underlying personality traits of strangers” (Vernon, 1933). So the socially intelligent person has a better than average ability to judge other people’s feelings, thoughts, attitudes and opinions, intentions, or the psychological traits that may determine their behavior. This judgment creates expectations on the observer’s part about the likely behavior of the observed person. This in turn leads to adjustments of one’s own behavior accordingly and appropriately. However, that appropriateness can only be judged when the social context is taken into account. In this sense, social intelligence is not merely something that goes on between two people in isolation, but contextual factors also come into play. The complexity and challenges for designing social intelligent systems is further discussed in Green and De Ruyter (2008). Social intelligence in AmI environments can take the form of a socialized, empathic or conscious system (see Figure 1).
Socialized AmI environments that are socialized are compliant to social conventions. For example, in a sensing environment some form of system intelligence can be context aware and thus know that a person is in a private situation. A personalized system
1042
Boris de Ruyter and Emile Aarts
Fig. 1 The extended AmI model
would know that it is the user’s preference not to be disturbed in such situation. An intelligent system that is socialized would use common sense knowledge to not allow disturbing the person in such a context. From this simple example it should also be clear that although we make a distinction between system and social intelligence at conceptual level, that at implementation level both forms of intelligence need to come together.
Empathic An empathic system is able to take into account the inner state of emotions and motives a person has and adapt to this state. Empathy is described as the intellectual or imaginative apprehension of another's condition or state of mind without actually experiencing that person's feelings (Hogan, 1969). For example, a form of system intelligence could infer that a person is getting frustrated while the social intelligent system with empathic capabilities would trigger the AmI environment to demonstrate understanding and helpful behavior towards the person.
Conscious Ultimately, a conscious system would not only be aware of the inner state of the person but also about its own inner state. With such level of social intelligence, the conscious system could anticipate the effect a person is trying to get onto the system. With this level of social intelligence it will be possible to develop rich and human like interactions in AmI environments.
A Methodology for Developing Human-centered Interfaces
1043
Discussion Ambient Intelligence is a vision on the development of technology applications with an emphasis on creating end user experiences that highlight user benefits of new technology applications. Throughout the years of its existence, the vision had undergone several changes and the basic model of Ambient Intelligence has been extended with the notion of social intelligence. It is clear that the Ambient Intelligence paradigm has a strong impact on the methodologies and instruments that are used for application driven research. This impact is further discussed in the next section.
2 Experience Research The design of Ambient intelligent environments differs markedly from the design of classical single device systems. AmI environments introduce new options for services and applications, by focusing on the desired functionality, rather than on the devices traditionally needed for each individual function. The fact that the technology will be integrated in these environments introduces the need for novel interaction concepts that allow the user to communicate with their electronic environment in a natural way. When aiming at user experiences, requirements engineering for AmI environments has to take a step beyond the development of scenarios and the translation of use cases into system requirements. For this we propose an iterative empirical research cycle that consists of three phases: studies in context, laboratory and field (see Figure 2).
Fig. 2 The Experience Research cycle
Although the presentation of the three phases might assume a sequential approach, it should be noted that its implementation is iterative. From traditional User Centered Design cycles it is known that each phase in the research cycle will pro-
1044
Boris de Ruyter and Emile Aarts
vide new insights but might also require the researcher to go back into one of the previous phases (De Ruyter, 2003). The iterative Experience Research cycle consists thus of three phases: studies in (i) context, in (ii) the laboratory and in (iii) the field. Whereas the context studies focus on collecting initial user requirements without introducing any new technology applications, the laboratory studies and field studies focus on the evaluation of new propositions in a controlled and real life setting respectively. Whilst some studies reported very limited added value of conducting both laboratory and field studies (Kaikkonen et al., 2005), others have highlighted the added value of conducting field studies (McDonald et al., 2006). Although Tory and Staub-French (2008) classify empirical studies in laboratory versus field settings as quantitative versus qualitative studies, we believe that such classification is an over simplification. Both types of studies will highlight different aspects related to the user – system interaction and both types of studies allow for the collection of qualitative as well as quantitative data. The different phases are now further discussed.
2.1 Context Studies In context studies the focus is on today’s reality without introducing any new technology applications. By using ethnographic techniques1 (such as observations, insitu interviews and diary studies), users are studied in their natural environment (Beyer & Holzblatt, 1998). Context studies can be seen as a way to understand the context in which future technology applications will be positioned. Although what is suggested as an approach is not participatory design, in which the end-user is not only the object of study but also the co-designer of new technology applications (Moss & Hunt, 1927), such participatory design sessions can be conducted as a follow-up for applying the results of the context study in the design of new technology applications. As a technique for collecting contextual data, the context mapping approach (Sleeswijk Visser et al., 2005) has demonstrated to be very successful in gaining insight into the context of use for future technology applications. Also within the user – system interaction research context Rose, Shneiderman and Plaisant (1995) has demonstrated the added value of applying ethnographic techniques in context studies provide both qualitative and quantitative insights for improving interactive systems. Once these studies have been completed, there is usually an overload of rich contextual data and the challenge is to abstract meaningful but not trivial insights without losing valuable information. It goes without saying that this is a difficult and cumbersome trajectory. All too often ethnographers complain that the step from the rich contextual data towards abstracted user insights is problematic due to in1 While ethnographic techniques are not restricted to context studies, it is important to note that in our positioning of context studies the aim is to study end users in their natural environments without interfering by for example introducing new technology applications. The latter studies will be part of field studies.
A Methodology for Developing Human-centered Interfaces
1045
formation loss. Nevertheless, one needs to make such transition as the contextual data is too extensive to work with. One approach that has proven to be useful is to compile some very high level insights from the contextual data and then to return to the rich contextual data to further understands these insights. This approach is very consistent with the approach to working with rich ethnographic data as suggested by Iqbal et al. (2005). For example (see Figure 3), when studying the role of social networks in real life settings, one can conclude that people need to receive appreciation from their social network. At first one might argue that such high level statement is rather obvious and that such insight was known even before conducting the contextual study. Or to quote Beyer & Holzblatt: “The complexity of work is overwhelming, so people oversimplify”. However, this is not where the analysis of the rich contextual stops but rather where it begins: in the next step one will go back to the rich contextual data and explore the exact instances that have led to the generation of the high level statement. In this example, one would go back into the rich data to understand how people experience and express appreciation through their social network. With this second step it will be possible to go beyond the obvious of the high level insights formulated during the first exploration of the rich contextual data. Using this two step approach of abstracting and detailing it becomes possible to formulate valuable insights from the rich contextual data.
Fig. 3 From rich context mapping data towards user insights
1046
Boris de Ruyter and Emile Aarts
2.2 Lab Studies System functionalities that generate true user experiences can be studied in a reliable way by exposing users to feasible prototypes that provide proofs of concept. These are called experience prototypes, and they can be developed by means of a user-centered design approach that applies both feasibility and usability studies in order to develop a mature interaction concept. For this laboratories are needed that contain infrastructures supporting fast prototyping of novel interaction concepts in environments that simulate realistic contexts of use. Moreover, these experience prototyping centers should also be equipped with an observation infrastructure that can capture and analyze the behavior of people who interact with the experience prototypes. Philips’ ExperienceLab is an example of such an experience and application research facility. It combines the opportunity for both feasibility and usability research into user-centric innovation, leading to a better understanding of (latent) user needs and the technologies that really matter from a user perspective. The use of the ExperienceLab is discussed in more detail further in this chapter. Since the opening of ExperienceLab in 2001, there are several lessons learned with respect to the use of the ExperienceLab: (De Ruyter et al., 2005) 1. Real-time observation is less important, off-line scoring is preferred. When equipping the ExperienceLab with observation tools, it was assumed that researchers would code observations in real-time. However, over the years we have learned that off-line scoring after the experiments is preferred. This has a consequence for the way data is collected and made available for scoring since now researchers need portable solutions and export the observational data from the ExperienceLab system. 2. Developing good coding schemes is as much effort as developing a questionnaire. Coding schemes provide an extensive classification of potential observable behavior. This coding scheme is used to code the recorded behavior. Developing good coding schemes takes time and reuse of these coding schemes (like for questionnaires) is desired. 3. New methods and instruments to measure the subjective user experience in an objective way are needed. Although the user experience is by nature subjective, there is a need to capture and analyze user experiences by means of objective methods. 4. ExperienceLab is a catalyst for improving technology transfer into the business. Traditionally, research results are communicated through scientific publications and presentations. Over the years ExperienceLab has proven to be a very effective communication tool within a large corporate environment. Although the original goal of the ExperienceLab was to support usability and feasibility research, there is a need to reserve capacity for demonstration and dissemination events. 5. Having a support team is essential when operating an ExperienceLab. Since the opening of the ExperienceLab there has been a permanent software engineering team available for technology integration and maintenance of the infrastructure.
A Methodology for Developing Human-centered Interfaces
1047
Similarly, there is a need for a team of behavioral scientists to guide the empirical research in ExperienceLab.
2.3 Field Studies Although studies in controlled laboratory settings can provide lots of valuable insights, the research findings are limited in terms of their ecological validity (Dix et al., 2004). With field studies the emphasis is on introducing new technology applications into realistic settings and studying the usage or behavioral change that might follow. In field studies end-users will be less enthusiastic about new technology applications and they will demand that these applications fit into their daily life by providing functionality that is meaningful to people. Additional, in field studies end-users will have the option to use these technology applications over longer periods of time. There are no clear guidelines on the optimal duration of a field study. Some researchers (Neustaedter et al., 2007) have used pilot studies to estimate the needed duration of their field study in order to observe for example behavioral change. Others (Breazeal and Scassellati, 2000) have used the term field studies to indicate both contextual studies (in which no new technology applications are introduced) and field studies in which technology applications have been introduced. Although, in contrast to laboratory studies, there seems to be very little methodological guidance in conducting field studies, one can postulate the following guidelines: 1. Field studies are often limited to the deployment of focused prototypes (rather than complete environments). This is both from a practical (i.e. installation & stability issues) and control perspective desirable. 2. Field studies (although very much depending on the type of behavior that is being studied) will often spread from 4 to 8 weeks. 3. Field studies will often be preceded by a period in which the users are not confronted with new technology applications. These periods serve as baseline for understanding the effect of the introduced technology applications. Often these periods will end with some data collection in the form of questionnaires. These questionnaires will be repeated at the end of the actual field test. 4. Field studies rely for data collection mostly on logging data and in-situ interviews or questionnaires sampled over time. Although influenced by the complexity of the introduced technology application, it is found to be very useful to revisit and interview the user after an initial period of 3 days. After such initial period the end-users have experienced most of the application’s functionality and have found the major obstacles in using the application. Interviews at the end of the field study will often not reveal these issues since end-users will have forgotten about them.
1048
Boris de Ruyter and Emile Aarts
3 ExperienceLab Infrastructure Given the high importance of controlled experiments in a realistic environment, the ExperienceLab is presented in more detail. By developing and integrating advanced technologies in the area of Ambient Intelligence, ExperienceLab, currently consisting of a Home, Shop and Apartment environment, is an innovation center for the development of novel consumer products and services.
Fig. 4 The ExperienceLab floorplan
3.1 HomeLab: the Home Environment The HomeLab is built as a two-stock house with a living, a kitchen, two bedrooms, a bathroom and a study. At a first glance, the home does not show anything special, but a closer look reveals the black domes at the ceilings that are hiding cameras and microphones. Equipped with 34 cameras throughout the home, HomeLab provides behavioral researchers a perfect instrument for studying human behavior inside HomeLab. Adjacent to the Home there is an observation room. When HomeLab was opened in 2001, one way mirrors were placed between the living and observation room. The idea was to have a direct view into the living. But time learned that observers preferred the camera images. The different viewing
A Methodology for Developing Human-centered Interfaces
1049
angels and possibility to zoom into details were reasons to abandon the mirrors. The observation room is equipped with four observation stations. Each station has a large high resolution flat screen showing a collection of six different images from the cameras in the house. The observer is free to choose which cameras he wants to use and what the pan, tilt and zoom position of every individual camera has to be. And he can route two of the roughly 30 available microphones to the left and right channel of his headphones. Each observer has an application running to feed the behavioral data to the storage system, synchronized to the video data. In the early days of HomeLab this setup was used in the real time situation. Four observers had a hard time to follow the progress of the experiment. Nowadays it is more common to first have the video data stored on the capture stations and do the behavioral data collection afterwards. Also events and sensor data are time-stamped and appended to the video data. This way of working is much more efficient and a single observer can collect all the relevant data.
Fig. 5 HomeLab: user centered design environment for advanced studies in multimedia concepts for the home
Broadband Internet facilities enable various ways to connect parts of the HomeLab infrastructure to the Philips High Tech Campus network or even to the outside world. A wireless Local-Area Network (LAN) offers the possibility to connect people in HomeLab without running cables. However, if cables are required, double floors and double ceilings provide nice hiding places. Corridors, adjacent to the rooms in HomeLab, accommodate the equipment that researchers and developers need to realize and control their systems and to process and render audio and video signals for the large flat screens in HomeLab. Light control systems (LON and amBX) can be accessed by the researchers and offer their prototypes the possibility to affect the light settings in the rooms.
3.2 ShopLab: the Retail Environment The ShopLab research program builds on the insight, that shopping itself has become an important leisure activity for many people, and that flexible atmospheres
1050
Boris de Ruyter and Emile Aarts
are needed to enhance shopping experiences. On the other hand many retail chains want to maintain a clear house style for branding reasons. This introduces the challenge of combining these two major aspects. One approach to this, studied in ShopLab, is that one atmosphere design will be sent to all stores and slightly adapted there to meet local conditions. With the introduction of solid state (LED) lighting, a wide range of new options to create such atmospheres using color and dynamic effects is becoming available. However, tuning these atmospheres requires controlling several hundred lamp settings, introducing a complex overall control challenge. Another approach studied to enhance the shopping experience is the introduction of interactivity, in the form of interactive shop windows, interactive signage and reactive spots. Adaptation of these shop atmospheres also requires input from smart environments that detect people’s presence and product interests while they are in or near a shop.
Fig. 6 ShopLab: augmented environment with advanced vision and lighting concepts for retail studies
The ShopLab is used extensively to perform user studies, both with retailers and with end-users (shoppers). By involving these users in all phases of the design process, including the evaluation of the experience prototypes, important insights in the actual experiences of users are obtained early on in the development process.
3.3 CareLab: the Assisted Living Environment This CareLab resembles a one-bedroom apartment for seniors and is equipped with a rich sensor network to study the contextual settings in which people will use the health and wellness applications. The sensor information is processed and combined to extract higher-order behavioral patterns that can be related to activities and states, such as the presence of people, the state of the home infrastructure, etc. With the CareLab it is possible to explore at an early stage the user's acceptance for these solutions and to assess the interactive and functional qualities of these solutions before deploying these into
A Methodology for Developing Human-centered Interfaces
1051
field settings. Results will be used to improve applications of innovative technologies, to eliminate imperfections and to explore new applications.
Fig. 7 CareLab: realistic aware environment with advanced sensing and reasoning capabilities to study consumer health and wellness propositions in a home context
4 The ExperienceLab as an Assessment Instrument As described, the ExperienceLab consists of realistic environments offering an advanced instrument for studying Ambient Intelligence solutions. Such laboratory environment should facilitate data collection without influencing the data itself. However, throughout the use of the ExperienceLab it has also been observed that test participants are impressed by the environments and expect to encounter high tech systems during their stay in the ExperienceLab. Murray & Barnes (1998) describe this so-called ‘wow’-effect as “initial enthusiasm”. Such ‘wow’-effect could bring on a highly satisfied feeling towards Philips HomeLab, which could positively influence participants’ responses collected during an experimental session. This type of response bias is also often described as a halo effect. Response bias can be describes as any systematic tendency of a respondent to manifest particular response behavior for an extraneous reason which is not part of the experimental manipulation. In the past, various researchers (Donovan & Rossiter, 1982; Mehrabian & Russell, 1974; Kotler, 1974) demonstrated the influence of specific facets of environment on consumer behavior. To measure the impact of the ExperienceLab as instrument, a controlled experiment was conducted. This experiment involved the replication of a traditional usability test (of an experimental system for video editing) in both the ExperienceLab and a traditional laboratory environment. The experiment and its findings are now briefly discussed2 .
2
A more detailed report of this study is found in De Ruyter et al, 2009.
1052
Boris de Ruyter and Emile Aarts
4.1 Method A total of 40 participants, who were unfamiliar with the ExperienceLab, were recruited for this study. The experiment was designed as a within subject design consisting of two experimental sessions separated by a time interval of one week. It was suggested to the participants that there would be an improvement of the system's usability between the sessions (see Figure 8) based on the general results of session 1. In reality, only minor changes (e.g. user interface colors) were made to the system in order to be able to compare the findings from both sessions. In both sessions we conducted a typical usability test of the same interactive system.
Group 1
Session 1 ExperienceLab
→
Session 2 ExperienceLab
Group 2
ExperienceLab
→
Laboratory
Group 3
Laboratory
→
ExperienceLab
Group 4
Laboratory
→
Laboratory
Fig. 8 The methodological design of the ExperienceLab experiment
4.2 Procedure Setup as a typical usability test, the ease of use of a video editing system was assessed in both the ExperienceLab and in a traditional laboratory. The laboratory environment is less attractive and realistic compared to the ExperienceLab environment. The most importance difference between both environments is the embedding and presentation of the experimental system as part of a home like environment. A set of questionnaires was preceded by a short introductory text that differed depending on the experimental setting the participant was about to enter. In the introduction the participants were told they were participating in a research on the usability on a recently developed video editing system. Consequently, the participants were given some information on the system to make it possible certain expectations could be evoked. Next, the participants were told the usability test had been divided in two sessions with one week in between in which the video editing system would be adjusted on the basis of first session’s results. During the second session the participants had to evaluate the system again.
A Methodology for Developing Human-centered Interfaces
1053
4.3 Materials The experimental system deployed in the usability test is a TV-based video editing system that is able to automatically convert a home video into an edited version, which is basically a ‘summary’ of the raw footage. By means of a remote control, the user can edit and modify this automatically created summary by, during the viewing of the summary, pausing the desired shot and subsequently selecting one of the editing functions (e.g. adding music, adding effects). During the usability test, participants are requested to complete a given set of tasks with this system. Several usability and contextual measures are collected during the experiment. The instruments used for this data collection are briefly discussed.
Brief Mood Introspection Scale (BMIS) In order to measure participants’ mood, the Brief Mood Introspection Scale (BMIS) by Mayer & Gaschke (1988) was applied. The BMIS consists of 16 adjectives which are based on eight mood states: (1) happy, (2) loving, (3) calm, (4) energetic, (5) fearful/anxious, (6) angry, (7) tired and (8) sad.
Software Usability Measurement Inventory (SUMI) The software’s usability was measured with the Software Usability Measurement Inventory (Kirakowski, 1993). The SUMI consists of 50 statements and three item Likert scales on which the participants have to indicate whether they agree, are undecided or disagree with the statement. The SUMI contains statements like “I enjoy my sessions with this prototype”, “It is obvious that user needs have been fully taken into consideration” and “The prototype has a very attractive presentation”.
Pleasure, Dominance and Arousal scale (PDA) In order to measure the degree to which participants are satisfied with the two different environments, the semantic differential Pleasure, Dominance and Arousal scale was used. Mehrabian & Russell (1974) designed this widely used instrument to investigate how consumer behaviors are influenced by atmospheres (Foxall, 1997). This instrument is based on three dimensions to describe an individuals’ emotional responses to an environment: pleasure, arousal and dominance. These dimensions have been subdivided into six opposing states of mind each. Each opposing pair is rated along a seven point Likert scale.
1054
Boris de Ruyter and Emile Aarts
NASA Task Load Index (TLX) The task load evoked by the performance of the assignments is measured by means of the NASA Task Load Index (Hart & Staveland, 1988). In order to measure mental workload, the TLX uses six bipolar scales to assess task load on six dimensions: mental demand, physical demand, temporal demand, performance, effort and frustration (Rubido, Díaz, Martín & Puente, 2004). The mean task load value represents how demanding the participants experienced the execution of the tasks.
(Dis)confirmation of Expectations In order to measure the (dis)confirmation of expectations a commonly applied seven-point semantic differential scale is used (Aiello, Czepiel & Rosenberg, 1977; Linda & Oliver, 1979; Oliver, 1977; Swan & Trawick, 1980; Westbrook, 1980). The scale ranges from “The experimental system was worse than expected” to “better than expected”. In order to measure participant’s expectation level Churchill & Surprenant’s (1982) seven-point semantic differential scale was used. The range of this scale goes from “My expectations about the experimental system were too high: it was poorer than I thought” to “My expectations about the experimental system were too low: it was better than I thought”.
4.4 Results The results of the statistical analysis of the collected measures are now presented. Note that in order to process data of ordinal level as interval level data, all values were normalized to Z-scores preliminary to statistical processing.
Brief Mood Introspection Scale (BMIS) 4.4 shows the mean scores on mood state for each group preliminary to the first session. Table 1 Mean group scores on mood state corresponding to session 1 (score is at least 1 and at most 7; Ngroup = 10)
Session 1
Group 1 3.62 (0.38)
Group 2 3.64 (0.38)
Group 3 3.80 (0.36)
Group 4 3.68 (0.47)
The results of a One-Way ANOVA show that there was no significant difference between the four groups on the mean mood state scores concerning session
A Methodology for Developing Human-centered Interfaces
1055
1 (F(3,36) = 1.60, p = .21). In 4.4 the mean scores on mood state for each group preliminary to the second session are presented. Table 2 Mean group scores on mood state corresponding to session 2 (score is at least 1 and at most 7; Ngroup = 10)
Session 2
Group 1 3.67 (0.25)
Group 2 3.79 (0.45)
Group 3 3.84 (0.32)
Group 4 3.76 (0.63)
Again, a One-Way ANOVA was conducted. The results show that, concerning the second session, the four groups did not differ significantly from each other on the mean mood state scores (F(3,36) < 1, p = .91). Subsequently, in order to check whether there was a significant difference between the scores of session 1 and session 2, for each group an Independent Samples t-test was conducted. There was no significant difference found for group 1 (t(18) = 1.96, p = .07) just as for group 2 (t(18) = 1.34, p = .20), group 3 (t(18) < 1, p = .95) and group 4 (t(18) < 1, p = .89).
Software Usability Measurement Inventory (SUMI) The group mean scores for satisfaction as measured by the SUMI in session 1 are presented in 4.4. Table 3 Mean scores on satisfaction(based on SUMI) corresponding to session 1 (score is at least 1 and at most 3; Ngroup = 10)
Session 1
Group 1 2.45 (0.26)
Group 2 2.41 (0.23)
Group 3 2.40 (0.15)
Group 4 2.55 (0.17)
The results of a One-Way ANOVA show that, regarding session 1, the groups did not differ significantly from each other concerning mean satisfaction scores (F(3,36) < 1, p = .57). 4.4 shows the groups’ mean scores on satisfaction for session 2. Table 4 Mean scores on satisfaction (based on SUMI) corresponding to session 2 (score is at least 1 and at most 3; Ngroup = 10)
Session 2
Group 1 2.47 (0.20)
Group 2 2.40 (0.14)
Group 3 2.40 (0.25)
Group 4 2.50 (0.27)
Again, a One-Way ANOVA was conducted to check for significant differences in mean satisfaction scores between the groups. The results show, once more, that there was no significant difference between the groups (F(3,36) < 1, p = .57).
1056
Boris de Ruyter and Emile Aarts
Finally, an Independent samples t-test was conducted to check for significant differences in mean satisfaction scores within each group between session 1 and session 2. The results show that neither for group 1 (t(18) < 1, p = .36) nor group 2 (t(18) < 1, p = .92), group 3 (t(18) = 1.06, p = .31) and group 4 (t(18) < 1, p = .93) were there significant differences between both mean scores.
Pleasure, Dominance and Arousal scale (PDA) In 4.4, for each group the mean scores on the PDA-scale corresponding to session 1 are presented. Table 5 Mean scores on feeling evoked by environment corresponding to session 1 (value is at least 1 and at most 7; Ngroup = 10)
Session 1
Group 1 4.44 (0.59)
Group 2 4.55 (0.45)
Group 3 4.34 (0.73)
Group 4 3.99 (0.52)
The result of an One-Way ANOVA indicated that the four groups did not differ significantly from each other concerning session 1 (F(3,36) = < 1, p = .42). 4.4 shows for each group the mean scores on the PDA-scale corresponding to session 2. Table 6 Mean scores on feeling evoked by environment corresponding to session 2 (value is at least 1 and at most 7; Ngroup = 10)
Session 2
Group 1 4.42 (0.57)
Group 2 4.28 (0.65)
Group 3 4.55 (0.70)
Group 4 3.59 (0.49)
Again, a One-Way ANOVA was conducted to check whether the mean scores differed within session 2. Just like in session 1, there was no significant difference between the mean scores concerning session 2 (F(3,36) = < 1, p = .44). Finally, an Independent samples t-test was carried out in order to find out whether there was a matter of significant difference within the mean group scores between session 1 and session 2. However, the results show that within group 1 (t(17) < 1, p = .89), group 2 (t(17) < 1 p = .41), group 3 (t(18) < 1, p = .63) and group 4 (t(17) < 1, p = .35) there were no significant differences between the sessions.
NASA Task Load Index (TLX) In 4.4, for each group the mean values on task load corresponding to session 1 are presented. A One-Way ANOVA showed that the first session’s results of the groups did not differ significantly from each other (F(3,36) = 1.37, p = .27). Secondly, within the
A Methodology for Developing Human-centered Interfaces
1057
Table 7 Mean task load values corresponding to session 1 (value is at least 5 and at most 100; Ngroup = 10)
Session 1
Group 1 26.67(13.56)
Group 2 31.08(9.92)
Group 3 33.58(11.08)
Group 4 26.67(12.90)
first session the mean scores of group 1 and group 4 and group 2 and group 3 were also compared. However, neither between group 1 and 4 (F(1,18) = 1.49, p = .24) nor group 2 and 3 (F(1,18) < 1, p = .42) any significant differences were found. 4.4 shows for each group the total mean value on task load corresponding to session 2. Table 8 Mean task load values corresponding to session 2 (value is at least 5 and at most 100; Ngroup = 10)
Session 2
Group 1 18.47(5.64)
Group 2 31.50(9.52)
Group 3 27.33(9.33)
Group 4 26.17(11.62)
A One-Way ANOVA was conducted to check whether the second session’s results differed significantly between the groups. The results show that there was again no significant difference between mean values (F(3,36) < 1, p = .55). The results of a One-Way ANOVA show that group 1 and 4 (F(1,18) = 1.73, p = .24) and group 2 and 3 (F(1,18) < 1, p = .49) did not differ significantly from each other with regard to session 2. Furthermore, to find out whether there was a matter of significant difference between the group’s mean scores between session 1 and session 2, an Independent samples t-test was carried out. The results show that there was only a significant difference between the mean scores for group 1 (t(18) = 2.35, p < .05, explained variance = 18.4 percent) and not for group 2 (t(18) < 1, p = .63), group 3 (t(18) < 1, p = .64) and group 4 (t(18) < 1, p = .88).
(Dis)confirmation of expectations In 4.4, for each group the mean scores on (dis)confirmation of expectations towards the experimental system measured in session 1 are presented. Table 9 Mean scores on disconfirmation of expectations measured in session 1 (value is at least 1 and at most 7; Ngroup = 10)
Session 1
Group 1 4.40 (1.17)
Group 2 4.80 (1.09)
Group 3 4.45 (1.09)
Group 4 5.05 (0.93)
1058
Boris de Ruyter and Emile Aarts
The results of a One-Way ANOVA show that the mean scores of the groups did not differ significantly from each other (F(3,36) = < 1, p = .49). The mean scores of each group corresponding to the second session are presented in 4.4. Table 10 Mean scores on disconfirmation of expectations measured in session 2 (value is at least 1 and at most 7; Ngroup = 10)
Session 2
Group 1 3.90 (0.91)
Group 2 5.15 (1.31)
Group 3 4.20 (0.95)
Group 4 4.35 (1.11)
Once more, the results of a One-Way ANOVA show that there was no significant difference in mean scores between the four groups (F(3,36) = 2.44, p = .08). An Independent samples t-test was conducted to check if there were significant differences in mean scores between session 1 and session 2. However, there was no significant difference found for group 1 (t(18) = 1.07, p = .30) just as for group 2 (t(18) < 1, p = .52), group 3 (t(18) < 1, p = .59) and group 4 (t(18) = 1.53, p = .14).
4.5 Discussions The main finding of this study is that there is no statistical significant difference between the usability measures obtained in the ExperienceLab and those obtained in a traditional usability laboratory. On the basis of this study one can assume that conducting an usability study in the ExperienceLab environment does not result in a more positive system evaluation as compared to evaluating the same system in a traditional laboratory environment. During the debriefing at the end of the second session, several participants explained that because of performing the video editing tasks and filling in questionnaires they had paid hardly any attention to the environment. On the basis of this explanation it could be possible that the occurrence of response bias due to the ExperienceLab environment is much more system related. That is: systems that are very much part of the environment will draw more attention to the testing environment. Consider for example the evaluation of voice controlled environments in which there is no single point of interaction but in which the user will interact with or through the environment as a whole. In such situations the presence of response bias due to the testing environment, needs to be re-assessed. However, the present study leads us to the conclusion that the evaluation results of an interactive system, in which this system is the main locus of attention, will not be biased by the testing environment. As such, the present study provides us with an important argument towards the validity and reliability of empirical research in environments such as the ExperienceLab.
A Methodology for Developing Human-centered Interfaces
1059
5 Case Study of the Experience Research Approach In this section, a brief 3 case study of applying the Experience Research cycle is presented. The application domain for this case study is found in the domain of Ambient Assisted Living solutions for the ageing population. The target group for this case study is characterized by elderly who have recently retired and who have an active social network. For this study the main focus was not identifying problematic situations (e.g. social isolation) but to better understand well functioning social networks of people who experience a major change in their life (i.e. retirement). Equipped with these insights research could conceptualize potential technology applications for elderly that are at risk of experiencing a reduced active social network and for which it is important to support them in re-activating their social network.
5.1 Context-Mapping Study Following the context mapping methodology (Sleeswijk et al., 2005) a study was designed to gain more insights into social world of elderly. The context-mapping process generally comprises three parts: (1) eliciting information about the context, (2) structuring the contextual information and communicating the information to the development team, and (3) incorporating the contextual information in concept development activities. In this method, a key element is having users create expressive artifacts (in so-called generative sessions) and discussing them in individual interviews or group sessions. After recruiting and introducing 11 elderly for the context mapping study, a probe package was provided to these participants. This purpose of this package is to sensitize the participants by making them conscious of their social network. More specific, the participants were requested to complete a poster that represents their social network (Kang and Ridgway, 1996). The participants were not instructed to pay attention to any specific aspect of their social network but rather to keep a diary of this social network by working on the poster on a daily basis. The participants are thus asked to work on the small assignments each day for approximately 10 minutes during one week by describing and annotating their social interactions on this poster. At the end of this first week, the participants were interviewed and the poster was discussed with them. During the interview it was agreed that the participants would involve a companion (a person that plays an important role in their social network) in completing together another poster representing activities and interactions within the social network (see Figure 9). This second activity was again setup as a collection of daily assignments spread over one week. At the end of the second week, both the participant and the companion would be 3
The case study description is brief and intended merely to illustrate the different steps of the Experience Research cycle. A more detailed report of this case study is found in De Ruyter & Leegwater, 2009.
1060
Boris de Ruyter and Emile Aarts
Fig. 9 Context mapping participants detailing their social network on a poster
interviewed to better understand the poster they have created together. At this occasion there would also be a small interview on the use of today’s technologies (e.g. phone, email, postal mail) within the context of staying in touch with their social network. In the next step of the context-mapping study, the findings of the context-mapping study are clustered and analyzed (see Figure 10). The richness of this material requires making high level abstractions that represent the high level needs of the participants. These are summarized as: • There is a strong need for recognition from the actors in the social network. • The elderly have a need for being independent from the social network. • The elderly have a need to experience general satisfaction about their social world. For example, the need for recognition was further detailed by specific behaviors and situations that were highlighted during the context-mapping study: • Being useful. There is a strong need to make one’s self useful; after retirement the mentioned group of elderly still feels responsible to add something to society to a certain degree. • Being included. People want to be become member of an association or want to be invited at someone’s birthday party. In other words, they feel the need of belonging. • Confirmation of status. People show what they are capable by comparing themselves with others with respect to condition, expertise or skills, or by challenging each other in a competitive way. • Mutual appreciation. Recognition is expressed for other people’s skills and expertise by giving each other compliments. Also the elderly show mutual interest in each other by exchanging news and ask for each other’s situation. A lot of mutual help and support takes place as well: help each other with small jobs, cook for each other, and helping someone to his destination. Although many of these findings are already confirmed in literature (Bauwmeister and Leary, 1995), the present study provided more insights into how these fundamental human needs are addressed with today’s solutions. In a second level of the
A Methodology for Developing Human-centered Interfaces
1061
Fig. 10 The classification and analysis of the context mapping findings
data analysis, it became clear how their social networks are formed, how the participants maintained and expanded these networks and what made these networks satisfying and rewarding.
5.2 Laboratory Study As a further research direction, the topic of social recognition and appreciation was explored. It is noted that the context-mapping data is rich enough for investigating many other aspects of the participant’s social network. Building on literature (Bauwmeister and Leary, 1995), the following requirements for an application concept were put forward: • Social networks consisting of few close friendships are preferred over large networks with less intimate friendships (Caldwell and Peplau, 1982) • Satisfying social networks involve two criteria: (i) frequent and affectively pleasant interactions and (ii) these interactions must express an affective concern between the network’s members (Bauwmeister and Leary, 1995) • The social interaction should allow for small talk over trivial matters (Gerstel and Gross, 1982) While these requirements are general and well documented in literature, the context-mapping study results supported understanding the context in which any technological solution would be introduced and additional requirements for the concept were established. After a further literature study and an exploration of existing technological solutions for supporting social networks, a concept creation step was started. The concept developed in this case study is that of an interactive television channel for a closed community (see Figure 11). With this TV channel, the members of the social
1062
Boris de Ruyter and Emile Aarts
network can post short messages and pictures that are meaningful to their network4 . More important was that the members of the community could rate the posted materials by attributing flowers to the postings (see Figure 12). This was done in order to provide the network’s members with an explicit means to show appreciation for each other’s postings. The interactive channel would only allow a limited number of postings to be shared at the same time. As soon as new materials were introduced the older postings would be removed.
Fig. 11 The concept of an interactive community channel
Although the TV is a well known device, it also introduced a major challenge for the concept’s usability since all interaction would be done using a simple remote control. After consulting existing design guidelines (e.g. readability of text on TV displays) a first prototype of the concept was built and tested in the laboratory environment in terms of its usability and initial acceptance. The usability test was based on a comparison of actual usage logging and a pre-defined task model for the developed concept. Post experimental interviews revealed some of the participant’s attitudes towards this concept.
Fig. 12 Viewing postings and expressing appreciation
4 Note: the actual posting is done using a PC with an internet connection while the consumption and sharing of the postings is limited to the TV channel.
A Methodology for Developing Human-centered Interfaces
1063
The laboratory study highlighted several usability issues with respect to the use of the remote control, the transitions between the messages and the readability of the posted messaged. The concept was adapted and prepared as a robust prototype that could be deployed in a field setting.
5.3 Field Study The prototype (implemented as a set-top box that connects to any standard TV) was installed in six homes and was available for the users for a period of 8 days. After explaining the basic functionality and use of the system, the end users could start using the prototype. The participants in this study were member of a small social network that organized frequent walking trips. The channel was used to share pictures and small messages related to their trips.
Fig. 13 Installation of the prototype in end-user's home
Since the prototype was implemented as a networked application, detailed loggings were made of the daily use of the system. Throughout the field test, the participants were requested to complete a daily questionnaire measuring the perceived appreciation they experience from their social network (Adler and Fagley, 2005). The results of the field test provided more insights into the use of the concept in real life settings. Additional, it enabled researchers to investigate some specific mechanisms that enable and create strong social networks. Additional, the results of the field study highlighted some unexpected behaviors. For example, one of the members of the social network was very negative towards the system due to its simplicity. After discussing the issue further it became clear that the person raising this issue was recognized in the social network as an expert when it comes to using ICT solutions such as the PC. However, due to the concept’s simplicity any member of the social network could post and retrieve messages and pictures. As a consequence, the specific user lost his status as expert since his support was no longer needed.
1064
Boris de Ruyter and Emile Aarts
6 Discussion It goes without saying that the AmI vision holds the promise of becoming a truly disruptive paradigm. It calls for a far-reaching multi-disciplinary and integrative approach that extends far beyond the levels of system innovation that mankind has been dealing with so far. This requirement is a challenge and a threat at the same time. The threat lies in the fact that the complexity of AmI environments may not be tractable and that the implementation of the vision therefore will be infeasible. On the other hand the requirement may stimulate the search for innovative solutions to this complexity problem resulting into now insights that eventually will lead to the realization of true Ambient Intelligence. Although the experience research cycle case study is only reported in brief, it should make clear that each phase of this cycle will provide its own specific insights into the interaction between users and technology applications. The contextual study enabled the research team to better understand and instantiate some of the general findings from literature that relate to the development and maintenance of social networks. Within the laboratory setting it was possible to study how end users could use the experimental prototype. Finally, the field study gave more insight into actual use and affective responses people have when using the proposed technology applications. It should be clear that this process (although not presented this way) is iterative and that findings of each phase might require the researcher to step back to a previous phase of the experience research cycle.
References [1] Aarts, E.H.L., Harwig, R., & Schuurmans, M. (2001), Ambient Intelligence, in: P. Denning, The Invisible Future, McGraw Hill, New York, 235—250. [2] Adler, M.G,. & Fagley, N.S. (2005). Individual differences in finding value and meaning as a unique predictor of subjective well-being. Journal of Personality, 73, 79-112. [3] Aiello, A. Jr, Czepiel, J.A. & Rosenberg, L.A. (1977). Scaling the heights of consumer satisfaction: An evaluation of alternate measures, In: Consumer Satisfaction, Dissatisfaction and Complaining Behavior, Ralph L. Day, ed. Bloomington, Indiana: School of Business, Indiana University. [4] Bauwmeister, R.F. & Leary, M.R. (1995). The need to Belong: Desire for Interpersonal Attachments as a Fundamental Human Motivation, Psychological Bulletin, Vol. 117, 497-529. [5] Beyer, H. & Holzblatt, K. (1998). Contextual Design, Academic Press. [6] Breazeal (Ferrell), C. and Scassellati, B. (2000), Infant-like Social Interactions Between a Robot and a Human Caretaker. In K. Dautenhahn. Special issue of Adaptive Behavior on Simulation Models of Social Agents. [7] Caldwell, M.A. & Peplau, L.A. (1982). Sex differences in same-sex friendships, Sex Roles, 8, 721-732.
A Methodology for Developing Human-centered Interfaces
1065
[8] Churchill, G.A. Jr & Surprenant, C. (1982). An investigation into the determinants of customer satisfaction, Journal of Marketing Research, 19, 491-504. [9] De Ruyter B. (2003). User Centred Design, In (Eds.): Aarts, Emile, and Stefano Marzano, The New Everyday: Vision on Ambient Intelligence, 010 Publishers, Rotterdam, The Netherlands. [10] De Ruyter, B. & Leegwater, E. (2009). Supporting social networks of elderly users, in preparation. [11] De Ruyter, B. & Pelgrim, E. (2007). Ambient Assisted Living research in CareLab, ACM Interactions, Volume 14, Issue 4, July + August 2007. [12] De Ruyter, B., Saini, P., Markopoulos, P. & van Breemen, A. (2005). Assessing the effects of building social intelligence in a robotic interface for the home, Interacting with computers, 522 - 541, Elsevier. [13] De Ruyter, B., van Loenen E. & Teeven, V. (2007). User Centered Research in ExperienceLab, European Conference, AmI 2007, Darmstadt, Germany, November 7-10, 2007. LNCS Volume 4794/2007, Springer. [14] De Ruyter, B., Van Geel, R., Markopoulos, P. & Kramer, E. (2009). Measuring the Halo effect of Experience and Application Research Centers, in preparation. [15] Dix, A., Finlay, J.E., Abowd, G.D. & Beale, R. (2004). Human-computer Interaction, Pearson Education. [16] Donovan, R.J., Rossiter, J.R. (1982). Store atmosphere: An experimental psychology approach, Journal of Retailing, 58, 34-57. [17] Ford, M.E., & Tisak, M.S. (1983). A further search for social intelligence. Journal of Educational Psychology, 75, 196 – 206. [18] Foxall, G.R. (1997). The emotional texture of consumer environments: A systematic approach to atmospherics, Journal of Economic Psychology, 18, 505523. [19] Gerstel, N. & Gross, H. (1982). Commuter marriages: A review, Marriage and Family Review, 5, 71-93. [20] Green, W. & de Ruyter, B. (2008). The design and evaluation of interactive systems with perceived social characteristics: six challenges, The 7th International Workshop on Social Intelligence Design: Designing socially aware interactions, 3–5 December 2008, Puerto Rico. [21] Hart, S.G. & Staveland, L.E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research, In P.A. Hancock & N. Meshkati (Eds.), Human Mental Workload, 239-250. Amsterdam: NorthHolland Press. [22] Hogan, R. (1969). Development of an empathy scale, Journal of Consulting and Clinical Psychology, 33, 307-316. [23] Iqbal, R., Gatward, R. & James, A. (2005). EthnoModel: An Approach for Developing and Evaluating CSCW Systems, Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2005. IDAACS 2005. IEEE, 633 - 638. [24] ISTAG (2001), Scenarios for Ambient Intelligence, European Commission, http://www.cordis.lu/ist/istag.html
1066
Boris de Ruyter and Emile Aarts
[25] Kaikkonen, A., Kekäläinen, A., Cankar, M., Kallio, T. & Kankainen, A. (2005). Usability Testing of Mobile Applications: A Comparison between Laboratory and Field Testing, Journal of Usability Studies, Issue 1, Volume 1, November 2005, 4-17. [26] Kang, Y.S., & Ridgway, N.M. (1996). The importance of consumer market interactions as a form of social support for elderly consumers, Journal of Public Policy and Marketing, 15, 108-118. [27] Keating, D.P. (1978). A search for social intelligence, Journal of Educational Psychology, 70, 218–223. [28] Kirakowski, J. & Corbett, M. (1993). SUMI: The Software Usability Measurement Inventory, British Journal of Educational Technology, 24(3), 210. [29] Kotler, P., 1974. Atmospherics as a marketing tool. Journal of Retailing, 49, 48-64. [30] Linda, G. & Oliver, R.L. (1979). Multiple brand analysis of expectation and disconfirmation effects on satisfaction, Paper presented at the 87th Annual Convention of the American Psychological Association. [31] Mehrabian, A. & Russell, J.A. (1974). An approach to environmental psychology, MIT Press, Cambridge, MA. [32] Mayer, J.D. & Gaschke, Y.N. (1988). The experience and meta-experience of Mood. Journal of Personality and Social Psychology, 55(1), 102-111. [33] McDonald, S., Monahan, K., and Cockton, G. (2006). Modified contextual design as a field evaluation method. In Proceedings of the 4th Nordic Conference on Human-Computer interaction: Changing Roles , Oslo, Norway, October 14 - 18, 2006. A. Mφ rch, K. Morgan, T. Bratteteig, G. Ghosh, and D. Svanaes, Eds. NordiCHI '06, vol. 189. ACM, New York, NY, 437-440. [34] Mehrabian, A. & Russell, J.A. (1974). An approach to environmental psychology, MIT Press, Cambridge, MA. [35] Moss, F.A., & Hunt, T. (1927). Are you socially intelligent?, Scientific American, 137, 108-110. [36] Murray & Barnes (1998). Beyond the “wow” factor : Evaluating multimedia language learning software from a pedagogical viewpoint, System, 26(2), 249259 [37] Neustaedter, C., Brush, A. J., and Greenberg, S. (2007). A digital family calendar in the home: lessons from field trials of LINC. In Proceedings of Graphics interface 2007 (Montreal, Canada, May 28 - 30, 2007). GI '07, vol. 234. ACM, New York, NY, 199-20. [38] Oliver, R.L. (1977). Effect of expectation and disconfirmation on postexposure product evaluations: An alternative interpretation, Journal of Applied Psychology, 62, 480-6. [39] Rose, A., Shneiderman, B., & Plaisant, C. (1995). An applied ethnographic method for redesigning user interfaces. In Proceedings of the 1st Conference on Designing interactive Systems: Processes, Practices, Methods, & Techniques , Ann Arbor, Michigan, United States, August 23 - 25, 1995). G. M. Olson and S. Schuon, Eds. DIS '95. ACM, New York, NY, 115-122.
A Methodology for Developing Human-centered Interfaces
1067
[40] Rubio, S., Díaz, E., Martín, J. & Puente, J.M. (2004). Evaluation of subjective mental workload: A comparison of SWAT, NASA-TLX, and workload profile methods, Applied Psychology: An international review, 53(1), 61-86. [41] Sleeswijk Visser, F., Stapper, PJ, van der Lucht, R. & Sanders, E. (2005). Contextmapping: experiences from practise, CoDesign, Vol 1., No. 2, June 2005, 119 – 149. [42] Sternberg, R., Smith, C. (1985), Social intelligence and decoding skills in nonverbal communication, Social Cognition, Vol.2, 168-92. [43] Swan, J.E. & Trawick, I.F. (1980). Consumer satisfaction with a retail store related to the fulfillment of expectations on an initial shopping trip. In: Consumer Satisfaction, Dissatisfaction and Complaining Behavior, Ralph L. Day, ed. Bloomington, Indiana: School of Business, Indiana University. [44] Tory, M. and Staub-French, S. (2008). Qualitative analysis of visualization: a building design field study, In Proceedings of the 2008 Conference on Beyond Time and Errors: Novel Evaluation Methods For information Visualization, Florence, Italy, April 05 - 05, 2008. [45] Taylor, A. S. (2002, 19-23 October). Teenage 'Phone-talk' and its Implications for Design, NordiCHI 2002, Bridging the gap between field studies and design workshop, Aarhus, Denmark. [46] Thorndike, E.L. (1920). Intelligence and its uses, Harper’s Magazine, 140, 227-235. [47] Vernon, P.E. (1933). Some characteristics of the good judge of personality, Journal of Social Psychology, 4, 42-57. [48] Weiser, M. (1991). The computer for the 21st century, Scientific American, 265(3):94–104. [49] Westbrook, R.A. (1980). Intrapersonal affective influences upon consumer satisfaction, Journal of Consumer Research, 7, 49-54.
Part IX
Projects
Computers in the Human Interaction Loop A. Waibel, R. Stiefelhagen, R. Carlson, J. Casas, J. Kleindienst, L. Lamel, O. Lanz, D. Mostefa, M. Omologo, F. Pianesi, L. Polymenakos, G. Potamianos, J. Soldatos, G. Sutschet, J. Terken
1 Introduction It is a common experience in our modern world, for us humans to be overwhelmed by the complexities of technological artifacts around us, and by the attention they demand. While technology provides wonderful support and helpful assistance, it also A. Waibel, R. Stiefelhagen Universität Karlsruhe (TH), Interactive Systems Labs, Karlsruhe, Germany, e-mail: ahw@cs. cmu.edu,[email protected] R. Carlson Kungl Tekniska Högskolan, Centre for Speech Technology, Stockholm, Sweden J. Casas Universitat Politecnica de Catalunya, Barcelona, Spain J. Kleindienst IBM Research, Prague, Czech Republic L. Lamel LIMSI-CNRS, France O. Lanz, M. Omologo, F. Pianesi Foundation Bruno Kessler, irst, Trento, Italy D. Mostefa ELDA, Paris, France L. Polymenakos, J. Soldatos Athens Information Technology, Athens, Greece G. Potamianos Institute of Computer Science, FORTH, Crete, Greece (work performed while with the IBM T.J. Watson Research Center, Yorktown, Heights, NY, USA) G. Sutschet Fraunhofer Institute IITB, Karlsruhe, Germany J. Terken Technische Universiteit Eindhoven, Netherlands H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_40, © Springer Science+Business Media, LLC 2010
1071
1072
Waibel et al.
causes an increased preoccupation with technology itself and a related fragmentation of attention. But as humans, we would rather attend to a meaningful dialog and interaction with other humans, than to control the operations of machines that serve us. The cause for such complexity and distraction, however, is a natural consequence of the flexibility and choice of functions and features that technology has to offer. Thus flexibility of choice and the availability of desirable functions are in conflict with ease of use and our very ability to enjoy their benefits. The artifact cannot yet perform autonomously and therefore requires precise specification of every aspect of the machine’s behavior under a variety of different user interface conventions. Standardization and better graphical user interfaces, multimodal human-machine dialog systems including speech, pointing, mousing have all contributed to improve this situation. Yet, they have only improved the input mode, but not provided any proactive, autonomous capabilities of the artifact, and still force the user in a humanmachine interaction loop at the exclusion of other human-human interaction. To change the limitations of present day technology, machines must be engaged implicitly and indirectly in a world of humans, or: We must put Computers in the Human Interaction Loop (CHIL), rather than the other way around. Computers should proactively and implicitly attempt to take care of human needs without the necessity of explicit human specification. If technology could be CHIL-enabled, a host of useful services could be possible. Could two people get in touch with each other at the best moment over the most convenient and best media, without phone tag, embarrassing ring tones and interruptions? Could an attendee in a meeting be reminded of participants’ names and affiliations at the right moment without messing with a contact directory? Could computers offer simultaneous translation to attendees who speak different languages and do so unobtrusively just as needed? Can meetings be supported, moderated and coached without technology getting in the way? Human secretaries often work out such logistical support, reminders, and helpful assistance, and they do so tactfully, sensitively and sometimes diplomatically. Why not machines? Clearly, it is a lack of recognition and understanding of human activities, needs and desires and an absence of learning, proactive, contextaware computing services that limit machines’ ability to specify their own behavior and serve people implicitly. To change the status quo, a consortium of 15 laboratories in 9 countries has teamed up to explore what is needed to build usable CHIL computing services. Supported under the 6th Framework Program of the European Commission, the CHIL consortium has been one of the largest consortia working on this problem. It has developed a research and development infrastructure by which different CHIL services can be quickly proposed, assembled and evaluated, exploiting a User Centered Design approach to secure that the developed services answer real users’ needs and demands. Examples of prototype services are 1.) the Connector (a proactive phone/communication device), 2.) the Memory Jog (for supportive information and reminders in meetings), 3.) collaborative supportive workspaces and meeting monitoring. Under the CHIL paradigm the consortium also experimented with new ideas for CHIL services based on the background of the partners (e.g. recently, a simultaneous speech translator for the lecture domain).
Computers in the Human Interaction Loop
1073
To realize such CHIL computing, the work of the consortium concentrated on four key areas: • Technologies: Proactive, implicit services require a good description of human interaction and activities. This implies a robust description of the perceptual context as it applies to human interaction: Who is doing What, to Whom, Where, How, and Why. Unfortunately, such technologies – in all of their required robustness – do not yet exist, and the consortium identified a core set of needed technologies and set out to build and improve them for use in CHIL services. • Data Collection and Evaluation: Due to the inherent challenges (open environments, free movement of people, open distant sensors, noise, ...), technological advances are made under an aggressive R&D regiment rounded off by worldwide evaluation campaigns. In support of these campaigns, data from meetings, lectures, seminars held in offices and lecture rooms has been collected at 5 different sites, and metrics for technology evaluation defined. • Software Infrastructure: A common and defined software architecture serves to improve inter-operability among partners and offer a market driven exchange of modules for faster integration. • Services: Based on the emerging technologies developed at different labs, using a common architecture, and within a User Centered Design framework, CHIL services are assembled and evaluated. In this fashion, first prototypes are continuously being (re-) configured and the results of user studies effectively exploited. In the following, we review some of the highlights of each of these critical components of CHIL service design.
2 Audio-Visual Perceptual Technologies The user-centered approach of CHIL-enabled environments requires a wide range of audio-visual (AV) perceptual technologies in order to place the Computer into the Human Interaction Loop. Such technologies are necessary for the computers to exhibit context- and content-awareness, and can be further facilitated by cognition tools, for example situation modeling, strategy and planning to decide on the most adequate interaction. The latter will be discussed in the software infrastructure section of this chapter (Sect. 4). Multimodal interface technologies “observe” humans and their environments by recruiting signals from the multiple AV sensors of the CHIL smart rooms to detect, track, and recognize human activity. The analysis of all AV signals in the environment (speech, signs, faces, bodies, gestures, attitudes, objects, events, and situations) provides the proper answers to the basic questions of “who”, “what”, “where”, and “when”, that can drive higher-level cognition concerning the “how” and “why”, thus allowing computers to engage and interact with humans in a human-like manner
1074
Waibel et al.
using the appropriate communication medium. Research work by the CHIL consortium on a number of such technologies is described next. All these technologies are rigorously evaluated on a regular basis within international evaluations, such as the Rich Transcription Spring 2006 (RT06s) and 2007 (RT07) evaluation campaigns [39, 75] and the CLEAR – Classification of Events, Activities and Relationships – evaluation [21, 85, 84] (see also Sect. 3).
2.1 Speech Recognition Speech is the most critical human communication modality in the CHIL seminar and meeting scenarios of interest. Its automatic transcription is of paramount importance to real-time support and off-line indexing of the interaction, for example providing input to the higher-level technologies of summarization and question answering. In the spirit of ubiquitous computing, the goal of automatic speech recognition (ASR) in CHIL is to achieve high performance using far-field sensors. Therefore, despite the maturity of ASR technology, the CHIL scenarios present significant challenges for state-of-the-art systems. Such are, for example, the presence of speech from multiple speakers with varying accents and frequent periods of overlapping speech, a high level of spontaneity with many hesitations and disfluencies, and a variety of interfering acoustic events, such as knocks, door slams, steps, cough, laughter, and others. The CHIL data are comprised of interactive seminars, held inside smart rooms equipped with numerous acoustic and visual sensors. The acoustic sensors include a number of microphone arrays located on the walls (typically, at least three T-shaped four-microphone arrays and at least one linear 64-channel array), as well as a number of table-top microphones (at least three). The latter provide the basis of the multiple distant microphone (MDM) primary condition of interest in the development of far-field ASR technology. In addition to these sensors, most meeting participants wear headset microphones to capture individual speech in the close-talking condition. Three CHIL partner sites, IBM, LIMSI, and UKA-ISL, have been actively involved in the effort to develop robust ASR technology on such CHIL data. Although the sites primarily worked on their ASR systems independently of each other, all systems contain a number of standard important components. At the lower level, these include feature extraction and enhancement. ASR systems typically utilize Mel frequency cepstral coefficients (MFCCs) [24], or perceptual linear prediction (PLP) [46] features. Additional techniques include linear discriminant analysis (LDA) [44], a maximum likelihood linear transform (MLLT) [43], feature normalization steps, such as variance normalization and vocal tract length normalization (VTLN) [3], as well as a novel feature extraction technique developed by UKAISL, which is particularly robust to the changes in fundamental frequency [97]. At a higher level, for acoustic modeling, all systems made use of hidden Markov models (HMMs) estimated with the expectation maximization (EM) algorithm
Computers in the Human Interaction Loop
1075
(maximum likelihood training) [26], followed by discriminative model training using maximum mutual information (MMI) [71], or the minimum phone error (MPE) approach [72]. A number of adaptation techniques were also employed, ranging from maximum a-posteriori estimation (MAP) [41] to maximum likelihood linear regression (MLLR) [62], feature-space MLLR (fMLLR), or speaker adaptive training (SAT) [2]. Most ASR systems used a multi-pass decoding strategy, where a word hypothesis was employed for unsupervised model adaptation prior to the next decoding pass. For language modeling, different n-gram language models (LMs), with n = 3 or 4, have been employed by the CHIL sites. These LMs were typically developed separately for various data sources, and were subsequently linearly interpolated to give rise to a single model. Most often, CHIL and other meeting corpora were employed for this task. An important part relevant to the language modeling work, is determining the recognition vocabulary and the pronunciations, particularly to model foreign accented speech. The use of a connectionist LM [76], shown to be performant when LM training data is limited, was also explored at LIMSI. Finally, an important aspect of all systems was the combination in the recognition process of the available information coming from the multiple available acoustic sensors. Both decision and feature fusion approaches have been employed for this purpose, for example beamforming [98] and ROVER [38]. The CHIL partner sites involved in ASR work have made steady progress in the CHIL domain, as benchmarked by yearly technology evaluations. During the first two years of CHIL, the consortium evaluated ASR internally with a first dry-run held in June 2004, followed by an “official” evaluation in January 2005. Subsequently, the three ASR CHIL sites participated in the Rich Transcription evaluations of 2006 and 2007 – RT06s [39] and RT07 [75]. These international evaluations were sponsored by NIST and attracted a number of additional external participants. The evaluations have provided a quantitative measure of progress over the years. For the far-field ASR task, for example, the best system performance improved from a word error rate of over 68% in the 2004 dry run to approximately 52% in the CHIL 2005 internal evaluation, and from 51% in RT06s down to 44% in RT07. These improvements were achieved in spite of the fact that the recognition task became increasingly more challenging: Indeed, from 2005 to 2006, multiple recording sites involving speakers with a wider range of non-native accents have been introduced into the test set. Furthermore, both in 2006 and 2007, the degree of interactivity in the data has significantly increased. In addition to ASR, two other relevant technologies were investigated, namely speech activity detection (SAD) and speaker diarization (SPKR) – also known as the “who spoke when” problem. Both SAD and SPKR constitute crucial preprocessing components of state-of-the-art ASR systems, and are important for fusion with other CHIL perceptual technologies, for example, acoustic based speaker localization and identification. In particular, they locate the speech segments that are processed by the ASR systems, and attempt to cluster homogeneous sections, which is crucial for efficient signal normalization and speaker adaptation techniques. SAD technology has been investigated by many CHIL partners (AIT, FBK-irst, IBM, INRIA, LIMSI, UKA-ISL, and UPC) that explored various approaches dif-
1076
Waibel et al.
fering in a number of factors: For example, in feature selection (MFCCs, energybased, combined acoustic and energy based features, etc.), the type of classifier used (Gaussian mixture model classifiers (GMMs), support vector machines, linear discriminants, decision trees), the classes of interest and channel combination techniques. Similarly to ASR, these components have been evaluated in separate CHIL-internal and NIST-run evaluation campaigns. In the RT06s evaluation, the best systems achieved very encouraging error rates of 4% and 8% for the conference and lecture subtasks, respectively, when calculated by the NIST diarization metric. A number of CHIL partner sites have also developed SPKR systems (AIT [73], IBM [48, 47], LIMSI [101, 102], and UPC [63]), exploring a variety of research directions, including exploiting multiple distant microphone channels, speech parametrization, as well as segmentation and clustering [63, 4, 102].
2.2 Person Tracking The localization and tracking of multiple persons behaving without constraints, unaware of audio/video sensors, in natural, evolving and unconstrained scenarios, still poses significant challenges to the current state-of-the-art in tracking technologies.
Fig. 1 Example screenshot of a person tracking system running on CHIL data (image taken from [17]).
Video-based tracking techniques have to deal with varying illumination, shadows, occlusions, changing target appearances, the lack of color constancy, and so forth. While these difficulties are inherent to some degree in all kinds of visual tracking tasks, the problem posed by occlusions becomes particularly severe here, when observing multiple persons in a small crowded space. Close proximity to the
Computers in the Human Interaction Loop
1077
capturing cameras may cause persons to occlude others; partial occlusions by furniture, such as tables and chairs, are very common. The choice of reliable features for tracking is also not straightforward. Changes in the foreground support may be caused by person parts of varying sizes, by other objects, such as moved chairs, by projection surfaces, etc. The use of color or edge features on the other hand presumes that person-specific appearance models are available. These can, however, be difficult to initialize or maintain for every new acquired target. The primary constraint in audio-based localization and tracking, using distantly placed microphone arrays, is that the target person needs to be an active speaker. In most practical scenarios, this limits applications to the tracking of one person at a time, with further difficulties caused by the variety of acoustic conditions (e.g., room acoustics and reverberation time) and, in particular, the undefined number of simultaneous active noise sources and competing speakers. In addition to single modality tracking, audio-visual tracking approaches were also researched, to measure the advantage of modality fusion. The challenges here lie in the balancing of the modalities and in the fusion algorithms themselves, as measurements from different types of sensors come with varying levels of confidence, at different frequencies, and may not always be available at all times (e.g. for visually occluded persons or for silent audience members). One of the key problems that needed to be solved, for visual, acoustic and multimodal tracking alike, was the definition of performance evaluation procedures for multiple object trackers. The defined metrics, usable for single or multiple object tracking alike, judge a tracker’s performance based on its ability to precisely estimate an object’s position (Multiple Object Tracking Precision, MOT P) and its ability to correctly estimate the number of objects and keep correct trajectories in time (Multiple Object Tracking Accuracy, MOTA) (see [8] for details). They were used in the CLEAR [88, 87] evaluations, for a broad range of CHIL or VACE [93] related tasks including acoustic, visual and multimodal 3D person tracking, 2D visual person tracking, face detection and tracking, and vehicle tracking. To face the numerous further challenges mentioned above, the CHIL consortium applied several strategies for what concerns the room design and selected sensors, as well as the algorithms employed. A distributed camera and microphone network provided a good “coverage” of each area of the room and the fusion of sensor data helped overcome the problems of occlusions or noisy observations and increase localization accuracies. On the visual side, from the point of view of algorithm design, two distinct approaches have been followed by the various 3D tracking systems built within the consortium: First, a model-based approach: A 3D model of the tracked object is maintained by rendering it onto the camera views, searching for supporting evidence in each view, and based on that, updating its parameters [67, 59, 1, 7, 61, 17]. Second, a data-driven approach: 2D trackers operate independently on the separate camera views; then the 2D tracks belonging to a same target are collected into a 3D one [100, 92, 52]. An important advantage of the model-based approach is that model rendering can be implemented in a way that mimics the real image formation process, including
1078
Waibel et al.
effects like perspective distortion and scaling, lens distortion, etc. In the context of multi-body tracking this is particularly advantageous, since occlusions can be handled in a systematic manner. The model-based approach also makes it easier to incorporate many different types of features, such as foreground segments, color histograms, etc., increasing tracking robustness. On the acoustic side, tracking approaches can be roughly categorized as follows: Approaches which rely on the computation of a more or less coarse global coherence field (GCF, or SRP-PHAT), on which the tracking of correlation peaks is performed [1, 14], particle filter approaches, which stipulate a number of candidate person positions at every point in time and measure the agreement of the observed acoustic signals (their correlation value) to the position hypothesis [66], and approaches that feed computed time delays of arrival (TDOAs) between microphone pairs directly as observations to a Kalman or other probabilistic tracker [54, 42]. Whereas in earlier CHIL systems, speech activity detection (SAD), necessary for correct localization, was often performed and evaluated separately, toward the end of the project, more and more approaches featured built-in speech detection techniques [42, 77]. In the field of multimodal tracking, while most initial systems performed audio and video tracking separately and combined tracker outputs in a post-processing step, a few select approaches incorporated the multimodal streams at the feature level. These were notably particle filter based trackers [66, 7, 13], as these allow for the probabilistic, flexible and easy integration of a multitude of features across sensors and modalities.
Fig. 2 Best system performances for the CLEAR 2006 and 2007 3D person tracking evaluations. The MOTA score measures a tracker’s ability to correctly estimate the number of objects and their rough trajectories.
Throughout the duration of the project steady progress was made, going from single modality systems with manual or implicit initialization, using simple features, sometimes implying several manually concatenated offline processing steps and tracking at most one person, to fully automatic, self-initializing, real-time capable systems, using a combination of features, fusing several audio and visual
Computers in the Human Interaction Loop
1079
sensor streams and capable of tracking multiple targets. Aside from the tracking tasks, which grew more and more complex, the evaluation data, which was initially recorded only at the UKA-ISL site, also became increasingly difficult and varied. This was due to the completion of four more recording smart rooms, the inclusion of more challenging interaction scenarios, the elimination of simplifying assumptions such as main speakers, areas of interest in the room, or manually segmented audio data that excludes silence, noise or crosstalk. Nevertheless, the performance of systems steadily increased over the years. Figure 2 shows the progress made in audio, video and multimodal tracking for the last official CLEAR evaluations [88, 87].
2.3 Person Identification The challenges for AV person identification (ID) in CHIL scenarios are due to farfield, wide-angle, low-resolution sensors, acoustic noise, speech overlap and visual occlusion, unconstrained subject motion, and no position/orientation assumptions to facilitate well-posed signals (one cannot assume frontal faces, or speakers facing/talking to sensors). A sample face sequence from a smart room can be seen in Fig. 3. Clearly, employing tracking technologies (detection followed by tracking) and fusion techniques, either multi-view or multimodal (speaker ID combined with face ID for example) is a viable approach in order to improve robustness.
Fig. 3 A sample face sequence from a smart room.
Identification performance depends on the enabling technologies used for audio, video and their fusion, but also on the accuracy of the extraction of the useful portions from the audio and video streams. The detection process for audio involves finding the speech segments in the audio stream by employing speech activity detection (see Sect. 2.1). The corresponding process for video involves face detection, which might be a composite task, comprising body tracking, head tracking, and face detection, each subtask narrowing the search space for the face. Developed mono- and multimodal ID systems within CHIL have been successfully evaluated in the CLEAR 2006 and 2007 campaigns [84, 85]. In the evaluations, two different training durations, 15 and 30 seconds, and four different testing durations, 1, 5, 10 and 20 seconds, were considered. The number of subjects in the
1080
Waibel et al.
CLEAR 2006 person identification evaluation corpus was 26 and it became 28 in CLEAR 2007. A significant performance increase was observed in CLEAR 2007, especially in face recognition. It was shown that longer training and testing durations resulted in very high recognition results and that multimodality improves the robustness of person identification. This indicates that person ID in open environments is quite plausible. The correct recognition rates of the best performing systems in the CLEAR 2006 and 2007 evaluations are shown in Fig. 4.
Fig. 4 Performance comparison of the person identification systems in the CLEAR 2006 and 2007 evaluations. Two out of the eight conditions are shown per evaluation; the shortest training and testing (15 and 1 seconds respectively) and the longest training and testing (30 and 20 seconds respectively) (from [94]).
2.4 Interaction Cues: Gestures, Body Pose, Head Pose, Attention Studies in social psychology have experimentally validated the common feeling that non-verbal behavior, including but not limited to gaze and facial expression, is extremely significant in human interaction. In the analysis of group behavior a key variable is the focus of attention which indicates the object or person one is attending to. Another form of non-verbal communication develops along body language, including explicit gestures, such as raising a hand to catch attention, or nodding the head to manifest agreement, and body postures. In this chapter, approaches for the automated analysis of non-verbal communication in a meeting scenario will be discussed. Although visual attention does not necessarily coincide with psychological attention, it is highly correlated with it, and it is physically observable. While the image resolution required to estimate eye gaze may not always be available in a meeting recording, a reliable indicator can still be derived from estimated head pose. CHIL
Computers in the Human Interaction Loop
1081
Fig. 5 Estimating head pose and focus of attention [86]. Head orientations are estimated from four camera views. These are then mapped to likely targets of attention, such as other people.
partners have investigated approaches to estimate the head orientation of people in a smart room using multiple fixed cameras (see Fig. 5). With a neural network based approach [65], it is possible to estimate relative head orientation from a single view with mean errors as low as twelve degrees. When applied to each single camera and combined by a Bayes filter to obtain an absolute value for head orientation, such an approach allowed to correctly determine who is looking at whom in a CHIL meeting recording around 70% of the time. Another option is to perform signal level based fusion using spatial and color analysis [18]. Such an approach exploits the redundancy available in multiple images, and can be extended to integrate also acoustic input. A detailed description of a person’s pose can be extracted from multiple views, e.g. using a particle filter based approach in combination with a 3D shape and appearance model for each target [60] (see Fig. 6). The state of a person includes 2D body position, horizontal and vertical head and torso orientation, and a binary value for sitting or standing pose. A coarse 3D shape model maps such a state into a set of image patches where the different body parts (head, torso, legs, arms) are expected to be visible. A likelihood score is then computed by matching those patches with a previously acquired color model of the target. This approach has the advantage of estimating a number of suitable perceptual parameters jointly (e.g. position and head orientation to determine visual attention in dynamic environments), thus reducing the impact of error propagation which can lead to failures in cascaded solutions (e.g. head alignment error impacts the accuracy of head orientation estimation). Other visual cues that are relevant to human behavior analysis can be detected at a finer spatio-temporal scale. To pinpoint fidgeting, temporally unstable skin pixels are recorded to a Motion History Image (MHI) representing a temporal record of repetitive skin motion. Patterns of interest, such as nodding, shaking and typing, are characterized in the MHI and then used for their detection [20].
1082
Waibel et al.
Fig. 6 The SmarTrack system [82] simultaneously estimates spatial location, head orientation and body pose of up to 5 persons in real time, allowing for online detection of focus of attention and macro-scale gestures such as pointing.
2.5 Activity Analysis, Situation Modeling The recognition of activities depends on many factors such as the location and number of people, speech activity, and the location and state of certain objects. Perceptual technologies like person tracking or acoustic event detection provide important information on which the higher-level analysis of the activity can be based on. Due to the complexity of the scene there are, however, potentially relevant phenomena like for example a door being half opened, which – due to their high number and variability – cannot be addressed by manually designed detectors at large. Therefore, activity recognition may need to directly analyze the observation in order to find out what is relevant and what is not to detect a certain activity. Using only one camera and one microphone per room, we can distinguish, for example, between typical office activities such as “paperwork”, “meeting” or “phone call”. The approach employs a multi-level HMM framework to characterize these kinds of events and situations. It is based on a low-level local feature model that integrates adaptive background modeling, optical flow and speech activity detection. The classes are trained from a relatively small number of samples. Figure 7 depicts an example of data-driven clustering of activity regions within an office. The regions labeled represent the learned areas of activity of office workers and their visitors. The evaluation of an unconstrained one week recording session showed acceptable recognition accuracy despite strongly varying illumination conditions, as shown in [96]. In a smart room it may also be necessary to determine the type of interactions among the people that are present in the room in order to perform context-dependent actions. To this aim, a system based on [50], which analyzes activities using stochastic parsing, has been developed by adapting it to the specific requirements of smart room environments. The system can be divided into three modules: the tracking system, the event generator and the parser. The tracking system [58] takes as input the multi-camera video sequence, reconstructs the 3D objects in the room using a foreground detection for each camera, and performs the tracking of the various objects. This tracker output is used, together with some configuration information of
Computers in the Human Interaction Loop
1083
Fig. 7 The first three Gaussian mixture components obtained from data driven clustering represent typical activity regions in the office room.
the room, to detect events using simple grammars. The actual activity recognition is performed by the parser: given a chain of events and a stochastic context-free grammar, it finds the chain derivation with the maximum probability. The activities recognized by the system are: “meeting”, “presentation”, “conversation between two people”, “leave an object” and “take an object” (Fig. 8).
Fig. 8 Two snapshots from a presentation, shown as camera image and 3D blob representation. The left shows an example of the detected event “person enters”, the right shows an example for the events “person at whiteboard” and “person sits”.
Activities that involve the motion of body limbs cannot be recognized using only person/object tracking. Therefore, a view-independent approach has been developed to recognize human gestures from multiple cameras [16]. In contrast to other systems that perform a classification based on the fusion of features coming from different views, our system performs data fusion to obtain a 3D representation of the scene, and then feature extraction and classification. Motion descriptors introduced by [11] are extended to 3D and a set of features based on 3D invariant statistical moments are computed. A simple body model is fit to the 3D data to capture in which body part the gesture occurs. Classification is thus performed by jointly analyzing the motion features and the pose information from the body model. Finally, a Bayesian classifier is employed to perform recognition over a small set of actions. The actions relevant to the smart room scenario are “raising hand”, “sitting down” and “standing up”. However, we have tested the system including other actions such as “waving hands”, “crouching down”, “punching”, “kicking” and “jumping”. Our experience from CHIL shows that it is hard to find a common methodology for activity recognition that fits all application domains. Our current systems are
1084
Waibel et al.
dedicated to certain domains like office situations and meetings. They define a small set of activities that are relevant within their domain, and they proved to be able to recognize these on in-domain test data. Future work could on one hand try to extend the application domains, and on the other hand it could aim for a more detailed analysis of the activities within one domain.
2.6 Audio-Visual Output Technologies Conveying information from the computer to the user in a natural way is another key ability on the way to realizing unobtrusive CHIL services. Multimodal output in innovative combinations, such as far-field acoustic and visual modalities, makes it possible to convey messages without disturbing other meeting participants. Experiments in CHIL show that this can be achieved by employing highly directive audio beams produced by properly designed steerable arrays. When the signal is accompanied by the video of lip-synchronized talking heads [9, 10] to improve the naturalness of interaction, the combined modalities increase intelligibility in noisy environments, a frequent occurrence in CHIL scenarios. Similarly, adding a lipsynchronized talking head makes it possible to deliver voice messages more quietly without decrease in intelligibility from the perspective of the addressee of a message, resulting in less noise pollution for non-addressees [80, 90].
2.7 Interaction Designing spoken dialog systems in coherence with a human metaphor, so that the computer is perceived to have human-like conversational abilities, allows us to profit from a number of characteristics of spoken interaction that are unexploited or underexploited in current systems. This, however, increases the demands on a number of conversational abilities [32]. One of the most obvious differences between human-human conversations and the vast majority of spoken human-machine interactions is the interaction flow. Spoken human-human interaction is characterized by a large number of non-lexical sounds (hums and grunts, filled pauses, false starts, breath, smacks, etc.). These are an essential part of human interaction, and are used for a great number of reasons, such as adding focus to part of an utterance or controlling who should speak next. When these are removed, as they habitually are on the computer side of humancomputer communication, the human metaphor suffers and the interaction becomes more similar to GUI manipulation than to human speaking. As this is contrary to the spirit of CHIL, where humans should not adapt to computers but rather the other way around, efforts were undertaken to model more human behavior – both for production and for perception. Real-time prosodic analysis is used for improved interaction control and precise timing of feedback, which in turn allows the dialog
Computers in the Human Interaction Loop
1085
system to barge in when it is appropriate, and users to do the same without causing the interaction to become strange and interrupted [45, 34, 33]. In CHIL dialogs, elliptical clarifications are utilized to focus on problematic fragments in order to make the dialog more natural and efficient, which makes prosody and context then more important [81, 95]. The human metaphor can be strengthened further by adding a visual embodiment to the system. The concept of embodied conversational agents (ECAs) – animated agents that are able to interact with a user in a natural way using speech, gesture and facial expression – holds the potential of a new level of naturalness in human-computer interaction, where the machine is able to convey and interpret verbal as well as non-verbal communicative acts, ultimately leading to more robust, efficient and intuitive interaction. A series of studies performed within CHIL were aimed at exploiting embodiment for interactional purposes [31, 49].
2.8 Natural Language Intelligent solutions for human information needs in a smart room scenario also require natural language processing technologies such as summarization and question answering (QA). Due to the fact that speech is the most natural way of human communication, spontaneous speech single document summarization (SSSDS) can help in digesting large amounts of information produced by speakers. The major aim of speech summarization is to distill relevant information from speech data, which can be communicated efficiently to the user at a later time. Summarization can be considered as a variant of speech understanding. An important issue when summarizing the output of an automatic speech recognition system is that the document to be summarized has no punctuation or capitalization. Searching textual content for short text snippets that are relevant to natural language questions is also a key objective in CHIL. QA can be regarded as a direct extension to search engines (e.g. Google) to handle cases when the type of the expected answer is known. For example, answers to questions of the form: “Who discovered/wrote/developed NAME?” are not expected to be a list of URLs but rather just the name of the author of the required work. QA can enable automated customer service, where many of the questions are of the form: “How do I perform ACTION?” Standard information retrieval techniques are not sufficient to answer such questions, because the answer is usually hidden deep inside technical manuals of hundreds of pages. Last but not least, in the context of the CHIL project, QA is one of the engines behind the Memory Jog service (see Sect. 5). Within this service, a QA system is employed to search both the transcriptions of a meeting and background knowledge offered by documents relevant to this meeting on the web.
1086
Waibel et al.
3 Technology Evaluations and Data Collection Systematic evaluation is essential to drive the rapid progress of a broad range of audio-visual perceptual technologies. Within the CHIL project, such evaluations were undertaken on an annual basis, so that improvements could be measured objectively and different approaches compared and assessed. CHIL has organized a series of technology evaluations with subsequent evaluation workshops, during which the systems and obtained results were discussed in detail. A first project-internal technology evaluation was held in June 2004. Here baseline results of available technologies used in the real-life lecture scenarios CHIL was initially focusing on, were established. Already twelve evaluation tasks were conducted, including face and head tracking, 3D person tracking, face recognition, head pose estimation, hand tracking and pointing gesture recognition, speech recognition (close-talking and far-field), acoustic speaker tracking, speaker identification, acoustic scene analysis and acoustic event detection. Then, in January 2005, a more formal evaluation was organized, which now also included multimodal evaluation tasks (multimodal tracking, multimodal identification). External research groups were also invited to participate. Also in 2005, CHIL research teams took part in NIST’s (National Institute of Standards and Technology, USA) Rich Transcription (RT) Meeting Evaluation. This annual evaluation series focuses on the evaluation of content-related technologies such as speech and video recognition. CHIL contributed test and training data sets in the lecture room domain, and participated in the Speaker Location Detection task. Many researchers, research labs and, in particular, a number of current major research projects worldwide are working on technologies for analyzing people’s activities and interactions. However, common evaluation standards, common metrics and benchmarks for such technologies, as for example for person tracking, are missing. It is therefore hardly possible to compare the developed algorithms and systems. Hence most researchers rely on using their own data sets, annotations, task definitions and evaluation procedures. This again leads to a costly multiplication of data production and evaluation efforts for the research community as a whole. In order to overcome this situation, we decided in 2005 to completely open up the project’s evaluations by creating an open international evaluation workshop called CLEAR, Classification of Events, Activities, and Relationships [88, 21], and transferred part of the CHIL technology evaluations to CLEAR. CLEAR’s goal is to provide a common international evaluation platform and framework for a wide range of multimodal technologies, and to serve as a forum for the discussion and definition of related common benchmarks, including the definition of common metrics, tasks and evaluation procedures. The creation of CLEAR was possible by joining forces with NIST, who, amongst many other evaluations, also organizes the technology evaluation of the US Video Analysis Content Extraction (VACE) program [93]. As a result, CLEAR could be jointly supported by CHIL, NIST and the VACE program. The first CLEAR evaluation was conducted in spring 2006 and was concluded by a two day evaluation workshop in the UK in April 2006. CLEAR 2006 was organized in cooperation with RT06s [74]. This additionally allowed for sharing data between
Computers in the Human Interaction Loop
1087
both evaluations by also harmonizing the 2006 CLEAR and RT evaluation deadlines. For example, the speaker-localization results generated for CLEAR were also used for the far-field ASR task in RT06s. The CLEAR 2006 evaluation turned out to be a big success. Overall, around sixty people of 16 different institutions participated in the workshop, and nine major evaluation tasks, including more than 20 subtasks were evaluated [21]. Following the success of CLEAR 2006, CLEAR 2007 took place in May 2007, in Baltimore, USA, again organized in conjunction and collocated with the NIST RT 2007 evaluation. An important aspect of the technology evaluations was using real-life data representing situations for which the applications and technologies in CHIL were developed, annotated with the necessary information for various modalities. Therefore, a number of data sets was created which enabled the development and evaluation of audio-visual perception technologies. The data was collected inside smart rooms at five different locations. All of these rooms were equipped with a broad range of cameras and microphones. After an initial data set collected in 2004 for the first CHIL internal technology evaluation, new data sets in slightly modified scenarios were collected in 2005 and 2006. These sets constituted the main test and training data sets used in the CLEAR 2006 and 2007 evaluations. The audio modalities (close-talk and far-field) of the 2005, 2006 and 2007 data sets were also part of the RT05s, RT06s and RT07 evaluations. Additionally, the audio portion was employed in a pilot track in the Cross Language Evaluation Forum, CLEF 2007 [22]. The utilization of the CHIL data sets in such high-profile evaluation activities demonstrates the state-of-the-art nature of the corpus, and its contribution to the development of advanced perception technology. Moreover, the numerous scientific publications produced as results of the evaluations are further enhancing the importance of the corpus. The CHIL data sets are publicly available to the community through the language resources catalog [36] of the European Language Resources Association (ELRA).
3.1 Data Collection The CHIL corpus consists of multi-sensory audio-visual recordings inside smart rooms. The corpus has been collected over the past few years in five different recording sites and it contains data of two types of human interaction scenarios: lectures and small meetings.
3.1.1 Sensor Setup Five smart rooms have been set up as part of the CHIL project, and used in the data collection efforts. They are located in five countries: Germany (UKA-ISL),
1088
Waibel et al.
Greece (AIT), Italy (ITC), Spain (UPC), and USA (IBM). These five smart rooms are medium-size meeting or conference rooms with a number of audio and video sensors installed, and with supporting computing infrastructure. The multitude of recording sites provides desirable variability in the CHIL corpus, since the smart rooms obviously differ from each other in their size, layout, acoustic and visual environment (noise, lighting characteristics), as well as sensor properties (location, type). Nevertheless, it is crucial to produce a homogeneous database across sites to facilitate technology development and evaluations. Therefore, a minimum common hardware and software setup has been specified concerning the recording sensors and resulting data formats. All five sites comply with these minimum requirements, but often contain additional sensors. Such a setup consists of: • A set of common audio sensors, namely: – – – –
A 64-channel linear microphone array; Three 4-channel T-shaped microphone clusters; Three table-top microphones; Close-talking microphones worn by the lecturer and the meeting participants.
• A set of common video sensors, that include: – Four fixed cameras located at the room corners; – One fixed, wide-angle panoramic camera; This set is accompanied by a network of computers to capture the sensory data, mostly through dedicated data links, with data synchronization information provided by the Network Time Protocol (NTP). A schematic diagram of such a room including its sensors is depicted in Fig. 9.
3.1.2 Scenarios Two types of interaction scenarios constitute the focus of the CHIL corpus: lectures and meetings. In both cases, a presenter gives a seminar in front of an audience, but the two scenarios differ significantly in the degree of interactivity between the audience and the presenter, as well as the number of participants. The seminar topics are quite technical in nature, spanning the areas of audio and visual perception technologies, but also biology, finance, etc. The language used is English; however most subjects exhibit strong non-native accents, such as Italian, German, Greek, Spanish, Indian, Chinese, etc. In the Lecture scenario, the presenter talks in front of an audience of typically ten to twenty people having little interaction in the form of a few question-answering turns, mostly toward the end of the presentation. As a result, the audience region is quite cluttered and of little activity and interest. The focus in lecture analysis lies therefore on the presenter. As a consequence, only the presenter has been annotated in the lecture part of the CHIL corpus and is the subject of interest in the associated
Computers in the Human Interaction Loop
1089
Fig. 9 Schematic diagram of the IBM smart room, one of the five installations used for recording the CHIL corpus. The room is approximately 7 × 6 × 3m3 in size and contains 9 cameras and 152 microphones for data collection.
evaluation tasks. A total of 46 lectures are part of the CHIL corpus (see also Table 1), most of which are between 40 and 60 minutes long. Site # Seminars Epoch Type Evaluation Campaigns UKA 12 2003/2004 Lectures CHIL Internal Evals., CLEF07 UKA 29 2004/2005 Lectures CLEAR06, RT05s, RT06s, CLEF07 ITC 5 2005 Lectures CLEAR06, RT06s UKA 5 2006 Meetings CLEAR07, RT07 ITC 5 2006 Meetings CLEAR07, RT07 AIT 5 2005 Meetings CLEAR06, RT06s AIT 5 2006 Meetings CLEAR07, RT07 IBM 5 2005 Meetings CLEAR06, RT06s IBM 5 2006 Meetings CLEAR07, RT07 UPC 5 2005 Meetings CLEAR06, RT06s UPC 5 2006 Meetings CLEAR07, RT07 Table 1 Details of the 86 collected lectures/non interactive seminars (upper table part) and meetings/interactive seminars (lower part) that comprise the CHIL corpus. The table depicts the recording site, year, type, and number of collections, as well as the evaluation campaigns where the data were used.
1090
Waibel et al.
In the Meeting scenario, the audience is small, between three and five people, and the attendees mostly sit around a table, all wearing close-talking microphones. There exists significant interaction between the presenter and the audience, with numerous questions and often a brief discussion among meeting participants. A few of these meetings are scripted to provide a more interesting, challenging, and dynamic interaction, with participants entering or leaving the room or congregating during a short coffee break. In addition, a significant number of acoustic events is generated to allow more meaningful evaluation of the corresponding technology. Clearly, in such a scenario all participants are of interest to meeting analysis. Therefore, the CHIL corpus provides annotations for all. Such data have been recorded by AIT, IBM and UPC in 2005, as well as all five sites during 2006. A total of 40 meetings are part of the CHIL corpus (see also Table 1), most of which are approximately 30 minutes in duration.
3.2 Corpus Annotation For the collected data to be useful in technology evaluation and development, it is crucial that the corpus is accompanied by appropriate annotations. The CHIL consortium has devoted significant effort in identifying useful and efficient annotation schemes for the CHIL corpus and providing appropriate labels in support of these activities. As a result, the data set contains a rich set of annotations in multiple channels of both audio and visual modalities. Data recording in the CHIL smart room results in multiple audio files containing signals recorded by close-talking microphones (near-field condition), table-top microphones, T-shaped clusters, and Mark III microphone arrays [68] (far-field condition), in parallel. The recorded speech as well as environmental acoustic events were carefully segmented and annotated by human transcribers. Video annotations were manually generated using an ad-hoc tool. The tool allows displaying one picture every second, in sequence, for four cameras. To generate labels, the annotator performs a number of clicks on the head region of the persons of interest, i.e. the lecturer only in the non-interactive seminar (lecture) scenario, but all participants in the interactive seminar (meeting) scenario. In particular, the annotator first clicks on the head centroid (e.g., the estimated center of the person’s head), followed by the left eye, the right eye, and the nose bridge (if visible). In addition, the annotator delimits the person’s face with a bounding box. The 2D coordinates of the marked points within the camera plane are saved to the corresponding label file. This allows the computation of the 3D head location of the persons of interest inside the room, based on camera calibration information. In addition to 2D face and 3D head location information, part of the lecture recordings were also labeled with gross information about the lecturer’s head pose. In particular, only eight head orientation classes were annotated, as this was deemed to be a feasible task for human annotators, given the low-resolution captured views of the lecturer’s head. The head orientation label corresponded to one of eight dis-
Computers in the Human Interaction Loop
1091
crete orientation classes, ranging from a 0o to 315o angle, with an increment of 45o . Overall, nineteen lecture videos were annotated with such information.
4 Software Infrastructure CHIL aims to develop next-generation services that assist humans in a natural and unobtrusive way by utilizing the state-of-the-art in machine perception and context modeling. The CHIL software infrastructure makes this vision conceivable by implementing three key architectural pieces: the CHIL Reference Architecture [25], a set of CHIL Tools for service authoring, and the catalog of CHIL-compliant components. The CHIL Reference Architecture (also called the Ice Cube) provides a collection of structuring principles, specifications and Application Programming Interfaces (APIs) that govern the assemblage of CHIL components into highly distributed and heterogeneous interoperable systems. The CHIL Ice Cube is designed as a layered architecture model (see Fig. 10). Each level represents an abstraction of the information processed and has specific data flow characteristics such as latency and bandwidth, based on the functional requirements from the CHIL system design phase. Building software infrastructure for smart context-aware services is an iterative process, where most of the software architecture craft resides in the discovery of suitable conceptual abstractions and the refinement of their functional relationships. We now describe the basic layers of the CHIL Ice Cube.
Fig. 10 The CHIL Reference Architecture (CHIL Ice Cube).
1092
Waibel et al.
The upper layers – User Services – implemented as software agents, manage interactions with humans by means of various interaction devices. The components at this level are responsible for communication with the user and with presenting the appropriate information at appropriate time-spatial interaction zones. The User Services utilize the contextual information available from the situation modeling layer. In projects like CHIL, where a number of service developers concentrate on radically different services, it is of high value that a framework ensures reusability in the scope of a range of services. To this end, we have devised a multi-agent framework that: • Facilitates integration of diverse context-aware services developed by different service providers. • Facilitates the development of services by leveraging basic functionalities (e.g. sensor and actuator control) available within the smart rooms. • Allows augmentation and evolution of the underlying infrastructure independent of the services installed in the room. • Controls user access to services and supports service personalization through maintaining appropriate profiles. • Enables discovery, involvement and collaboration of services. The middle layer – Situation Modeling – is the place where the situation context received from audio and video sensors is processed and modeled. The context information acquired by the components at this layer helps CHIL services to respond better to varying user activities and environment changes. For example, the situating modeling answers questions such as: “Is there a meeting going on in the smart room?”, “Who is the person speaking at the whiteboard?”, “Has this person been in the room before?”. In the CHIL Reference Architecture, the Situation Modeling layer is implemented by the SitCom framework. Specifically, this layer is also a collection of abstractions representing the environment context in which the user interacts with the application. It thus maintains an up-to-date state of objects (people, artifacts, situations) and their relationships. The situation model acts as a inference engine, which watches for the occurrence of certain situations in the environment and triggers respective events and actions. The lower layers – Perceptual Technologies – host perceptual components that extract meaningful events from continuous streams of video and audio signals. All kinds and variations of perceptual components live in this layer. In CHIL we actually counted a total of 64 such components provided by 8 different project partners. These components process the sensor signals from one modality (body trackers, face recognizers) or modality combinations (audio-visual speech recognition) and deliver events to upper layers via a dedicated communication mechanism, the CHiLiX middleware. The CHIL Reference Architecture defines the API contract (Access, Subscriber, Control, Introspection, and Admin APIs) as well as the life cycle (Unregistered, Registered, Launched, Running) to which the perceptual components must adhere to be considered CHIL-compliant. The compliance specifies how these perceptual components operate, “advertise” themselves, subscribe to re-
Computers in the Human Interaction Loop
1093
ceiving a specific sensor data stream and how they forward their extracted context to the higher layers of the CHIL Ice Cube. The vertical columns CHIL utilities and Ontology provide support across layers. The CHIL utilities provide global timing and other basic services that are relevant to all layers. The Ontology provides a definition of CHIL concepts and a directory service that provides access to information at any layer. The CHIL Ontology is implemented in a Knowledge Base Server. It is worth emphasizing that the layers are not strictly isolated. For example, the multimodal service residing at the user frontend can render a video stream originating from the Logical sensors and actuators at the bottom of the Ice Cube. Separating the architecture into different abstract functional layers enabled the implementation of CHIL components that are self describing, self regulating, self repairing and self configuring. Via feedback and evaluation from developers, we are enhancing the CHIL architecture with features such as dynamic service lookup and invocation, system reconfiguration and adaptation. We strongly believe that these non-proprietary APIs, tools and methods, available as part of the CHIL technology catalog [19], can be embraced by a larger community of pervasive service researchers and developers.
4.1 SitCom - The Situation Composer For designing, debugging and running the situation model, we have implemented a framework called SitCom [40]. SitCom (Situation Composer) is a simulator tool and runtime for the development of context-aware applications and services. SitCom gets information from perceptual and other context acquisition components and allows the composition of situation models into hierarchies to provide event filtering, aggregation, and combination. Through its IDE controls, SitCom also facilitates the capture and creation of situation machines and subsequent realistic rendering of the scenarios in a 3D visualization module. The conceptual goal of SitCom is to extract semantically higher information from the stream of environment observations. Technically, this is achieved through a set of situation machines (SM), where each SM is being implemented as a finite state graph. Situation machines can be organized into hierarchical structures. A state of the SM is determined by a set of entities in the entity repository that mirror real objects in the environment and by the current states of other SMs. The actual implementation of a particular situation machine can be part of the Java runtime or it can be a standalone module using a remote API. In Fig. 11, we can see the middle lower panel showing the currently active set of situation machines, along with their state information, both a 2D and a 3D visualization, and the corresponding video frame from the actual meeting recording.
1094
Waibel et al.
Fig. 11 SitCom running the situation model on meeting data recorded in the IBM smart room.
4.2 Agent Infrastructure The CHIL software agent infrastructure covers the upper two layers of the CHIL architecture. Agents and components close to the user including the User Profile are situated in the User Front-end layer. Agents in the Services and Control layer are further subdivided in basic agents including the communication ontology and specific service agents. Services include reusable basic services as well as complex higher-level services composed of suitable elementary services. The CHIL Agent, the CHIL Agent Manager, the communication ontology and the Service Agent form the fundamental part of the agent framework. The CHIL Agent is the basic abstract class for all agents in the CHIL environment. It provides methods for agent administrative functionality, for registration and deregistration of services provided by agent instantiations, and additional supporting functions like creating and sending ontology-based messages, extracting message contents and logging. The CHIL Agent Manager is a central instance encapsulating and adding functionality to the JADE [51] Directory Facilitator (DF). Other CHIL agents register their services with the Agent Manager including required resources for carrying out these services. The Service Agent is the abstract base class for all specific service agents. Service Agents provide access to the service functionality. Based on a common ontology they map the syntactical level of the services to the semantic level of the agent community enabling the use of other Service Agents to supply their own service. The Travel Agent can arrange and rearrange itineraries according to the user’s profile and current situation; it uses a simulated semantic web interface to gather the
Computers in the Human Interaction Loop
1095
Fig. 12 CHIL agent infrastructure.
required data from several online resources on the internet. The Connector Agent comprises the core Connector Service functionality: managing intelligent communication links, both bilateral and multilateral, it stores pending connections and finds a suitable point in time for these connections to take place. The Smart Room Agent controls the service-to-user notification inside the smart room using appropriate devices such as loudspeakers, steerable video projectors and targeted audio. The Situation Watching Agent provides access to the situation model. Finally, the Memory Jog Agent is the central agent for the Memory Jog Service, providing pertinent information to the participants. As proposed in the FIPA Abstract Architecture Specification [37], the information exchange between agents is completely based upon a specific communication ontology in order to ensure that the semantic content of tokens is preserved across agents. Distributed development of multiple services and the necessity to understand each other increase the importance of such a common semantic concept. Together with a FIPA-compliant implementation of the agent communication and an agent coordination mechanism, the communication ontology forms a sophisticated technique for agent collaboration. In order to allow the distributed development of services and to facilitate their integration and configuration, a plugin mechanism has been designed that enables the implementation of service specific code in pluggable handlers, keeping the agent itself service independent. Three types of pluggable handlers have been considered
1096
Waibel et al.
to be necessary: “Setup handlers” are responsible for service specific initialization in the setup phase of an agent, “Event handlers” are registered for certain events from outside the agent world (e.g. the user’s GUI, a perceptual component, the situation model or a web service), and “Pluggable responders” are triggered by incoming messages from other agents.
5 CHIL Services Several prototypical services were developed that instantiate the CHIL vision of perceptive and proactive services supporting human-human interaction. These services make use of the perceptual technologies and are integrated using the software architecture described in the previous section. Chosen domains for applications were lectures and small office meetings (up to 4-5 participants). The development process took a User-Centered Design (UCD) approach, which is characterized by taking into consideration the user perspective in all stages of the design process. CHIL services address intrinsically social scenarios, where two or more people interact to reach both individual and shared (group-level) goals. They aim to unobtrusively and seamlessly offer better chances to people for managing meeting-related activities. On the basis of an analysis of primary processes, services were proposed for supporting task-related activities such as discussing, producing and sharing material, accomplishing common goals (the Cooperative Workspace) and providing background information and memory assistance (the Memory Jog), for managing the social interaction and fostering a better quality of the interaction (the Relational Report and the Relational Cockpit), and for managing contacts with the outside world (the Connector). These goals are pursued by means of services that perceive the ongoing communication, reason about what they perceive and act to meet people’s needs and objectives, either autonomously or on demand. User-Centered Design (UCD) was adopted as a general methodology to secure that the services developed provide value to people, e.g., in terms of performance, perceived helpfulness, etc. The consistency and appropriateness of the initial service concepts, as emerged from brainstorming within the consortium, were validated by means of interviews and focus groups with potential end users. This stage produced a set of clearer requirements, modifications and refinements of the original concepts, which were used for the design of the various services, followed by the implementation of the first prototypes. Formative evaluations of the designs were performed by means of user studies conducted with mock-ups, simulations, and prototypes. Despite the fact that the usefulness of simulations and mock-ups can be limited by the status of the perceptual components involved – continuous monitoring of people locations, focus of attention, acoustic events, etc., is not an easy job for humans to perform, or simulate in mock-ups – at least in some cases it was possible to both validate the services’ conception and collect information for the later stages of the UCD cycle. For instance, it turned out that people find a computer’s feedback about their social behavior as acceptable as that from a human expert, and that such a
Computers in the Human Interaction Loop
1097
feedback can actually influence behavior, making it more appropriate to the goals of the meeting. Insights from the initial phases were fed into the following implementation phase, leading to the deployment of the working prototypes for the services described above. The latter were tested in summative evaluations, with two purposes. The first purpose was to establish whether and to what extent our services enhanced productivity, both in terms of effectiveness (quality of solution) and efficiency (speed to arrive at the solution). The second purpose was to establish whether and to what extent our services enhanced the satisfaction of meeting participants, in terms of dimensions such as ease of use, usefulness, involvement, group engagement, etc. These data were used to model people’ intention to use the various services in terms of networks where the psychological and motivational dimensions are posited to causally affect intention to use. The model was realized by means of a questionnaire measuring the relevant dimensions by means of appropriate psychometric scales, and was tested by applying Partial Least Square Structural Equation Modeling.
5.1 The Memory Jog CHIL produced a number of prototype services in order to validate the technologies, but also to manifest the benefits of context-aware implicit services. One of these services of the CHIL project was Memory Jog (MJ), a complex integrated context-aware service that was developed in-line with the CHIL Reference Architecture. Apart from the CHIL Reference Architecture, the Memory Jog service made extensive use of perceptual components of the CHIL technology catalog [19]. MJ was developed and demonstrated within the CHIL smart rooms [83]. From a functional perspective the MJ service is a pervasive non-obtrusive humancentric assistant for lectures, meetings and presentations, which supports: • Continuous room monitoring enabling identification and tracking of persons in the room. To this end the service leverages face detector, body tracker and face identification components. • Context-Aware provision of information concerning past events and access to associated information. Information is provided upon user request, but also in a proactive and non-obtrusive manner (i.e. based on both “push” and “pull” modes). Moreover, information can be provided to participants physically present in the meeting/lecture, as well as remote participants who attend the meeting/lecture via a networked device. • Context-aware access to actuating services (such as display services and targeted audio). These services are appropriately invoked based on the status of the user and his/her surrounding environment. The MJ supports a range of automatic, ambient and context-aware invocations of services (e.g., context-aware opening and display of slides). • Ambient video recording of the meeting from a multi-camera system. A multicamera selection algorithm optimally selects the best camera stream for the cur-
1098
Waibel et al.
rent speaker, while recordings are tagged according to contextual states. In this way, the MJ system integrates the functionality of an automated “intelligent” cameraman, without however any human intervention. • A “Show me the video” functionality enabling searching and accessing past video recordings, according to the collection of past contextual states used to tag the recordings. Hence, participants can recall past video segments pertaining to their context. • A “What happened while I was away?” functionality providing a summary of events that occurred during a user’s short periods of absence. • Context-aware integration with additional third-party utilities such as Skype and Google Calendar. This functionality enables the services to initiate Skype sessions and Google searches, according to the end-user’s needs. From an integration perspective the MJ service leverages structuring principles and middleware libraries specified in the CHIL Reference architecture. Specifically, the CHiLiX middleware bridge [27] is used to integrate perceptual components and service elements, which are hosted on different platforms. At the heart of the MJ implementation is also the situation modeling approach defined in the CHIL Reference Architecture. In particular, the MJ is supported by a situation model which defines a graph of situations comprising the context states of interest for the service, along with the possible transitions between these states. Situation states are triggered by combinations of perceptual components including body trackers, face and speech identification components, speech recognizers, acoustic localization components, face detectors, as well as speech activity detectors. These components provide elementary context cues (e.g., person’s location), which are then combined toward identifying the more sophisticated contextual states (e.g., a person returning in the room, a person conducting a presentation, a person addressing a question). Perceptual component combinations for triggering the states of the situation model, as well as the situation model are depicted in Fig. 13 which is derived from [27]. According to the CHIL situation modeling approach, a situation is triggered when all perceptual component outputs are close to the target values. Implementing the MJ required thorough design and testing of the situation model. This was challenging given that most of the underlying perceptual components were in their infancy when the development started. The SitCom tool [27] facilitated the parallel evolution of the development of perceptual components and situation models. Service developers were able to systematically simulate contextual states and models, until they provided a model that meets end-user requirements, while also being in-line with the functionalities of perceptual components. Note however that the rule based approach adopted for situation modeling in CHIL can be error prone, as a result of failing or inaccurate perceptual components. To this end, the CHIL project has also investigated and implemented statistical modules for situation machines (see [83] and references therein). Such statistical modules can handle and resolve conflicting input in cases of uncertainty. In addition to the CHIL situation modeling approach, MJ also makes use of the CHIL Knowledge Base Server (KBS) [83]. The KBS provides directory services for all the components of the MJ service. Sensors, devices, perceptual components,
Computers in the Human Interaction Loop
1099
Fig. 13 The situation model for monitoring meetings as adopted by the AIT Memory Jog service.
actuating services, as well as the smart room configuration are registered within the semantic KBS. Through querying the KBS perceptual components can automatically get a reference to their required sensors, while they can also access information about the room’s physical objects. Similarly, situation models can automatically discover their required perceptual components. The KBS facilitates discovery and use of the various MJ components (e.g., sensors, perceptual components, actuating services) in a nearly “plug n’ play” fashion. Overall, the MJ is a manifestation of a non-trivial pervasive service that consists of several hardware, software and middleware components in a highly distributed and heterogeneous environment. This service has validated a great number of CHIL perceptual processing components, while also providing insights into the benefits on implicit human centric services. At the same time, the MJ developments are a first class manifestation of CHIL contributions in the area of integrated development of pervasive systems for smart spaces [27] .
5.2 The Collaborative Workspace The Collaborative Workspace (CW) is an infrastructure for fostering cooperation among participants. The system provides a multimodal interface for entering and manipulating contributions from different participants, e.g., by supporting joint dis-
1100
Waibel et al.
cussion of minutes or joint accomplishment of a common task, with people proposing their ideas and making them available on the shared workspace, where they are discussed by the whole group. Technologies to support human-human collaboration have always been a hot topic for computer science. Meetings in particular represent a stimulating topic since they are a common, yet at the same time problematic, human activity. One of the seminal studies to inform design of technology to support meetings is [91]. The aim of that work was to inform the design of an application to support remote meetings, yet the author described observations to real face-to-face meetings. The list of work on remote meetings published since then is long enough to discourage any attempt at synthesis. In the last years, the emergence of hardware able to support, at least partially, multi-users raised the interest in technologies to support face-to-face collaboration [64]. Shen and colleagues in [79] proposed the use of DiamondTouch, a real multi-user touch device to support collocated interaction. They proposed a horizontal circular interface to solve the problems of the different users’ point of view around the table while users manipulated the objects on the projected display by touching the devices with the fingers. Kray and colleagues [55] discuss several issues that arise in the design of interfaces for multiple users interacting with multiple devices with focus on user and device management, technical concerns and social concerns. The Collaborative Workspace is realized as a horizontal device and implemented as a top-projected interface that turns a standard wooden table into an active surface (see Fig. 14, left). It also integrates person tracking facilities to detect the (possibly changing) location of the users around the table, moving toolbars accordingly. The main objects are virtual sheets of paper that can be created, edited, rotated and shared by the participants and used as generic textual or graphical documents. They are acted upon by means of an electronic pen (see Fig. 14, right). In order to have a more natural way of rotating documents, the Rotate and Translate algorithm proposed by Kruger and colleagues [56] was implemented. This technique allows the rotation and the movement of the window with a single gesture by dragging the window from a border. The system is also augmented with an agenda that allows the participants to keep track of the time spent on each item. Furthermore, by dragging documents on an item it will store them and use them for a semi-automatic generation of the minutes. In its simplest form, the minutes contain just the indication of the time spent, and, if any, the documents generated from a template. A more complex use of the agenda tool, that leads to a more elaborated report, is by using a special type of document, called the “note”. Notes look pretty similar to textual documents with the difference that they also have space to store other documents as attachments. When dragged into an agenda item, however, the text of the note becomes part of the report. A careful use of the notes, therefore, allows the users to generate rich reports during the meetings. In order to evaluate the prototype, several project managers working at FBK-irst in various research projects in the Information Technology field were requested to hold their meetings in a special room equipped with the tabletop device and video recording infrastructure. Overall, 16 meetings by 5 different groups were recorded during the period from December 2004 to June 2005. The main purpose of the meet-
Computers in the Human Interaction Loop
1101
Fig. 14 Left, the Collaborative Workspace. Right, a snapshot of the interface.
ings was project coordination, decision taking, sharing of information, planning of future individual work and so on. It is worth noting that the small groups observed can actually be defined as teams, i.e. groups of people who are task interdependent and share a goal or purpose [53]. Two broad dimensions emerged from the observations: the patterns of technology use and the styles of interaction that the groups employed. While the patterns of use make evident the lack or the redundancy of some of the functionalities thus providing direct insights for a future re-design, the style of interaction revealed how the collocated interface may help in structuring the group style of interaction and how the interface promotes certain kinds of collaboration. More details on the results of the evaluation and the design process can be found in [94]. The most important pattern of use relates to the use of the documents. For two of the five groups, the documents served mainly as a whiteboard. The documents were enlarged to reach about the surface of the interface and used to support the communication by drawing diagrams and schemes. In this way the documents foster communication and the common understanding of a problem by providing a shared attentional focus for participants. In other meetings, documents were used to simulate the functionalities not provided by the system. For example, a calendar and a timeline have been sketched in order to plan future activities. Both these aspects relate to the concept of appropriation [29] which is defined as the process by which a technology is adopted and adapted into a working practice. Appropriation is claimed to be endemic to collaborative work and it is a twofold process: from one side the users modify their behavior to fit the system constraints while from the other side they re-invent new uses to accomplish tasks not foreseen by the designers. In some cases the tabletop device has been circumvented by traditional technologies and artifacts, such as whiteboard, pens and, above all, paper-based resources. In any ecological situation, the technology has to “negotiate” its space with other artifacts. Harper and Sellen [78] demonstrated, for example, the importance of the paper in supporting collaborative work. In our observations too, paper and electronic tools co-exist and contribute to the meeting activities. Paper is mainly used at the personal level during the meeting, usually to take personal notes. Yet paper resources also served to introduce in the meeting preparation the work done before (as
1102
Waibel et al.
handouts or as personal notes). The necessity of linking the meeting to the activity of the team before and after the meeting itself emerged in several interviews. Three main working styles have been identified observing the allocation of inputs between participants and the functionalities employed: the division of labor, the specialization in roles and the centralized management. These patterns contribute to understanding the modalities through which the small groups implicitly or explicitly coordinate their actions in order to carry out a collaborative activity with the collocated interface. Also, this work is significant as a first step toward the identification of learning and repairing processes that groups display to overcome system limitations or unexpected behavior of the system. The CW proved to be flexible enough to make it possible for each group to accommodate it into their work practices while not preventing the teams to keep using their normal tools (from sheets of papers to personal laptops). Yet, the introduction of the system required the groups to engage in a sense-making process that was sometimes disrupting for the cohesion and the efficiency of the meeting. In designing this kind of system it is crucial to consider both their usability and their impact on established working practices.
5.3 Managing the Social Interaction: The Relational Cockpit and the Relational Report The availability of rich multimodal information can be exploited to provide grouporiented services. Our approach to service development was inspired by the observation that often professional coaches are invoked to facilitate group sessions and provide feedback to team members about their behavior, in order to help the team improve the social interaction and the effectiveness of their meetings. Two services were developed and evaluated. The Relational Cockpit (RC) monitors and provides feedback during a meeting about participants’ speaking time and eye gaze behavior. The Relational Report (RR) consists of a report about the social behavior of individual participants that is generated from multimodal information, and privately delivered to them after the meeting [70]. In both cases, the underlying idea is that the individuals, the group(s) they are part of, and the whole organization might benefit from an increased awareness of participants about their own behavior during meetings.
5.3.1 The Relational Cockpit The Relational Cockpit aimed to provide non-obtrusive feedback about participants’ social behavior during a meeting. This aim inspired us to adopt the constraint that the service should not provide in-depth feedback (this was to be reserved for the Relational Report), but instead feedback that could be easily grasped from a peripheral display. The focus of our research was on meetings where all participants
Computers in the Human Interaction Loop
1103
are supposed to contribute and reach consensus about a course of action. Based on literature reports, analyses of meeting behavior and interviews, we identified aspects of non-verbal behavior that are indicative of the social inclusion of meeting participants as most suited for feedback. In the first place, the speaking time of individual participants provides information about dominance patterns, leading to the monopolization of the conversation by certain participants and the exclusion of other participants from the conversation. In the second place, the eye gaze patterns of meeting participants provide information about the turn-giving and turn-taking behavior during a meeting. Since the current speaker has the privilege to select the next speaker and eye gaze plays an important role in identifying the addressee of an utterance, meeting participants who are not gazed at by the current speaker will feel neglected and excluded. Based on this reasoning, we devised a service that provided feedback about speaking time and eye gaze patterns to meeting participants. Speaking time was obtained from a speech diarization component (indicating who is speaking when). Eye gaze patterns were estimated from a perceptual component that measured head orientation (the physical arrangement of the meeting participants around the table was such that shifts in eye gaze across participants required a change in head orientation). The extracted speaking time and eye gaze patterns were visualized in the form of colored circles that were projected on the table before each participant and that would grow as the meeting evolved. The circle for speaking time reflected the amount of time that a participant had been speaking since the beginning of the meeting. A relatively large circle, compared to other participants, would identify that participant as an “over-participator”, i.e. someone who had been speaking proportionally much. The circle for eye gaze reflected the amount of time that a participant had been gazed at by the speaker since the beginning of the meeting, summing across the other participants. A relatively large circle, compared to other participants, would mean that the participant had been gazed at proportionally much by the other participants when they had been speaking. There was a third circle for each participant, reflecting how much visual attention he or she received while speaking from the other participants, but we will leave this out of consideration in the further treatment. Reliability analyses were conducted by comparing post-hoc manual annotations with the automatically extracted speaking time and eye gaze patterns. It was found that, while the accuracy of the automatically extracted patterns was only modest at a micro-level, there was satisfactory agreement between the manually and automatically extracted patterns for the cumulative patterns. This means that the duration of individual speaking turns and shifts in eye gaze were not extracted very accurately, but summing over larger stretches of the conversation the total speaking time and gaze time for individual participants were estimated at a satisfactory level. Two user tests were conducted to test the impact of the service on the social interaction during meetings [57, 89]. Summed across the two tests, thirty-one groups of four participants tried to reach an agreement about pre-specified topics requiring negotiation in two sessions. One session was a normal meeting, in the other session the group received feedback about its social interaction. Each session lasted
1104
Waibel et al.
about twenty minutes. The order of sessions with and without feedback was balanced across groups. It was found that participants who contributed relatively much (“over-participators”) in the session without feedback used less speaking time in the session with feedback and that participants who contributed relatively little (“underparticipators”) in the session without feedback used more speaking time in the session with feedback. Further inspection of the data suggested that the effects were mainly due to the feedback on speaking time. There was no evidence that feedback on eye gaze patterns led speakers to change their eye gaze patterns such that they would result in better inclusion of “under-participators”.
5.3.2 The Relational Report The success of a meeting is often hindered by the participants’ behavior: professionals agree that as much as 50% of meeting time is unproductive and that up to 25% of meeting time is spent discussing irrelevant issues [30]. In order to improve the productivity of meetings, external interventions such as facilitators and training experiences are commonly employed. Facilitators participate in the meetings as external elements of the group and their role is to help participants maintain a fair and focused behavior as well as directing and setting the pace of the discussion. Training experiences aim at increasing the relational skills of individual participants by providing an offline (with respect to meetings) guidance – or coaching – so that the team eventually will be able to overcome or to cope with its disfunctionalities. The Relational Report (RR) aims at providing coaching support in an automatic way, by monitoring groups, and generating individual reports about the participants’ relational behavior. The system does not keep verbatim trace of what people said or did during the meeting, and does not produce minutes; rather it makes available more qualitative, meta-level interpretations of people’s social behavior. The reports are delivered privately to each participant after the meeting, with the purpose of informing them about their behavior, this way stimulating reflective processes and, ultimately behavioral change [12]. Support for the idea that automatic coaching might be as effective as human coaching comes from studies that have shown how people do not rate differently reports about their own relational behavior according to whether they were generated by a human expert or by a system on dimensions such as perceived usefulness of the report, reliability, intrusiveness, and acceptability [70]. The RR composes its reports from information about the functional relational roles the given person has played during the meeting. Building on Benne and Sheats [6] and Bales [5] works, we identified two sets of role labels: the Task Area, roles related to facilitation and coordination tasks as well as to technical expertise of members; the Socio Emotional Area, which is concerned with the relationships between group members and the “functioning of the group as a group”. The Task Area functional roles include: the Orienteer, who orients and keeps the group focused; the Giver, who provides factual information and answers to questions; the Seeker, who requests information and clarifications; the Follower, who just listens, without
Computers in the Human Interaction Loop
1105
actively participating. The Socio-Emotional functional roles include: the Attacker; who deflates the status of others, expressing disapproval; the Protagonist; who takes the floor and drives the conversation; the Supporter, who shows a cooperative attitude demonstrating understanding and attention; the Neutral, played by those who passively accept the ideas of the others, serving as an audience in group discussion; see [69]. Information concerning the functional relational roles played by participants is assembled by the system from low level audio-visual inputs, in particular, speech activity and fidgeting (that is, the overall level of hand and body activity). Those features were exploited in experiments aiming to investigate and compare the performances of a number of classifiers for role detection: Support Vector Machines, Hidden Markov Models and the Influence Model [99, 28] ; the best results were yielded by the Influence Modeling, with an accuracy of 75% for both “task” and “socio” roles. The RR uses this information to compose multimedial reports (see Fig. 15).
Fig. 15 An example of a multimedia relational report for one meeting participant.
The textual part describes the behavior in an informative rather than normative way, helping the user accomplish the first step of the reflective process – namely, the return to experience [12]. It is based on a general-purpose schema-based text planner [15] which accesses a repository of declarative discourse schemata derived from the analysis of the actual reports written by the psychologists. To improve effectiveness and emotional involvement, a virtual character reads the report with emotional facial expressions appropriate to the content (e.g., a sad expression is used when something unpleasant, e.g., a serious contrast with a colleague, is being recalled). When appropriate, the presentation is enriched with short audio-video clips from the actual meeting, which exemplify the information presented. Finally, a graphical representation of the participant’s behavior is also provided, yielding a more explicit feedback about the system’s internal interpretation of what was monitored
1106
Waibel et al.
during the meeting. The graphical part was inspired from visualizations discussed by Bales [5], and expresses at a glance the interaction behavior of a participant by profiling the amount of time spent on each task- and socio-emotional roles. More information about the presentation engine can be found in [70].
5.4 The Connector: A Virtual Secretary in Smart Offices Much of the communication at the workplace – via the phone as well as face-to-face – occurs in inappropriate contexts, disturbing meetings and conversations, invading personal and corporate privacy, and more broadly breaking social norms. This is because both, callers and visitors in front of closed office doors, face the same problem: they can only guess the other person’s current availability for a conversation. The Connector Service was designed to facilitate more socially appropriate communication at the workplace by acting similarly to a context-aware Virtual Secretary. Concordant with the CHIL paradigm, it aims toward understanding a person’s situation and activity in smart offices, and passes on important contextual information to callers and visitors in order to facilitate more informed human decisions about how and when to initiate contact [23]. Deploying such a context-aware Virtual Secretary requires an instrumented environment, able to sense situations, recognize people and track activities. At UKAISL, cameras were installed in “smart” offices (Fig. 16), and a set of perception technologies was employed as described previously in order to detect basic social situations: whether the office was empty, if the occupant was alone, or if he was in a face-to-face meeting with others.
Fig. 16 Camera snapshots of different smart offices with occupants present.
The Connector as Virtual Secretary mediated all phone calls as well as face-toface interactions in the smart offices. Knowing about the current situation, it could warn senders at the time of the call when the current situation was likely to be inappropriate for talking on the phone. Virtual Secretary [synthesized voice]: ”Hello Jane. Thank you for calling Bob Smith. This is his Digital Secretary. Bob is currently in a meeting with Eric. Your call is important. Please hang on, or press 1, to leave a message. Instead, you may
Computers in the Human Interaction Loop
1107
now press 2 to be connected to the office phone. To schedule a call at the next available time, please press 4.” Similarly, once a visitor was detected in front of the office, the Virtual Secretary provided him or her with important contextual information about the situation inside the office (Fig. 17). The goal was to prevent untimely disruptions from taking place, such as during meetings or phone conversations. In order to detect and identify people, a small webcam was placed on the screen that greeted visitors in front of the office door. A few seconds were enough for the face recognition system [35] to accurately detect and identify members of our research lab (approximately 15 persons) and distinguish them from unknown persons.
Fig. 17 A visitor consults the Virtual Secretary at the office door. A small webcam was placed on top of the screen from which the Virtual Secretary greeted visitors.
The Virtual Secretary was installed in front of the office of a senior researcher, who was very frequently occupied, and mediated all his phone calls and meetings over a period of three weeks. The system provided a web-based diary of all office activity, as shown in Fig. 18. At the end of each day, the participant was asked to browse this diary and annotate each interruption according to how available he had been and how appropriate the interruption was. Observations and interviews showed that the system was running robustly over this extended period of time and was easy to use, even without any prior experience with it. A total number of 150 contact attempts (76 were phone calls) were protocoled by the system. Analyzing ratings of appropriateness of all interruptions empirically proved that there were significantly fewer inappropriate workplace interruptions when the Virtual Secretary mediated phone calls and direct interactions [23].
6 Conclusion Implicit, proactive computing services as proposed in CHIL, are a way to free humans from unnecessary and unwanted preoccupation with technological artifacts.
1108
Waibel et al.
Fig. 18 The office activity diary was shown as interactive time line widget.
We can think of CHIL computing as a whole new human-centered way of opening up access and deployment of computing services. They allow the attention of the human beneficiary of the services to shift away from human-machine interaction to (machine-supported) human-human interaction, in the interest of improved productivity and user satisfaction. Such proactive services, however, require considerably broader, better and more robust perceptual capabilities to suitably judge user activities, intent and needs. Under CHIL, we have proposed, implemented, improved and evaluated such perceptual processing over publicly available benchmarks, and we were able to integrate these technologies into multiple service prototypes efficiently under a common software architecture. It is our hope that this will seed and promote a drive for better and more appealing clutter-free computing.
Acknowledgement The work presented here was funded by the European Union (EU) under the integrated project CHIL, Computers in the Human Interaction Loop (Grant number IST-506909).
References [1] Abad, A., Canton-Ferrer, C., Segura, C., Landabaso, J.L., Macho, D., Casas, J.R., Hernando, J., Pardas, M., Nadeu, C.: UPC Audio, Video and Multimodal Person Tracking Systems in the CLEAR Evaluation Campaign. In: Multi-
Computers in the Human Interaction Loop
[2]
[3] [4]
[5] [6] [7]
[8]
[9] [10]
[11]
[12] [13]
[14]
1109
modal Technologies for Perception of Humans, Proceedings of the First International CLEAR Evaluation Workshop. Springer LNCS 4122, Southampton, UK (2006) Anastasakos, T., McDonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker adaptation training. In: Proc. Int. Conf. Spoken Language Process. (ICSLP), pp. 1137–1140. Philadelphia, PA (1996) Andreou, A., Kamm, T., Cohen, J.: Experiments in vocal tract normalisation. In: Proc. CAIP Works.: Frontiers in Speech Recognition II (1994) Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Language Process. 15(7), 2011–2022 (2007) Bales, R.F.: Interaction process analysis: a method for the study of small groups. University of Chicago press (1976) Benne, K.D., Sheats, P.: Functional roles of group members. Journal of Social Issues 4 pp. 41–49 (1948) Bernardin, K., Gehrig, T., Stiefelhagen, R.: Multi-Level Particle Filter Fusion of Features and Cues for Audio-Visual Person Tracking. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 70–81. Springer, Baltimore, MD, USA (2007) Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, Special Issue on Video Tracking in Complex Scenes for Surveillance Applications (2008) Beskow, J., Karlsson, I., Kewley, J., Salvi, G.: SYNFACE - A talking head telephone for the hearing-impaired, pp. 1178–1186. Springer-Verlag (2004) Beskow, J., Nordenberg, M.: Data-driven synthesis of expressive visual speech using an mpeg-4 talking head. In: Proceedings of Interspeech 2005. Lisbon (2005) Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(3), 257–267 (2001) Boud, D., Keogh, R., (Eds.), D.W.: Reflection: Turning experience into learning. Kogan Page, London (1988) Brunelli, R., Brutti, A., Chippendale, P., Lanz, O., Omologo, M., Svaizer, P., Tobia, F.: A generative approach to audio-visual person tracking. In: Multimodal Technologies for Perception of Humans, Proceedings of the First International CLEAR Evaluation Workshop, pp. 55–68. Springer LNCS 4122, Southampton, UK (2006) Brutti, A.: A person tracking system for CHIL meetings. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625. Springer, Baltimore, MD, USA (2007)
1110
Waibel et al.
[15] Callaway, C., Not, E., Stock, O.: Report generation for post-visit summaries in museum environments. In: O. Stock, M. Zancanaro (eds.). PEACH: Intelligent Interfaces for Museum Visits. Springer (2007) [16] Canton-Ferrer, C., Casas, J.R., Pardàs, M.: Human model and motion based 3D action recognition in multiple view scenarios (invited paper). In: 14th European Signal Processing Conference, EUSIPCO. EURASIP, University of Pisa, Florence, Italy (2006). ISBN: 0-387-34223-0 [17] Canton-Ferrer, C., Salvador, J., Casas, J., M.Pardas: Multi-person tracking strategies based on voxel analysis. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 91–103. Springer, Baltimore, MD, USA (2007) [18] Canton-Ferrer, C., Segura, C., Casas, J.R., Pardàs, M., Hernando, J.: Audiovisual head orientation estimation with particle filters in multisensor scenarios. EURASIP Journal on Advances in Signal Processing (2007) [19] The CHIL technology catalogue. http://chil.server.de/ servlet/is/5777/ [20] Chippendale, P., Lanz, O.: Optimised meeting recording and annotation using real-time video analysis. In: Proc. 5th Joint Workshop on Machine Learning and Multimodal Interaction, MLMI08. Utrecht, The Netherlands (2008) [21] CLEAR – Classification of Events, Activities, and Relationships Evaluation and Workshop: http://www.clear-evaluation.org [22] The CLEF Website: http://www.clef-campaign.org/ [23] Danninger, M., Stiefelhagen, R.: A context-aware virtual secretary in a smart office environment. In: Proceedings of the ACM Multimedia 2008. Vancouver, Canada (2008) [24] Davis, S., Mermelstein, P.: Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Trans. on Acoustics, Speech, and Signal Process. 28(4), 357–366 (1980) [25] D2.2 functional requirements and chil cooperative information system software design, part 2, cooperative information system software design. Available on http://chil.server.de [26] Dempster, A.P., Laird, M.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Statistical Society Series B (methodological) 39, 1–38 (1977) [27] Dimakis, N., Soldatos, J., Polymenakos, L., Curin, J., Fleury, P., Kleindienst, J.: Integrated development of context-aware applications in smart spaces. IEEE Pervasive Computing 7(4), 71–79 (2008) [28] Dong, W., Lepri, B., Cappelletti, A., Pentland, A., Pianesi, F., Zancanaro, M.: Using the influence model to recognize functional roles in meetings. In: Proceedings of the International Conference on Multimodal Interaction ICMI2007. Nagoya, Japan (2007) [29] Dourish, P.: The appropriation of interactive technologies: Some lessons from placeless documents. Computer Supported Cooperative Work (2003)
Computers in the Human Interaction Loop
1111
[30] Doyle, M., Straus, D.: How To Make Meetings Work. The Berkley Publishing Group, New York, NY (1993) [31] Edlund, J., Beskow, J.: Pushy versus meek - using avatars to influence turntaking behaviour. In: Proceedings of Interspeech 2007 ICSLP, pp. 682–685. Antwerp, Belgium (2007) [32] Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A.: Towards human-like spoken dialogue systems. Speech Communication 50(89), 630–645 (2008). URL http://www.speech.kth.se/prod/ publications/files/3145.pdf [33] Edlund, J., Heldner, M.: Exploring prosody in interaction control. Phonetica 62(2-4), 215–226 (2005) [34] Edlund, J., Heldner, M.: Underpinning /nailon/: automatic estimation of pitch range and speaker relative pitch. In: C. Müller (ed.) Speaker Classification. Springer/LNAI (2007) [35] Ekenel, H.K., Stiefelhagen, R.: Analysis of local appearance-based face recognition: Effects of feature selection and feature normalization. In: CVPR Biometrics Workshop. New York, USA (2006) [36] ELRA Catalogue of Language Resources: http://catalog.elra. info [37] FIPA: The foundation for intelligent physical agents. http://www.fipa. org [38] Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In: Proc. Automatic Speech Recognition and Understanding Works. (ASRU), pp. 347–352. Santa Barbara, CA (1997) [39] Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The Rich Transcription 2006 Spring meeting recognition evaluation. In: S. Renals, S. Bengio, J.G. Fiscus (eds.) Machine Learning for Multimodal Interaction, vol. 4299, pp. 309–322. LNCS (2006) [40] Fleury, P., Cuˇrín, J., Kleindienst, J.: S IT C OM - development platform for multimodal perceptual services. In: Proceedings of the 3nd International Conference on Industrial Applications of Holonic and Multi-Agent Systems, pp. 106–113. Regensburg, Germany (2007). V. Marik, V. Vyatkin, A.W. Colombo (Eds.): HoloMAS 2007, LNAI 4659 [41] Gauvain, J.L., Lee, C.: Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Trans. on Speech and Audio Processing 2(2), 291–298 (1994). URL ftp://tlp.limsi. fr/public/map93.ps.Z [42] Gehrig, T., McDonough, J.: Tracking multiple speakers with probabilistic data association filters. In: Multimodal Technologies for Perception of Humans, Proceedings of the First International CLEAR Evaluation Workshop. Springer LNCS 4122, Southampton, UK (2006) [43] Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 661–664. Seattle, WA (1998)
1112
Waibel et al.
[44] Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP), vol. 1, pp. 13–16 (1992) [45] Heldner, M., Edlund, J., Carlson, R.: Interruption impossible. In: M. Horne, G. Bruce (eds.) Nordic Prosody: Proceedings of the IXth Conference, Lund 2004, pp. 97–105. Peter Lang, Frankfurt am Main (2006) [46] Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoustic Society America 87(4), 1738–1752 (1990) [47] Huang, J., Marcheret, E., Visweswariah, K.: Improving speaker diarization for CHIL lecture meetings. In: Proc. Interspeech, pp. 1865–1868. Antwerp, Belgium (2007) [48] Huang, J., Marcheret, E., Visweswariah, K., Potamianos, G.: The IBM RT07 evaluation systems for speaker diarization on lecture meetings. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 497–508. Springer, Baltimore, MD, USA (2007) [49] Hugot, V.: Eye gaze analysis in human-human communication. Master thesis, KTH Speech, Music and Hearing (2007) [50] Ivanov, Y.A., Bobick., A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 852–872 (2000) [51] JADE: Java Agent DEvelopent Framework. http://jade.tilab.com [52] Katsarakis, N., Talantzis, F., Pnevmatikakis, A., Polymenakos, L.: The AIT 3D audio / visual person tracker for CLEAR 2007. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 35–46. Springer, Baltimore, MD, USA (2007) [53] Katznbach, J., Smith, D.: The Wisdom of Teams. Creating the High Performance Organisations. Harvard Business School Press, Cambridge, MA (1993) [54] Klee, U., Gehrig, T., McDonough, J.: Kalman filters for time delay of arrivalbased source localization. Journal of Advanced Signal Processing, Special Issue on Multi-Channel Speech Processing (2006) [55] Kray, C., Wasinger, R., Kortuem, G.: Concepts and issues in interfaces for multiple users and multiple devices. In: Proceedings of the Workshop on Multi-User and Ubiquitous User Interfaces (MU3I), IUI/CADUI (2004) [56] Kruger, R., Carpendale, M., Scott, S., Tang, A.: Fluid integration of rotation and translation. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2005). Portland, Oregon (2005) [57] Kulyk, O., Wang, C., Terken, J.: Real-time feedback based on nonverbal behaviour to enhance social dynamics in small group meetings. In: MLMI’05: Proceedings of the Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, LNCS, vol. 3869, pp. 150–161 (2006)
Computers in the Human Interaction Loop
1113
[58] Landabaso, J.L., M. Pardas, M.: Foreground regions extraction and characterization towards real-time object tracking. In: Machine Learning for Multimodal Interaction (MLMI), vol. 3869, pp. 241–249. Springer LNCS (2006) [59] Lanz, O.: Approximate Bayesian Multibody Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1436–1449 (2006) [60] Lanz, O., Brunelli, R.: Dynamic head location and pose from video. In: IEEE Conf. Multisensor Fusion and Integration (2006) [61] Lanz, O., Chippendale, P., Brunelli, R.: An appearance-based particle filter for visual tracking in smart rooms. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 57–69. Springer, Baltimore, MD, USA (2007) [62] Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 9(2), 171–185 (1995) [63] Luque, J., Anguera, X., Temko, A., Hernando, J.: Speaker diarization for conference room: The UPC RT07s evaluation system. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 543–554. Springer, Baltimore, MD, USA (2007) [64] Morris, M., Piper, A., Cassanego, A., Huang, A., Paepcke, A., Winograd, T.: Mediating group dynamics through tabletop interface design. IEEE Computer Graphics and Applications pp. 65–73 (2006) [65] M.Voit, R.Stiefelhagen: Tracking head pose and focus of attention with multiple far-field cameras. In: International Conference On Multimodal Interfaces - ICMI 2006. Banff, Canada (2006) [66] Nickel, K., Gehrig, T., Ekenel, H.K., McDonough, J., Stiefelhagen, R.: An audio-visual particle filter for speaker tracking on the CLEAR’06 evaluation dataset. In: Multimodal Technologies for Perception of Humans, Proceedings of the First International CLEAR Evaluation Workshop. Springer LNCS 4122, Southampton, UK (2006) [67] Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A Joint Particle Filter for Audio-visual Speaker Tracking. In: Proceedings of the Seventh International Conference On Multimodal Interfaces - ICMI 2005, pp. 61–68. ACM Press (2005) [68] The NIST MarkIII Microphone Array: http://www.nist.gov/ smartspace/mk3_presentation.html [69] Pianesi, F., Zancanaro, M., Lepri, B., Cappelletti, A.: A multimodal annotated corpus of consensus decision making meetings. The Journal of Language Resources and Evaluation 41(3–4) (2007) [70] Pianesi, F., Zancanaro, M., Not, E., Leonardi, C., Falcon, V., Lepri, B.: Multimodal support to group dynamics. Personal and Ubiquitous Computing 12(2) (2008)
1114
Waibel et al.
[71] Povey, D., Woodland, P.: Improved discriminative training techniques for large vocabulary continuous speech recognition. In: Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP). Salt Lake City, UT (2001) [72] Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 105–108. Orlando, FL (2002) [73] Rentzeperis, E., Stergiou, A., Boukis, C., Pnevmatikakis, A., Polymenakos, L.C.: The 2006 Athens Information Technology speech activity detection and speaker diarization systems. In: Machine Learning for Multimodal Interaction, vol. 4299, pp. 385–395. LNCS (2006) [74] The Rich Transcription 2006 Spring Meeting Recognition Evaluation Website: http://www.nist.gov/speech/tests/rt/2006-spring [75] Rich Transcription 2007 Meeting Recognition Evaluation. http://www. nist.gov/speech/tests/rt/2007 [76] Schwenk, H.: Efficient training of large neural networks for language modeling. In: IJCNN, pp. 3059–3062 (2004) [77] Segura, C., Abad, A., Nadeu, C., Hernando, J.: Multispeaker localization and tracking in intelligent environments. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 82–90. Springer, Baltimore, MD, USA (2007) [78] Sellen, A., Harper, R.: The Myth of the Paperless Office. MIT Press (2001) [79] Shen, C., Vernier, F., Forlines, C., Ringel, M.: Diamondspin: An extensible toolkit for around-the-table interaction. In: ACM Conference on Human Factors in Computing Systems (CHI) (2004) [80] Siciliano, C., Williams, G., Beskow, J., Faulkner, A.: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. In: Proc of ICPhS, XV Intl Conference of Phonetic Sciences, pp. 131–134. Barcelona, Spain (2003) [81] Skantze, G., House, D., Edlund, J.: User responses to prosodic variation on fragmentary grounding utterances in dialogue. In: Proceedings Interspeech 2006, pp. 2002–2005. Pittsburgh, PA (2006) [82] SmarTrack - a SmarT people Tracker. Patent pending. Online at http:// tev.fbk.eu/smartrack/ [83] Soldatos, J., Dimakis, N., Stamatis, K., Polymenakos, L.: A Breadboard Architecture for Pervasive Context-Aware Services in Smart Spaces: Middleware Components and Prototype Applications. Personal and Ubiquitous Computing Journal 11(2), 193–212 (2007). URL http://www. springerlink.com/content/j14821834364128w/ [84] Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., Soundararajan, P.: The CLEAR 2006 Evaluation. In: Multimodal Technologies for Perception of Humans, Proceedings of the First International CLEAR Evaluation Workshop, CLEAR 2006, no. 4122 in Springer LNCS, pp. 1–45. Southampton, UK (2006)
Computers in the Human Interaction Loop
1115
[85] Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R.T., Michel, M., Garofolo, J.: The CLEAR 2007 Evaluation. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 3–34. Springer, Baltimore, MD, USA (2007) [86] Stiefelhagen, R., Bernardin, K., Ekenel, H., McDonough, J., Nickel, K., Voit, M., Woelfel, M.: Audio-visual perception of a lecturer in a smart seminar room. Signal Processing - Special Issue on Multimodal Interfaces 86(12) (2006) [87] Stiefelhagen, R., Bowers, R., Fiscus, J. (eds.): Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625. Springer, Baltimore, MD, USA (2007) [88] Stiefelhagen, R., Garofolo, J. (eds.): Multimodal Technologies for Perception of Humans, First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR’06. No. 4122 in LNCS. Springer, Southampton, UK (2006) [89] Sturm, J., van Herwijnen, O.H., Eyck, A., Terken, J.: Influencing social dynamics in meetings through a peripheral display. In: ICMI ’07: Proceedings of the 9th international conference on Multimodal interfaces, pp. 263–270. ACM, New York, NY, USA (2007) [90] Svanfeldt, G., Olszewski, D.: Perception experiment combining a parametric loudspeaker and a synthetic talking head. In: Proceedings of Interspeech, pp. 1721–1724 (2005) [91] Tang, J.C.: Finding from observational studies of collaborative work. International Journal of Man-Machine Studies 34(2), 143–160 (1991) [92] Tyagi, A., Potamianos, G., Davis, J.W., Chu, S.M.: Fusion of multiple camera views for kernel-based 3D tracking. In: Proc. IEEE Works. Motion and Video Computing (WMVC). Austin, Texas (2007) [93] VACE - Video Analysis and Content Extraction, http://iris.usc. edu/Outlines/vace/vace.html [94] Waibel, A., Stiefelhagen, R. (eds.): Computers in the Human Interaction Loop. Human-Computer Interaction. Springer (2009) [95] Wallers, Å., Edlund, J., Skantze, G.: The effects of prosodic features on the interpretation of synthesised backchannels. In: E. André, L. Dybkjaer, W. Minker, H. Neumann, M. Weber (eds.) Proceedings of Perception and Interactive Technologies, pp. 183–187. Springer, Kloster Irsee, Germany (2006) [96] Wojek, C., Nickel, K., Stiefelhagen, R.: Activity recognition and room level tracking in an office environment. In: IEEE Int. Conference on Multisensor Fusion and Integration for Intelligent Systems. Heidelberg, Germany (2006) [97] Wölfel, M.: Warped-twice minimum variance distortionless response spectral estimation. In: Proc. EUSIPCO (2006)
1116
Waibel et al.
[98] Wölfel, M., McDonough, J.: Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In: Proc. Interspeech (2005) [99] Zancanaro, M., Lepri, B., Pianesi, F.: Automatic detection of group functional roles in face to face interactions. In: Proceedings of the International Conference of Multimodal Interfaces ICMI-06 (2006) [100] Zhang, Z., Potamianos, G., Senior, A.W., Huang, T.S.: Joint face and head tracking inside multi-camera smart rooms. Signal, Image and Video Processing pp. 163–178 (2007) [101] Zhu, X., Barras, C., Lamel, L., Gauvain, J.L.: Speaker diarization: from Broadcast News to lectures. In: Machine Learning for Multimodal Interaction, vol. 4299, pp. 396–406. LNCS (2006) [102] Zhu, X., Barras, C., Lamel, L., Gauvain, J.L.: Multi-stage speaker diarization for conference and lecture meetings. In: Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS, vol. 4625, pp. 533–542. Springer, Baltimore, MD, USA (2007)
Part IX
Projects
Eye-based Direct Interaction for Environmental Control in Heterogeneous Smart Environments Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
1 Introduction environmental control is the control, operation, and monitoring of an environment via intermediary technology such as a computer. Typically this means control of a domestic home. Within the scope of COGAIN, this environmental control concerns the control of the personal environment of a person (with or without a disability). This defines environmental control as the control of a home or domestic setting and those objects that are within that setting. Thus, we may say that environmental control systems enable anyone to operate a wide range of domestic appliances and other vital functions in the home by remote control. In recent years the problem of self-sufficiency for older people and people with a disability has attracted increasing attention and resources. The search for new solutions that can guarantee greater autonomy and a better quality of life has begun to exploit easily available state-of-the-art technology. Personal environmental control can be considered to be a comprehensive and effective aid, adaptable to the functional possibilities of the user and to their desired actions. We may say that the main aim of a smart home environment is to:
Fulvio Corno Politecnico di Torino, Dip. di Automatica e Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italy, e-mail: [email protected] Alastair Gale Applied Vision Research Centre, Loughborough University, Garendon Wing, Holywell Park, Loughborough LE11 3TU, UK, e-mail: [email protected] Päivi Majaranta Unit for Computer-Human Interaction (TAUCHI), Department of Computer Sciences, FI-33014 University of Tampere, Finland, e-mail: [email protected] Kari-Jouko Räihä Unit for Computer-Human Interaction (TAUCHI), Department of Computer Sciences, FI-33014 University of Tampere, Finland, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_41, © Springer Science+Business Media, LLC 2010
1117
1118
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
Reduce the day to day home operation workload of the occupant
There are several terms that are currently used to describe domestic environmental control. Often a domestic environment that is controlled remotely is referred to as a “Smart Home,” “SmartHouse,” “Intelligent Home” or a “Domotic Home”; where “Domotics” is the application of computer and robot technologies to domestic appliances. This is a portmanteau word formed from domus (Latin, meaning home) and informatics. All of these terms may be used interchangeably. Within this chapter environmental control within the home as a form of assistive technology to aid users with disabilities will use the term “Smart Home” (unless a system referred to has a different name) as it encompasses all of these terms in a clear and understandable way, and allows for levels of automation and possible artificial intelligence over and above simple direct environmental control of objects within the home. Often the term “remotely controlled” gives the impression that the home is controlled from some other place outside of the home (perhaps from the workplace for example). This can be the case, but within the terms of COGAIN—and within this chapter—“remotely controlled” means an object or function of the home that is controlled without the need to handle or touch that object. In these terms, the object may be right in front of a user, with that user controlling the object remotely via a computer screen rather than actually handling, lifting and manipulating the object—physical actions that the user may not be able to accomplish. Users with a physical disability may not be able to manipulate physically objects in their environment at all. Thus an environmental control system moves from being a useful labour saving device to a personal necessity for independent living, by enhancing and extending the abilities of a disabled user and allowing independence to be maintained. The environmental control system may be the sole and only way such a person can control their environment. Hence, extending the definition above, when that occupant has a physical disability, the aim of smart home systems is extended, it will: Reduce the day to day home operation workload of the occupant and enable the occupant of the home to live autonomously as much as is possible
Such personal autonomy over their environment has the benefit of reducing the reliance on the continuous help of a carer and/or family member, and increasing the self-esteem of the people as they can control the World around them. Typical targets are those items within the home that may be easily controlled, and which the user would often wish to control. Examples include: • • • •
Lighting (switches, dimmers); Windows and blinds; Door lock; Intercom (microphone and button at the door as well as bell and loudspeaker in the apartment); • Heating, Ventilating, and Air-Conditioning (HVAC); • Home appliances (also called white goods—domestic equipment such as refrigerators, cookers, washing machines and central heating boilers);
Eye-based Interaction for Environmental Control
• • • •
1119
Brown goods (name for Audio Visual Appliances and devices); Home security devices (e.g., burglar alarm, fire alarm); Telecommunication; Information access.
However, such a system is limited as it requires the user to actively and deliberately control each device. This may become tiresome and cause fatigue over time. Many functions could be automated to save the user the efforts of controlling the system themselves. Thus, an improvement to direct control is to add a level of “intelligence” to the system. Electronic devices that contain processors or are computers that can communicate with other systems are increasingly popular in ordinary homes. In order to address security and control, communications, leisure and comfort, environmental integration and accessibility, it is desirable to have a central control system to integrate and manage all the devices throughout a domestic home. Therefore, it is essential that this environment control system can interact and work together with all the electronic devices within the home to benefit the users in the home. Most modern homes have appliances that allow some degree of remote control. Environmental control aims to integrate and extend this control throughout the home. A home with an environmental control system installed might have many computers, perhaps built into the walls, to allow the homeowner to control applications in any part of their home from any other, or it may have a single simple computer operating an essential function such as emergency call and security. In addition to integration, intelligent systems are being developed. The great potential of these systems lies in their ability to apply computer programming to act in what could be called “intelligent” ways.
2 The COGAIN network COGAIN (Communication by gaze interaction) is a Network of Excellence funded by the European Commission in 2004–2009. COGAIN has brought together researchers, users, user organisations, and eye tracking manufacturers to foster the development of solutions for users with severe motor disorders. Eventually the only way for communication for such users may be the use of their eyes, the modality that COGAIN focuses on. This can be caused by a number of diseases, including Amyotrophic Lateral Sclerosis; Motor Neurone Disease; Multiple Sclerosis; Cerebral Palsy; Spinal Cord Injury; Spinal Muscular Atrophy; Rett Syndrome; Stroke and Traumatic Brain Injury. The potential numbers of such users in the EU alone has been estimated as circa three million. In addition other users, who have some minor motor or mobility impairment, could also use an eye based system as an optional interface mechanism to augment existing controls. COGAIN was launched with several goals in mind. The different stakeholders involved did not previously have a forum for meeting and working together, and the network has provided unprecedented opportunities for them all to share their
1120
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
expertise. When COGAIN started, the application base available for the target user community was rather limited; a second goal of COGAIN was to radically increase both the number of applications and also the application areas where eyegaze could be used. Third, the equipment needed for making use of eyegaze in the interaction has been rather expensive, often prohibitively so for most people who could benefit from the use of eyetrackers. Here COGAIN has worked in two ways: first, researchers have actively sought for solutions that would enable the development of low-cost trackers; and second, COGAIN has tried to influence decision makers, conveying the information that eye tracking is mature, it works, and is both a necessity and a critical life improvement technology for certain users. As the name of the network indicates, COGAIN started by focusing on communication in the traditional sense: that of producing messages to be delivered to other people, usually in written form. This area, eye typing or gaze typing, is the most established—actively studied and actively used—in the eye tracking field. COGAIN has collected this knowledge together and published recommendations and observations on how to make the installations work for individual users ([10]). As the work of COGAIN evolved, its scope has expanded from utility applications on the desktop to other types of applications, such as games and entertainment, to supporting mobility of users (for instance for users in wheel chairs), and to environmental control—the focus of this chapter. The selection of desktop applications that exist today is fairly extensive. A major hindrance preventing full access to these solutions by all users is that they are often proprietory, working on specific platforms. COGAIN has worked to overcome this problem in the desktop environment by proposing standards to enhance interoperability [1], but the take-up of the proposal has been slow outside the network. In environmental control the problems are an order of magnitude more challenging, because several technologies are involved, and there may be alternative interaction and control mechanisms in simultaneous use for them. The next section presents the approach COGAIN has taken to facilitate solutions that are flexible in the presence of multiple solution providers. Similarly, just as there are many alternative techniques for gaze-based interaction in the desktop environment (for instance, selection by dwell time, selection by eye gestures, or selection by switches), there are two fundamentally different approached to environmental control by eye gaze: control of an interface on the computer screen, or direct control of the environment by embedding intelligence in the objects in the environment. The second main section of this chapter discusses these approaches, both of which have been investigted in COGAIN, in more detail.
3 Domotic Technology Domotic systems, also known as home automation systems, have been available on the market for several years, however only in the last few years they started to spread also over residential buildings, thanks to the increasing availability of low
Eye-based Interaction for Environmental Control
1121
cost devices and driven by new emerging needs on house comfort, energy saving, security, communication and multimedia services.
3.1 Overview of Domotic Technologies For the purposes of the current discussion, environmental control systems may be thought of as comprising of five elements: • The communication capabilities of the user – the physical capabilities of the user to convey their needs to the environmental control system (such as the ability to click a switch, or point with their eyes). • The user interface to the system – to allow the user to control the environment via the interface by showing the state of the environment, and receiving user control input to that environment (such as a computer screen showing a graphical representation of the environmental state). • The central domotic system – that processes user commands and environmental states and adds integration and intelligence to the system, receiving input from the user and the environment and sending out environmental control signals (the central computer based ‘hub’ of the system). • The communications system – that flows to and from the central domotic system to the environmental actuators and sensors in the domestic environment (the wires or wireless transmission system that links the distributed actuators and sensors in the environment). • The domestic environment – the real-world devices and functions controlled by the system (such as doors, heating systems, curtains, alarms). A Smart Home for disabled people may include assistive devices (electromechanical/robotic systems for movement assistance), devices for health monitoring and special user interfaces. In the case of COGAIN it is a gaze based interface. From the technical point of view, any complex smart home is integrated by several types of components, cooperating in a multi-layer architecture. A comprehensive picture summarizing the different layers and the main components in each layer is reported in Figure 1. The four main layers, right-to-left in the figure, are: the Smart Environment layer, the Protocol, Network, Access interface layer, the Control application user interface layer, and the User interaction device layer. The following paragraphs specify in more detail the structure and the role of each of these layers, and its internal devices/functions.
3.1.1 Smart Environment layer This layer comprises the actual equipment composing the smart house. Such equipment is able to operate independently, autonomously, even without the additional
1122
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
Fig. 1 Generic Layers of a Domotic System
layers (but in this case with limited or no programmability). Such layer consists of zero or more domotic plants, plus zero or more appliances. In particular: Domotic plants. Each plant is characterized by a specific technology. The technology comprises aspects such as: bus technology (wired, wireless, powerline, . . . ), bus protocol (usually proprietary), types of supported devices, and brands that produce such devices. The plant technology is either defined unilaterally by a single manufacturer, for its products, or is the result of an industrial standard agreed among a group of manufacturers. For an overview of the available plant technologies please see [2]. It is normally safe to assume that each domotic plant is equipped with at least one ‘intelligent’ interface gateway able to convert bus traffic (that is a low-level proprietary issue) into some for of TCP/IP protocol. Such device is not strictly needed, but all advanced plants should have one anyway. Domotic devices. Such devices, attached to a plant, are strictly technology dependent. Each plant manufacturer (or manufacturers adhering to some standard) usually offers a wide range of devices: switches, buttons, relays, motors (for opening/closing doors, shutters, windows, . . . ), sensors (presence, intrusion, fire and smoke, temperature, . . . ), as well as a range of multimedia capabilities (ambient microphones, analog or digital cameras, loudspeakers, radio tuners, . . . ). Some plants are also integrated with the communication system in the house (telephone, intercom, . . . ) and sometimes with entertainment systems (TV sets, media centers, radio tuners, . . . ). The simplest devices (i.e., switches, relays, motor controls) offer similar functionality across all manufacturers and plants, even if the addressing and activation details differ. More complex devices tend to be more dissimilar across manufacturers, and therefore are more difficult to classify in a uniform manner and to integrate seamlessly into control interfaces.
Eye-based Interaction for Environmental Control
1123
Smart appliances. Because of the explosive growth of consumer electronics, enabled by digital media technologies that call for new ways of dealing with photographs, music, and video, ‘smart’ appliances are rapidly increasing both in number and in functionality. Such appliances often aim at covering the communication and multimedia needs of the users. For these kinds of devices, “Current end-to-end solutions that are based on proprietary vertical implementations bring products to market early but have little impact on rapidly establishing a new category of products. Moreover, end users do not have the opportunity to select parts of a system from different manufacturers because there is no interoperability between those non-standard devices. Thus industry leaders must define guidelines to enable an interoperable network of CE [Consumer Electronics], PC and mobile devices” [9]. This problem is starting to be understood by some of the major players in the consumer electronic industry, that teamed to define interoperability standards for media control and media delivery in the Digital Living Network Alliance.1 Such interoperability, even if backed by some industries, is not yet a reality. Further, in the context of COGAIN we want to achieve a still higher interoperability level: that of home automation (and not just entertainment) systems, and that of accessible interfaces.
3.1.2 Protocol, Network, Access interface layer If one just wants to operate a smart home system with the functions and the user interface predefined by their manufacturer(s), no additional layer is needed. However, in order to be able to create and customize the user interface, a first requirement is being able to communicate with the smart home. We recall that at the Environment layers there are numerous incompatible protocols to handle. The protocol access layer should comprise the function to: • Interact with the domotic plant(s) according to its network protocol. This requires first knowledge of the details of such network protocols, which are often proprietary, or are available only commercially. Once the protocol is known, one must properly configure the network (we recall that we are assuming a TCP/IP—wired or wireless—infrastructure in the smart home) and properly authenticate with the system (home security requirements often impose secret or ‘weird’ authentication procedures). Once the connection is established, the protocol handler may issue the commands and enquiries (as requested by the upper layers) to the underlying domotic system. In the cases in which in the same house more than one domotic plant is installed (i.e., you have an anti-theft system of a different brand from your lightning and automation one), several protocols must be supported at the same time. • Interact with the various appliances present in the house. This requires studying, one by one, such appliances, checking if they support any standard protocol, or reverse-engineering any proprietary communication features they may pos1
DLNA - http://www.dlna.org/
1124
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
sess. This is a highly device-specific task that can be accomplished with varying degrees of success, mainly depending on the openness of mind of the appliance designers. In general terms, we have discovered that it is often reasonably easy to control a device (i.e., to issue commands that it will execute) rather than to observe it (i.e., to query it to determine its current status). • Convert all the protocols to a common one, at the higher level. Ideally, the control application developer should not bother with all the details and issues of interacting with a complex smart house. In this case it can be useful to design a common protocol, encompassing all (or most) functions of lower level protocols, to be understood by a gateway. This component is the basis of the proposed COGAIN standard for accessing smart house environments, as it dramatically eases the work of application developers. The difficulty in the gateway component lays in the complexity and diversity of the information it should exchange, and the varying level of detail and functionality offered by different protocols on different topics.
3.1.3 Control application user interface layer Assuming that the previous layer (that may be embedded into the same executable application or may reside on a different server) resolved the connection and protocol issues, we may build control applications for all the intelligent devices and appliances available in the house. Such application, being relieved from low-level control issues, has to concentrate on user interaction issues, such as navigation, layout, and graphics. The control interface for environmental control systems is peculiar in at least three ways that will be better explored in the remainder of this chapter: • Asynchronous additional control sources. The control interface is not the sole source of commands: the house occupants may choose to operate on wallmounted switches, or some external events may change the status of some sensors, or the current music track just ended. In other words, the control interface needs to continuously update its current knowledge about the status of the house (the ‘model’ in the model-view-controller pattern). Icons and menu labels must change according to status evolution, in order to have a coherent view of the environment through its control interface. • Time-sensitive behaviour, including safety issues. Some actions are time dependent, such as the time available to react to an alarm condition. In such cases, the user is put in a stressful condition, since they have limited time to take important decisions, which may pose threats to their safety. In this case the control interface must offer very simple and clear options, easy to select, and must be able to take a correct (i.e., safest) action in case the user cannot answer in time. Another example are actions ‘initiated by the house’ (this means that the rules programmed in the intelligent devices have just triggered some preprogrammed action, such as closing the shutters at nighttime). In this case the user should be allowed to interrupt the automatic action and to override it. The control interface should make the user aware that an automatic action has been
Eye-based Interaction for Environmental Control
1125
initiated, and offer ways to interrupt it (unobtrusively with respect to the current user interaction status). • Trade-off between structural and functional views. From the Information Architecture point of view, we have at least two possible hierarchies by which to organize the user interface: structural and functional. The structural hierarchy is the one adopted by most environmental control interfaces, and follows the physical organization of the devices in the house. Successive menu levels represent stages, rooms, devices within the room, and possible actions on those devices. While being natural, since it exploits the knowledge of the house structure, this choice leaves out some important actions: where to put in the hierarchy such ‘global’ or ‘un-localized’ actions as switching the anti-theft system on, or connecting to a news bulletin, or uploading personal health data? Often some ‘Other,’ ‘Scenarios,’ ‘Global’ or similarly named menu plays the catch-all role for these situations. For this reason functional hierarchy may also be adopted, where actions are grouped according to their nature, rather than by their location.
3.1.4 User interaction device layer This final layer represents the actual devices (simple as a mouse or complex as an eye tracker) that the user is adopting to interact with the control application user interface. These are usually standard devices or systems that materialize the tangible part of user interaction. The diversity of these devices, coupled with the desire and need to let the user to select their preferred interaction method, poses additional challenges to the user interface that must accommodate different screen resolutions, different pointing precisions, different modalities (e.g. audio instead of graphics), and so on. Interaction between devices and the control interface uses the modalities supported by the device and by the underlying Operating System.
3.2 Architecture for Intelligent Domotic Environment Current domotic solutions suffer from two main drawbacks: they are produced and distributed by various electric component manufacturers, each having different functional goals and marketing policies; and they are mainly designed as an evolution of traditional electric components (such as switches and relays), thus being unable to natively provide intelligence beyond simple automation scenarios. The first drawback causes interoperation problems that prevent different domotic plants or components to interact with each other, unless specific gateways or adapters are used. While this was acceptable in the first evolution phase, where installations were few and isolated, now it becomes a matter of concern as many large buildings such as hospitals, hotels and universities are mixing different domotic components, possibly realized with different technologies, and need to coordinate them as a single sys-
1126
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
tem. On the other hand, the roots of domotic systems in simple electric automation prevent satisfying the current requirements of home inhabitants, who are becoming more and more accustomed to technology and require more complex interaction possibilities. In the literature, solutions to these issues usually propose smart homes [8, 7, 20], i.e., homes pervaded by sensors and actuators and equipped with dedicated hardware and software tools that implement intelligent behaviors. Smart homes have been actively researched since the late 90’s, pursuing a revolutionary approach to the home concept, from the design phase to the final deployment. The costs involved are very high and have prevented, until now, a real diffusion of such systems that still retain an experimental and futuristic connotation. The approach proposed in this chapter lies somewhat outside the smart home concept, and is based on extending current domotic systems by adding hardware devices and software agents for supporting interoperation and intelligence. Our solution takes an evolutionary approach, in which commercial domotic systems are extended with a low cost device (embedded PC) allowing interoperation and supporting more sophisticated automation scenarios. In this case, the domotic system in the home evolves into a more powerful integrated system that we call a Intelligent Domotic Environment (IDE). IDEs promise to achieve intelligent behaviors comparable to smart homes, at a fraction of the cost, by reusing and exploiting available technology, and by providing solutions that may be deployed even today. Most solutions rely on a hardware component called residential [6] or home gateway [15] originally conceived for providing Internet connectivity to smart appliances available in a given home. This component, in our approach, is evolved into an interoperation system, called DOG (Domotic OSGi Gateway), where connectivity and computational capabilities are exploited to bridge, integrate and coordinate different domotic networks that will be able to learn user habits, to provide automatic and proactive security and to implement comfort and energy saving policies by using low cost, commercially available technologies. An Intelligent Domotic Environment (IDE, Figure 2) is usually composed of one2, or more, domotic systems, by a variable set of (smart) home appliances, and by a Home Gateway that supports implementation of interoperation policies and provides intelligent behaviors. Domotic systems usually include domotic devices such as plugs, lights, doors and shutter actuators, etc., and a so-called network-level gateway that allows to tunnel low-level protocol messages over more versatile, application independent, interconnection technologies, e.g., Ethernet. These gateways are not suitable for implementing features needed by IDEs as they have reduced computational power and they are usually closed, i.e., they cannot be programmed to provide more than factory default functionalities. However, they play a significant role in an IDE architecture as they offer an easy to exploit access point to domotic systems. The Home Gateway is the key component for achieving interoperation and intelligence in IDEs; it is designed to respond to different requirements, ranging from 2
in this case interoperation may not be needed but intelligence still needs to be supported
Eye-based Interaction for Environmental Control
1127
Fig. 2 Intelligent Domotic Environment.
simple bridging of network-specific protocols to complex interaction support. These requirements have been grouped in three priority levels (Table 1): priority 1 requirements include all the features needed to control different domotic systems using a single, high-level, communication protocol and a single access point, priority 2 requirements define all the functionalities needed for defining inter-network automation scenarios and to allow inter-network control, e.g., to enable a Konnex switch to control an OpenWebNet light, and priority 3 requirements are related to intelligent behaviors, to user modeling and to adaptation. A domotic home equipped with a home gateway is said to be an Intelligent Domotic Environment if the gateway satisfies at least priority 1 and priority 2 requirements. Priority 3 requirements can be considered advanced functionalities and may impose tighter constraints on the gateway, both concerning the software architecture and the computational power. While priority 1 requirements basically deal with architectural and protocol issues, requirements listed in priorities 2 and 3 imply the adoption of sophisticated modeling techniques. In fact, the cornerstone of intelligent behaviors and applications is a suitable house modeling methodology and language (R2.1). In DOG, we chose to adopt a semantic approach and to adopt technologies developed in the Semantic Web community, to solve this knowledge representation problem and enable DOG to host or support intelligent applications.
1128
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
Table 1 Requirements for Home Gateways in IDEs. Priority
Requirement R1.1 Domotic network connection R1 Interoperability R1.2 Basic interoperability
R2 Automation
R3 Intelligence
Description Interconnection of several domotic networks. Translation / forwarding of messages across different networks. R1.3 High level network pro- Technology independent, high-level nettocol work protocol for allowing neutral access to domotic networks. R1.4 API Public API to allow external services to easily interact with home devices. R2.1 Modeling Abstract models to describe the house devices, their states and functionalities, to support effective user interaction and to provide the basis for home intelligence. R2.2 Complex scenarios Ability to define and operate scenarios involving different networks / components. R3.1 Offline Intelligence Ability to detect misconfigurations, structural problems, security issues, etc. R3.2 Online Intelligence Ability to implement runtime policies such as energy saving or fire prevention. R3.3 Adaptation Learning of frequent interaction patterns to ease users’ everyday activities. R3.4 Context based Intelli- Proactive behavior driven by the current gence house state and context aimed at reaching specific goals such as safety, energy saving, robustness to failures.
3.3 The DOG Domotic OSGi Gateway The core component of an IDE is the software running on the additional embedded PC (either Residential or Home Gateway [23, 15]). In particular, in the COGAIN network, the DOG gateway has been studied for environmental control [4]. DOG (Domotic OSGi Gateway) is a domotic gateway designed to transform new or existing domotic installations into IDEs by fulfilling the requirements defined in Section 3.2. Design principles include versatility, addressed through the adoption of an OSGi based architecture, advanced intelligence support, tackled by formally modeling the home environment and by defining suitable reasoning mechanisms, and accessibility to external applications, through a well defined, standard API also available through an XML-RPC [26] interface. OSGi [22] is an Universal Middleware that provides a service-oriented, component-based environment for developers and offers standardized ways to manage the software life cycle. It provides a general-purpose, secure, and managed framework that supports the deployment of extensible service applications known as bundles. DOG exploits OSGi as a coordination framework for supporting dynamic module activation, hot-plugging of new components and reaction to module failures. Such basic features are integrated with an ontology model of domotic systems and
Eye-based Interaction for Environmental Control
1129
sourrounding environments named DogOnt [3]. The combination of DOG and DogOnt supports the evolution of domotic systems into IDEs by providing means to integrate different domotic systems, to implement inter-network automation scenarios, to support logic-based intelligence and to access domotic systems through a neutral interface. Cost and flexibility concerns take a significant part in the platform design and we propose an open-source solution [5] capable of running on low cost hardware systems such as an ASUS eeePC 701.
3.3.1 DOG Architecture DOG is organized in a layered architecture with 4 rings, each dealing with different tasks and goals, ranging from low-level interconnection issues to high-level modeling and interfacing (Figure 3). Each ring includes several OSGi bundles, corresponding to the functional modules of the platform. Ring 0 includes the DOG common library and the bundles necessary to control and manage interactions between the OSGi platform and the other DOG bundles. At this level, system events related to runtime configurations, errors or failures, are generated and forwarded to the entire DOG platform. Ring 1 encompasses the DOG bundles that provide an interface to the various domotic networks to which DOG can be connected. Each network technology is managed by a dedicated driver, similar to device drivers in operating systems, which abstracts network-specific protocols into a common, high-level representation that allows to uniformly drive different devices (thus satisfying requirement R1.1). Ring 2 provides the routing infrastructure for messages travelling across network drivers and directed to DOG bundles. Ring 2 also hosts the core intelligence of DOG, based on the DogOnt ontology, that is implemented in the House Model bundle (R1.2, R1.3, R2.1 and, partially, R2.2). Ring 3 hosts the DOG bundles offering access to external applications, either by means of an API bundle, for OSGi applications, or by an XML-RPC endpoint for applications based on other technologies (R1.4).
API Message Dispatcher Executor MyHome Network Drivers House Model
Simulator
Platform Manager
XML−RPC
DOG Library Konnex OSGi
Fig. 3 DOG architecture.
Status
Configuration Registry Ring 0
Ring 1
Ring 2
Ring 3
1130
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
3.3.2 Modeling in DOG Modeling the house structure, its domotic components, their states and functionalities is achieved by defining a custom ontology, called DogOnt [3]. According to the classical Gruber’s definition [13] an ontology is an “explicit specification of a conceptualization,” which is, in turn, “the objects, concepts, and other entities that are presumed to exist in some area of interest and the relationships that hold among them”. Today’s W3C Semantic Web standard suggests a specific formalism for encoding ontologies (OWL), in several variants that vary in expressive power [19]. DogOnt is a OWL meta-model for the domotics domain describing where a domotic device is located, the set of its capabilities, the technology-specific features needed to interface it, and the possible configurations it can assume. Additionally, it models how the home environment is composed and what kind of architectural elements and furniture are placed inside the home. It is organized along 5 main hierarchy trees (Figure 4), including: Building Thing, modeling available things (either controllable or not); Building Environment, modeling where things are located; State, modeling the stable configurations that controllable things can assume; Functionality, modeling what controllable things can do; and Domotic Network Component, modeling peculiar features of each domotic plant (or network). The BuildBTicino component Konnex component
Domotic network component
...
DogOnt
Building Thing
Controllable Uncontrollable
.... Garage Garden
Building Environment
Room Control functionality Notification functionality
Functionality
Query functionality Continuous state Discrete state
State
Fig. 4 An overview of the DogOnt ontoloy
ingThing tree subsumes the Controllable concept and its descendants, which are used to model devices belonging to domotic systems or that can be controlled by them. Devices are described in terms of capabilities (Functionality concept) and possible configurations (State concept). Functionalities are mainly divided in Continuous and Discrete, the former describing capabilities that can be variated continuously and the latter referring to the ability to change device configurations in a discontinuous manner, e.g., to switch on a light. In addition they are also categorized depending on their goal, i.e. if they allow to control a device (Control Functionality), to query a device condition (Query Functionality) or to notify a condition change (Notification Functionality). Each functionality instance defines the set of associated commands
Eye-based Interaction for Environmental Control
1131
and, for continuous functionalities, the range of allowed values, thus enabling runtime validation of commands issued to devices. Devices also possess a state instance deriving from a State subclass, which describes the stable configurations that a device can assume. Each State class defines the set of allowed state values; states, like functionalities, are divided in Continuous and Discrete. DOG uses the DogOnt ontology for implementing several functionalities encompassing command validation at run-time, using information encoded in functionalities, stateful operation, using the state instances associated to each device, device abstraction leveraging the hierarchy of classes in the controllable subtree. The last operation, in particular, allows to deal with unknown devices treating them as a more generic type, e.g., a dimmer lamp can be controlled as a simple on/off lamp. Ontology instances modeling controlled environments are created off-line by means of proper editing tools, some of which are currently being designed by the authors, and may leverage auto-discovery facilities provided by the domotic systems interfaced by DOG.
4 Eye-based Environmental Control An intelligent domotic environment is of particular importance for an individual with any form of restricted mobility or other disability which can arise as a result of disease or aging. The facility to operate environmental systems and devices directly without the necessity of either having first to move physically to that device location and interact with it, or alternatively communicate to a carer their wish to have the device operated on their behalf, is important in achieving independent living. In order for a disabled user to initiate a device operation there exists a wide range of user interface mechanisms (e.g., head movement switch, sip/puff switch) which can be tailored to the individual’s needs and physical abilities. Usually the requirements of the individual are first assessed by a rehabilitation professional and an appropriate interface or range of interfaces for that person then determined and implemented. Suitable follow up of the individual over time then ascertains whether such interfaces are suitable. For some disabled users operating environmental systems by their eye movements is one such optional control mechanism and for others it is the only control mechanism available to them.
4.1 Overview of Eye Based Environmental Control For some deteriorating health conditions the voluntary control of eye movements is the last controllable physical movement available. Such individuals can benefit from eye-operated assistive technology as part of a gaze-operated home automation system. A wide range of diseases and conditions give rise to an individual potentially having very restricted motor movement ability which limits them using other
1132
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
interfaces and so eye based systems could be useful. The background to eye based control lies in the fact that humans exhibit a wide range of different types of eye movements which are usually not consciously controlled. However, saccadic eye movements can be voluntarily controlled and therefore can consciously and purposefully shift eye gaze, and therefore direct visual attention, to particular areas of the environment. These movements have long been studied to understand perceptual and cognitive aspects of numerous activities such as reading, examining pictorial displays or in vehicle driving behaviour. In considering such saccadic movement behaviour, vision is construed as consisting of an alternating stream of very fast saccades (when little or no visual input takes place) interspersed with eye fixations of differing time lengths when eye gaze is almost stationary (although there are still micro-movements of the eye) at some spatial position. Usually the length of an eye fixation is indicative of the degree of visual processing taking place at that location coupled with determining where to fixate next in the environment. Because eye gaze can link directly to visual attention then it is feasible to use saccadic eye movements as an HCI control movement. However, eye gaze direction does not always relate to visual attention, which has been termed the Midas touch problem [17] which poses a particular difficulty when eye gaze is used to make a visual selection of a device control operation. Consequently eye based systems usually need some additional technique to indicate that an actual selection is required [16]. Virtually all systems that record saccadic eye movements are based on trying to determine the user’s point of gaze as accurately as possible and then use this measured gaze location information to indicate a particular decision selection. More recently research has begun to investigate using the direction of the saccadic movements themselves and not just the gaze location [14]. Numerous commercial eye tracking systems exist which can be used easily by a disabled user. However, the cost of such systems remains comparatively high which limits their widespread adoption even though the technique has wide applicability. Research, particularly that supported through the COGAIN network, is gradually leading to a range of cost effective and affordable technological options (see www.cogain.org). Eye trackers for assistive technology usage typically fall into two camps: head mounted on the user, or remote and not attached to the user but positioned in front of them. Head mounted systems are small, lightweight, inconspicuous and allow the user to move their heads freely without any loss of eye data recording. Remote systems, on the other hand, are generally mounted in front of a user and record where the user is looking within a fixed range of measurements— generally circa 20-300 visual angle. These are ideally suited to a user interacting with a display positioned in front of them if the user is relatively immobile or else can be mounted on a user’s wheelchair. With either type of system environmental devices can be represented on a monitor in front of the user as a symbol or in text and gazing at this can bring up a menu from which the user selects the particular action they require. The advantages of such eye based approaches over other user control selection systems are that they can provide quick and effective direct control. Also they can be fairly intuitive in nature. In terms of limitations some systems can be difficult
Eye-based Interaction for Environmental Control
1133
to set up for a particular disabled user but increasingly research developments are making them very user friendly (e.g. MyTobii).
4.2 The ART system One innovative approach to eye based interaction control is illustrated through the ART (Attention Responsive Technology) system [12, 24]. In this, the overall aim is to allow the user to operate environmental systems simply by looking directly at a particular device which would then give rise to device operation. The concept is that this obviates the need for the user to first select which device they wish to operate from a displayed menu of potential devices. It was also a design requirement that the user would be mobile, for instance in an electric wheelchair, and so could approach different real world devices in their environment from various directions. To ensure the final system was fit for purpose input from potential end users was used at key developmental stages. The system was developed using a head-mounted and a remote eye movement monitoring system separately to illustrate that either approach can be implemented and that both were feasible for a disabled user. Each system is coupled to a digital video camera which monitors the environment in front of the user (this is either a miniature camera mounted on a headband, if a head-mounted eye movement system is used, or else mounted directly on a wheelchair). Firstly, environmental devices which the user may wish to operate are imaged from various points of view by this camera and then a SIFT algorithm [18, 25] is run which results in suitably extracted and image processed SIFT features representative of the device which are then stored in a database. Additional or new devices can be added easily to the system by imaging these in a similar manner and their SIFT features automatically added to the database. Control operations for each device are also added so that when that device is recognised by the ART system then the appropriate control options can be presented to the user. A user is first calibrated for the particular eye movement system in use and then the system works autonomously by constantly monitoring the user’s eye movements and recognising whenever they are steadily gazing, using user defined fixation length criteria, at some point in their environment. The coordinates of where they are gazing are then calculated and related to the video camera image. A SIFT algorithm is then applied which analyses the camera image data around the point of gaze and determines whether a known device is actually being gazed at. If it is, then an interface is offered to the user specifically for that device alone which the user can decide whether to operate—this then overcomes any accidental device operation. Figure 5 shows a flow diagram of the system. The actual operation of the interface itself can again be by eye control. The approach has the advantage of not needing to present the user with an initial menu of all available devices to operate, instead offering a direct selection through eye gaze at the device itself.
1134
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
Fig. 5 Operational flow chart of the ART system.
In the developmental system several dedicated device interfaces have been constructed so that these can be presented to the user on a small monitor but these could equally well be presented either projected as an overlay onto, or beside, the actual device itself or in any other manner (e.g. audio) and format that the user could efficiently respond to. There is no real requirement for the user to be positioned in front of a monitor. Two system parameters are used to overcome possible false operation of a device simply because the user’s gaze is recorded as falling upon it. Firstly, the user must gaze at a device for a pre-determined criterion time; this allows the software to identify the device in the scene camera image from the database of SIFT features as well as preventing the ART system trying to recognise other potential objects. Secondly, the user’s eye gaze does not (of itself) initiate device operation but instead initiates the presentation of a dedicated interface just for that device. This permits a check on whether or not the user does in fact wish to operate the device. Figure 6 shows the developmental system in a laboratory with a user sitting in an electric wheelchair. The camera beside the user’s head monitors the environment in front of them. In this set up the user’s gaze is monitored by a Smarteye eye movement system (www.smarteye.se) which uses three small cameras, coupled with
Eye-based Interaction for Environmental Control
1135
infrared light sources. For actual implementation, both the eye movement system and the environmental monitoring camera are mounted on the wheelchair itself.
Fig. 6 Laboratory set up with user in a wheelchair
4.3 Eye Based Environmental Control and Communication Systems The ART system is currently solely designed to facilitate direct operation of environmental devices. One next step is to incorporate other user requirements into its design. The approach represents just one way of utilising eye gaze as an assistive technology and deliberately set out largely to remove a user from being tied to their computer monitor. However, other approaches to using eye movements assistively, utilise this interactivity premise. In terms of recording eye gaze it is an easier proposition to have a user located directly in front of a monitor. Employing this scenario different eye tracking techniques have been developed as have different software packages which facilitate communication, work, game playing and other leisure activities. Monitoring eye movements of a user whilst seated in a wheelchair can be used to provide real time precise steering and guidance [11] as well as for planning movement around an environment, before actually
1136
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
moving, coupled with multisensory collision avoidance systems to prevent any accidents [21]. Whilst product development needs require that research targets specialised usages there is an overall need for eye based control to interface wholistically with the everyday living requirements of a disabled or elderly user, whether this be for environmental control or communication. Facilitating the user to move around their environment as well as control it is yet another exciting step.
5 Conclusions This chapter provided an overview of the opportunities offered by domotic technologies and by eye tracking systems. Domotics is a new and promising way of implementing ambient intelligence and to enable ambient assistive living for elderly or disabled people. Eye tracking technology is essential for supporting mobilityimpaired people and to enable them to use a computer system. The chapter presented the latest advances within the COGAIN project, where environmental control and eye tracking are coupled to provide a comprehensive system allowing user to remotely control all aspects of their home, by exploiting natural and intuitive gaze-based interactions. Acknowledgements This work was supported by the European Commission Project 511598, Network of Excellence on Communication by Gaze Interaction (COGAIN). The ART project is supported by EPSRC grant EP/F008937/1. The authors also wish to acknowledge the significant contribution of Richard Bates, Emiliano Castellina, Dario Bonino to the results described in this chapter.
References [1] Bates R, Spakov O (2006) D2.3 implementation of cogain gaze tracking standards. Deliverable, Communication by Gaze Interaction (COGAIN), IST2003-511598, URL http://www.cogain.org/results/reports/ COGAIN-D2.3.pdf [2] Bates R, Daunys G, Villanueva A, Castellina E, Hong G, Istance H, Gale A, Lauruska V, Spakov O, Majaranta P (2006) D2.4 a survey of existing ‘de-facto’ standards and systems of environmental control. Deliverable, Communication by Gaze Interaction (COGAIN), IST-2003-511598, URL http: //www.cogain.org/results/reports/COGAIN-D2.4.pdf [3] Bonino D, Corno F (2008) Dogont - ontology modeling for intelligent domotic environments. In: 7th International Semantic Web Conference. October 26-30, 2008. Karlsruhe, Germany.
Eye-based Interaction for Environmental Control
1137
[4] Bonino D, Castellina E, Corno F (2008) Dog: an ontology-powered osgi domotic gateway. In: 20th IEEE Int’l Conference on Tools with Artificial Intelligence, IEEE, USA [5] Bonino D, Castellina E, Corno F (2008) The DOG Domotic OSGi Gateway. http://domoticdog.sourceforge.net [6] Chung SL, Chen WY (2007) MyHome: A Residential Server for Smart Homes. Knowledge-Based Intelligent Information and Engineering Systems 4693/2007:664–670 [7] Cook DJ, Youngblood M, Heierman EO III, Gopalratnam K, Rao S, Litvin A, Khawaja F (2003) Mavhome: An agent-based smart home. In: PERCOM ’03: Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, IEEE Computer Society, Washington, DC, USA, p 521 [8] Davidoff S, Lee MK, Zimmerman J, Dey A (2006) Principle of Smart Home Control. In: Proceedings of the Conference on Ubiquitous Computing, Springer, pp 19–34. [9] DLNA (2007) DLNA overview and vision whitepaper 2007. Digital Living Network Alliance, URL http://www.dlna.org/en/industry/ pressroom/DLNA_white_paper.pdf [10] Donegan M, Oosthuizen L, Bates R, Daunys G, Hansen J, Joos M, Majaranta P, Signorile I (2005) D3.1 user requirements report with observations of difficulties users are experiencing. Deliverable, Communication by Gaze Interaction (COGAIN), IST-2003-511598, URL http://www.cogain.org/ results/reports/COGAIN-D3.1.pdf [11] Figuereido L, Nunes T, Caetano F, Gomes A (2008) Magic environment. In: Istance H, Štˇepánková O, Bates R (eds) Proceedings of COGAIN 2008 [12] Gale A (2005) Attention responsive technology and ergonomics. In: Bust PD, McCabe PT (eds) Proceedings of the International Conference on Contemporary Ergonomics (CE2005), Taylor and Francis, 978-0-415-37448-4 [13] Gruber T (1995) Toward principles for the design of ontologies used for knowledge sharing. International Journal Human-Computer Studies 43(56):907–928 [14] Heikkilä H (2008) Gesturing with gaze. In: Istance H, Štˇepánková O, Bates R (eds) Proceedings of COGAIN 2008 [15] HGI (2008) Home gateway technical requirements: Residential profile. Tech. rep., Home Gateway Initiative, URL www.homegatewayinitiative. org/publis/HGI_V1.01_Residential.pdf [16] Istance H (2008) User performance of gaze based interaction with on-line virtual communities. In: Istance H, Štˇepánková O, Bates R (eds) Proceedings of COGAIN 2008 [17] Jacob R (1990) What you look at is what you get: eye movement-based interaction techniques. In: Carrasco J, Whiteside J (eds) Proceedings of the ACM CHI 90 Human Factors in Computing Systems Conference, Washington, USA, pp 11–18
1138
Fulvio Corno, Alastair Gale, Päivi Majaranta, and Kari-Jouko Räihä
[18] Lowe D (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2:91–110 [19] McGuinness DL, van Harmelen F (2004) Owl web ontology language. W3C Recommendation, http://www.w3.org/TR/owl-features/ [20] Nakashima H (2007) Advances in Ambient Intelligence Edited, Series: Frontiers in Artificial Intelligence and Applications, vol 164, chap Cyber Assist Project for Ambient Intelligence, pp 1–20 [21] Novák P, Krajník T, Pˇreuˇcil L, Fejtová M, Štˇepánková O (2008) AI support for a gaze-controlled wheelchair. In: Istance H, Štˇepánková O, Bates R (eds) Proceedings of COGAIN 2008 [22] OSGI (2007) OSGi service platform release 4. Tech. rep., The OSGi alliance [23] Saito T, Tomoda I, Takabatake Y, Arni J, Teramoto K (2000) Home gateway architecture and its implementation. IEEE Transactions on Consumer Electronics 46(4):1161–1166 [24] Shi F, Gale A, Purdy K (2006) Helping people with ict device control by eye gaze. In: Miesenberger K, Klaus J, Zagler W, Karshmer A (eds) Computers Helping People with Special Needs, Springer Verlag, Berlin, Lecture Notes in Computer Science, pp 480–487 [25] Shi F, Gale A, Purdy K (2006) Sift approach matching constraints for real-time attention responsive system. In: Byun J, Pan Z (eds) Proceedings of the 5th Asia Pacific International Symposium on Information Technology, Hangzhou, China [26] Winer D (2003) XML-RPC specification. Tech. rep., UserLand Software
Middleware Architecture for Ambient Intelligence in the Networked Home Nikolaos Georgantas, Valerie Issarny, Sonia Ben Mokhtar, Yerom-David Bromberg, Sebastien Bianco, Graham Thomson, Pierre-Guillaume Raverdy, Aitor Urbieta and Roberto Speicys Cardoso
1 Introduction With computing and communication capabilities now embedded in most physical objects of the surrounding environment and most users carrying wireless computing devices, the Ambient Intelligence (AmI) / pervasive computing vision [28] pioneered by Mark Weiser [32] is becoming a reality. Devices carried by nomadic users can seamlessly network with a variety of devices, both stationary and mobile, both nearby and remote, providing a wide range of functional capabilities, from base sensing and actuating to rich applications (e.g., smart spaces). This then allows the dynamic deployment of pervasive applications, which dynamically compose functional capabilities accessible in the pervasive network at the given time and place of an application request. In the smart home environment, this translates into integrating today’s, still mostly distinct, four application domains of the networked home: • Personal computing, based on the home computers, printers and Internet connection; • Mobile computing, manifested by the increasing use of home Wi-Fi and Bluetooth networks connecting laptops, PDAs and tiny personal devices; • Consumer electronics, targeting multimedia in the home; and • Home automation, adding intelligence to household appliances like washing machines and lighting systems.
Nikolaos Georgantas, Valerie Issarny, Yerom-David Bromberg, Sebastien Bianco, Graham Thomson, Pierre-Guillaume Raverdy and Roberto Speicys Cardoso INRIA Paris-Rocquencourt, France, e-mail: [email protected] Sonia Ben Mokhtar University College London, UK, e-mail: [email protected] Aitor Urbieta Mondragon Unibertsitatea, Spain, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_42, © Springer Science+Business Media, LLC 2010
1139
1140
Nikolaos Georgantas et al.
Still, easing the development, from design to deployment, of pervasive applications raises numerous challenges. In this chapter, we concentrate on the software engineering aspect of pervasive computing, and more specifically on the adequate abstraction of networked resources, both software and hardware, and supporting software systems, so as to effectively enable pervasive applications to dynamically deploy over the pervasive network. Specifically, pervasive applications should be dynamically composed out of capabilities that are in reach to realize the target functional behavior while also meeting required quality of service (also referred to as non-functional properties). Our aim is to offer a solution that enables exploiting most networked resources, without being restrictive in terms of underlying software and hardware platforms, neither in terms of assumed resources. Indeed, while the development of advanced middleware platforms for pervasive computing has led to significant progress over the last decade [10], those platforms require consumers and providers of resources to agree on a common syntactic description of capabilities and distributed runtime environment for them to actually network. This assumption is too restrictive for truly open pervasive environments, despite existing ubiquitous software technologies, and in particular those from the Web. Common syntactic description of capabilities is not achievable on a large scale basis, neither is the usage of a common middleware platform. Concretely, this chapter introduces the Amigo middleware approach for the open networked home, which was designed as part of the IST FP6 Amigo1 project on the development of a networked home system towards the ambient intelligence vision. Furthermore in Amigo, we targeted the extended home environment, where users get access to applications also between homes, between home and workplace, and potentially anytime, anywhere. This chapter more specifically focuses on the AmIi2 middleware interoperability solution developed at INRIA as part of the Amigo system architecture. As opposed to most existing pervasive system architectures, the Amigo approach does not impose any specific middleware technology, as it allows heterogeneous technologies to be integrated, establishing interoperability at a higher, semantic, level. In Amigo, we exploit service oriented architectures [22], which appears as the appropriate architectural paradigm to cope with the above requirements [14]. Therein, networked devices and hosted applications are abstracted as services, which may dynamically be retrieved and composed, based on service discovery protocols as well as choreography and orchestration protocols [23]. We further advocate usage of semantic services. Semantics of an entity encapsulate the meaning of this entity by reference to an established vocabulary of terms (ontology) representing a specific area of knowledge. In this way, meanings of entities become machineinterpretable, enabling machine reasoning on them. Such concepts come from the Knowledge Representation field and have been applied and further evolved in the
1 2
http://www.hitech-projects.com/euprojects/amigo/ Ambient Intelligence interoperability
Middleware Architecture for Ambient Intelligence in the Networked Home
1141
Semantic Web3 domain [5]. The Web Ontology Language (OWL)4 is a recommendation by W3C supporting formal description of ontologies and reasoning on them. A natural evolution to this has been the combination of the Semantic Web and Web Services5 , the currently dominant service oriented technology, into Semantic Web Services [19]. This effort aims at the semantic specification of Web Services towards automating Web services discovery, invocation, composition and execution monitoring. The Semantic Web and Semantic Web Services paradigms address web service interoperability [30, 21]. Nevertheless, our goal in Amigo is wider, that is, to address service interoperability without being tied to any service technology. To this end, we establish semantic service modelling independently of underlying service technologies. Based on such modelling, we empower the networked environment with interoperability mechanisms so as to allow networked services to seamlessly compose, independently of their underlying software technologies. The remainder of this chapter is structured as follows. Sect. 2 presents the Amigo abstract reference service architecture that targets service interoperability, and further discusses limitations of related work with respect to achieving interoperability in service oriented systems. Sect. 3 then introduces the AmIi solution to interoperable service discovery; it is based on a generic service model on which the various existing service description languages may be mapped and thus compared against each other for the purpose of service matching; it further introduces a repositorybased service discovery solution. Sect. 4 then concentrates on the AmIi solution to achieving interoperability among heterogeneous middleware communication protocols; once discovered as in Sect. 3, heterogeneous services can communicate as in Sect. 4. Finally, Sect. 5 concludes with a summary of our contribution and sketches our future work on enablers for pervasive computing.
2 Achieving Interoperability in AmI Environments The Amigo service oriented system architecture aims to enable integrating diverse technologies in terms of networks, devices and software platforms. Thus, in the design of the Amigo system architecture, only a limited number of technologyspecific restrictions are imposed. This further means that existing service platforms (e.g., Web Services6 , UPnP7 , etc.) relevant to the four home application domains are retained. Our intension was not to develop yet another service platform imposing a homogeneous middleware layer on all devices within the Amigo home, but to introduce an abstract reference service architecture for the Amigo system which can represent various service platforms by abstracting their fundamental features. 3
http://www.w3.org/2001/sw/ http://www.w3.org/TR/owl-semantics/ 5 http://www.w3.org/TR/ws-arch/ 6 http://www.w3.org/2002/ws/ 7 http://www.upnp.org 4
1142
Nikolaos Georgantas et al.
In the following section, we introduce the Amigo architecture, while in Sect. 2.2 and Sect. 2.3 we discuss related work on service platform interoperability.
2.1 The Amigo Service Architecture for Interoperability The Amigo reference architecture [11] is based on a typical service oriented architecture, which follows a general application-middleware-platform layering structure [22]. In this typical architecture, the platform layer offers base system and network support, while the middleware layer includes essential mechanisms for service discovery and service communication. Furthermore, services in the application layer are functionally described based on a common syntactic service description language in order to enable service discovery and invocation independently of service implementation details. The elements of the typical service architecture are depicted on the left-hand side of Fig. 1 in normal typeface. In Amigo, we enhanced the typical architecture to account for the open AmI environment. Hence, it is assumed that all three layers of the architecture may be heterogeneous and based on diverse technologies. Furthermore, support for contextawareness and quality of service (QoS)-awareness is incorporated, as these are core features in AmI environments. This leads to the enhancement of service description with context and QoS features, and to the establishment of capability- and resourceawareness in the base system and network. The above advanced elements added to the typical service architecture are depicted on the left-hand side of Fig. 1 in boldface. Furthermore, to deal with the heterogeneity present in this enhanced service architecture for AmI, we introduce a common architectural and behavioural description at a higher, technology-independent level, based on semantic concepts. This description aims at enabling interoperability between heterogeneous service platforms in AmI environments, effectively offering Ambient Intelligence interoperability (AmIi); we call this description AmIi Service Description Model (ASDM). ASDM covers the elements identified in the enhanced service architecture, not only in the application layer but also in the underlying middleware and platform layers. This abstraction is illustrated in Fig. 1, where the right-hand layered structure complements the left-hand enhanced service architecture introduced so far, to produce the Amigo abstract reference service architecture. Based on ASDM, we enable deployment of AmIi interoperability mechanisms within the reference architecture, which may concern any of the three layers (see Fig. 1). These mechanisms aim at establishing semantic-based service interoperability, and comprise conformance relations and interoperability methods. Conformance relations aim at checking conformance (matching) between services for assessing their capacity to interoperate. Interoperability methods aim at enabling integration of partially conforming services, thus allowing their seamless discovery, communication and composition.
Middleware Architecture for Ambient Intelligence in the Networked Home
1143
Building on the above principles, we have developed in Amigo interoperability solutions to all of service discovery, communication and composition. In this chapter, we present the first two solutions; for the third, the interested reader is referred to [2]. In the next two sections, we survey related research work on service discovery and communication interoperability.
2.2 Service Discovery Interoperability Service discovery protocols (SDPs) enable services on a network to discover each other, express opportunities for collaboration, and compose themselves into larger collections that cooperate to meet an application’s needs. Many academic and industry-supported SDPs have already been proposed such as UDDI or CORBA’s Trading Service for the Internet, or SLP and Jini for local and ad hoc networks. Classifications for SDPs [33] distinguish between pull-based and push-based protocols. In pull-based protocols, clients send a request to service providers (distributed pullbased mode) or to a third-party repository (centralized pull-based mode) in order to get a list of services compatible with the request attributes. In push-based protocols, service providers provide their service descriptions to all clients that locally maintain a list of the available networked services. Leading SDPs in pervasive environments use a pull-based approach (Jini, SSDP), often supporting both the cen-
Fig. 1 The Amigo abstract reference service architecture
1144
Nikolaos Georgantas et al.
tralized and distributed modes of interaction (SLP, WS-Discovery). In centralized pull-based discovery protocols, one or a few repositories store the descriptions of the available services in the network, and their location is either well-known (e.g., UDDI) or dynamically discovered (e.g., Jini). Repositories are usually kept up to date by requiring explicit sign-off or by removing entries periodically. If multiple repositories exist, they cooperate to distribute the service registrations among them or to route requests to the relevant repository according to pre-established relationships. Although many SDP solutions with well-proven protocol implementations are now available, middleware heterogeneity raises interoperability issues between the different SDPs (e.g., SLP, SSDP, UDDI) active in the environment. Existing SDPs do not directly interoperate with each other as they employ incompatible formats and protocols for service descriptions or discovery requests, and also use incompatible data types or communication models. In any case, the diverse environment constraints and the de facto standard status of some of the existing protocols make it unlikely for a global, unique SDP to emerge. Several projects have thus investigated interoperability solutions [20, 12, 15], as requiring clients and service providers to support multiple SDPs is not realistic. SDP interoperability is typically achieved using intermediate common representations of service discovery elements (e.g., service description, discovery request) [8] instead of direct mappings [15], as the latter does not scale well with the number of supported protocols. Furthermore, the interoperability layer may be located close to the network layer [8], and efficiently and transparently translate network messages between protocols, or may provide an explicit interface [24] to clients or services so as to extend existing protocols with advanced features such as context management [25]. Furthermore, the matching of service requests and service advertisements is classically based on assessing the syntactic conformance of functional and nonfunctional properties. However, an agreement on a common syntactic standard is hardly achievable in open environments. Thus, higher-level abstractions, independent of the low-level syntactic realizations specific to the technologies in use, should be employed for denoting service and context semantics [3]. A number of approaches for semantic service specification have been proposed, and in particular for semantic Web services such as OWL-S [18] or SAWSDL. In addition, it has been shown that efficient semantic service discovery can be performed efficiently, which is a key requirement for the resource-limited devices found in pervasive environments [4]; supporting solutions lie in encoding ontology concepts off-line and adequately classifying service descriptions in the repository based on these encoded concepts. While the essence of the above issues is well understood, and individual solutions have been proposed that may form the foundations of a comprehensive service discovery solution for pervasive environments, a number of problems remain, or arise from such combination. First and foremost, syntactic-based and semanticbased solutions have mostly been considered separately. Indeed, interoperability solutions enabling multi-protocol SDP have focused on syntactic SDPs. At the same time, semantic-based SDPs neither manage protocol nor network heterogeneity, and
Middleware Architecture for Ambient Intelligence in the Networked Home
1145
context-aware SDPs assume the consistent use of a common ontology by all clients and providers. A better integration of the semantic and syntactic worlds is required, which is supported by the AmIi interoperability solution to service discovery, as presented in Sect. 3.
2.3 Service Communication Interoperability In a dynamic open networked environment, applications/services need to adapt themselves to the context by switching, for instance, on the fly, their communication protocol. This is currently not feasible, as the way service clients and providers are designed depends strongly on the middleware upon which they are developed. Thus, applications can not be decoupled from their underlying middleware. For instance, considering RPC-based communication, a RMI client can not switch on the fly its communication protocol to interact with a CORBA service and vice versa unless coupled with some interoperability system. The above issue outlines the need for a system enabling interoperability among middleware communication protocols. A number of such have been introduced since the emergence of middleware. These include middleware bridges, which can be direct or indirect. Direct bridges (e.g., RMI-IIOP8, IIOP-.NET9) provide interoperability between two fixed middleware, whereas, indirect bridges assume the predominance of one specific middleware that acts as an intermediary [29]. Bridges may appear as an attractive solution to provide interoperability. However, bridges are not suitable for dynamic open networks, and in particular those that are formed in an ad hoc manner, where the communication protocols used are not known in advance. Applications and/or services are strongly coupled to a given bridge and hence are not able to switch on the fly their communication protocol according to the networked environment context. Indeed, direct or indirect bridges are a static mean (i.e., fixed at design or possibly deployment time) to overcome middleware heterogeneity as it is expected to know, in advance, between which heterogeneous communication protocols interoperability is required. One possible solution to overcome the above constraint is to decouple applications/services from their bridge through proxy-based bridging, which acts as an intermediary. The proxy then encapsulates one or more bridges and hides their implementation details to applications/services, enabling thus to change transparently the bridge used according to the networking context. The most famous implementation of such a mechanism is the Java dynamic proxy. Still, its shortcomings are: (i) proxies are platform-specific; and (ii) if applications/services are not aware in advance of the bridges to be used, they need to have or to dynamically download the code of all possibly needed bridges – a dedicated bridge is needed for each pair of heterogeneous middleware – which can be very resource-consuming. 8 9
http://java.sun.com/products/rmi-iiop/ http://iiop-net.sourceforge.net/index.html
1146
Nikolaos Georgantas et al.
Greater flexibility to bridge-based interoperability is brought by Enterprise Service Buses (ESBs), which address the above shortcomings. An ESB [9] is a server infrastructure that acts as an intermediary among heterogeneous middleware through the integration of a set of reusable bridges. Applications/services are freed from the management overhead of interoperability. However, this requires an ESB server to be deployed and configured in the network. It is not reasonable to consider that there exists such a server in each open networked environment joined by mobile devices and in particular in ad hoc networks. The extended home environment that was targeted by Amigo is potentially such open environment. As we can not do any assumption about software infrastructure in such environment, interoperability must possibly be managed by the networked devices themselves through the use of middleware, like e.g., ReMMoC. ReMMoC is one of the pioneering middleware introducing an interoperability system for open (wireless) networks [12]. In its latest version, ReMMoC is a Web Service-based reflective middleware that uses the Web Services abstraction (abstract part of a WSDL document), to abstract to applications the concrete communication protocol used to invoke remote services. Indeed, REMMoC, thanks to its reflection mechanisms [31], selects dynamically the most appropriate communication protocol according to the context. Although ReMMoC is currently one of the most efficient and innovative middleware to perform interoperability, it is confronted to some constraints. First, client applications must be developed using the ReMMoC middleware. Thereby, interoperability is available only to ReMMoC-based clients. In addition, ReMMoC is dedicated to client applications, excluding thus interoperability to service providers. Providing interoperability to service providers enables clients, which are not interoperable (e.g., not based on ReMMoC) to still interoperate with services that are based on a different communication protocol. In a way similar to ReMMoC, RMIX is a middleware that permits transparent dynamic binding with multiple communication protocols [16]. The RMIX originality comes from its programming model that is based on RMI. Hence, the reengineering of existing RMI applications, to take benefit of RMIX, is reduced to a minimum. As a result, RMIX is dedicated to Java and uses functionalities inherent to the Java platform. Thus, interoperability is restricted to Java-compliant devices and/or services. The need to embed a Java Virtual Machine (JVM) and to rewrite non-Java applications to be interoperable is a strong limitation. The same applies to OSGi10 , which is a popular Java-based middleware that provides the capability to integrate different communication protocols for OSGi-specific applications. Summarizing, from the above survey of existing solutions to middleware communication interoperability, there is, to the best of our knowledge, no satisfying solution to middleware interoperability for open dynamic networks, which has led us to develop the AmIi interoperability solution to service communication, as detailed in Sect. 4.
10
http://www.osgi.org/
Middleware Architecture for Ambient Intelligence in the Networked Home
1147
3 AmIi Interoperable Service Discovery In Sect. 2, we identified the AmIi service description model (ASDM) and the thereupon based AmIi interoperability mechanisms (comprising conformance relations and interoperability methods) as the elements that enable interoperability in the Amigo reference architecture. In this section, we detail our AmIi interoperable service discovery solution (AmIi-SD), which includes: (i) the design of the ASDM conceptual model and a concrete realization of the model, the AmIi Service Description Language (ASDL); (ii) conformance relations for matching heterogeneous services based on ASDL; and (iii) a repository-based service discovery mechanism. The ASDM model serves as basis for enabling mapping between heterogeneous service description languages, including both syntactic- and semantic-based languages, used by existing service discovery protocols (SDPs). Specifically, service descriptions given using syntactic-based languages (e.g., UPnP, SLP, WSDL) and semantic-based languages (e.g., SAWSDL, OWL-S, WSMO) can be translated to ASDM-based (more specifically ASDL-based) descriptions. As a result, a service request may be matched against a service description, even if they are expressed in different service description languages. Interoperable service discovery is then achieved by deploying multi-SDP repositories in the network. Repositories run plugins associated with legacy SDPs and are thus able to serve requests and catch service advertisements from the various SDPs. Concretely, as depicted in Fig. 2, the AmIi service repository is a (logically) centralized repository that enables service discovery within the pervasive environment. Specific legacy SDP plugins register with the active SDPs in the network, and translate requests and advertisements from legacy formats to ASDL (e.g., UPnP in Fig. 2). Depending on the specific SDP, the legacy plugin either directly performs service discovery (i.e., pull-based only protocols) or registers for service advertisements (i.e., push-based and hybrid protocols). In the latter case, the ASDL description generated from a service announcement is sent to the ASDL descriptions directory for storage. Additionally, the AmIi repository provides an explicit
Fig. 2 AmIi interoperable service discovery
1148
Nikolaos Georgantas et al.
API supported by the AmIi plugin that enables clients (resp. providers) in a network to issue service requests (resp. advertisements) directly in the ASDL format, thus benefiting from all its advanced semantic features. In this case, the ASDL directory stores the ASDL descriptions issued by the AmIi plugin. Finally, the matching engine combines various conformance relations to support both syntactic-based and semantic-based service descriptions (included in requests or advertisements), and thus provides interoperability between SDPs. When there is an incoming service request, it is translated by the appropriate plugin, and then the matching engine matches it against the ASDL directory. In the next sections, we present the ASDM model (Sect. 3.1) and its ASDL instantiation (Sect. 3.2). We then detail our interoperable matching relations (Sect. 3.3) and method for ranking the matching results (Sect. 3.4).
3.1 ASDM: A Model for Semantic and Syntactic Service Specification The design of ASDM results from the analysis of many existing service description languages; it further builds upon two models coming from our previous work: (i) the MUSDAC service model [25], which supports interoperability between syntacticbased languages (e.g., UPnP, SLP and WSDL); and (ii) the EASY service model [4] for the semantic specification of service functional and non-functional capabilities. ASDM then extends these two models to account for both semantic- and syntacticbased descriptions, as well as non-functional properties associated with services. In this chapter, due to the lack of space, we do not present features related with the complete specification and matching of service non-functional properties; the interested reader is referred to [1]. In ASDM, a service description is composed of two parts: a profile and a grounding (see corresponding UML diagram in Fig. 3). The service profile is described as a non-empty set of capabilities and a possibly empty set of non-functional properties, while the service grounding prescribes the way of accessing the service. A service capability is any functionality that may be provided by a service and sought by a client. It is described with: its name; its interface comprising a possibly empty set of inputs and a possibly empty set of outputs; a potential conversation; and a possibly empty set of non-functional properties. Capabilities that do not have any associated input/output descriptions are those retrieved by legacy SDPs that simply use names to characterize capabilities (e.g., a native SLP service). Inputs associated with a capability are the information necessary for the execution of the capability, while outputs correspond to the information produced by the capability. Inputs and outputs of capabilities are described with their names, types, and a possible semantic annotation that is a reference to a concept in an existing ontology. Capabilities may themselves have a semantic annotation. Capabilities that do not have semantic annotations associated to them or to their inputs and outputs are those provided by legacy services without an enriched semantic interface (e.g., a native UPnP service).
Middleware Architecture for Ambient Intelligence in the Networked Home
1149
A conversation associated with a capability prescribes the way of realizing this capability through the execution of other capabilities. This conversation is described as a workflow of activities that may correspond either to elementary or composite capabilities. Elementary capabilities are those that do not have a conversation, while composite capabilities are those that are themselves composed of other capabilities. In concrete terms, an elementary capability represents a basic interaction with a service. For instance, in the case of SLP services, it corresponds to the invocation of the whole service, whereas in the case of UPnP or Web Services it corresponds to the invocation of one of the service operations. Conversations are said to be semantic if they involve capabilities that have associated semantic annotations. They are said to be syntactic otherwise. Further in ASDM, non-functional properties are related with context and QoS information of services. They are expressed at two levels: at the service (profile) level and at the capability level. Non-functional properties defined at the service level are those that apply to all the capabilities of the service. For instance, a property given at the service level and describing that the service encrypts its messages using a particular encryption algorithm means that all the capabilities of the service employ the same encryption algorithm. On the other hand, a property given at the capability level and expressing a latency property, for instance, concerns only the capability itself. Non-functional properties have semantic annotations associated to them.
Fig. 3 The ASDM model
1150
Nikolaos Georgantas et al.
3.2 ASDL: A Language for Semantic and Syntactic Service Specification Next to the ASDM conceptual model, we introduce the AmIi Service Description Language (ASDL) as a concrete realization of the model. For the implementation of ASDL, we opted for an XML-based schema defining a container, which is combined with the two emergent standard service description languages, namely SAWSDL and WS-BPEL. The ASDL description acts primarily as a top-level container for additional files describing facets of the service. SAWSDL is used to describe the capability interfaces, while WS-BPEL is used to express conversations associated with capabilities. We employ SAWSDL for the definition of capability interfaces because it supports both semantic and syntactic specification of service attributes (e.g., inputs, outputs). Thus, both legacy syntactic descriptions and rich semantic descriptions can be translated to SAWSDL. On the other hand, WS-BPEL is a comprehensive language for workflow specification, which is adequate for conversation specification. It has largely been adopted both in the industrial community and in academia.
Fig. 4 An ASDL description
Middleware Architecture for Ambient Intelligence in the Networked Home
1151
WS-BPEL supports only syntactic conversation specification, however, if combined with SAWSDL, semantic conversations can be defined. Additional files may be optionally linked to the ASDL container to describe a service’s non-functional properties using existing QoS and context models (e.g., SLAng [17], EASY [4]). Fig. 4 shows an example of an ASDL description where the service is composed of two capabilities. The first capability has a functional and non-functional description that comprises a reference to a SAWSDL file defining the capability interface, a WSBPEL description that defines the conversation associated with the capability, as well as QoS and context descriptions given in a SLAng and an OWL file respectively. The second capability of this service is only given with an interface description defined in a SAWSDL file. Given ASDL, heterogeneous service descriptions can be translated into reference ASDL descriptions by using translators associated with service description languages that come along with legacy SDPs, e.g., UPnP2ASDL (see legacy SDP plugins in Fig. 2). It is also possible that a service provider makes available a rich semantic description of a service in ASDL, thus exploiting the expressiveness of the language set included in the ASDL container (see AmIi plugin in Fig. 2). ASDL descriptions are used to assess the conformance of services against service requests. Fig. 5 gives an overview of how various legacy service descriptions are translated into ASDL, where we focus on functional aspects of services. In this figure, five different scenarios are identified: 1. A legacy service specified with the name of its provided functionality (e.g., a SLP service). The produced ASDL description contains the SLP grounding information and links to a SAWSDL description that contains a single operation having as name the name of the SLP service without any input or output specification. 2. A service that provides a list of operations described syntactically with their signatures, as it is the case for UPnP services or web services. The produced ASDL description links to a SAWSDL description that comprises a list of WSDL operations corresponding to the operations specified in the legacy description without semantic annotations. 3. A service described as a set of semantically annotated operations (e.g., given as a SAWSDL description). The produced ASDL description can link directly to the given SAWSDL description file or, in the case of a description other than SAWSDL, after mapping the given description to SAWSDL. 4. A syntactic capability described with an associated conversation of operations (e.g., a service described as a WSDL operation that is realized through the execution of a WS-BPEL conversation). The generated ASDL description contains the specification of both an interface and a conversation. The interface points to a SAWSDL description that contains a single operation without semantic specification and is used to describe the capability. The conversation links to a WS-BPEL description that describes the conversation associated with the operation. This WS-BPEL description uses itself another WSDL file that specifies the operations used in the conversation.
1152
Nikolaos Georgantas et al.
5. A semantic capability having an associated conversation of semantic operations (e.g., an OWL-S service with a profile that describes the semantic capability and a process model that describes the associated conversation). The generated ASDL description comprises both an interface and a conversation, as in the previous scenario. However, contrary to the previous scenario, the SAWSDL description used to describe the capability includes semantic annotations of the capability elements (i.e., inputs, outputs). Furthermore, the WS-BPEL file describing the conversation associated with the capability uses another SAWSDL description in which the operations used in the conversation are also semantically annotated.
Fig. 5 Various legacy service descriptions translated into ASDL
Middleware Architecture for Ambient Intelligence in the Networked Home
1153
3.3 Interoperable Matching of Service Capabilities Based on their common mapping to ASDL descriptions, heterogeneous service descriptions (a service request and a service advertisement) can be matched. We introduce a set of conformance relations for matching the various cases of heterogeneous service descriptions, inspired from the scenarios of the previous section. These conformance relations enable different degrees of matching, from basic capability name matching to advanced semantic conversation matching. The choice of the appropriate conformance relation depends on how much information is given in the service descriptions. Obviously, in each case, the highest degree of matching possible is limited by the greatest common denominator of information given in the two service descriptions. For instance, comparing a service request described as a syntactic capability name (first scenario) with a rich semantic service description (third scenario) requires ignoring the semantic service annotations as well as the input and output information and performing a syntactic comparison of the request with the names of the service capabilities. The different cases of matching of heterogeneous service descriptions are outlined in the table depicted in Fig. 6. In this table, following the five scenarios of the previous section, a service request and a service advertisement can be described as: (1) a syntactic capability name; (2) a list of syntactic capabilities; (3) a list of semantic capabilities; (4) a syntactic capability with an associated syntactic conversation; and (5) a semantic capability with an associated semantic conversation. Fig. 6 shows twenty-five cases of matching a service request with a service advertisement. Among these cases, we can identify certain redundancy by applying the greatest common denominator rule discussed above. This leads us to introduce the following five matching relations:
Fig. 6 Interoperable matching of services capabilities
1154
Nikolaos Georgantas et al.
1. Syntactic matching of capability names, noted SynNameMatch(). It applies when the service request or the service advertisement is described only as a syntactic capability name. This function is defined as follows: SynNameMatch(Cadv,Creq ) = Cadv .CapabilityName = Creq .CapabilityName 2. Syntactic signature matching, noted SynSigMatch(). It applies – to the remaining cases – when the request or the advertisement is described as a list of syntactic capabilities, or when additionally one of them is described as a syntactic conversation but the other side provides no conversation. The SynSigMatch() function is based on syntactic matching of the capability names and of the inputs and outputs of the corresponding capabilities of the request and the advertisement; it is defined as follows: SynSigMatch(Cadv,Creq ) = Cadv .CapabilityName = Creq .CapabilityName ∀inadv ∈ Cadv .Input, ∃inreq ∈ Creq .Input : inadv .Name = inreq .Name ∧ inadv.Type = inreq .Type ∀outreq ∈ Creq .Out put, ∃outadv ∈ Cadv .Out put : outreq .Name = outadv .Name ∧ outreq.Type = outadv .Type 3. Semantic signature matching, noted SemSigMatch(). It applies when both the request and the advertisement are described as a list of semantic capabilities, or when additionally one of them is described as a semantic conversation but the other side provides no conversation. The SemSigMatch() function is based on semantic matching of the capabilities and of the inputs and outputs of the corresponding capabilities of the request and the advertisement; it is defined below. It is based on the ConceptMatch() function, which is used to check whether two concepts are related in an ontology, i.e., if they are equivalent or one is more generic than the other [4]. If semantic conformance is established, syntactic adaptation should then be performed between the syntactic signatures of the capabilities. SemSigMatch(Cadv,Creq ) = ConceptMatch(Cadv.CapabilityName,Creq .CapabilityName) ∀inadv ∈ Cadv .Input, ∃inreq ∈ Creq .Input : ConceptMatch(inadv.SemanticAnnotation, inreq.SemanticAnnotation) ∀outreq ∈ Creq .Out put, ∃outadv ∈ Cadv .Out put : ConceptMatch(outreq.SemanticAnnotation, outadv.SemanticAnnotation) 4. Syntactic conversation matching, noted SynConvMatch(). It applies when both the request and the advertisement are described as conversations, but at least one
Middleware Architecture for Ambient Intelligence in the Networked Home
1155
of them is only syntactic. First, the SynSigMatch() function is used to match the request and the advertisement based on their signature. Then, matching of conversations is performed. As WS-BPEL can be attributed formal semantics based on process algebras, conformance between the requested and provided conversations can be assessed using process bi-simulation. In the case where no service that provides an equivalent conversation to the request is found, we may employ mechanisms for dynamic composition of heterogeneous services to reconstruct the requested conversation, as addressed by our previous work reported in [2]. 5. Semantic conversation matching, noted SemConvMatch(). It applies when both the request and the advertisement are described as semantic conversations. This case is similar to the previous one, except that we employ the SemSigMatch() function for the initial matching.
3.4 Ranking Heterogeneous Matching Results As discussed in the previous sections, service descriptions are heterogeneous in terms of their expressiveness, going from very simple syntactic definitions to very rich semantic definitions with associated conversations. Thus, we defined five matching relations to assess the conformance of heterogeneous service advertisements with respect to a particular service request. In the case of having multiple service advertisements matching a service request, we should further be able to select the service that best matches the request. This requires being able to rank the heterogeneous matching results. Ranking such results is dependent upon the expressiveness of service requests and service advertisements. For instance, services that have semantic annotations are preferred to syntactic services when the request is given with semantic annotations. Furthermore, semantic services that match a semantic request should be ranked with respect to their semantic distance to the request. Semantic distance allows evaluating the degree of conformance of a semantic service capability with respect to a request. We present in this section our mechanism for ranking service advertisements with respect to a service request. First, according to the degree of expressiveness of the service request, results coming from our five matching relations are ranked according to the table in Fig. 7; where m1 > m2 means that matching m1 is more accurate than m2 and thus preferred. Second, the results of the semantic signature matching function, i.e., SemSigMatch(), are themselves classified according to their degree of conformance to the given service request. The degree of conformance between a semantic request and a semantic service advertisement is evaluated using the function SemSigDegreeOfMatch(), which sums the results of ConceptDegreeOfMatch() for all concepts matched in the matching phase for SemSigMatch(), as follows:
1156
Nikolaos Georgantas et al.
SemSigDegreeO f Match(Cadv,Creq ) = n
∑ ConceptDegreeO f Match(ci, ci ), ∀ci , ci : ConceptMatch(ci, ci )
i=1
Where ConceptDegreeO f Match(c1, c2 ) returns the number of levels that separate the concepts c1 and c2 in the ontology hierarchy that contains them [4].
4 AmIi Interoperable Service Communication Services located through the AmIi interoperable service discovery (AmIi-SD) may embed various middleware communication protocols. It is then necessary to overcome such heterogeneity, so that interaction with these services may be possible independently of the middleware technologies embedded on the client’s and provider’s sides. In this section, we present our AmIi solution to interoperable service communication (AmIi-COM) [7], which is based on runtime protocol translation. We more specifically focus on interoperability achieved among heterogeneous RPC protocols, as this is an essential communication model in service oriented computing. In the following, we first discuss the relation between AmIi-COM and AmIi-SD (Sect. 4.1). We then detail our AmIi-COM solution. More specifically, we recall the characteristics of RPC communication protocols (Sect. 4.2), in order to introduce efficient event-based techniques to overcome RPC protocol heterogeneity (Sect. 4.3). Finally, Sect. 4.4 discusses the deployment and configuration of AmIi-COM.
4.1 Interoperable Service Discovery and Communication In the previous sections, we already discussed discovery of services. We herein recall some of the elements of this process and see how it is coupled with service communication.
Fig. 7 Ranking heterogeneous matching results
Middleware Architecture for Ambient Intelligence in the Networked Home
1157
In order to interact with services in open networked environments, clients must first find remote services using some SDP. Then, they rely on specific information that they get about the discovered services to actually interact with them. Clients look up the information needed to interact with services in service repositories, which are logical centralization points for such information. Each RPC communication middleware depends on a dedicated repository. For instance, Web Services, which are based on the SOAP protocol, use UDDI11 , whereas the RMI and CORBA middleware use repositories respectively called rmiregistry and CORBA Naming/Trading Service. Hence, remote services must first publish/export their description to a repository to be accessed by clients (Fig. 8, Step 1). Through the export process, services advertise their interface, communication protocol and unique reference/address. The former is a set of methods describing the service’s communication contract, whereas the latter two provide a mean to locate and access the service’s instance. This data enables producing a client-side stub that acts as a proxy for the remote service. Clients then use the stub as a handle to make method calls to the remote service. The way stubs are produced and obtained by clients may differ from one middleware to another. Stubs can be obtained statically or dynamically. In the former case, stubs are generated at development-time, so that clients do not need to get them at run-time; still, clients retrieve at run-time from the repository the service’s reference/address, which they feed into the static stub. In the latter case, stubs are transparently created by the export process and registered to the repository (Fig. 8, Step 2); in this case, clients retrieve the whole stub. In the following, we consider only the dynamic stub case as representative of both cases. The repository’s location is either known in advance by the client or dynamically discovered using some SDP. Once the client gets the stub (Fig. 8, Step 3), it can interact with the desired service. To invoke a method on the remote service, the client makes a local call on the stub (Fig. 8, Step 4). The latter first marshals the call into a request message according to the communication protocol used by the middleware (e.g., IIOP for CORBA, JRMP for RMI) and then sends the message to the remote service (Fig. 8, Step 5). Hence, clients are not aware of the implementation specifics of services; stubs abstract their location, programming language and communication protocol. Finally, on the ser-
Fig. 8 RPC-based middleware architecture 11
http://www.uddi.org/specification.html
1158
Nikolaos Georgantas et al.
vice side, the incoming request message is unmarshalled by the service-side stub into a local call (Fig. 8, Step 6). Stubs are not always mandatory; some clients may dynamically generate method calls to the remote service (e.g. CORBA DII). In Amigo, we extended the above essential mechanism to deal with interoperability among heterogeneous discovery and communication protocols. As introduced in Sect. 3, the AmIi service repository assumes the role of a universal repository for different SDPs. Depending on the specific mechanism of each supported SDP and RPC communication middleware coupled with it, the AmIi repository enables export and retrieval of stubs as depicted in Fig. 9 (Steps 1 and 2). More specifically, by using its native SDP, the service exports its stub, which conforms to (and indicates) its native communication middleware. On its side, by using its native SDP, the client looks up the service. The native communication middleware of the client is inferred from its SDP – as indicated above, a communication middleware employs a standard SDP. Then, a translation is performed between the stub exported by the service and the stub to be retrieved by the client, as either stub conforms to its native communication middleware. The above procedure allows providing the appropriate stub to the requesting client. The retrieved stub has been further customized – once used – to tunnel interaction between the client and service through the AmIi communication interoperability mechanism. AmIi-COM deals with communication protocol translation between the client and the service by employing appropriate protocol units (see Fig. 9, Steps 3 and 4). To enable AmIi-COM to configure appropriate protocol units, the AmIi repository dispatches to the former its knowledge about the communication protocols and also the references/addresses of the client and the service. AmIi-COM may be located on the client, the service, or even a gateway device. AmIi-COM instances register themselves with the AmIi repository in their vicinity, which allows the AmIi repository to select the appropriate AmIi-COM instance. The whole process (AmIi-COM deployment and use) is done in a decentralized and totally transparent way for the client and the service. In the following, we detail the AmIi-COM mechanism, starting with RPC basics in the next section.
Fig. 9 Coupling between AmIi-SD and AmIi-COM
Middleware Architecture for Ambient Intelligence in the Networked Home
1159
4.2 RPC Communication Stack According to the OSI model, RPC communication protocols can be decomposed into layers, providing a functional division of the tasks required to enable successful interaction. As depicted in Fig. 10, RPC communication protocols decompose into 5 layers, defining a reference RPC communication stack. The network and transport layers are similar to the OSI ones. The former determines how data are transferred between networked devices whereas the latter specifies how to manage end-to-end message delivery among networked entities. The invocation layer, refining the OSI session layer, defines how to manage sessions with remote services across the network and then specifies the types of messages exchanged during an open session. Then, the serialization layer, refining the OSI presentation layer, encodes messages according to a format specification. Finally, the application layer provides to applications an interface to perform remote procedure calls. For illustration, consider a device A hosting an RMI communication stack and another device B hosting a Web Services stack. As depicted in Fig. 11, assume an RMI-based application of A wishes to invoke a Web service of B (Fig. 11, Step 1). The corresponding request message passes through the 5 layers of the stack hosted on A (Fig. 11, Step 2). Specifically, the request is first passed to the application layer, which adds a header to the data. The resulting message is passed to the serialization layer, which adds its own header to the message it just received from its upper layer and so on, all the way down to the IP network layer. At the IP layer, the resulting message is transmitted through the network medium to B (Fig. 11, Step 3). The message should then traverse the 5 layers of the communication stack hosted on B (Fig. 11, Step 4). Each crossed layer shall extract its corresponding header and pass the message payload to the next layer and so on, all the way up to the topmost layer. Each added header contains information dedicated to the crossed layer and thus enables a direct layer-to-layer communication between the two stacks that are respectively hosted on A and B. However, although the communication stacks have a similar design, interoperability is not supported as the stacks of A and B are bound to specific message types and data format. In the RMI stack, the serialization layer offers functions to encode/decode application data in binary format according to the Java Object Serialization Stream Protocol12 (JOSSP) specification. In the Web Services stack, the same layer en-
Fig. 10 RPC communication protocol stack 12
http://java.sun.com/j2se/1.5.0/docs/guide/serialization/
1160
Nikolaos Georgantas et al.
codes/decodes data in XML format according to the SOAP specification. Thus, regarding the serialization layer, RPC-based communication protocols do not differ in terms of functionalities but in the way their communication stack represents/transforms data. Similarly, for the invocation layer, applications based on RMI send messages across the network in a binary format following the JRMP specification, whereas Web services use HTTP specifications13. Regardless of the RPC-based communication protocol, the invocation layer offers always the same functions but differs, as previously, in the way messages are sent across the network. Thanks to functional commonalities of the protocol layers among heterogeneous protocols, a way to achieve communication protocol interoperability is to offer perlayer interoperability among heterogeneous communication protocol stacks. For instance, we should enable the invocation layer from RMI and web services to interoperate. Although these layers use different specifications to marshall/unmarhall network messages, this challenge can be addressed because these layers provide identical functions. The same applies to the serialization layer. Obviously, if the stack related to one communication protocol is enriched with new features through the adjunction of a new layer, interoperability may be compromised. However, our aim is not to modify existing communication protocols by enriching them with functionalities that they do not implement even if others do. More particularly, we do not want to add new features, if this implies changing existing applications. Hence, we enable communication protocols interoperability among different middleware only if there exist enough similarities in their corresponding protocol stacks. In other terms, the quality of the interoperability among different protocol stacks achieved by AmIi-COM depends on the degree of their functional similarities. This is measurable in terms of the number of similar functions shared among the different protocol stacks, independently of the heterogeneity of the message/data formats, which is efficiently overcome through the use of event-based parsing techniques, as described in the next section. In fact, AmIi-COM provides interoperability for the greatest common denominator of similar functions.
Fig. 11 Layer-to-layer communication 13
Depending on the point of view, SOAP can be also considered as an invocation protocol
Middleware Architecture for Ambient Intelligence in the Networked Home
1161
4.3 Event-based Interoperability Following our previous work on the design of the INDISS interoperability system [8], interoperability for one layer of the communication protocol stack is the result of the composition of a protocol parser with a protocol composer. Specifically, a protocol parser generates semantic events according to input protocol messages and the protocol composer does the inverse process, either one for a specific protocol. Cooperation between a parser and composer is achievable because the parsed and produced protocols share similar functions, which are abstracted as events. An interoperability process is then a translation process, resulting from the composition of a parser and a composer. Thus, for each layer, we have an interoperability process based on the set of events abstracting the functionalities of the layer that are common across heterogeneous communication protocols. According to the RPC communication stack, at least 5 interoperability processes are required to enable interoperability between two middleware based on different communication protocols. However, these processes cannot always be completely known in advance (e.g., an RMI stack may use JRMP or HTTP for the invocation layer). In this particular case, AmIi-COM can dynamically discover the structure of a remote protocol stack to select the appropriate parsers in order to create and chain the interoperability processes; and this, as long as AmIi-COM receives a message from a remote protocol stack. This is enabled by the structure of the network-layer message, as illustrated in Fig. 11. More specifically, every network message embeds the headers corresponding to the layers previously crossed. The set of headers is therefore a signature that reveals the composition of the protocol stack. Furthermore, by definition, a header always contains a magic number and/or a field to specify the current protocol used and/or the protocol expected in the next upper layer. Hence, this property enables chaining progressively the adequate parsers belonging to the different layers to generate a stream of events that semantically represents the RPC message. The chaining of interoperability processes is depicted in Fig. 12. In Step 1, the RPC call from device A is first parsed by the network parser. The parser decomposes the message into two distinct parts: the header and the payload. The former is transformed into an event stream that is forwarded to the network composer and the payload is passed to the transport parser, which is the next parser in the chain. Recursively, the transport parser extracts from the received payload a new header translated into events that are sent to the transport composer and a new payload that is directed to the invocation parser and so on, all the way down to the application parser that finally translates the data of the RPC call into an event stream. Events from each parser are sequentially forwarded to composers (Fig. 12, Step 2). However, composers are not able to generate a message until the last parser of the chain has parsed the last payload. In fact, the composer from the bottom level generates the payload that is required for the composer of the level immediately above and so on, all the way up to the network level (Fig. 12, Step 3). The resulting message is finally compliant to the protocol stack of device B (Fig. 12, Step 4). A similar process applies to the RPC reply from B to A. Therefore, to provide bidirectional communi-
1162
Nikolaos Georgantas et al.
cation between two different communication protocol stacks, to each protocol layer corresponds a protocol unit, which embeds the protocol parser and composer for the specific protocol layer as depicted in Fig. 13. Further details about protocol units are introduced in [8] in the context of service discovery protocols. In practice, as layers are not always fully independent (e.g., SOAP defines both serialization and invocation protocols), the flow of events generated from the chain of parsers are forwarded to the chain of composers since each one of them is free to handle or ignore incoming events. In a way similar to Horus [26, 27], Ensemble [13], Coyote [6], and Microsoft Indigo14, protocol units are independent protocol modules or blocks that are stacked on top of each other to constitute a vertical protocol stack. However, we further introduce a dynamic composition of protocol stacks that is both vertical and horizontal. Vertical stack composition (i.e., vertical unit chaining) enables translating a RPC call to a stream of semantic events, whereas the horizontal stack composition (i.e., horizontal unit chaining) translates the stream of semantic events to another protocol (See Fig. 13). In contrast to traditional protocol stacks, the combined vertical and horizontal stack composition of AmIi-COM enables translating one communication protocol to another but does not interpret them. Also, note that, contrary to the above systems that provide reconfiguration of protocol stacks [26, 6, 13], with AmIi-COM, applications and services are not aware of the reconfiguration of protocol compositions and are therefore not bound to the AmIi-COM system. The latter acts at the network layer on top of the operating
Fig. 12 Event-based interoperability 14
http://windowscommunication.net/Default.aspx
Middleware Architecture for Ambient Intelligence in the Networked Home
1163
system and below legacy middleware (See Fig. 14). Further, AmIi-COM needs only to be deployed on one of the nodes involved in the communication, whether the client or the service host, or even a gateway.
4.4 AmIi-COM Instances In practice, there are not as many units to compose as protocol stack layers. In general, protocol stacks share a number of identical layers, thus reducing the number of units involved in protocol interoperability. For instance, as illustrated in Fig. 11, a majority of RPC protocols are based on TCP/IP, simplifying the interoperability system, which works only from Layer 3 to 4 (i.e., invocation to serialization layers). The TCP/IP drivers of the operating system act as units dedicated to Layers 1 and 2. However, if the latter are heterogeneous, our system can also enable dynamic, through adequate units, interoperability among different networks. AmIi-COM is built around the concept of vertical and horizontal, dynamic unit chaining. However, dynamic chaining is not without cost in terms of resource con-
Fig. 13 Vertical and horizontal chain composition to provide interoperability
Fig. 14 Localisation of the AmIi-COM system
1164
Nikolaos Georgantas et al.
sumption, is not always required, and is not always possible (as discussed above, it is based on the analysis of incoming messages). In these cases, the vertical composition of protocol units for each supported protocol stack can be achieved statically. Specifically, the service discovery process (the AmIi repository of AmIi-SD) enables AmIi-COM to select the adequate vertical stack, which is statically composed (see Sect. 4.1). However, interoperability among heterogeneous stacks is still dynamic, as is the horizontal composition of protocol units. More specifically, the AmIi-COM interoperability system is defined as a set of protocol units that can be either statically or dynamically composed. As illustrated in Fig. 15, the specification of an AmIi-COM instance defines the supported units (for invocation and serialization layers) and the vertical protocol stacks that are statically composed. However, at run-time (See Fig. 16), AmIi-COM may still dynamically create new vertical stacks, or reconfigure the existing stacks, which were statically composed, by adding, removing or changing one protocol unit by another, according to the context. Protocol units are not necessarily specific to one communication protocol and may be stacked in various ways. For instance, the vertical stack named RMI_2 in Fig. 15, which handles mobile code of RMI-based clients/services, depends on the HTTP unit, which is also used by the SOAP stack. In general, AmIi-COM instances evolve across time due to the communication protocols used by both the hosted applications and the available networked services. Accordingly, protocol units are reconfigured in order to provide interoperability between clients and services.
5 Conclusion Ambient intelligence / pervasive computing environments, such as the smart networked home, should integrate networked devices, possibly wireless, from various application domains, e.g., the home automation, consumer electronics, mobile and personal computing domains. Such environments have introduced new challenges for middleware. Devices need to dynamically detect services offered by other de-
Fig. 15 Specification of an AmIi-COM instance
Middleware Architecture for Ambient Intelligence in the Networked Home
1165
vices available in the open networked environment and adapt their communication protocols to interact with them, as services are implemented on top of diverse middleware (e.g., UPnP used in the home, Java RMI in the mobile domain). Addressing such a requirement calls for enabling interoperability among networked devices throughout their system architecture, i.e., their application, middleware and platform layers. Application layer interoperability should allow networked services to meet and coordinate according to applications to be provided to users, whereas they have been developed independently without a priori knowledge of their respective specific functionalities. Significant helpers towards dealing with this requirement are the Semantic Web and in particular Semantic Web Services, which allow high level description of service functionalities and rigorous reasoning about them. However, being tied to a single service technology, such as Web Services is also restrictive; application layer interoperability should be enabled independently of service platforms. This then further calls for middleware layer interoperability (which may also include platform layer aspects), as each service technology/platform employs a specific middleware technology/platform supporting the execution and networking of services; it cannot be assumed that all networked devices in the open AmI environment will eventually converge to a unique middleware technology. In the above context, the IST FP6 Amigo project aimed at the development of an extended home AmI environment. As part of our work in the INRIA ARLES15 group on the development of distributed systems enabling the AmI vision, we devise enablers for pervasive services. In this chapter, we specifically concentrated on our AmIi interoperability solution developed within Amigo. We introduced a reference system architecture that has three key features: (i) it is an enhanced service architecture imposing no specific technologies; (ii) it has a full semantic description at a higher, technology-independent level; and (iii) it includes interoperability mechanisms as first-class entities based on this semantic description. Relying on these principles, we detailed in this chapter two interoperability mechanisms: (i) the AmIi interoperable service discovery (AmIi-SD), enabling service discovery across heterogeneous, both syntactic- and semantic-based, discovery protocols; and (ii) the AmIi interoperable service communication (AmIi-COM), enabling communication
Fig. 16 AmIi-COM instances 15
http://www-rocq.inria.fr/arles/
1166
Nikolaos Georgantas et al.
across heterogeneous RPC protocols. AmIi-SD and AmIi-COM work together to support complete interoperability among heterogeneous devices in the open AmI environment. Prototypes of the related software components have been implemented as part of the Amigo project, and are available under open source software license on the Amigo open source software Web site16 and our group’s Web site17 . Due to the lack of space, we cannot detail these implementations or report on our performance evaluation results in terms of both resource consumption and response time. Details may be found in the references included in the chapter. Performance evaluation has shown that our solutions comply with the requirements of the pervasive computing environment, i.e., resource constraints and high interactivity. The Amigo project and the presented AmIi interoperability approach was a step forward in enabling the AmI / pervasive vision. In our future work, we aim at further removing the technology barriers that constrain this vision. AmIi, although dynamic, is based on protocol translators that are statically designed and developed to address existing service oriented architectures. In our view, the networking of systems should be completely agnostic to their specific technologies. We envisage systems that network behaviorally, as opposed to networking technologically. Systems should be able to unambiguously specify their networked functional and nonfunctional behavior based on adequate theoretical foundations. This should allow automated reasoning for behavioral matching and further adaptation of systems, enabling them to interact irrespectively of their underlying technologies. More specifically, building upon work in the software architecture domain, we aim to investigate formal specification of connectors that allows reasoning upon and adaptation of system networked behavior at runtime. This adaptation will be performed through on-the-fly synthesis of connectors customized for the communicating networked systems. This approach envisages “eternal systems” that can seamlessly join any networked environment, existing or possibly even future one.
References [1] Ben Mokhtar, S.: Semantic middleware for service-oriented pervasive computing. Ph.D. thesis, University of Paris 6, France (2007) [2] Ben Mokhtar, S., Georgantas, N., Issarny, V.: Cocoa: Conversation-based service composition in pervasive computing environments with qos support. J. Syst. Softw. 80(12), 1941–1955 (2007). DOI http://dx.doi.org/10.1016/j.jss. 2007.03.002 [3] Ben Mokhtar, S., Kaul, A., Georgantas, N., Issarny, V.: Efficient semantic service discovery in pervasive computing environments. In: Middleware ’06: Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Mid16 17
https://gforge.inria.fr/projects/amigo/ http://www-rocq.inria.fr/arles/
Middleware Architecture for Ambient Intelligence in the Networked Home
[4]
[5] [6]
[7]
[8]
[9] [10]
[11]
[12]
[13]
[14]
[15]
[16]
1167
dleware, pp. 240–259. Springer-Verlag New York, Inc., New York, NY, USA (2006) Ben Mokhtar, S., Preuveneers, D., Georgantas, N., Issarny, V., Berbers, Y.: Easy: Efficient semantic service discovery in pervasive computing environments with qos and context support. J. Syst. Softw. 81(5), 785–808 (2008). DOI http://dx.doi.org/10.1016/j.jss.2007.07.030 Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001) Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu, W.: Coyote: A system for constructing fine-grain configurable communication services. ACM Transactions on Computer Systems 16, 321–366 (1998) Bromberg, Y.D.: Résolution de l’hétérogénéité des intergiciels d’un environnement ubiquitaire. Ph.D. thesis, University of Versailles-Saint Quentin en Yvelines, France (2006) Bromberg, Y.D., Issarny, V.: Indiss: Interoperable discovery system for networked services. In: Proceedings of the ACM/IFIP/USENIX 6th International Middleware Conference, pp. 164–183. Grenoble, France (2005) Chappell, D.A.: Enterprise Service Bus. O’Reilly Media (2004) Georgantas, N., Inverardi, P., Issarny, V.: Software platforms. In: E.H.L. Aarts, J.L. Encarnacao (eds.) True Visions: The Emergence of Ambient Intelligence, pp. 151–170. Springer Berlin Heidelberg (2006) Georgantas, N., Mokhtar, S.B., Bromberg, Y.D., Issarny, V., Kalaoja, J., Kantarovitch, J., Gerodolle, A., Mevissen, R.: The amigo service architecture for the open networked home environment. In: WICSA ’05: Proceedings of the 5th Working IEEE/IFIP Conference on Software Architecture, pp. 295–296. IEEE Computer Society, Pittsburgh, Pennsylvania (2005). DOI http://dx.doi.org/10.1109/WICSA.2005.71 Grace, P., Blair, G.S., Samuel, S.: A reflective framework for discovery and interaction in heterogeneous mobile environments. SIGMOBILE Mob. Comput. Commun. Rev. 9(1), 2–14 (2005). DOI http://doi.acm.org/10.1145/1055959. 1055962 Hayden, M., van Renesse, R.: Optimizing layered communication protocols. In: Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing, pp. 169–177. Portland, Oregon (1997). DOI 10.1109/HPDC.1997.626686 Issarny, V., Sacchetti, D., Tartanoglu, F., Sailhan, F., Chibout, R., Levy, N., Talamona, A.: Developing ambient intelligence systems: A solution based on web services. Automated Software Engg. 12(1), 101–137 (2005). DOI http: //dx.doi.org/10.1023/B:AUSE.0000049210.42738.00 Koponen, T., Virtanen, T.: Service discovery: a service broker approach. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences (2004). DOI 10.1109/HICSS.2004.1265669 Kurzyniec, D., Wrzosek, T., Sunderam, V., Slominski, A.: Rmix: a multiprotocol rmi framework for java. In: Proceedings of the International Parallel
1168
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
Nikolaos Georgantas et al.
and Distributed Processing Symposium (2003). DOI 10.1109/IPDPS.2003. 1213269 Lamanna, D.D., Skene, J., Emmerich, W.: Slang: A language for defining service level agreements. In: FTDCS ’03: Proceedings of the The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 100–106. IEEE Computer Society, Washington, DC, USA (2003) Martin, D., Paolucci, M., Mcilraith, S., Burstein, M., Mcdermott, D., Mcguinness, D., Parsia, B., Payne, T., Sabou, M., Solanki, M., Srinivasan, N., Sycara, K.: Bringing semantics to web services: The owl-s approach. In: J. Cardoso, A. Sheth (eds.) SWSWPC 2004, LNCS, vol. 3387, pp. 26–42. Springer (2004) McIlraith, S.A., Martin, D.L.: Bringing semantics to web services. IEEE Intelligent Systems 18(1), 90–93 (2003). DOI http://dx.doi.org/10.1109/MIS.2003. 1179199 Nakazawa, J., Tokuda, H., Edwards, W., Ramachandran, U.: A bridging framework for universal interoperability in pervasive systems. In: Proceedings of ICDCS’06: The 26th IEEE International Conference on Distributed Computing Systems. Lisboa, Portugal (2006). DOI 10.1109/ICDCS.2006.5 O’Sullivan, D., Lewis, D.: Semantically driven service interoperability for pervasive computing. In: MobiDe ’03: Proceedings of the 3rd ACM international workshop on Data engineering for wireless and mobile access, pp. 17–24. San Diego, CA, USA (2003). DOI http://doi.acm.org/10.1145/940923.940927 Papazoglou, M.P., Georgakopoulos, D.: Service-oriented computing. Commun. ACM 46(10), 24–28 (2003). DOI http://doi.acm.org/10.1145/944217. 944233 Peltz, C.: Web services orchestration and choreography. IEEE Computer 36(10), 46–52 (2003). DOI http://dx.doi.org/10.1109/MC.2003.1236471. URL http://dx.doi.org/10.1109/MC.2003.1236471 Raverdy, P.G., Issarny, V., Chibout, R., de La Chapelle, A.: A multi-protocol approach to service discovery and access in pervasive environments. In: Proceedings of the 3rd Annual International Conference on Mobile and Ubiquitous Systems, pp. 1–9. San Jose, CA, USA (2006). DOI http://doi. ieeecomputersociety.org/10.1109/MOBIQ.2006.340448 Raverdy, P.G., Riva, O., de La Chapelle, A., Chibout, R., Issarny, V.: Efficient context-aware service discovery in multi-protocol pervasive environments. In: MDM ’06: Proceedings of the 7th International Conference on Mobile Data Management. IEEE Computer Society, Nara, Japan (2006). DOI http://dx.doi. org/10.1109/MDM.2006.78 van Renesse, R., Birman, K.P., Friedman, R., Hayden, M., Karr, D.A.: A framework for protocol composition in horus. In: PODC ’95: Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pp. 80–89. ACM, New York, NY, USA (1995). DOI http://doi.acm. org/10.1145/224964.224974 van Renesse, R., Birman, K.P., Maffeis, S.: Horus: a flexible group communication system. Commun. ACM 39(4), 76–83 (1996). DOI http://doi.acm.org/ 10.1145/227210.227229
Middleware Architecture for Ambient Intelligence in the Networked Home
1169
[28] Satyanarayanan, M.: Pervasive computing: Vision and challenges. IEEE Personal Communications 8, 10–17 (2001) [29] Slominski, A., Govindaraju, M., Gannon, D., Bramley, R.: Design of an XML based Interoperable RMI System : SoapRMI C++/Java 1.1. In: Proceedings of PDPTA, pp. 1661–1667 (June 25-28, 2001) [30] Tsounis, A., Anagnostopoulos, C., Hadjiefthymiades, S.: The role of semantic web and ontologies in pervasive computing environments. In: Proceedings of the Mobile and Ubiquitous Information Access Workshop, Mobile HCI ’04. Glasgow, UK (2004) [31] Van Engelen, R.A., Gallivan, K.A.: The gsoap toolkit for web services and peer-to-peer computing networks. In: CCGRID ’02: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, p. 128. IEEE Computer Society, Washington, DC, USA (2002) [32] Weiser, M.: Some computer science issues in ubiquitous computing. Commun. ACM 36(7), 75–84 (1993). DOI http://doi.acm.org/10.1145/159544.159617 [33] Zhu, F., Mutka, M.W., Ni, L.M.: Service discovery in pervasive computing environments. IEEE Pervasive Computing 4(4), 81–90 (2005). DOI http: //dx.doi.org/10.1109/MPRV.2005.87
The PERSONA Service Platform for AAL Spaces Mohammad-Reza Tazari and Francesco Furfari and Juan-Pablo Lázaro Ramos and Erina Ferro
1 An Introduction to the PERSONA Project The project PERSONA aims at advancing the paradigm of Ambient Intelligence through the harmonization of Ambient Assisted Living (AAL) Technologies and concepts for the development of sustainable and affordable solutions for the independent living of senior citizens. PERSONA is one of the integrated projects funded by the European Commission within the 6th Framework Program for IST (Information Society Technologies) on AAL for the Aging Society. It involves the participation of 21 partners, from Italy, Spain, Germany, Greece, Norway and Denmark, with a total budget of around 12 million Euros. Ambient Assisted Living (AAL) is the concept that groups the set of technological solutions, named AAL Services, targeting the extension of the time that elderly and disabled people live independently in their preferred environment, i.e. their own four walls, neighborhood and town where they are used to live [1]. AAL Services provide personalized continuity of care and assistance, dynamically adapted to the individual needs and circumstances of the users throughout their lives. The main objective of PERSONA is the development of a scalable open standard technological platform to build a broad range of AAL Services. A number of AAL Services are being implemented over such a platform in order to demonstrate, test and evaluate their social impact and potential exploitation scenarios. AAL Services
Mohammad-Reza Tazari Fraunhofer-IGD, Darmstadt, Germany e-mail: [email protected] Francesco Furfari CNR-ISTI, Pisa, Italy e-mail: [email protected] Juan-Pablo Lázaro Ramos ITACA-TSB, Valencia, Spain e-mail: [email protected] Erina Ferro CNR-ISTI, Pisa, Italy e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_43, © Springer Science+Business Media, LLC 2010
1171
1172
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
under development are divided into four different categories according to an analysis of elderly needs in regard to a life with the highest level of independence: • AAL Services supporting social inclusion and experience exchange, • AAL Services supporting elderly users in their daily life activities, • AAL Services supporting elderly people to feel more confident, safe and secure, and helping their relatives to manage risky situations, and • AAL Services fostering mobility and supporting elderly people outside their homes. Figure 1 shows an example of a scenario where the user is supported to live in a more confident, safe and secure way. It shows how the home environment is able to detect potential risky situations, and to propose actions to the user for mitigating the risks. The interaction mechanisms between the user and the system are compliant with the paradigm of natural and contextual communication, in order to minimize the impact on the user’s life.
Fig. 1 Virtual scenario of AAL Services related to support elderly people to feel safe and secure
The targeted AAL technological platform is expected to be developed under the umbrella of a reference architecture for AAL spaces and AAL service provision as the main scientific and technological outcome of the project. The reference architecture of PERSONA for AAL systems has been designed to enable the smooth, flexible and modular integration of service-providing components using appropriate Information and Communication Technologies, and to create systems which are cost-effective, reliable & trusted, and adaptable to the requirements of each specific user within both the house and urban environments. Of course, a such open reference architecture promotes its wide adoption by the related R&D community when it can be used as a powerful tool to build affordable AAL Services oriented to help elderly people to live independently. Hence, one
The PERSONA Service Platform for AAL Spaces
1173
of the most important challenges of PERSONA in the medium term is to transfer the project’s results into a commercial scenario within two or three years. For this reason, a number of resources are being spent in order to define the business strategy to exploit the proposed scenarios into a real life market environment on a European scale, where the different types of stakeholders participate and coordinate to provide innovative, affordable, sustainable and effective AAL Services for elderly people.
2 Requirements on a Service Platform for AAL Spaces To derive both user and technical requirements, we investigate in the following the requirements on an AAL service platform from the view point of both the services to be hosted and the users of such services.
2.1 User Requirements An AAL Service is a composition of one or more functionalities that covers specific needs of elderly or disabled people, allowing them to live more independently. It ensures the continuity of care along the time and in the preferred environments (home, neighborhood, village, also named AAL Spaces) by using an Ambient Intelligence infrastructure that provides the basis for service development and deployment. Before starting to list the requirements on such an ambient intelligence infrastructure, also known as AAL Service Platform, we would like to analyze the requirements of end users on the final AAL Services to be delivered to them. In the scope of PERSONA, end users are elderly people. Alone this fact has a lot of implications in terms of both coverage of needs and service delivery, because the targeted services involve technologies that are unusual or even unknown to this group of users, at least within the early decades of the 21st century, and therefore can cause lack of acceptance or even rejection. Our analysis was based on the four categories of elderly needs (social integration, support for daily life activities, support for feeling safe and secure, and mobility) as the starting point, and targeted the realization of an ambient assisted living system that covers those needs or minimizes their impact. The sources of information used to gather the list of generic requirements for AAL Services were: • • • •
ISTAG documents [7, 13] describing the Ambient Intelligence paradigm ARTEMIS Strategic Research Agenda [2] The AAL program [1] PERSONA Project investigations with experts about elderly care and ethical and legal issues.
1174
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
The following list summarizes requirements derived from the combination and the analysis of the above sources with focus on the development of AAL Services: • General requirements – The new services must be as effective as the usual way of dealing with situations – The user must be able to disconnect the service at all times – The system must not be designed as a full replacement of personal care by relatives, neighbors or friends – The system should always take into account the current context of the user in order to avoid inconsistency and ensure coherent service provision – It should be possible to easily add new functionalities and services to a concrete physical infrastructure – It should be possible to add new devices to the system while enabling their easy adoption and usage by the rest of the logical or physical components – Devices used for building up the ambient infrastructure should be “invisible” as far as possible; in particular, installed sensors and those used in wearables shall be hidden / unnoticeable – The automatic decisions of the system must have the user’s confirmation and remain traceable for him/her • User interaction with AAL Services – Services should conform to usability standards and apply design-for-all with a focus on older adults – The system must be easy to learn – Adaptation: User Interfaces shall incorporate features for coping with access impairments and capability changes due to aging – Communication with the user should be done in a consistent way when using different modalities, such as voice, text, graphics, signals, actions, and state changes • Privacy and data management: – Data obtained must be relevant for the purposes of the service and from early design stages, emphasis must be put on the necessity of including data items – The user shall be able to prevent devices from gathering information about him/her at any time he/she wishes – For various purposes, such as provision of services and information, exercise of rights, and gathering of data, the system must be able to authenticate who the user is – Transmission of data to third parties must be kept to a minimum and security of data transmission to external entities must comply with legal requirements – The system shall allow the verification of access rights of third parties with possibility for rectification, opposition and cancellation
The PERSONA Service Platform for AAL Spaces
1175
2.2 Technical Requirements The most important deliverable of the PERSONA project is the AAL Service Platform. Hence, for defining the technical requirements, which must correspond to the targeted platform, we had to establish a common understanding of such a platform for AAL services. Normally, platform is understood as a framework on which applications (in this case AAL services) may run. The framework may be very specific and tied to certain hardware and operating system or more open by abstracting those levels based on a more generalized runtime environment, such as Java. Apart from the runtime environment, the platform may readily provide services that perform various common functions in order to avoid duplicating efforts. For the purpose of interoperability, a platform must also facilitate communication and collaboration between various components, either application components or those providing the platform services. Ontologies, communication protocols and data exchange formats can be provided as part of such an interoperability framework. AAL systems, as applications of Ambient Intelligence, are highly distributed systems which increases the importance of the interoperability framework. In order to create a list of requirements based on the above understanding, we explored the existing research initiatives in the field of ambient intelligence architectures and middleware implementations thoroughly at a national, European, and international level. Finally, the most important inputs came from the research priorities of ARTEMIS [2] in the fields of “Seamless Connectivity and Middleware” and “Reference Designs and Architectures”, the SODAPOP notion of self-organization [12], and even more specific initiatives such as the Jaspis architecture [25]. Some of the more general requirements can be summarized as follows: • guarantee a high-level of flexibility in the distribution of functionalities and facilitate the integration of arbitrary numbers of sensors, actuators, control units, appliances, and applications into the system • support ad-hoc networking to enable components to immediately act as a node in the networking infrastructure • support different communication patterns, such as event-based and call-based • provide service discovery and binding mechanisms: explicit service requests must be resolved and responded as far as the corresponding service is available per se • support service chaining: meaningful continuation with (intermediate) results reached / generated by one component in another component by discovery of alternatives and choosing a reasonable next “step” • provide mechanisms for event aggregation: seamless combination of events generated / captured by different components in order to infer a possibly more significant event • support for parallel processing of events while detecting and resolving / prohibiting possible conflicting interpretations and actions in favor of more reasonable ones
1176
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
• provide for service composition and orchestration: serving explicit service requests that cannot be resolved directly but by intelligent strategy planning algorithms that use simpler services available as building blocks to compose the demanded service. To serve the original service request, the composed service must then be executed1 • facilitate the explicit user interaction with the system while separating the presentation mechanism from application logic and supporting multimodality in an ensemble of devices distributed at different locations • support personalization and context-awareness in all layers of the system and adaptability in the presentation layers • hide the complexity of utilizing services from the users accessing them (cf. the SET concept of eMobility2) • provide for identity management, privacy-awareness, and QoS-awareness • follow a modular design to better support distribution and extensibility of the system, higher maintainability / dependability, and greater possibility for reusing existing and composing new functionality • At the realization level, characteristics to be considered include stability and diagnosis efficiency and performance, and completeness and simplicity of the API
3 Evaluation of Existing Solutions A declared goal of PERSONA is to avoid re-inventing the wheel in its technical developments, especially by identifying promising conceptual approaches while combining them with the newest technological trends. The development of the platform followed this principle using an appropriate methodology outlined in [3]. Starting with the above mentioned requirements, over 30 R&D projects with promising solutions in the fields of architectural design and middleware solutions for AmI systems were gathered. After two quick iterations for filtering out those with comparatively less significant outcomes, this list was reduced to six: Oxygen (launched by MIT in 1999), EMBASSI / DynAMITE (1999-2006, two German national research projects introducing the SODAPOP model), RUNES (2004-07, EU-IST project), Amigo (2004-08, EU-IST project), and ASK-IT (2004-08, EU-IST project). The theoretical and practical evaluations (e.g. by testing the provided software toolkits) showed that a combination of the concepts from the SODAPOP model and those developed within Amigo could lead to an improved solution for AmI systems. Section 4 discusses different aspects of the PERSONA solution derived in this way. In 1
The latter five requirements collectively facilitate self-organization of the distributed system in an intelligent way, guaranteeing a higher level of service in comparison with using each of the nodes in an standalone way. 2 The technology platform for mobile and wireless communications in the seventh framework program of the European Union for the funding of research and technological development in Europe – www.emobility.eu.org
The PERSONA Service Platform for AAL Spaces
1177
the remaining of this section, an overview of the two major “inputs” (SODAPOP & Amigo) will be given followed by a discussion of our evaluation results. A system built according to the SODAPOP3 model [11, 12] consists of two types of components: transducers and channels. Transducers read one or more messages during a time interval and map them to one (or more) output message(s). They are the functional units of the system that may have a memory and accept messages selectively. They provide for temporal aggregation of multiple input messages to a single output message. A transducer may attach to multiple channels; they, in turn, read a single message at a point in time and map them to one or more messages which are delivered to transducers (conceptually, without delay). Channels can be realized within a middleware solution and may be seen as the “cutting points” for distributing a. They have to accept every message and provide for spatial distribution of a single event to multiple transducers. Channels accept (and deliver) messages of a certain type t, transducers map messages from a type t1 to a type t2 . A system is defined by a set of channels and a set of transducers connecting to channels. There are two types of channels: event-channels, on which messages are posted without expecting any reply, and RPC-channels, on which posted messages will be replied with an appropriate response. Events and RPCs are (in general) posted by transducers to channels without specific addressing information: in a dynamic system, a sender never can be sure which receivers are currently able to process a message. It is up to the channel to identify a suitable (set of) receiver(s) as the result of its arbitration. How this process of choosing the receiver set works is defined by the channel’s strategy. The channel strategy must ensure that message routing is transparent to transducers, no matter if the message contains an event, a call or a reply. This has the following effects: a) a coarse-grained self-organization is achieved based on data-flow partitioning through the definition of several channels to which transducers may attach, b) a fine-grained self-organization for functionally similar transducers is achieved by channel strategy based on a kind of “pattern matching” approach, c) transducer designers are relieved from the burden of dealing with the distribution of functionality and from the task of finding and selecting the appropriate receiver set, and d) the risk of misbehaved transducers that, e.g., always select the receivers from the same vendor is minimized. The architecture of Amigo [14, 23] is based on a special view on a logical software layer in computing units called the middleware layer that is placed between the platform layer (comprising the operating system and the networking facilities) and the application layer. This way, the middleware is not a single piece of software responsible for ensuring seamless connectivity and interoperability but exists only logically comprising all the Amigo platform services that provide common generalpurpose functions. On a more abstract level, Amigo identifies the need for context information to be accessible from all three layers while the application and the middleware layers require privacy and security services. An enriched platform layer may provide some support for the quality of service (QoS) and info on reusable resources and device capabilities. An enriched middleware layer uses the resource 3
Self-Organizing Data-flow Architectures suPporting Ontology-based problem decomPosition – www.igd.fhg.de/igd-a1/projects/sodapop/sodapop.zip
1178
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
constraints and device capabilities to provide QoS support and may turn the service discovery into a semantic, QoS-aware and context-aware mechanism with support for service composition. Finally, the service description on the application layer can be enriched by semantic functional specification, syntactic and semantic QoS specification, and a context specification. The logical view on the middleware layer divides it into two main blocks, namely the base middleware and the “intelligent user services”. The former has a core part called the “interoperable middleware core” that provides for service discovery and interaction while incorporating QoS parameters. On top of this core, diverse sub-components are suggested, such as mobility management, billing, security & privacy, and even content storage and interoperability. The “intelligent user services” comprise context management & notification services, user modeling & profiling, and user interface services. A device capable of being integrated into an Amigo environment is defined as a device that: a) implements (a subset of) the abstract reference architecture employing specific technologies, b) provides some Intelligent User Services, and c) may implement some interoperability methods for service discovery and interaction. A comparison of the two solutions shows the following differences: • The SODAPOP model provides a message brokering solution whereas the Amigo solution is based on object brokering. • In SODAPOP, the middleware is a piece of software that implements the distributed channels and is responsible for seamless connectivity and interoperability among transducers. This view on the middleware is comparable with the “interoperable middleware core” of Amigo, but the interesting thing in SODAPOP is that all the common general-purpose functions to be provided by the integral components of a concrete AmI system use the middleware the same way like integrable components. This would mean that even such system services do not enjoy any special privilege and may face competition and be substituted by alternative realizations; a characteristic that guarantees a higher level of system openness. • Amigo uses modern approaches, such as service orientation and ontological description and match-making of services. Although SODAPOP also talks about channel ontologies, but at the time its two realizations were developed, ontological technologies were not mature enough so that those ideas were never realized. • Although service orientation was chosen as the fundamental principle in Amigo, it provides the application layer with a special broker for context sources with special protocols for supporting event-based communication. Similarly, the interaction between human users and the system is handled specifically, e.g. with special protocols for device selection. This is comparable with the definition of several channels in SODAPOP, each responsible for a certain communication “field”. However, the SODAPOP approach was evaluated more positively because of its inherent neutrality in modularization of such communication “fields” that allows for modular specification of ontologies, protocols and strategies.
The PERSONA Service Platform for AAL Spaces
1179
• The comprehensive set of system services proposed by Amigo can be used as a good basis for deriving the set of PERSONA platform services.
4 The PERSONA Architectural Design Based on the analysis presented above, PERSONA chose the following strategy for deriving a reference-architecture for AmI systems: • rely on the bus4 -based architecture of SODAPOP, • simultaneously adopt the service-orientation and the ontological approach of Amigo in service discovery, match-making, and composition, and still • examine the set of integral system services to be considered within the architectural model based on proposals made by Amigo and even other solutions. In the following subsections, we present the results of this process leading to a proposal for a reference-architecture for AAL spaces.
4.1 The Abstract Physical Architecture A vision of ambient intelligence is that distributed functionality embedded in appliances, controllers, actuators and sensors should be utilized seamlessly and made available to the human users based on natural interaction paradigms. This leads immediately to the intuitive conclusion that an AAL space is an open distributed system that can be modeled as a dynamic ensemble of networked nodes (see figure 2), where each of the nodes may be such a networking enabled physical component, as a sensor, an appliance / a consumer electronic device, a controller, an actuator, or a full-scaled computing device like a PC. As shown in figure 2, we call the gluing software facilitating the integration of, and the collaboration among the nodes through support for seamless connectivity and enabling interoperability the middleware that each node must bind in order to play a role in the ensemble, i.e. utilize and/or offer functionality. A point to consider here is that at least in case of general-purpose computing devices, such as PCs, a physical node may host several logical components, each with certain functionalities. So, when zooming in on each node, a question is how such logical components may bind the middleware. Figure 3 shows different scenarios in this regard and in regard to wrapping legacy nodes / components5. It is worth to mention that in case of node #1 of figure 3, it will act as two separate nodes even if it is a single physical node, because each instance of the middleware represents a distinct “node” in the ensemble. 4 The term “communication bus” (or shorter “bus”) was preferred to the original SODAPOP term “channel”. 5 See section 5 for the PERSONA standard solution for wrapping thin components like sensors.
1180
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
Fig. 2 An ensemble of networked nodes arranged whether with infrastructure (left) or ad-hoc (right)
Fig. 3 Different scenarios of hosting logical components and wrapping legacy nodes / components
The project also made the following decisions in the early design stages: • PERSONA goes for ad-hoc networking based on existing node discovery solutions, such as UPnP and Bluetooth. • As per definition in [15], middleware is responsible for “hiding the heterogeneity of the various hardware components, operating systems and communication protocols”, we decided to rely on the existence of a Java virtual machine providing the needed abstraction. • In order to facilitate the sharing of the middleware within nodes in such cases as depicted by nodes #2 and #5 in figure 3, the OSGi dynamic module system for Java6 was chosen.
4.2 The Interoperability Framework According to the SODAPOP model, for the specification of a system it would be sufficient to a) determine the set of its communication buses, b) specify the ontology, protocol, and strategy for each bus, and c) identify the set of components that 6
See www.osgi.org.
The PERSONA Service Platform for AAL Spaces
1181
connect to them. For a reference-architecture, however, the steps 2 and 3 may be taken only partially in order to address a class of systems, which in the PERSONA case would be the class of AmI systems for AAL spaces. With this understanding of the striven reference architecture, it would be possible to limit the specification in its second step to the specification of an upper ontology shared in the corresponding class of systems and in its third step to the identification of only those components that are shared among all of the instances of the corresponding class of systems as mandatory or recommended components. To be compliant with such a reference-architecture, a concrete system realizing concrete use cases must then employ the shared specifications, enhance the ontology on each bus according to its needs and add components that are needed for the realization of its use cases. To initiate the derivation of the targeted architecture, we now refer to another guideline from the beginning of this section, namely the adoption of the SOA-based approach of Amigo. The question that arises is: does service-orientation mean that we should have an abstract view on all functionalities as services and consequently define only one bus, say a “service” bus? The answer to this question has been already given in section 3: even Amigo has developed special-purpose brokering frameworks for handling contextual events and user interaction. PERSONA has also decided to abstract all functionality other than those related to the above two topics as “service” and consequently to provide a set of four communication buses, namely the context, input, output, and service buses. Referring to the SODAPOP model that provides an event-based class of buses and a call-based one, we defined the first three buses event-based and the service bus call-based (cf. figure 4).
Fig. 4 The PERSONA set of communication buses responsible for brokering four rough types of messages
The rationale behind the type of the context and service buses is self-explanatory. In case of the input and output buses, one may pose the remark about sufficiency
1182
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
of one call-based bus instead of two event-based buses. It seems that from the view point of the user an input may be handled like calling a system function that may be followed by some system output as the corresponding response. Similarly, system output can be viewed as if the system is calling the user to give his/her answer. Although such a solution is not disqualified from our point of view, but we chose the other alternative due to the following concerns: In addition to a special timeout handling that comes into play when the user is expected to react to a system output, the fact that the communication between the user and the system is handled by components that bridge the gap between the virtual world of the system and the real world of the user have to be considered. Hence, on both sides of the I/O buses there are virtual components that in case of the input bus are responsible for either capturing user input or processing it in the context of a dialog. Similarly, the components attached to the output bus generate the system output on one side, and present it to the user on the other side. Generally, it seems that only components that process the user input and / or generate system output in the context of a dialog are able to decide about the direction of the communication, i.e. if the system is being called by the user or the system is calling the user. Additionally, the strategy for dispatching a captured user input posted to the input bus to input processing components interested in that input can be designed in a specific way that differs from the output bus strategy. In the next section we will get back to this issue again. Another discussion point is that if both the input bus and the context bus are event-based buses that capture some event in the real world, either by sensing or by capturing explicit user intervention, why not merge them into one bus as it was the case in DynAMITE [12]. Also here we have a preference for separating the two buses in order to have a more modular approach in the development of the bus ontologies, protocols and strategies. Factual concerns for doing so are twofold: explicit user interaction in AmI environments necessitates a special treatment because it happens in an openly distributed I/O handling environment with possibility of merging several modalities while following the user, e.g. from one room to the next one. Additionally, user input in contrast to sensory data may follow in the context of a dialog. Adding the four identified buses to the simplified physical view of a PERSONA system from the previous section results in a scene depicted by figure 5. Components must be linked with the middleware that realizes the four buses – on the left side, the example of a node in the ensemble hosting three components that share the same instance of the middleware connecting to different buses with different roles – and instances of the middleware can find each other and collaborate, resulting in virtually connected communication buses over all of the participating nodes and components. That is, through the cooperation of different instances of the middleware, local pieces of the same bus will find each other and so will be able to cooperate with each other based on strategies specific to each bus. This way, the middleware can hide the physical distribution of functionality within the ensemble. This means that the interoperability framework in PERSONA can be summarized as shown in figure 6. The communication buses reflect the loose connections needed in a dynamic environment and represent, in a modular way, the need for interface /
The PERSONA Service Platform for AAL Spaces
1183
Fig. 5 The distributed communication buses of PERSONA and the formation of virtually global buses
ontology definitions, protocol specifications for communication, and strategies for “dispatching incoming messages” to an appropriate (set of) receiver(s). A component simply registers to some buses by specifying the role(s) it is going to play on each of them. Possible roles for a registering component on an event bus are publisher and subscriber; possible roles on the service bus are caller (or service client) and callee (or service provider). Based on this terminology, figure 6 furthermore identifies groups of similar functional roles in such an AmI system and their dependencies, without specifying any concrete component. It guarantees the coherence of the system by modeling the basic data flow in AmI systems: The explicit user input through input publishers and the contextual / situational events posted by context publishers trigger dialogs that will finally terminate with explicit system output through output subscribers and / or changes in the environment performed by appropriate service providers. The dialog handlers (formally equivalent to simultaneously playing some of the roles of input and context subscriber, output publisher, and service client) are therefore responsible for the behavior of the whole system. The horizontal arrows attached to the vertical buses symbolize the availability of services and context in the whole system, independent from any layering. For example, the input and output layers may access the s-bus for utilizing transformation services, such as automatic speech recognition or text-to-speech, and the dialog handling layer accesses the s-bus, mainly for procurement of application services to the user.
4.3 The PERSONA Platform Services A reference-architecture may specify a set of mandatory / recommended components shared in the whole class of systems complying with that architecture. As discussed earlier, the set of the PERSONA platform services is determined by not only
1184
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
Fig. 6 The abstract interoperability framework in PERSONA
relying on the specific use cases from the PERSONA scenarios but also checking such results against the platform services in other solutions, like Amigo, ASK-IT, EMBASSI, and RUNES. However, it showed very quickly that this should not be set equal to the set of all functions that are shared somehow but the number of the related components must be kept manageable towards the definition of a basic configuration for AAL spaces. Hence, even borderline cases such as contact lists and personal calendars whose services are widely shared at least in private AAL spaces are considered as pluggable components in PERSONA. The criteria for considering a component as an integral part of all AAL spaces were therefore twofold: a) the component does something towards producing aggregated added value7, and / or b) it plays a complementary role for the function of the middleware. This way, we came up with the following set of components: • First of all, an application-independent component is needed that handles the system-wide dialogs and hides the complexity of utilizing the application services from the user (e.g. hierarchical presentation of the services available to the user, supporting some sort of search for services, or handling login dialogs). It should be obvious that this component has to be a dialog handler (in the sense of the functional roles discussed in the previous section); therefore it is called the Dialog Manager. Another important task for the Dialog Manager could be the provision of a mechanism for associating service calls with situations as means for providing a configurable management of the reactivity of an AmI 7
In a modular and dynamically evolving system, it is essential to be able to utilize the whole potential of the system on a meta-level that does not necessitate any sort of re-compile, re-build, re-deploy, or restart of any other component when new components are added to or old ones are removed from the system.
The PERSONA Service Platform for AAL Spaces
1185
environment. For this purpose, the Dialog Manager may rely on a configurable repository of rules schematically in the form of “situation → action” (abbreviated as “s[i] → a[ j]”). Then, it must subscribe to the context bus for all situations s[i], for which it has an associated action a[ j] in its repository. The association “s[i] → a[ j]” is the heart of controlling system behavior and hence it will be very advantageous to store it in a central configurable repository. An enhanced version of the Dialog Manager can additionally a) initiate dialogs for pro-actively suggesting services that seem to fit to the current situation and b) offer services to other components for handling common dialogs with the user, such as okcancel, warning and notifying. • The Context History Entrepôt8 (CHE) gathers the history of all context events in a central repository not only to fill the gap caused by context publishers that provide no query interface, but also to provide a fallback solution for those that cannot maintain the whole history of data provided by them. Additionally, it guarantees the essential support to reasoners9 that perform statistical analysis and need context stored over time10 . As a singleton component, the CHE takes care of logging every context event that is published in the context bus by specifying a “pass-all” filter when subscribing to the bus. In order to have the growth of the repository under control, the CHE also implements a deletion policy based on the likeliness of the data to be needed further on; the policies consider a time-based threshold as well as the abstraction level of the data. For making the context history available to context consumers, the CHE registers to the service bus as a callee supporting some kinds of context queries, such as full-fledged SPARQL11 queries. The latter is possible just because the CHE relies on such a repository containing data from all possible sources. • A general-purpose context reasoner called the Situation Reasoner that uses the database of the CHE and infers new contextual info using the logical power of the RDF query language SPARQL. It stores “situation queries” persistently – as they are not meant as a one-time query that are answered and then forgotten, but they must generate related situational events whenever the situation changes – and indexes them based on context events that must trigger its evaluation. It provides two services on the service bus, one for accepting new situation queries and the other for dropping them. These services are also used by a graphically interactive tool for administrators in order to facilitate the introduction of new relevant situations to the system by providing an overview of existing context 8
For a more detailed discussion on the PERSONA framework for supporting context-awareness please refer to [8]. 9 Context reasoners estimate the state of some context elements by combining different known information and applying certain methods of aggregation, statistical analysis and / or logical deduction. 10 Many of reasoners that predict context need even long term histories that pluggable context providers may not be able to maintain, as discussed in [17] 11 SPARQL Protocol and RDF (Resource Description Framework, www.w3.org/RDF) Query Language specified as part of the Semantic Web technologies in the context of OWL (Web Ontology Language, www.w3.org/2004/OWL/), www.w3.org/2001/sw/DataAccess/.
1186
•
•
•
•
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
providers, allowing drag-and-drop interaction using artifacts for accessible context elements, catching logical errors made by the user, and generating the appropriate SPARQL query string, to name a few of its features. The SR takes its name from a modeling theory for situations introduced in [22]. Services may exist only at a meta-level in terms of “composite” services made from combining really-existing “atomic” services. The Service Orchestrator (SO) is the component in charge of interpreting the metadata describing a composite service and performing the instructions within it. These descriptions are added / removed / modified by a GUI for system administrators. The SO registers the composite services to the bus like any other service callee would do so for its atomic services. This way, whenever a composite service is called on the service bus, the bus will find the SO as the only object that “implements” that service; hence the SO implements the callee interface for handling service requests. At this stage, the SO starts to execute the corresponding composite service by calling the sub-services through its capabilities as a caller until it finishes and then returns the results to the bus that will forward them to the original caller. Summarizing the admin tool aspect so far, it is worth to mention that three repositories must be kept configurable for administrators of AAL spaces: a) the database of the Situation Reasoner regarding “conditions → situation” rules, b) the database of the Dialog Manager regarding “s[i] → a[ j]” associations, and c) the database of the SO regarding composite services. This demonstrates the total power of the PERSONA system regarding the generality, configurability, and adaptability of the solution and also admits the necessity of such service providers in the business model of PERSONA that are specialists in installation, configuration, and maintenance of PERSONA-based AAL-spaces. In order to guarantee the adaptability of an AAL space to the wishes and preferences of its users, it is essential that a special-purpose component is foreseen for the management of the profiles and the provision of needed shared mechanisms. Almost all of the projects from the STAR analysis suggest such a component as a mandatory one. Also the analysis of the realization of PERSONA use cases confirmed this general assumption. We call this component the Profiling Component. The middleware must control the access to services with the help of a component that we call the Privacy-aware Identity & Security Manager (PISM) that is also supposed to act as a service provider. The middleware must verify the legitimacy of components that register to its buses by consulting the PISM. Likewise, trusted services serving anonymous entities must use PISM services in order to decide if the data to be returned can be disclosed. Hence, the main responsibilities of the PISM are: a) management of the entities’ identities and credentials, b) management of permissions for accessing “hosted” services, c) providing authentication services, and d) providing a tunable mechanism for deciding on the disclosure of private data. In order to facilitate remote access to AAL spaces and, the other way around, to support AAL spaces in notifying an absent native user, as well as to enable
The PERSONA Service Platform for AAL Spaces
1187
the bridging between AAL spaces and, furthermore, to provide a possibility for external service providers to advertise their services to the occupants of AAL spaces, we suggest to employ a special-purpose component called the AALSpace Gateway. The gateway provides access to the hosted services in the AAL space under a fixed URL. For this purpose, it must act within the AAL space as input publisher and output subscriber so that in case of incoming remote access and after authentication, the remote user can start a dialog with the smart home to access info and services for which he or she has the required access rights. For each of the modalities supported by the gateway, it may register a different input publisher to the input bus (resp. a different output subscriber to the output bus). So, the gateway can make an SMS notification functionality available to the output bus. If an output event is triggered on the output bus that is addressed to an absent user, then the SMS output of the gateway may be selected for presenting the output to the user as an SMS. The phone I/O of the gateway can be utilized by the AAL space, if the absent user is required to interact with the system rather than just being notified. This way, a voice-based session will be initiated actively by the AAL space. Finally, external entities that would like to make advertisement for their services may use an interface of the gateway to be provided as a Web Service for this purpose. Then there must be some authorization mechanism upon which the gateway can decide to ignore or forward the notification to the user; if the authorization process fails, the message will be ignored. Otherwise the gateway calls an appropriate service of the Dialog Manager for notifying the user and hence must register to the service bus as a service caller, too. Figure 7 incorporates the above components into the logical interoperability framework of PERSONA summarizing the discussions so far. To keep the figure wellarranged, the pluggable components are shown as separate context publishers, special-purpose reasoners, or service clients / providers although a component may play several of these roles simultaneously. Also, the possibility for them to play the role of a dialog handler is not depicted in this figure at all. Additionally, the PISM is omitted from the scene due to the difficulty of showing its double role as a sub-component of the middleware and as a service provider.
4.4 The Middleware The middleware is composed of a set of OSGi bundles organized in three logical layers (cf. figure 8): • The lowest layer, the abstract connection layer (ACL), is responsible for the peer-to-peer connectivity between instances of the middleware. Different discovery and message transfer protocols, such as UPnP, R-OSGi or Bluetooth, have been used to provide competing solutions that realize an exported interface called P2PConnector. Listeners can register to such connectors for discov-
1188
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
Fig. 7 The PERSONA reference-architecture for AAL spaces
ering peers with a predefined interface. A bridging solution over all available protocols guarantees the overall coherence of nodes within an ensemble. • The Sodapop Layer implements the peer and listener interfaces from ACL and registers as the local peer to all connectors found. It introduces the concepts of bus (either event-based or call-based), bus strategy and message along with an interface for message serialization. • The PERSONA-specific layer implements the input, output, context, and service buses with their distributed strategies according to the Sodapop protocol, using an RDF serializer for the exchange of messages among peers.
Fig. 8 The internal architecture of the PERSONA middleware
The PERSONA Service Platform for AAL Spaces
1189
The context bus provides a lightweight communication channel which can be used for publishing context events to local and remote end-points. Context publishers must find the OSGi service realizing the context bus and register to it (this is readily done within a corresponding abstract class), in order to be able to publish context events. Context subscribers must additionally specify a filter for context events they are interested in as their registration parameters. The context bus strategy is as simple as broadcasting a received event to all peers, then each peer does a local match-making with limited ontological inference capabilities for finding interested subscribers. Therefore, the registration parameters of the subscribers are stored only locally by each instance of the context bus. Alternatively, one single node could play the role of a permanent coordinator that gathers all registration parameters from all subscribers to all peers, where a peer that has received an event from a locally registered context publisher, would forward the event only to the coordinator which in turn would forward the message only to those peers that had at least one local subscriber interested in that event. Of course, other strategies are also possible, but in the first version we chose the broadcasting strategy to avoid a single point of failure, accepting the relatively increased messaging traffic. On the service bus, callees provide service profiles in OWL-S12 at the registration time. The callers request services using “service queries”. The service bus finds the appropriate callee(s) by examining its repository of service profiles using the same limited ontological inference as in case of the context bus, calls it (them) by providing the required input extracted from the original query, gets the output and prepares it as the response to be returned to the caller. The details of the service bus strategy, however, go beyond the scope of this paper. Interested readers are invited to read the project deliverable D3.1.1 at www.aal-persona.org. In order to have a clear separation between application logic and presentation layers of the system, PERSONA distinguishes between handling dialogs and handling I/O: • A Dialog handler is a part of applications that expresses the application output using a device-, modality- and layout-independent language (e.g., XForms that has been found as a promising means in our experiments so far) and publishes it to the output bus together with adaptation parameters fetched from the profiling component. It then waits for the user input on the input bus expecting it at the same abstraction level and in relation to the output previously published to the output bus. The Dialog Manager is a special dialog handler provided by the PERSONA platform. • Each I/O handler manages a set of I/O channels and subscribes to the output bus by specifying its capabilities, which is used by the output bus in the course of match-making with adaptation parameters associated with each output event. That is, using the adaptation parameters, the output bus tries to find a bestmatch I/O handler that receives the content to be presented to the user along with instructions in regard to modality and layout derived from the adaptation parameters. The selected I/O handler is then responsible for converting the ap12
An OWL-based ontology for describing services. See www.daml.org/services/owl-s/.
1190
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
plication output to the format appropriate for the channel selected in accordance with received instructions. It then monitors its input channels to catch the related user input. Upon recognized input, it must convert it to the appropriate format in accordance to the previously handled application output and publishes it as an event to the input bus. • I/O handlers are application-independent, pluggable technological solutions that manage their respective I/O channels to concrete devices using one or more of the following alternatives: – A tight connection using low-level protocols – A loose connection using device services on the service bus – A loose connection using contextual events on the context bus • Context-free user input is handled by the Dialog Manager in terms of service search; if an I/O handler detects user input that has no relation to any previous output, the Dialog Manager receives the input and tries to interpret it as search for services
4.5 The Ontological Modeling Apart from the abstract classes InputPublisher, InputSubscriber, . . . , and ServiceCallee shown in figure 8, the interfaces facilitating interoperability in PERSONA are specified in terms of ontological concepts and their counterparts in the Java world. This has been possible, because the APIs are defined in a way that the actual information is exchanged as the content of SODAPOP messages. In order to overcome the complexity of an open distributed system and still enable extensible interoperability, PERSONA has adopted / provided three elementary tools: a) the knowledge representation technologies of the Semantic Web consisting of RDF and OWL, b) an upper ontology with appropriate programming support consisting of those concepts that all users of the middleware must know, and c) a general conceptual solution described in section 5 with certain shared tools for integrating thin devices and embedded sensors and transforming the tapped data into an appropriate ontological representation. Using this framework, two end points that share the same ontological concepts can achieve the needed level of interoperability without the need for the middleware to know those concrete concepts. Still, the middleware is able to adopt ontological reasoning, to some extent, in its brokerage function. Figure 9 summarizes the Java class hierarchy used by the middleware for handling ontological resources. The class ManagedIndividual is the root of all ontology classes that register to the middleware. This way, each instance of the middleware will have a repository of ontological classes that are relevant for the local members of its buses. The repository provides a mapping between class URIs and their Java representation and enables the middleware to infer hierarchical relationships between the registered classes and check class memberships at Java level. This reveals
The PERSONA Service Platform for AAL Spaces
1191
the previously mentioned limitations of the ontological reasoning within the middleware: the limited knowledge stored about the underlying ontologies and matchmaking to the extent supported by Java. The reason for this way of realizing the first version of the middleware is to be spare of the resources needed by the middleware, in order to reach the widest portability by keeping it as small as possible, even at the level of Java 1.3. As a trade-off for these limitations, the middleware supports not only the whole capacity of the OWL class expressions but also enhances it by supporting more specific restrictions that can be posed on the properties of classes. The OWL class expressions are used by subscribers to event-based buses as registration parameters for specifying their interests in certain classes of individuals. Havng received a concrete event, an event-based bus checks if it matches13 any of the registered class expressions; if yes, the corresponding subscriber will be notified. Also service callers specify their requests in terms of such class expressions, which is used by the service bus in the course of match-making against registered service profiles.
Fig. 9 The Java class hierarchy used and exported by the PERSONA middleware for handling ontological resources
5 Sensing the environment The artifacts composing AAL-spaces can be classified in stationary, portable and wearable components. The stationary artifacts run on a desktop PC, Set Top Box, 13
Here comes the ontological inferencing into play.
1192
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
Residential Gateway or they belong to the environmental infrastructures like Home Automation sub-systems; portable components are, for instance, medical devices or mobile/smart phones. Garments equipped with sensors are examples of wearable components. The PERSONA middleware targets small but reasonably powerful devices, therefore not all the components can be integrated by using an instance of the PERSONA middleware. Typically, the integration of wearable components as well as nodes of wireless sensor networks (WSN) or Home Automation Systems (HAS) requires a different approach because of their limited computation resources; in such cases a gateway solution can be adopted, which allows the PERSONA application layer to share information by communicating with other application layers resident on different network infrastructures (e.g. ZigBee [4] or Bluetooth applications). Considering the OSGi platform14, it is sufficient to write a bridging component that interacts with the PERSONA middleware, on one side, and with the software drivers that access the networked devices, on the other side. However, it is worth to notice that many Home Automation Systems have developed their proprietary solution to provide external access to the networked devices. Usually it is possible to find cut-of-the-shelf device servers that are themselves gateways, thus exposing the network through standard protocols like TCP/IP or HTTP (e.g. Siemens EIB-TCP gateway - http://www.eib-home.de/siemens-eib-ip-schnittstelle.htm). In this case an easier integration is achieved through an application proxy that communicates with the remote server rather than with network drivers (see node #3 in figure 3). In the next section, we focus on the integration of wireless sensor networks, generally adopted for sensing the environment, and we introduce the Sensors Abstraction and Integration Layer (SAIL) developed within PERSONA, by highlighting different aspects and requirements of sensor network applications. A specific section related to the integration of ZigBee devices concludes the chapter.
5.1 Wireless Sensor Networks Wireless Sensor Networks [4] (WSN) are an important technological support for smart environments and ambient assisted applications. Up to now, most applications are based on ad hoc solutions for WSNs, and efforts to provide uniform and reusable applications are still in their youth. As general requirements for the use of WSN, we can consider the following list: - R1: Integration of various sensor network technologies. AAL-spaces like home environments may be populated by different network technologies typically used in different application domains like Home Automation, Digital Entertainment or Personal Healthcare Systems. As a concrete example we can consider ZigBee or IEEE 802.15.4 standard, and Bluetooth.
14
initially OSGi was designed as a framework to develop Open Service Gateways
The PERSONA Service Platform for AAL Spaces
1193
- R2: Sharing of communication medium by concurrent sensor applications. Sensor applications involving the same network protocol may imply conflicts at the communication level and waste of resources like energy, bandwidth, and so on. - R3: Management of different applications on the same WSN. The WSN may support different applications at the same time, for example localization of a number of users, posture detection of users, monitoring of energy consumption of appliances, etc. - R4: Management of multiple instances of the same application. Further complications occur when the sensor application is associated with users so that there are as many many instances of the same application as the number of users; in this case, the different instances must be discovered and safely integrated into the system. - R5: Dynamic discovery of sensor applications. Because of the innate dynamism of users’ activities, such applications may join and leave the AAL-spaces at any time, according to the user behavior. - R6: Management of logical sensors. The physical sensors deployed either in the environment or worn by the users could be represented (virtualized) by a dissimilar number of logical sensors. For example, let’s consider an application that detects the user’s posture by means of a number of accelerometers placed on the body: in this case, the posture can be represented by a single logical sensor, which aggregates information produced by all the accelerometers for producing the state of the user (sitting, walking, etc.). In this example, the accelerometers may not even be visible to the application layer. - R7: Configuration and calibration of sensor applications. Many sensor applications need to be configured during the deployment phase, for example to bind the user’s metadata to the sensor used to locate him/her. The requirement R1 is usually satisfied by enabling the dynamic deployment of ad hoc network drivers in the system, while we can assume that requirements from R2 to R5 largely depend on the operating systems and middleware used for programming the sensor nodes (e.g. ZigBee [26], TinyOS [24], TinyDB [16], TeenyLIME [6] and others [18] ). Unlike the requirements R1 to R5, addressing the requirement R6 is still an open issue, because, on one side, the sensor nodes should locally pre-process and transmit aggregated data as far as possible in order to reduce the power consumption15. On the other side, arranging such processing on the network is not always possible, due to the limited computational power. In [5] a declarative language called WADL (Wireless Application Description Language) is used to describe sensor-based applications which combine the sensed data on the Host PC side. WADL is based on a producer-consumer pattern constructed according to the OSGi WireAdmin specification: both the producers (sensors) and the consumers are modeled as OSGi services and connected at runtime by “wire” objects. WADL is an XML-based language, which defines a dynamic set of wires that connects different 15
In this respect, it is worth to mention the SPINE (Signal Processing in Node Environment [10]) open source framework that provides simple feature extractors directly on the sensor nodes with TinyOS.
1194
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
producers and consumers. This approach is strongly characterized by the assumption that the sensors behave according to a producer-consumer pattern. Although this assumption leads to a simple and elegant solution, it may become too limiting when used to abstract WSNs with more complex behaviors. In a first attempt to provide a uniform model for sensor applications, we designed the SAIL architecture [9] based on three layers, namely the Access, Abstraction, and Integration layers, constructed over the OSGi framework. Network drivers reside on the access layer for handling communication protocols specific to each WSN type. The abstraction layer of SAIL defines a shared sensor model over all WSN types integrated through the access layer. The realization of the abstraction layer was based on the Service Provider Interfaces (SPI) pattern [21] used in different application domains, like database oriented applications (e.g. ODBC, JDBC drivers). Each network driver is hereby considered as a service provider hidden by a manager that abstracts the different network models by providing a common API for the upper layer (the integration layer). The sensor model provided in [9] also took into account data-centric paradigms adopted in many sensor network applications (e.g. query oriented TinyDB [16]). However Smart Environments are generally indoor and limited; thus also the size of the WSN hardly scales up to hundreds of sensors, and the network often has a small diameter, in some cases it may even be a star. On the contrary, data centric approaches have been designed while keeping in mind very general monitoring applications where the number of sensors can (almost) arbitrarily scale and the sensors are homogeneous in terms of features and monitoring capabilities. For this reason and because of the many hardware developers in PERSONA consortium converging to the ZigBee industry standard, we decided to design an optimized version of SAIL tailored to ZigBee.
5.2 ZigBee Networks Integration Maintaining the same three-layered architecture introduced in [9], we developed a ZigBee Base Driver for the SAIL access layer that uses native libraries implementing the ZigBee Application Layer. In fact, while Bluetooth drivers are usually integrated with different operating systems and platforms, the standardization process of the network interfaces for ZigBee is still in progress. Therefore, we chose to adopt USB ZigBee Dongles available on the market which come either with a simple AT command-like interface or a more elaborated API based on a native driver. The basic operations carried out by the base driver include the following set of functions: a) discovery functions that notify the connection and disconnection of sensors to/from the WSN, b) description functions that retrieve the properties associated with the sensor nodes and the relevant service interfaces, c) control functions for setting the state of the actuators embedded in the sensor nodes, d) access functions that provide read access to the sensors transducers, and e) eventing functions that enable the subscription to data updates and that notify changes. The implemen-
The PERSONA Service Platform for AAL Spaces
1195
tation of these functionalities by using the ZigBee Application Layer is straightforward; the ZigBee standard allows both device and service discovery, as well as description request and reporting commands (eventing). However, the implementation of these functionalities is not always mandatory, which often makes the full integration of the network somewhat complex. For example, in the discovery phase, the implementation of a broadcast announcement of the devices that join or leave the network is optional. For describing devices, ZigBee does not employ the descriptive approach of protocols like UPnP; rather, a device responds to a description request by replying with a reference (ClusterID) to the implemented interfaces. It implies that the consumer of a service must know to which message structure (cluster format) the clusterID refers. The possibility to answer with additional description in CompactXML does exist (e.g. <deviceUrl>), but even this feature is optional. However, the ZigBee Alliance, the standardization body defining ZigBee, publishes application profiles that allow multiple OEM16 vendors to create interoperable products. The current list of application profiles already published or still in progress, consists of Home Automation, ZigBee Smart Energy, Telecommunication Applications, and Personal Home and Hospital Care. In general, these profiles use a combination of clusters defined in the ZigBee Cluster Library. Vendors can define their custom library for devices not defined by the ZigBee Alliance. According to this, the PERSONA project is now working on the definition of a PERSONA Cluster Library and Profile, in order to integrate the personal healthcare system and special devices like the smart glove or the smart carpet that will be developed by the consortium partners. A diagram of the current SAIL architecture is depicted in Figure 10. The ZigBee Base Driver is a network driver in charge of scanning the network, getting the description of the various nodes, and registering a proxy service for accessing the discovered remote services. This proxy is a generic ZigBee service, registered with the OSGi Platform, which exposes the properties retrieved during the network inquiry. It allows to access the remote service by means of simple primitives that have to be filled by appropriate clusters. Thus, in contrast with the more generic model used by the previous version of SAIL, we have defined a specialized model tailored on an extension of ZigBee profiles, namely the PERSONA profiles. The components on the upper layers may act as Refinement Drivers (in OSGi terms). By using the OSGi Framework facilities, as soon as a service implementing a known cluster is registered, these drivers refine the service by registering another service proxy with a more detailed interface (e.g. action/command based). Thus the second layer is specialized to represent the service according to a specific profile, for instance the Home Automation profile. The upper layer is the final step for integrating the ZigBee services within PERSONA. It is composed of Sensor Technology Exporters (STE), which discover the services by implementing standard or extended profiles 16
From Wikipedia: An original equipment manufacturer, or OEM is typically a company that uses a component made by a second company in its own product, or sells the product of the second company under its own brand.
1196
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
and register proxies that are PERSONA-aware components 17 . The mapping between services compliant to the ZigBee model and PERSONA OSGi services is realized at this level; these proxies send events or register service callees according to the PERSONA ontologies. In conclusion, the abstract layer is populated by custom drivers which may combine and process the sensed data to instantiate logical sensor services (see R6; e.g. a sensor providing the user’s position by elaborating RSSI measurements - Received Signal Strength Indication - coming from different stationary sensors), as well as refine the cluster-based services. Of course, such an elaboration could be realized also at PERSONA application level, but, from a practical point of view, the realization at lower layers is expected to be more efficient. In fact, we even plan to implement the user interfaces addressing the last requirement introduced in the previous section (R7) at this layer, in order to provide a uniform way for configuring sensor applications.
Fig. 10 The SAIL layered architecture
17 The same approach can be used to expose the services according other access technologies (e.g. UPnP)
The PERSONA Service Platform for AAL Spaces
1197
6 Conclusions The AAL service platform developed within the PERSONA project and introduced in this chapter is expected to facilitate the gradual evolvement of AAL spaces based on a very compact core. To reach this goal, AAL spaces are modeled in PERSONA as Open Distributed Systems (see [8] for our undersanding of ODS). The resulted platform allows for self-organization of both physical nodes and logical components by providing a middleware that supports seamless connectivity and semantic interoperability. For providing aggregated added value, the solution relies on administrative re-configurability of platform components, such as the Situation Reasoner, the Dialog Manager, and the Service Orchestrator. On the basis of the achieved re-configurability, future research can work in parallel on both simplifying the configuration task towards end user programming and developing intelligent algorithms towards automatic re-configuration. The next steps in the project plan foresee a thorough evaluation of the technological results during and after the deployment of the system to the PERSONA three pilot sites in Denmark, Italy, and Spain. We have already started to define the evaluation criteria, methodology, procedures, and tools. The rising awareness of the importance and difficulties of evaluating AmI systems in the research community (see, for example, the summary provided in [20]) shows that the evaluation phase will be one of the most challenging tasks of the project in the near future. Further research planned by the authors of this chapter relates to breaking out of the boundaries of the home environment towards societal systems, as suggested in [19]. The more suspenseful topics with relation to the PERSONA approach are the ad-hoc formation of temporary spaces and related security concerns, rapid configuration changes in public spaces, and interoperability among different spaces.
References [1] AAL: The Ambient Assisted Living Joint Programme – www.aal-europe. eu (2007) [2] ARTEMIS SRA Working Group: The ARTEMIS Strategic Research Agenda. Research Priorities, IST Advanced Research and Technology for Embedded Intelligence in Systems, www.artemis-office.org/DotNetNuke/ SRA/tabid/60/Default.aspx (2006) [3] Avatangelou, E., Dommarco, R.F., Klein, M., Müller, S., Nielsen, C.F., Soriano, M.P.S., Schmidt, A., Tazari, M.R., Wichert, R.: Conjoint PERSONA – SOPRANO Workshop. In: Constructing Ambient Intelligence – AmI 2007 Workshops, pp. 448–464. Springer CCIS Series, Darmstadt, Germany (2007) [4] Baronti, P., Pillai, P., Chook, V.W.C., Chessa, S., Gotta, A., Hu, Y.F.: Wireless sensor networks: A survey on the state of the art and the 802.15.4 and ZigBee standards. Computer Communications 30(7), 1655–1695 (2007)
1198
M.R. Tazari, F. Furfari, J.P. Lázaro, E. Ferro
[5] Cervantes, H., Donsez, D., Touseau, L.: An Architecture Description Language for Dynamic Sensor-Based Applications. In: Proceedings of Consumer Communications and Networking Conference (CCNC 2008), pp. 147–151. IEEE, Las Vegas (Nevada USA) (2008) [6] Costa, P., Mottola, L., Murphy, A.L., Picco, G.P.: Programming Wireless Sensor Networks with the TeenyLIME Middleware. In: Proceedings of the 8th ACM/IFIP/USENIX International Middleware Conference (Middleware 2007). Newport Beach, (CA, USA) (2007) [7] Ducatel, K., Bogdanowicz, M., Scapolo, F., Leijten, J., Burgelman, J.C.: Scenarios for Ambient Intelligence in 2010. ISTAG Report, European Commission – IST Advisory Group, ftp.cordis.europa.eu/pub/ist/ docs/istagscenarios2010.pdf (2001) [8] Fides-Valero, A., Freddi, M., Furfari, F., Tazari, M.R.: The PERSONA Framework for Supporting Context-Awareness in Open Distributed Systems. In: To appear in the proceedings of the European conference on Ambient Intelligence – AmI-08. Nuremberg, Germany (2008) [9] Girolami, M., Lenzi, S., Furfari, F., Chessa, S.: SAIL: a Sensor Abstraction and Integration Layer for Context Aware Architectures. In: Proceedings of the 34th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA 2008), pp. 374–381. IEEE, Parma, Italy (2008) [10] Gravina, R., Guerrieri, A., et Al., S.I.: SPINE (Signal Processing in Node Environment) framework for healthcare monitoring applications in Body Sensor Networks. White paper, TILAB, spine.tilab.com/papers/2008/ DemoSPINE.pdf (2008) [11] Heider, T., Kirste, T.: Architecture considerations for interoperable multimodal assistant systems. In: Proceedings of the 9th International Workshop on Design, Specification, and Verification of Interactive Systems, pp. 403–417. Rostock, Germany (2002) [12] Hellenschmidt, M., Kirste, T.: SODAPOP: A Software Infrastructure Supporting Self-Organization in Intelligent Environments. In: Proceedings of INDIN ’04: the 2nd IEEE International Conference on Industrial Informatics, pp. 479– 486. IEEE, Berlin, Germany (2004) [13] Ambient Intelligence: from vision to reality. ISTAG Report, European Commission – IST Advisory Group, cordis.europa.eu/ist/ istag-reports.htm (2003) [14] Janse, M., Vink, P., Georgantas, N.: Amigo Architecture: Service Oriented Architecture for Intelligent Future In-Home Networks. In: Constructing Ambient Intelligence – AmI 2007 Workshops, pp. 371–378. Springer CCIS Series, Darmstadt, Germany (2007) [15] Krakowiak, S.: Middleware Architecture with Patterns and Frameworks. INRIA Rhône-Alpes, France, sardes.inrialpes.fr/~krakowia/ MW-Book/ (2007) [16] Madden, S.R., Franklin, M.J., Hellerstein, J.M., Hong, W.: TinyDB: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst. 30(1), 122–173 (2005). DOI http://dx.doi.org/10.1145/1061318.1061322
The PERSONA Service Platform for AAL Spaces
1199
[17] Mayrhofer, R.: Context Prediction based on Context Histories: Expected Benefits, Issues and Current State-of-the-Art. In: Proceedings of ECHISE ’05: 1st International Workshop on Exploiting Context Histories in Smart Environments. Munich, Germany (2005) [18] Molla M.M.; Ahamed, S.: A survey of middleware for sensor network and challenges. In: Proceedings of nternational Conference on Parallel Processing Workshop (ICPP 2006) (2006) [19] Nakashima, H.: Cyber Assist Project for Ambient Intelligence. In: J.C. Augusto, D. Shapiro (eds.) Advances in Ambient Intelligence, pp. 1–20. IOS Press (2007) [20] Neely, S., Stevenson, G., Kray, C., Mulder, I., Connelly, K., Siek, K.A.: Evaluating pervasive and ubiquitous systems. In: J. Hong (ed) The “Conferences” column of PERVASIVE computing 7(3), 85–89. IEEE CS (2008) [21] Seacord, R.C.: Replaceable Components and the Service Provider Interface. In: ICCBSS ’02: Proceedings of the First International Conference on COTSBased Software Systems, pp. 222–233. Springer-Verlag, London, UK (2002) [22] Tazari, M.R., Grimm, M.: D11 – Report on Context-Awareness and Knowledge Representation. Public deliverable, MUMMY (IST-2001-37365), www.mummy-project.org/downloads.html (2004) [23] Thomson, G., Sacchetti, D., Bromberg, Y.D., Parra, J., Georgantas, N., Issarny, V.: Amigo Interoperability Framework: Dynamically Integrating Heterogeneous Devices and Services. In: Constructing Ambient Intelligence – AmI 2007 Workshops, pp. 421–425. Springer CCIS Series, Darmstadt, Germany (2007) [24] TinyOS Project: www.tinyos.net (2008) [25] Turunen, M.: Jaspis – A Spoken Dialogue Architecture and its Applications. Ph.D. thesis, University of Tampere, Department of Computer Sciences, Finnland (2004) [26] ZigBee Alliance: ZigBee specifications – www.zigbee.org (2006)
ALADIN - a Magic Lamp for the Elderly? Edith Maier and Guido Kempter
1 ALADIN - Magic Lighting for the Elderly? Like Aladdin in the medieval oriental folk-tale, the assistive lighting system developed by ALADIN (Ambient Lighting Assistance for an Ageing Population), a research project co-financed by the European Commission, is expected to bring enchantment to people’s lives. But this will not be achieved by magic and genies, but by exploiting our knowledge about the impact of lighting. adaptive lighting can contribute considerably to sound sleep and a regular sleep-wake cycle regulated by people’s ’inner clock’. This tends to deteriorate with ageing, but is essential to preserve and enhance comfort and wellbeing. And this is the main goal of the assistive ALADIN lighting system. So called ’ambient lighting’ with varying color, temperature and brightness has been in use for some time. However, the user has no possibility to interact with the predefined control strategy and the lighting solutions do not take into account individual differences.[3] The intelligent control system of ALADIN, however, is capable of • capturing and analysing the individual and situational differences of the psychophysiological effects of lighting, and • enabling the users to make adaptations tailored to their specific needs. Whereas currently most cognitive assessment is done in a clinical setting [15], we use sensor-based monitoring combined with adaptive algorithms to assess people’s level of functioning in a continuous way. Besides, we have taken the results obtained Edith Maier User Centered Technologies, University of Applied Sciences Vorarlberg, Austria, and University of Applied Sciences St. Gallen, Switzerland e-mail: [email protected]@ fhsg.ch Guido Kempter User Centered Technologies, Hochschulstrasse 1, 6850 Dornbirn, Austria e-mail: guido. [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_44, © Springer Science+Business Media, LLC 2010
1201
1202
Edith Maier and Guido Kempter
in the lab tests to real-life settings and validated them in extensive field trials lasting three months. The empirical data gathered in the private households of twelve older adults can help us refine the prototype both in terms of hardware and software. We intend to make the results and findings publicly available and thus contribute to the growing pool of knowledge on how to make technology more ageing-friendly. We start out by giving a brief overview on the impact of light on people’s wellbeing and health, especially of the elderly, and clarify the concept of “wellbeing”. We discuss the results of the empirical survey conducted in the South Tyrol and the theoretical framework that underlies the emphasis on qualitative methods and ensuring societal acceptance. We then present the results of the lab tests which were conducted to define the psycho-physiological target values for both relaxation and activation and to identify the light parameters relevant for these purposes. The next section is devoted to a description of the ALADIN prototype, the modern version of the enchanted lamp. This comprises the hardware prerequisites and the technical architecture as well as the software architecture, the different applications and the algorithms developed for adaptive control. The preliminary results obtained in the course of the field trials carried out in private households are discussed. Given the vast amount of empirical data the analysis is not yet complete. Finally, we explore the possibilities for exploiting the findings of ALADIN and discuss the various application scenarios.
1.1 The Impact of Lighting on People’s Health and Wellbeing Light affects the elderly mainly in terms of sleep quality, changes of mood and cognitive performance as well as the metabolic system. The negative impact of poorly designed or maintained indoor lighting resulting in unbalanced luminances in the field of view, glare, wrong intensities and light colors (spectra), flicker etc. has been documented [8, 16, 18]. These intensify existing vision problems, add to eye fatigue and headaches and contribute to a loss of concentration. Chronic sleep disorders are among the most widespread complaints in old age. A sound sleep-wake cycle enables people to lead an active social life and participate in social and leisure activities. Besides, with advanced age, many people suffer from reduced mobility and thus are limited in their outdoor activities or may even become home-bound. Lack of daylight can be counter-acted by increased illumination to prevent seasonal depression or sleep disturbances. The results of studies of nursinghome patients and hospitalized elderly patients suggest that increased daytime light exposure, measured by duration and intensity, has an impact on night-time sleep quality and consolidation [17] We also have empirical laboratory evidence for the superiority of so-called “dynamic lighting”, which provides light outputs varying over time, for visually demanding activity [1]. As natural lighting is almost always dynamic, there are probably evolutionary mechanisms in humans that make them prefer a certain dynamism of lighting.
ALADIN - a Magic Lamp for the Elderly?
1203
Impaired vision is one of the most common problems we face with advancing age usually caused by physiological and pathological changes in the eye like cataracts or long-sightedness. The diameter of the pupil that controls the amount of light entering the eye has been found to get smaller with age. The extent of this relative reduction in diameter is even more prominent at low light levels. This is compounded by the older lens becoming less transparent. The effect of ’normal’ ageing can be further exacerbated by various other damages to the retina due to diabetes, the incidence of which also increases with advancing age. To compensate for diminished vision, the level of illumination has to be increased up to three times for older people, which makes higher standards in lighting in older people’s homes necessary, e.g. through increasing light levels, improving color perception and increasing contrast. Despite the abundance of studies on the effect of lighting there still remain many questions . Among the lacunae in our knowledge is the amount of light required, the most effective way of administering light, whether there is any adaptation effect as there is in the visual system, the effect of different spectra and the influence of the timing of light exposure [3].
1.2 Wellbeing - a Multifaceted Concept Both the empirical and desk research about health and wellbeing in advanced age have shown that wellbeing is influenced by a great variety of factors [12]. Leading an autonomous life in one’s own home and being capable of performing one’s daily activities more or less independently have emerged as crucial factors. Besides, the successful handling of strains and crises and a system of medical care and social support that are adapted to the individual’s needs have to be considered essential prerequisites for achieving and/or preserving a sense of wellbeing with older adults. The advice and support application, which is an important component of the ALADIN system, is aimed at: • • • •
The preservation of an active, independent lifestyle The preservation of physical and mental fitness The avoidance of physical and psychological diseases The maintenance of an adequate supportive system
Our approach to health and wellbeing comprises five dimensions, namely mobility and physical fitness, communication, infrastructure (esp. home environment), competences and skills as well as mental health. Most recommendations given in the advice application address several of the above-mentioned dimensions of health, which should be regarded as a complex system, the components of which are interdependent. It has been decided to include biofeedback as a component of the ALADIN system although this was not foreseen in the original project plan. Biofeedback can be considered as a useful therapeutic method for treating a multitude of diseases.
1204
Edith Maier and Guido Kempter
Within the scope of ALADIN, biofeedback is regarded as a promising method for mental and physical relaxation. It makes subtle bodily functions perceptible to the user who receives feedback through acoustic or visual channels. The learning process is reinforced by immediate feedback.
2 ALADIN’s Journey 2.1 Theoretical and Methodological Underpinnings Getting to know one’s target group is essential for the future acceptance and use of any assistive system. To find out what users really need and want, it is not sufficient to rely on published data on demographic change and concomitant trends, although these may provide a useful starting point. Instead, empirical research using mostly qualitative and ethnographic methods such as interviews and observation is called for. Also, the lessons learned from previous AmI projects have to be taken into account such as MavHome [4], the Cyber Assist Project for Ambient Intelligence [14] or the Philips Home Lab [5]. Our approach is very much in line with Philips’ vision of ambient intelligence1 , namely: People living easily in digital environments in which the electronics are sensitive to people’s needs, personalized to their requirements, anticipatory of their behavior and responsive to their presence.
A further cornerstone of our approach are the scenarios outlined by the ISTAG Group according to which AmI is a pervasive and unobtrusive intelligence in the surrounding environment supporting the activities and interactions of the users [9]. The Report has identified different characteristics that contribute to the societal acceptance of AmI , namely: • Facilitate human contact • Be orientated towards community and cultural enhancement • Help to build knowledge and skills for work, better quality of work, citizenship and consumer choice • Inspire trust and confidence • Be consistent with long term sustainability both at personal, societal and environmental levels • Be controllable by ordinary people These ideas lie at the core of our endeavors. Throughout our project we have also had in mind the barriers that may hinder the acceptance of the AmI approach. When trying to outline an AmI roadmap, the following end-user attitudes have been found to slow down the take-up of assistive systems [6]: • People do not accept everything that is technologically possible and available. 1
See http://www.research.philips.com/technologies/projects/ambintel.html
ALADIN - a Magic Lamp for the Elderly?
1205
• People need resources/capabilities to buy and use technologies (money, time, skills, attitudes, language, etc. that are not evenly distributed in society. • People make use of new technologies in ways that are very different from the uses intended by suppliers. The empirical study conducted at the outset of our project to define user requirements as well as the feedback obtained in the course of the field trials very much confirm the importance of both the focus on activity support and the contributory factors for user acceptance outlined above.
2.2 Analyzing User Requirements In the first few months of our project, which started in January 2007, APOLLIS, an Italian social research organization, carried out 196 face-to-face interviews in the Autonomous Province of Bolzano. The reason for choosing this region was its cultural mix of German, Italian and Ladino speakers. The main goal was to describe typical every-day activities of people aged 65+ to model one-room/one-person use scenarios for the ALADIN prototype. Not surprisingly, watching TV emerged as the most wide-spread activity in virtually all the surveyed households. Mentally or visually more demanding tasks such as reading, handicrafts, or solving crossword puzzles, are much more diffuse and difficult to standardize. Besides, the same activity, e.g. cooking a meal, may be stressful for one person, whilst another person may find it relaxing. The consortium therefore decided to concentrate on activation/relaxation effects rather than support specific tasks. At the same time, APOLLIS surveyed user requirements concerning the different components of the ALADIN system like lighting devices, user interface, and body sensors and investigated people’s attitudes to new technologies in general. The interview data were supplemented by the interviewer’s personal observations and photo documentation of people’s housing conditions. The analysis of the detailed data shows that the population aged 65 and over is very heterogeneous: most people are still active, they may be working (e.g. farmers) or be engaged in volunteer work, they travel, they care for younger or older family members. Contrary to wide-spread belief, big multi-generational families turned out to be not a relict of the past: although mostly located in different households due to higher mobility, it is encouraging to learn that up to the age of 75 years elderly people tend to be givers rather than receivers of social support in the family and neighborhood network. We have to look at those of very advanced age (e.g. 75+) to find a notable prevalence of age-related disorders, mostly chronic conditions such as rheumatic pains, diabetes or dementia. The “older elderly” with limited mobility, restricted access to daylight and a reduced social environment represent an important target group for ALADIN and other assistive systems. However, given the focus on autonomous living and private households of the project and the overall AAL programme, the majority of people recruited for testing the prototype and
1206
Edith Maier and Guido Kempter
its various components belonged to the active and healthy segment of the older population. The ability of elderly people to imagine the benefits of new technologies for independent ageing proved rather limited. This, of course, may well change with people’s increasing familiarity with them during their working lives. Typical reactions are that technology is useless or too complicated, that it makes dependent or threatens to reduce social contact. User expectations towards technology are therefore difficult to ascertain. Overall, the following conclusions can be drawn with a view to increasing future acceptance: • Light devices must correspond to the sensitivity and limited adaptability of the elderly’s eyes. Glare effects and fast light changes should be avoided. • The whole installation must either be a ’camouflaged’ supplement to the existing infrastructure or correspond to individual aesthetic preferences. • Since motivation to use sensors is mainly correlated to safety matters e.g. automatic emergency call devices, safety concerns have to be taken into account. • As long as we have to deal with a high share of computer non-users within the target group, maximum attention is to be paid to the design of the user interface.
2.3 Tests in a Laboratory Setting A series of laboratory tests were conducted to determine the psycho-physiological target values for both relaxation and mental activation and examine the impact of various light parameters on older adults. Preliminary to the lab tests, we defined the criteria for our test population. The elderly person should live on his or her own, spend most of his or her time in the apartment or house and regularly watch TV, read newspapers, or perform some pastime activities in a particular room. The daylight situation in this room should possibly be such that the person normally needs artificial light for a specific period of time to perform daily routines. On the one hand, the test persons should be mentally fit enough to make active use of the ALADIN system, on the other hand they should suffer from certain constraints such as limited mobility that prevent them from engaging in frequent outdoor activity in full sunlight. The excluding factors include serious cardiovascular illnesses, serious neurological or psychiatric illnesses such as depression, epilepsy, severe dementia or serious diabetes. In connection with visual disturbances a serious state of cataract or glaucoma were ruled out. Besides, the excessive consumption of alcohol or caffeine was considered a reason for exclusion, since this might distort the results. As was to be expected, recruiting suitable test persons was quite challenging.
2.3.1 Investigating the Impact of Different Light Parameters The Bartenbach Light Laboratory, a well-known lighting design company located near Innsbruck, Austria, carried out tests for defining the psycho-physiological ef-
ALADIN - a Magic Lamp for the Elderly?
1207
fects of various light parameters such as light intensity and color temperature and tried to establish correlations between light parameters as independent variables and physiological biosignals (blood pressure, heart rate, respiration rate, skin conductance and muscle tension), cognitive performance and subjective perception rated by questionnaires. The tests at Bartenbach Light Laboratory pursued two objectives: • Investigate the effects of different lighting situations on older adults in terms of psycho-physiological impact, subjective perceptions and performance. • Produce recommendations for the prototype lighting system to be installed for the field tests. Fifteen test persons, 9 female and 6 male, participated in two test runs each. The average age was 69 (min: 65, max. 82). The data was collected within a time-span of approximately 2.5 hours by pair-wise random selection of bright and dim light situations. The psychological activation of the elderly people was quantified as the difference of correct answers and reaction times in the continuous performance test. First of all, the test was taken in standard lighting and repeated after five minutes’ exposure to test lighting. The psychological relaxation of elderly people was measured by collecting blood pressure data every minute and calculating a relaxation parameter out of the time-series of systolic blood pressure values. Overall, the different lighting situations showed a clear impact on older adults both in terms of activation and relaxation. The effects of lighting could be shown in: • the test persons’ subjective assessments as reflected in the questionnaires, • the performance tests which measured the test persons’ ability to concentrate, and • the heart rate variability analyses Bright light produced the most significant effects in the performance tests. The relaxing effects of all the dim light situations could be shown in the heart rate variability (HRV) analyses, especially in the very low, high and middle heart rate frequencies. Based on these results we can conclude that bright light of at least 500 cd/m2 and 6500 Kelvin is needed to achieve activation in older test persons. In practical terms this means that the lighting system should aim at providing a color temperature between 3000 and 8000 Kelvin and a brightness level of max. 750 cd/m2 . The adaptive lighting system therefore has to be dimmable in the range from 0 to 750 cd/m2 .
2.3.2 Defining Psycho-Physiological Target Values Our Hungarian partner, the Budapest University of Technology and Economics, was responsible for defining the optimal psycho-physiological target values for the successful accomplishment of mentally or visually demanding (VDT) every-day activities. Target values in electrophysiological measurements are the values which represent the desired state of alertness or relaxation. They need not be fixed numeric
1208
Edith Maier and Guido Kempter
values, but can be desired trends (increasing or decreasing) or local extremes (maxima or minima). The psycho-physiological target value indicates the psychological state associated with success or failure in a well-defined situation. The intention was to operationalize the mental and physical fitness in terms of alertness or relaxation by measurable indicators such as blood pressure, heart rate, respiration rate, skin conductance or ECG. The final sample consisted of 30 test persons, 24 women and 6 men. The low number of males was due to the bad health status of men as well as their low representation in this age group. This was a problem that we encountered in other countries as well when recruiting test persons although not to quite the same extent. The age range was between 65 and 84, with an average of 71 years. The total time of data collection per person was approximately 4 hours between 8.00 and 12.00 am. Heart rate, skin conductance, muscle tone as well as respiratory amplitude were measured. The laboratory experiments were conducted under the same lighting conditions as those in the Bartenbach Light Laboratory. Skin conductance response (SCR) turned out to be the most appropriate psychophysiological target value for the envisaged automatic lighting adaptation, i.e. the intelligent control loop. This parameter proved to be the most sensitive for different activities since a very clear and practically important relationship with activation and relaxation levels was found. Besides, this relationship does not only have statistical significance but shows very high absolute differences in pair-wise comparisons. As a further useful benefit, skin conductance response shows no habituation effect. Heart rate emerged as the second best psycho-physiological target value. Heart rate significantly corresponded with the high activating and relaxing situations. The absolute differences, however, were very small. All the other psycho-physiological parameters did not prove to be sensitive and/or specific enough due to very high individual differences.
2.4 User Testing In line with our participatory and iterative development approach end-user testing takes place continuously. Given our specific target group, i.e. people with various impairments such as diminished vision, the user interface of ALADIN has to conform not only to general usability guidelines, e.g. ISO 924, but also to accessibility guidelines. We designed a simple user interface in line with the widely accepted accessibility guidelines of WCAG 2.02 Throughout the project we carried out expert reviews and end-user tests. One result of this iterative testing is that the software development has led to a very flat user interface hierarchy. As mentioned before, in the course of user requirements analysis it emerged that a TV set could be found in almost all households of older adults and that therefore it could be assumed that our target users would be familiar with handling remote 2
The WCAG 2.0 guidelines were published as a W3C Recommendation on 11 December 2008.
ALADIN - a Magic Lamp for the Elderly?
1209
controls. However, in their article “New Trends in the Living Room and Beyond: Results from Ethnographic Studies Using Creative and Playful Probing”, Bernhaupt et al. point out that people still want the remote control as an input device, although it is perceived as too complex and difficult to use. They would prefer a universal remote control with only one button which is integrated with a display to inform users about what they have to do next [2]. Given the more or less universal presence of TV in the home environment, the adaptive lighting system was implemented on a computer with added TV functionality. Interactive TV poses special challenges since watching TV is normally characterized by a “lean back” attitude whereas working on a computer is normally associated with a so-called “lean forward” attitude. The human computer interface has been conceived with an active user in mind whereas TV systems are traditionally aimed at passive consumers. A TV screen is normally placed at a distance of a couple of meters from the viewer and has a lower resolution than a computer. When designing the screen layout we cannot use the full screen because TV sets vary in terms of the screen size used for presentation. The so-called “safe area” can be reckoned to be about 10% smaller than the full screen, or about 576 x 460 pixels. The interactive TV user interface of the ALADIN prototype was tested with clients of a care facility for the elderly. Users liked the overall graphic design (font, colors, contrast) and found the GUI easy to navigate, provided that the number of navigation elements was kept to a minimum. Several general observations could be derived from the early user tests: • • • •
The majority of users prefer number keys to cursor keys. Users expect immediate confirmation or feedback from the system. The number of menu items to choose from should not exceed 5 or 6 items. (Semi-)technical terms (e.g. biofeedback) have to be replaced with simple every-day terms (e.g. relaxation exercise).
The following test runs focused on the visualization of the biosignals and the advice application including the virtual agents. The results showed that the visual representation was easy to understand, except for the bar indicating the biofeedback: the variations of the values did not match the changes in the mimics of the avatar. As far as the use of an avatar for giving advice was concerned, our test persons were evenly split between those in favor and those against, which is why we decided that the extra investment of time and effort was not warranted. As far as sensor technology is concerned, testing the prototype with end-users made us discard the chest belt originally used for capturing biosignals. Test subjects found it too difficult to attach to their bodies. As a result, a special biosensor glove was developed which entailed a change of methodology: instead of measuring SCR we employ photoplethysmography (PPG) to measure peripheral pulse. The usability test results from the field trials show that the iterative testing approach has paid off. The criticisms voiced by the test persons concern mainly the hardware devices, namely the lighting devices, the sensor glove and the remote control. Contrary to the screen design of the interactive TV interface these components have not been subjected to the same rigorous testing with end-users. The partners
1210
Edith Maier and Guido Kempter
responsible for developing the hardware are currently analyzing the results with a view to improving the various components.
3 The Modern Version of the Enchanted Lamp Both the results of the lab tests and the test runs with end-users fed into the development of the ALADIN prototype. It constitutes an intelligent open-loop control and biofeedback system which can adapt various light parameters such as intensity and color in response to the psycho-physiological data. These are continuously registered by the system. It can be manually adjusted via the graphical interface and allows the resetting of all light parameters to their default values. As you can see in the Figure below, it comprises the following components: • • • • • •
Television (Fernsehen), Automatic lighting (Automatisches Licht), Manual lighting (Manuelles Licht), Exercises (Übungen), History (Rückschau), and Advice & support (Wohlfühltipps).
The user can select between these options by pressing the number buttons 1 to 6 or move the cursor to the desired option and press OK on the remote control.
Fig. 1 Main menu of ALADIN system
Each of these applications works independently and has to be started deliberately by the user. “Automatic lighting” adapts the lighting to achieve an activating or relaxing effect. Because of the great diversity of situations, events or individuals, the direction of change or adaptation is unlikely to be known beforehand. We have opted for genetic and simulated annealing algorithms to implement adaptive control
ALADIN - a Magic Lamp for the Elderly?
1211
[13]. With “Manual lighting” a user can turn on and off different predefined lighting situations (e.g. reading light) as well as manually modify lighting situations. The application “Exercises” offers the user a variety of activating and relaxing exercises. Whereas the former comprise various brain jogging exercises, the latter are largely based on the biofeedback method which allows making subtle bodily functions perceptible to the user. “Advice & support” allows the user to browse through different recommendations aimed at healthy behavior. Finally, with “History” the user can view the results of the exercises he or she executed during the last five days. The “Advice” and “History” applications derive their input from the results of the exercises. Both hardware, i.e. the sensor gloves and lighting devices, and software applications have been developed to allow easy integration into building management systems, thus becoming part of a general assistive environment.
3.1 Technical Architecture - Overview The general idea of ALADIN technology is to make as few as possible modifications within private households of elderly people. This means that we prefer using already existing and common devices in private households, to replace these devices with new technologies with additional functionality or to introduce only non-intrusive components for some novel features of ALADIN. An important requirement of our project is that it should be possible to integrate the ALADIN lighting system into conventional building control systems. There are basically two ways of ensuring that all components of a building system (lighting, heating, ventilation, air-conditioning, security systems, lifts, access control, etc.) work together smoothly and do not interfere with each other: 1. running all components across a single system, or 2. installing a special system for every individual service to achieve maximum functional efficiency in each area. We opted for the second possibility and decided to use the LITENET system from Zumtobel Lighting GmbH to control ligthing devices. It provides interfaces to other building services and higher-level building management systems in the form of BACnet and OPC as standardised interfaces. BACnet is a networking protocol, which has been designed specifically to meet the communication needs of building automation and control systems. OPC stands for “Object Linking and Embedding for Process Control”; the OPC standards aim for open connectivity of industrial automation devices and systems. Thus we can achieve seamless system integration between field buses and computer applications such as the ALADIN software. The ALADIN software runs on a computer system with interfaces for Ethernet and bluetooth. It contains a Java-based framework for capturing psychophysiological data supplied by a biosignal recorder via bluetooth connection, analyzing these biosignals with regard to the individual cognitive performance defined
1212
Edith Maier and Guido Kempter
by the performance-resource function citepAlvarez et al., and controlling lighting devices (and blinds) according to the results of these analyzing routines. Feedback is given by smart biosensors, which are attached to the human body and operate with a 32-bit microprocessor including an analogue-to-digital converter for eight channels for input and a serial peripheral interface for output. The biosignal recorder used for the prototype is a wireless adaptation of the Varioport device from Becker Meditec, one of our partners. It measures electrocardiographic, electrodermal, respiratory and body movement indicators and continuously transfers them to the computer system via bluetooth connection. Adaptive algorithms for feature detection, optimization, and learning are active continuously as there is no optimal lighting condition; this depends on a person’s individual psychological state at a particular point of time. For the field trials the light devices were set at 20% of their maximum when using the genetic algorithm. During the phase with the simulated annealing algorithm the lighting condition at the outset was set randomly. The intelligent control loop is started manually by TV remote control and stopped automatically by a specific motion pattern detected by the intelligent unit or by the user.
3.2 Hardware Components As mentioned in Section 2 almost every private household has a television set we have decided to use this as a user interface for explicit input and output. Television devices are used by older people in various ways: with terrestrial antenna, satellite antenna, cable television, together with video recorders, DVB receivers, amplifiers etc. So as not to be restricted to one of these technical combinations we avoid the use of input and output connectors for Euro scart, antenna, component video and audio. For ALADIN we used a television device with RS-232C port for service and control as well as an RGB port. Graphic outputs are presented best on a screen with a HD-ready television with 16:9 aspect ratio. Television signals can be received via antenna, Euro scart, or component input. To ensure the accessibility of the ALADIN user interface we recommend a remote control with large keys as well as a tactile and non-slip casing that makes it easier to use for older people. The device should reproduce all the functions of the original TV remote control. The ALADIN prototype was programmed for the Meliconi gum body remote control but all the other universal remote controls could equally be programmed. Therefore in the future, this might be replaced by the one-button device currently being developed by Bernhaupt and colleagues in Salzburg Bernhaupt 2007. The user navigates the television device as well as the ALADIN software user interface via the remote control. Therefore we need an infrared control system which is able to receive infrared signals from the remote control and send infrared signals to the television device. For ALADIN we selected an infrared control system directly connected with the ALADIN software via USB.
ALADIN - a Magic Lamp for the Elderly?
1213
The choice of a specific computer device for ALADIN is closely related to the general idea of ambient technology, i.e. implementing computer systems in such a way that the users are hardly aware of them. The Apple Mac Mini computer is small and stylish and can be taken as a common set-top box for television devices. For the benefit of the administrator the computer can be connected with a keyboard and mouse. The user, however, interacts with the ALADIN software only by means of the remote control.
3.2.1 Sensor Device As mentioned before, the biosignal recorder used for the prototype is a wireless adaptation of the Varioport device from Becker Meditec. In the beginning, we worked with a chest belt. Since in the early user tests this proved too unwieldy and difficult to attach by most older people, we had to find an alternative. Becker Meditec, the sensor expert in the Consortium, tested various other sensor devices including so-called “wearable sensor systems”, but found them inappropriate for older test persons or really any test persons who had a certain amount of “padding” on their bones since they only seemed to work with very skinny people. Eventually, it was decided to incorporate the sensor device into a glove. For the final biosensor glove we implemented photoplethysmography (PPG) to measure peripheral pulse on the index finger and implemented electrodermal electrodes to measure skin conductance between index finger and thumb. The user slips the glove on every time he or she wants to use automatic lighting or perform a biofeedback exercise. The smart biosensor glove can be turned on and off by a small power button. The glove has a rechargeable battery inside which has to be reloaded after twelve hours of operation. A great deal of research is currently conducted to develop unobtrusive and truly non-invasive sensor devices. As mentioned before, the idea of incorporating sensors into wearable T-shirts or other garments sounds very appealing but when put to the test in real-life situations fails to convince. A promising avenue may be the one pursued by Prance and his colleagues at the University of Sussex who have developed sensors that operate by monitoring the displacement current between the body and the sensor input electrode. Thus they are able to dispense with the usual electrolytic paste contact or gel [7].
3.2.2 Light Installation For ALADIN we use lighting and control devices from Luxmate, a company that belongs to the Zumtobel Group. The ALADIN system also works with control units of other producers such as Tridonic or Osram . The Zumtobel Luxmate system is addressed via serial interface which in turn communicates with a DALI system. The general lighting control protocol is DALI (digital addressable lighting interface), an open standard in accordance with the International Electrotechnical Com-
1214
Edith Maier and Guido Kempter
mission IEC 60929 standard for fluorescent lamp ballasts. Each piece of operating equipment with a DALI interface can communicate via DALI individually. Using a bi-directional data exchange, the DALI controller (e.g. LM-DALIS) can query and set the status of each light. As a stand-alone system, DALI can be operated with a maximum of 64 devices. For the ALADIN prototype we need at least twelve devices. The control circuits consist of ten lighting channels with five channels for blue colored light (e.g. 8000 Kelvin) and five channels for red colored light (e.g. 2700 Kelvin). A single lighting device is composed of a red and blue colored light. Furthermore the ALADIN prototype also includes a light sensor measuring the light intensity within the room and at least one manual lighting switch to turn on and off the lighting in the usual way. In Figure 2) you can see that the ALADIN lighting system consists of two different components: a direct or indirect suspended down-light, and a ceiling floodlight, both of which can change spectrum, color temperature, light intensity and directions. For test purposes the lighting solution including the lighting devices, lighting controls, manual switch, and light sensor have been assembled into a compact unit or bus box to enable quick installation in households. In everyday circumstances the ALADIN system can work with every digitally addressable device in a building automation system.
Fig. 2 ALADIN light installation
For adaptive lighting, all circuits can use the maximum light variation with all parameters (0-100 %)and in any combination except the two circuits for the direct lighting (table light). A single circuit for direct light has a maximum of 80 % and both circuits together have a maximum of 120 %. To obtain a smooth change, light variation must not exceed 10% per second. Indirect lights are mounted along the
ALADIN - a Magic Lamp for the Elderly?
1215
whole length of all four sides of the room. The direct light, i.e. the table lamp, produces a great deal of heat, which is why a fan has been added. A bus box contains all the interfaces that are usually included in the building automation system. The light sensor is attached to the table light and normally directed to a window to check the amount of daylight.
3.3 Software Architecture It is essential that the software architecture of ALADIN be open to allow easy integration with other information and communication systems in buildings, thus becoming part of a general assistive environment. For this reason ALADIN is a platform-independent system which can run on any supported operating system platform. ALADIN has a client-server computing architecture, where the applications are clients who may initiate a communication session. ALADIN consists of several applications and requires at least one server and one client application. The applications include “Adaptive Lighting”, “Manual Lighting”, “Exercises”, “History View”, “Advice”, “Television”, and the “Main Menu”. Each of these applications has a user interface and an application area. Data between client and server are exchanged via TCP/IP.
Fig. 3 Software architecture of ALADIN
1216
Edith Maier and Guido Kempter
An important advantage of such a distributed architecture lies in the fact that client applications are device-independent, which means that the biofeedback application could run on a mobile device whereas the other applications could be installed on a desktop computer. Besides, it allows installing, testing and, if necessary, replacing the various software components independently of each other. The applications have been implemented in Java. However, given the abovedescribed approach it would be possible to implement other individual applications using different programming languages if Java should prove inappropriate for any particular application. Client and server communicate via a specially developed ALADIN protocol at the application level. This means that the client application can instruct the server application to start or terminate recording, to store the results or retrieve records already stored. The applications “Adaptive Lighting” and “Manual Lighting” make use of the Lighting Control package to control lighting devices. The applications “Exercises”, “History View”, and “Advice” either save their data via Data Center package on the server or receive all relevant data via Data Center package from the server. The application “Exercises” needs the ALADIN Music Player package to play music sequences during the biofeedback exercise. To start a biosignal recording session the Data Center makes use of the Varioport framework which allows easy access to the data of the smart biosensor glove. The raw data from the biosensor glove are transmitted to the server via bluetooth and saved immediately in a directory. The server informs the corresponding client about the new raw data records. User inputs from the remote control are communicated via the IRControl package to the server part of ALADIN and from the server to the respective client application. All applications with user interfaces need the UI Framework package to provide outputs to the television screen. Each application can store data (e.g. biosignal results) in the persistence layer. All data are saved as plain text into a directory, the name of which records time and date. Therefore, a client can request raw as well as application data from any particular day. Furthermore, files can be opened using different text editors. What is difficult with this type of implementation is the performance of complex queries combining different search parameters. If this should be required in the future, it can be realized due to the abstraction of storage mechanisms provided by the persistence level which allows easy integration of a database. ALADIN software consists of two Java-based frameworks to facilitate an efficient and easy development of further applications. One of the frameworks subsumes all sub-routines into a single command communicated between client and server via TCP/IP. Functions such as connecting, disconnecting, sending of requests and receiving the data requested can be performed by extending the client class. The other framework provides reusable graphical elements/components which enables the developer(s) to adapt the graphical user interface (GUI) to different user requirements (e.g. different languages, font sizes, colors etc.). The framework also makes sure that the look-and-feel and navigation of the GUI are consistent across all applications.
ALADIN - a Magic Lamp for the Elderly?
1217
3.3.1 ALADIN Applications For automatic light adaptation the user has to select either an activating or a relaxing light situation. According to the user’s selection the adaptive algorithm (e.g. simulated annealing or genetic algorithm) starts looking for the best individual lighting condition. Before starting the light adaptation the light sensor checks the amount of daylight within the room. The user receives the information to darken the room if it is too bright for adaptive lighting. The automatic light adaptation is achieved by an open loop circuit modifying the light parameters through the DALI interface and interpreting the biosignals (electrodermal activity and heart rate) measured by the biosensor glove as an individual reaction to these light modifications. The result of this interpretation is used by the adaptive algorithms to find the next best light modification. The intelligent control loop is started manually by TV remote control and stopped manually by the user or automatically when the user leaves the room. After selecting the manual light option in the main menu, the user is offered different options of light situations on the television screen. The user can select one of these options. There are several predefined lighting situations and one variable light situation where the user can modify the light intensity as he or she wishes. ALADIN consists of an administrator tool to define the light situations which are shown on the television screen and/or are available through manual switches, e.g. romantic dinner light, bright light for visually demanding tasks or whatever. The exercise application of ALADIN contains a set of different relaxing and activating exercises. After starting this application the user gets the information about the number of exercises on a particular day. Relaxation is achieved through continuous biofeedback from the user’s own electrodermal activity or the heart rate. For this purpose the user has to slip over and turn on the biosensor glove. Activation exercises include calculating, text searching, memorizing, and a visual guessing task. The user can select one of these exercises as often as he/she wants to relax or to activate. All the parameters of these exercises can be configured by a configuration file of ALADIN system. After finishing an exercise the user receives immediate feedback on his or her success, i.e. the degree of activation or relaxation achieved. The data are stored in the data center of ALADIN. The history view application of ALADIN shows the individual’s success in relaxing and activating in the exercises of the past five days. The data base is composed of the results of the exercises which are automatically analyzed every day at noon by the ALADIN system in a two step process. The results are stored in the data center of ALADIN. The history view application takes this information from the data center every time when the user selects this option and shows the results of this routine on the television screen. The advice application takes several recommendations from a set of approximately 250 recommendations depending on the outcome of the results from the exercises executed during the last five days. There are six possible results: too much activation during the morning, too much activation during the afternoon, too much activation during the evening, not enough activation during the morning, not enough activation during the afternoon, and not enough activation during the evening.
1218
Edith Maier and Guido Kempter
3.3.2 Light Adaptation Algorithms Dynamically finding specific lighting situations that best support the ALADIN user basically comes down to a trial-and-error procedure. However, with ten individually controlled light circuits, each of which can be set to values between 0 and 255, the number of different possible lighting situations is enormous and therefore it would be impractical to test every single one of them for their effects. Furthermore, one and the same light scene could cause different effects in different situations. One approach consists in implementing uniform search strategies that can find optimal lighting situations by selectively generating new lighting situations and testing them against previously defined target values. ALADIN contains two different algorithms, simulated annealing algorithm (A1) and genetic algorithm (A2). In the field trials the algorithms were tested consecutively for three weeks each. Simulated annealing is a combination of a so-called hill-climbing search algorithm with a random walk within a state space. Each lighting situation is analogous to a state of a physical system, and the psycho-physiological responses that must be improved are analogous to the internal energy of the system in that state. The hillclimbing algorithm is simply a loop that continually searches a state with a higher value within a state space containing only immediate neighbors. Because hill climbing algorithms can get stuck on a local maximum they are combined with a random walk within the immediate neighborhood of the state space. This way it is not possible to generate different lighting situations successively. By means of this guided technique a searching path is built based on successive local searching steps from the initial to the best lighting situation.
Fig. 4 Simulated annealing algorithm
Each step of the simulated annealing algorithm replaces the current solution by a random “nearby” solution, chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (called “temperature”) that is gradually decreased during the process. The other approach used are so-called genetic algorithms which resemble evolution in nature and are often used to find good solutions within a huge search space, thus making them popular for optimization tasks. A genetic algorithm is a stochas-
ALADIN - a Magic Lamp for the Elderly?
1219
tic search in which successor states are generated by combining two parent states. This algorithm begins with a set of randomly generated states, called the population. Within one generation each state of the population is rated by the evaluation or fitness function. According to this fitness function two states are selected for reproducing new states applying crossover and mutation procedures. In ALADIN a set of different light scenes determined by the individual brightness levels of the lighting devices forms an initial population. These selected light scenes are then tested for psychophysiological effects on the person in the ALADIN controlled room. The more these effects approximate a specific target value, the higher the tested light scenes are rated, and the higher the chance that parts of them will be used in subsequent generations.
Fig. 5 Genetic algorithm
These highly rated scenes become the “parents” and are eventually recombined to form new light scenes. Random mutation can also take place, just like in nature. Once the creation of the new population is finished, the whole process is repeated over and over again. During the course of this process, the generated light scenes should increasingly approximate the desired effect. Both search algorithms are implemented with different time parameters. The simulated annealing algorithm works with a fixed light transition time of five seconds. Then the lighting parameters are held constant and the psychophysiological effects are analyzed until a trend is found within the biosignal. The maximum time window is two minutes. After two minutes a random change parametrization occurs within 20 seconds. The genetic algorithm introduces a new light situation within a dynamic transition time window. The length of the time window depends on the difference between star and end point of lighting situation. Only ten percent per second modifications are allowed. Analyzing psychophysiological effects is done within a 20 second time window.
1220
Edith Maier and Guido Kempter
Light variation occurs in a very smooth and gradual way from the beginning to the end of transition time. The analysing window should not be less than 20 seconds to equalize artefacts.
3.3.3 Analyzing Biosignals The parametrization of electrodermal activity (= skin conductance reaction; SCR) is done with the standard deviation curve of the biosignal. Furthermore, for A1 standard deviations are calculated within 1.5 second time windows and then the trend of standard deviations out of all 20 second windows is calculated. A2 measures the time during which skin conductance values exceed the starting value of the assessment period. For heart rate parametrization the inter-beat interval (IBI)from the peripheral pulse volume signal is identified. A1 calculates a trend of mean heart rate out of all 20 second windows. A2 calculates the difference of single mean heart rate of the first and the half of the 20 second window. For A1, heart rate variation is calculated as a trend of standard deviations of IBI out of all 20 second windows and for A2 as a single standard deviation from IBI for 20 seconds. The results form analyzing the biosignals are then mapped on the same standard scale to be able to add and weight single psychophysiological channels. The psycho-physiological reactions are calculated from the sum of standardized and weighted parameters. No spectrum analysis is performed since this is only possible with ECG and high sampling rate. Adaptation can be stopped manually or by motion sensors integrated into the Varioport or the lighting device on the ceiling. The ALADIN prototype should work without a learning period for the system but should show learning effects over time. The target values for activation or relaxation are defined by psycho-physiological parameters. The lighting effects as well as the psycho-physiological signals might be different for every individual adaptive lighting (e.g. learning, daytime, conditions and surrounding effects). Therefore a one-time calibration of target values, i.e. baseline definition for target value by measuring and rating, seems to be inadequate. On the other hand, a user-guided calibration with every adaptive lighting session is inefficient and not acceptable to the subjects. In the field trials we calibrated the system with the first biofeedback of each day. For target value calculation a baseline correction is implemented: for A1 we find the minimum and maximum of the starting psycho-physiological level of SCR, heart rate (HR), and heart rate variability (HRV) within biofeedback every morning. For A2 we find the median of absolute maxima and minima of SCR, HR, HRV in the history of psycho-physiological measurements. Relaxation is defined as 1-20% decrease of SCR, HR, and HRV depending on base line correction. Activation is defined as 1-20% increase of SCR, heart rate and heart rate HR, and HRV depending on baseline correction. A combinatory approach is used to interpret skin conductance and heart rate as indicators for psychophysiological activation and relaxation. In case the test persons are dissatisfied with the adaptive lighting, they can inter-
ALADIN - a Magic Lamp for the Elderly?
1221
vene and manually interrupt the adaptive lighting. These interruptions are also good indicators for target values and are logged.
3.3.4 Analyzing Exercise Routines Analyzing the results from exercises follows a two step process: In the first step all results of exercises per period of daytime (morning, afternoon, evening) are combined to single health status values for the ability to relax and the ability to activate in the morning, afternoon, or evening. In a second step, the results of this classification for the last five days are analyzed according to a visible trend in the history for the ability to relax and the ability to activate in the morning, afternoon, or evening. This second step should lead to individual recommendations for the user to increase the health status. The first step of data analysis is performed with artificial neural network (ANN) classification, the second step of data analysis with k-nearest neighbor (KNN) classification. For non-linear statistical data modeling of the complex relationships between exercise data and health status per daytime (morning, afternoon, evening) ALADIN uses artificial neural network (ANN) classification. A simple feed forward neural network has been implemented where the information moves in only one direction from the input node through the hidden nodes to the output node. There are no cycles or loops in the network. This multi-layer perceptron with only one hidden layer has 30 input nodes, 35 hidden nodes, and 5 output nodes. The 30 input nodes derive from six different exercises of the ALADIN system with integer values from 1 to 5 as possible results. Each of these possible results corresponds to an input node of the perceptron. The same is with the output nodes where each of these possible results corresponds to a single node. The units of this network apply a sigmoid activation function in the form of a hyperbolic tangent, which allows considering the positive and negative values of the exercise data. The artificial neural network classification of ALADIN uses back propagation as a learning technique. Here, the output values are compared with the correct interpretation of experts (physicists and psychologists) to compute the value of predefined error-function. The correct interpretations are described in training vectors. The error is then fed back to the network. Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount. After repeating this process for a sufficiently large number of training cycles (10,000 cycles for ALADIN), the network will usually converge to a state where the error of the calculations is small. For interpreting the time series patterns of combined exercise data, i.e. single health status values for morning, afternoon, and evening within the last five days, ALADIN has implemented the k-nearest neighbor (KNN) classification for activation and relaxation. This search algorithm computes the distance from a query point to every other point in a predefined database, keeping track of the “best so far”. The query point is a single health status value for the ability to relax or the ability to activate in the morning, afternoon, or evening. The data base consists of the expert
1222
Edith Maier and Guido Kempter
data, i.e. some exemplary time series of combined exercise data for activation and relaxation within five days that have been interpreted by experts such as physicists and psychologists. The idea of this classification is to use nearest neighbor search for finding the closest time series pattern in the database to the given time series pattern of the user. Within ALADIN k = 1, this means that the points are simply assigned to the class of their nearest neighbor. The training phase of the algorithm consists of storing the feature vectors, i.e. selected time series database, and class labels of the training samples (expert’s interpretation). In the actual classification phase, the test sample (whose class is not known) is represented as a vector in the feature space. Distances from the new vector to all stored vectors are computed and the closest samples are selected. This algorithm sometimes referred to as the ’naive approach’ works for small databases but quickly becomes intractable with increasing size or complexity of the problem. No search data structures have to be maintained, so linear search has no space complexity beyond the storage of the database.
4 Discussion of Field Test Results and Outlook 4.1 Findings from Field Trials The prototype has been tested in twelve single households for a period of three months. At regular intervals, various mental and physical performance tests as well as life and sleep quality questionnaires were administered. Besides, test persons were encouraged to enter anything they considered worth mentioning and which might affect the outcome of the measurements into personal diaries. We also organized two focus group meetings with our test persons to discuss any issues that might not have been included in the questionnaires or log data. Besides, based on the diaries as well as on information obtained by the coaches that accompanied the test persons during the field trials, case histories have been prepared. These illustrate and convey people’s experience with the prototype in a much more accessible way than a list of log data. Case histories can supplement the scientific discussion of the results and are useful for disseminating the results to a wider audience. The results indicate an overall increase in both mental and physical fitness. On the whole, test persons enjoyed the exercises and found the system easy to handle. However, they would like to have a bigger choice both with regard to the exercises and the music tunes that accompany the biofeedback sessions. We can also discern minor improvements in terms of technology acceptance. On the whole, the test persons who evaluated the system in winter and early spring showed a more positive attitude than those using the system in the summer months. This is not surprising since most test persons were active and preferred going for walks outside to watching TV with ALADIN. Quality of life did not seem to have improved significantly
ALADIN - a Magic Lamp for the Elderly?
1223
which may be due to the fact that most test persons started out from a very high level, so there was little room for improvement. The evaluation of the field tests, especially the qualitative data gathered by means of interviews and focus group discussions, appears to confirm the contributory factors to user acceptance as outlined in 2.1: technology can only complement but never replace interaction with humans and ideally, facilitate human contact. Test persons quite obviously enjoyed the attention they received during the test phase: each test person was assigned a coach who visited them at least once a week. Some of them appreciated the structuring effect that the daily exercises and the regular test runs had on their lives, especially those who had recently retired. These observations have a strong influence on our deliberations regarding ALADIN’s future as can be seen further below. The people who might benefit most, e.g. the fragile, home-bound older elderly, were represented rather poorly in the test population. This is largely due to the selection criteria as well as the recruiting process which relied mainly on traditional media channels. To reach the older elderly, different recruitment strategies would have to be employed such as addressing nursing homes or other care facilities/organisations. However, this was ruled out by the focus of the AAL programme on independent living. On the other hand, one may argue that older people should familiarize themselves with new technologies that will help them prolong the period they can remain living at home whilst still in comparatively good health and mobile.
4.2 Outlook On ALADIN’s Future Now that the ALADIN prototype is complete and has been tested in real-life settings, the Consortium partners have started looking at opportunities of exploiting the system and its components. The challenge consists in turning research findings into a cost-effective solution that can be marketed successfully to older people across Europe. Anyone who wishes to enhance his or her wellbeing and improve their quality of life by means of intelligent adaptive lighting is a potential customer. There is ample evidence of the economic potential of ALADIN-related research. Europe’s population is ageing rapidly. ICT and assistive technologies can help deal with the challenges and opportunities presented by this demographic change. Based on the analysis of published data and demographic trends, we can assume a growing target group of up to 135 millions of European elderly aged 65+ by the year 2050. With an ageing population, the number of people with various physical, psychological or other impairments including chronic conditions such as sleep disorders, various forms of dementia and rheumatism is expected to rise in all EU countries. Given the fact, that the average income of the elderly is expected to increase, a host of economic opportunities with great business potential are opening up for age-related products and services, especially in the healthcare field, but also in the housing and building markets. Many houses will need to be adapted to cater for the needs of such people. Homes suitable for older people range from homes that fulfil
1224
Edith Maier and Guido Kempter
basic accessibility standards such as no-step entrances and lifts to fully serviced accommodation with care provision. Because the ALADIN system is modular and therefore flexible, it can be implemented in all housing categories. Most of the older adults prefer remaining in their own homes as they age. Residence in a nursing home or care home is not the norm and is often regarded as a last resort. In the Netherlands, for instance, care of the elderly is increasingly moved away from nursing homes (intramural care) into the community (extramural care). In Austria, recent legislation which makes it possible to officially employ informal carers mainly from East European countries has strengthened the extramural trend. From a public policy perspective, this is a cost-effective option provided that the home can be used as a platform to ensure people’s general wellbeing, i.e. the home as a care environment. So far, intelligent or smart homes have been focusing on building services (heating, cooling, air conditioning, lighting, sun-blinds), household appliances (electric cooker, refrigerator, etc.) and multimedia (telephone, audio and video, TV, etc.) and the associated technological data processing. The integration of human beings by using the occupants’ data - physiological indicators, life style parameters, movement and location, etc. - to assist in their comfort, health and safety has not really been taken into consideration. However, integrating them in future AAL applications would contribute to tailor them to the specific needs, requirements and wishes of particular users. Apart from taking into account psycho-physiological parameters for personalizing technological applications, we are considering other measures for making AAL solutions more attractive and marketable. One of the measures we intend to implement in a follow-up project is to package the technology with support and advice measures. Although the great majority of the elderly want to live independently at home for as long as possible, they also want to be embedded in a social network. In a follow-up project we would like to investigate how this type of support can best be delivered. One of our partners intends to use SOPHIA, an information and communication platform developed by a housing foundation (Joseph-Stiftung) together with the university and clinic in Bamberg (see www.sophia-tv.de). Via a videophone this ’virtual nursing home’ connects people to a large variety of health and care as well as social services in the region. Another strategy consists in marketing the ambient lighting solution with a voucher for a certain number of hours for technological assistance. Actually, in many focus group discussions carried out by our colleagues in the social studies department the provision of technical support has emerged as a major prerequisite for user acceptance of new technologies. Most people would be quite happy to pay a higher price provided they are guaranteed a reliable and regular support service, if possible delivered by someone from their own neighborhood or community. This would be in line with the results of various evaluation studies of patient selfmanagement initiatives (e.g. the Expert Patient Programme in the UK) that showed how important it was that such initiatives be firmly anchored in the community.
ALADIN - a Magic Lamp for the Elderly?
1225
Since the elderly tend to suffer from chronic conditions to a disproportionate extent, the evaluation results of self-management programs could be very valuable [10, 11]. A further avenue we want to pursue is combining lighting solutions with advice on how to make people’s homes ageing-friendly. In the course of ALADIN we have acquired a great deal of knowledge about older people’s needs and preferences with regard to housing. Due to mobility constraints, many older people spend a large proportion of their time indoors, which makes optimizing lighting so essential for their wellbeing. When installing lighting systems in the homes of the elderly, quantity, spectrum, timing, duration and spatial distribution are important characteristics to be considered. In addition, special age-related impairments such as impaired vision have to be taken into account. Therefore the basic ALADIN system will have to be supplemented with some extra functions such as automatic daylight supplement to smoothen contrasts between light and shade or automatic corridor illumination by motion sensors. In the empirical research for ALADIN, the risk of falling down emerged as one of the most common worries among the elderly. The use of lighting for navigational purposes therefore would clearly respond to older people’s needs. Although the lighting industry can already offer a variety of solutions that address safety and security concerns such as recessed LED luminaires along staircases or corridors, few people are aware of their existence. This is why some partners in our consortium are planning to extend their repertoire of services by offering advice and counseling on these issues.
4.3 Conclusions From the perspective of a future roll-out of the ALADIN system and having in mind the preventive rather than therapeutical intention, it could be an advantageous strategy to conceive ALADIN as a convenient ambient assistance device addressed to technically open-minded younger end users. In the long run, however, the very elderly (75+) will constitute an equally interesting target group, provided the ALADIN functions related specifically to age-related impairments such as deteriorating sleep quality are emphasized. In this case, it would be important to target the intermediaries, i.e. family members, formal and informal care-givers, when marketing the system. The price of the system is still a major barrier to marketing the system on a large scale. Even if we succeed in bringing the price down due to economies of scale, the price would still be beyond the means of most older people unless they are reimbursed through their insurance. But national regulatory schemes are beyond our scope. This is why we are looking at care facilities for the elderly as a potential target group. This, however, would require a certain measure of adaptation since at the moment ALADIN is designed for single rooms in private households. Above all, we would have to devise strategies to reach the relevant intermediaries such as
1226
Edith Maier and Guido Kempter
health and care professionals, care organizations, building companies and housing associations. The modular and open architecture of our ambient lighting solution make it easy to integrate it in building management systems and general assistive environments. Therefore it can be marketed both as a stand-alone product as well as an add-on or option for already existing systems. Besides, the partners will leverage the alliances established during the project for ’packaging’ the technology with appropriate support measures which will have to be adapted to the differing social and organizational needs across Europe. This requires localizing the products not just by changing the language of the user interface, but by embedding the whole solution in the respective communities. This could be achieved by linking it up to a platform such as SOPHIA or to local and regional health care services. For the time being, ALADIN is directed at older adults and piloted only in the private homes. However, its findings are relevant for many other target groups as well as other application areas which others might want to explore further: • Shift-workers (e.g. in industry, in servicing sectors, hospital staff, etc.) • Assistive lighting systems for navigational support for older adults or people with cognitive impairments. • Air-traffic controllers or any other personnel responsible for the safety and security of critical installations such as nuclear power plants or chemical factories. • Office workers, in general, should benefit in terms of significant increases in work productivity. • Automotive Industry Applications, like the interior lighting of cars, applying ambient illuminating systems to facilitate the handling of single functions, to reduces fatigue of the driver and to increase safety. Because the system consists of modular components it can be adapted to different contexts such as hospitals, aeroplanes, vehicles, health centers and offices, and be extended to other contextual variables such as temperature, acoustics or displays. Anybody interested in developing it further, please contact us.
References [1] Abdou O (1997) Effects of Luminous Environment on Worker Productivity in Building Spaces. Journal of Architectural Engineering 3:124 [2] Bernhaupt R, Obrist M, Weiss A, Beck E, Tscheligi M (2007) Trends in the Living Room and Beyond. Lecture Notes in Computer Science 4471:146 [3] Boyce P, Mcibse F (2006) Education: the key to the future of lighting practice: The Trotter-Paterson memorial lecture presented to the Society of Light and Lighting, London, 21 February 2006. Lighting Research & Technology 38(4):283 [4] Cook D, Youngblood M, Heierman E, Gopalratnam K, Rao S, Litvin A, Khawaja F (2003) MavHome: An Agent-Based Smart Home. In: Proceedings
ALADIN - a Magic Lamp for the Elderly?
[5] [6]
[7]
[8] [9] [10] [11]
[12] [13] [14] [15] [16]
[17]
[18]
1227
of the IEEE International Conference on Pervasive Computing and Communications, pp 521–524 De Ruyter B (2003) 365 Days of Ambient Intelligence Research in HomeLab Friedewald M, Da Costa O (2003) Science and Technology Roadmapping: Ambient Intelligence in Everyday Life (AmI@ Life). Karlsruhe: FraunhoferInstitut für System-und Innovationsforschung (FhG-ISI) Harland C, Clark T, Prance R (2002) Electric potential probes-new directions in the remote sensing of the human body. Measurement Science and Technology 13(2):163–9 Heschong L (2002) Daylighting and human performance. ASHRAE journal 44(6):65–67 ISTAG (2002) Report of working group 60 of the ist advisory group concerning strategic orientations and priorities for ist in fp6. Tech. rep., Bruxelles Jordan J, Osborne R (2007) Chronic disease self-management education programs: challenges ahead. Medical Journal of Australia 186(2):84 Kennedy A, Gately C, Rogers A, EPP E (2004) National Evaluation of Expert Patients Programme: Assessing the process of embedding EPP in the NHS: Preliminary survey of PCT pilot sites Mayring P (1987) Subjective well-being in the aged. Status of research and theoretical development. Z Gerontol 20(6):367–76 Mitchell M (1996) An Introduction to Genetic Algorithms. Bradford Books Nakashima H (2007) Cyber Assist Project for Ambient Intelligence. Frontiers in Artifical Intelligence and Applications 164:1–20 Pollack M (2005) Intelligent Technology for an Aging Population. AI Magazine 26(2):9–24 Schweitzer M, Gilpin L, Frampton S (2004) Healing Spaces: Elements of Environmental Design That Make an Impact on Health. Journal of Alternative & Complementary Medicine 10(Supplement 1):71–83 Shochat T, Martin J, Marler M, Ancoli-Israel S (2000) Illumination levels in nursing home patients: effects on sleep and activity rhythms. Journal of Sleep Research 9(4):373–379 Ulrich R (2001) Effects of healthcare environmental design on medical outcomes. Design and Health: Proceedings of the Second International Conference on Health and Design Stockholm, Sweden: Svensk Byggtjanst pp 49–59
Japanese Ubiquotous Network Project: Ubila Masayoshi Ohashi
1 Introduction Recently, the advent of sophisticated technologies has stimulated ambient paradigms that may include high-performance CPU, compact real-time operating systems, a variety of devices/sensors, low power and high-speed radio communications, and in particular, third generation mobile phones. In addition, due to the spread of broadband access networks, various ubiquitous terminals and sensors can be connected closely. However, as many researchers argue, as addressed by Weiser in 1991 [32], the real goal of ubiquitous computing remains far beyond current R&D levels. In Japan, a more pragmatic approach focused on what we called ubiquitous networking. A roadmap of such ubiquitous networking projects in Japan is depicted in Figure 1. Ubiquitous networks cooperatively support user-centric services and applications anywhere and anytime. The Japanese Ministry of Internal Affairs and Communications (MIC) started a study on ubiquitous networking around 2000 and in 2002 issued a report on “Ubiquitous Networking” [11]. It argued that the context aware computing environments embedded in our real world interconnected by broadband fixed/mobile networks will greatly enhance services to end users and bring amenity and security to all people. At the same time, the Ubiquitous Networking Forum [29] was founded in June 2002 mainly by Japanese communication industries to accelerate and promote studies on ubiquitous networking. The forum’s objectives included the following: 1. 2. 3. 4.
Create new industries and business markets Enable secure social lifestyles Encourage social participation of the disabled and elderly Support various work environments
ATR Media Information Science Laboratories Kyoto, Japan e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_45, © Springer Science+Business Media, LLC 2010
1229
1230
Masayoshi Ohashi
The forum conducted extensive studies on various aspects of ubiquitous networking, and published a white paper named “A book of DOKODEMO Network (ubiquitous network)” in 2003 [30]. In 2003, MIC launched three national research and development projects on ubiquitous networking: a microchip networking project, an authentication and agent networking project, and a network control and administration project. We were assigned to lead the last project that was named the Ubila project. Followed by those three projects, MIC launched some ubiquitous networking related projects, including Network Robots Project in 2004, RFID project in 2004, and Sensor Network Project in 2005. In 2004, MIC started a policy roundtable for the future ubiquitous network society and named it “u-Japan” [13]. The Ubila project consisted of seven member companies and universities: Kyushu Institute of Technology, KDDI R&D Labs., NEC Corporation, Fujitsu Ltd., Keio University, the University of Tokyo, and KDDI Corporation. The Ubila project covered a large R&D area for supporting ambient services, and its keyword was context awareness. The project aimed to realize the total control and management of an ambient system. During its five years of Ubila R&D activities, almost $40 million were funded for the project. In this section, an overview of the project and its results are introduced from various aspects.
2001
2002 2003
2004 2005
2006 2007 2008 2009
MPHPT First phase committee Report of started the “Ubiquitous Networking” study of “Ubiquitous networking “Ubiquitous network” Fora “Network robot forum”
Next phase
RFID project
Network test bed
JGN network
Goal of u-Japan
forum”
Ubiquitous network project
Ubila project started (Sept. 2003)
2010
Network robot project Sensor network project
•Ubiquitous service platform •Ubiquitous terminal •Ubiquitous e-space
•Special ubiquitous zones
Intelligent home appliance project JGN2 network
Fig. 1 The roadmap of ubiquitous networking projects in Japan
• •
JGN2 plus network
Japanese Ubiquotous Network Project: Ubila
1231
2 Ubila’s Vision The objective of the Ubila project [28] was to establish key technologies for ubiquitous networking to generate and provide optimal applications, network connections, network services, computing/network environments, and real services to end users. A schematic diagram of the Ubila project is shown in Figure 2. Users performed daily activities in real space, which is indicated at the bottom of the picture. A ubiquitous network system tries to recognize a user’s situation, or infer what a user wants or requires. Such a user’s situation is called context, or more specifically, user context. Various monitoring methods such as sensors and RFID tags may be used. The retrieval flow is shown by the arrow on the left. Information about the network situation is also simultaneously retrieved. It includes the access situation of a user wireless terminal, the connection status of the communication devices around a user, and the necessary core network traffic situation. This is called network context, and it is shown by the arrow on the right. With these collected activities, a set of context is eventually created that considers both the user and the network context. Depending on their attributes, some components are named separately and differently such as profiles, feelings, policies or Service Level Agreements (SLA). Profiles include the user’s pieces of information associated with user activities. Feelings are the emotional information instances of a user. Policies are network controlling rules for a user or user applications. SLA show the committed degree of network service quality. The Ubila system will exploit all of these context for a target user, decide a service scenario, and identify the necessary network control based on the user and network context. A ubiquitous network will perform various controls, such as selecting an appropriate access network, ensuring quality guaranteed routing paths, and/or finding the best resource for contents downloads, etc. Eventually service or contents are presented and provided to a target user who enjoys such services with his/her local surrounding computing environments. This environment is called the Access Open Platform, which is comprised of many home appliances, wireless units, sensors, motes and actuators that provide direct human interfaces.
3 Approach As described in the introduction, the Ubila project had two major areas: applications/appliances and networks. Even though these two areas are normally considered independent, the Ubila project integrated them for mutual cooperation to provide ubiquitous environments to target users. Therefore, having and sharing a common vision among two areas was crucial. The Ubila project set two views: context awareness and user centric. In addition, target users were not technical experts, but ordinary people including the elderly and children to whom services, applications and network services must be provided without specific knowledge or previous ex-
1232
Masayoshi Ohashi
Fig. 2 Schematic picture of Ubila vision
perience. In the next section, some prototype developments of ambient services are introduced.
4 Various Ambient Service Designs and Prototypes 4.1 Basic Component Technology 4.1.1 Indoor Location System Measuring locations in outdoor environments is now possible with GPS. However, obtaining indoor location information remains an issue for ubiquitous computing. Many indoor positioning systems have been proposed, but they still need initial system configuration. The University of Tokyo developed a fully distributed ultrasonic positioning system called “DOLPHIN” (Fig. 3). It was verified in an actual indoor environment and provided high-accuracy positioning [14] without complicated initial configurations.
4.1.2 Wireless Sensor Network Nodes For sensing from environments or making actuations toward environments in ubiquitous computing, wireless sensor network nodes have important roles. Although
Japanese Ubiquotous Network Project: Ubila
1233
Fig. 3 Block diagram and implementation of DOLPHIN
motes with TinyOS are now widely available, they still have issues related to supporting real-timeness and programming flexibility. With a real-time OS, the university of Tokyo developed wireless sensor network [21] called “PAVENET” and “PAVENET OS” (Fig. 4). “PAVENET OS” provides a hard real-time process and yet can operate with very small resources. For example, exact 100 Hz real-time sampling while communicating with other nodes is ensured with “PAVENET OS”. “ANTH,” a device collaboration framework for sensors and actuators using “PAVENET,” detects events from sensor objects on an event-driven basis and passes them to actuator objects. Making a bind between objects may be done by just attaching two physical objects.
4.2 Pilot Applications and Services In this subsection, demonstrations of pilot applications and/or field trials are described. Through those demonstrations and field trials, they verified performance and service possibilities.
1234
Masayoshi Ohashi
Fig. 4 PAVENET module: wireless sensor network nodes with real-time OS
4.2.1 Earthquake Sensing System An earthquake sensing system was developed in collaboration with Kajima Corporation and the University of Tokyo [24] (Fig. 5) to measure vibrations caused by earthquakes using acceleration sensors embedded in buildings or bridges. During an earthquake, all sensors start to measure the acceleration and send the data to a sink node. A “PAVENET” module was used for data collection. Synchronization errors were supressed within 1 msec. This system will be useful to check whether a target building remains robust against earthquakes or needs reinforcement.
4.2.2 Smart Furniture smart furniture, an initiative to make our living objects smart, was aggressively developed by Keio University. Such furniture takes an important role for realizing smart space or providing ubiquitous services. Among the developed smart furniture, the most useful prototype is “uTexture” [10] (Fig. 6), a kind of panel computer with touch sensors that can be connected with each other to learn mutual connection topology. “uTexture” is programmable; it could be a tiled display or a CD audio player (Media Shelf ), etc. This furniture was embedded in the smart space in Yurakucho, Tokyo and several key applications were demonstrated with it (see 7.2). “MO@I” is another good example of smart furniture [4] (Fig. 7). It is basically a big screen with a touch sensor, a USB camera, an ultrasonic sensor, an LED, and RFID (passive and active) readers. Since several local interfaces have different coverage, “MO@I” may have different service zones, namely, touchable, face to face, visible, and ambient zones, and services are provided according to the user registered zone.
Japanese Ubiquotous Network Project: Ubila
1235
Fig. 5 Earthquake sensing system with “PAVENET” module
Fig. 6 Smart furniture; uTexture
4.2.3 uPhoto “uPhoto” was designed to add tags to visible spaces [5, 23]. Normally, available services in an environment are not explicitly indicated. “uPhoto” searches for available services in an environment from a photograph. Services embedded in an environment are marked with special identification tags called “service eyemarks” and
1236
Masayoshi Ohashi
Fig. 7 Smart furniture; MO@I
registered to a server. After a photo is taken, they are sent to a server to search and find available services within the picture. This service resolution is made immediately after the picture is taken, and available services are indicated on the picture. Therefore, by clicking the icon on the screen, target service can be invoked. “uPhoto” was developed by Keio University.
Fig. 8 uPhoto: finding avilable services from a photo
4.3 Collection and Dissemination of User Profiles If so much information about the real spaces or daily objects is collected and disseminated over an entire ubiquitous network, then innovative values can be created.
Japanese Ubiquotous Network Project: Ubila
1237
Among such information, user information or p user status information is important (see 9.1). We call this “user context” or “user profiles.” In this subsection, several R&D outputs are described that exploit user profiles.
4.3.1 Kuchikomi Navigator “Kuchikomi navigator” collects users’ views or short comments about things, analyzes them by on-demand community network engines, and visually shows the results (Kuchikomi means “a word of mouth communication” in Japanese). If a user selects a specific topic, then related topics are automatically sought by the engine and presented on the PC display based on their similarity. They collected Kuchikomi sources from the Internet and also took Lifelog (see in 4.3.3) as an input source. The community network engine dynamically establishes an on-demand secure link among relevant nodes called the Closed User Group (CUG). “Kuchikomi navigator” was developed by NEC.
4.3.2 Home Navigator “home navigator” is designed to control intelligent home appliances based on a context aware module. It was jointly developed by the University of Tokyo and NEC. The module consisted of “Synapse” and a policy engine. “Synapse” anticipates and suggests subsequent actions for home appliances, based on a previously learned typical sequence of user operations associated with sensor data [22]. A policy engine specifies a set of appliance execution policies. For example, prohibiting the simultaneous operation of both a heater and an air-conditioner in one room is a policy. Combining two engines can offer context aware services that comply with specified rules.
4.3.3 Lifelog with Mobile Phones Using mobile phones, KDDI developed a lifelog system called “Keitai de Lifelog (Lifelog with mobile phones)”. Current mobile phone has many capturing capabilities, such as camera, barcode reader, and obtaining locations with GPS, and even RFID reader may be feasible. In this sense, a mobile phone is an ideal object to collect a personal profile of a user’s daily life. This Lifelog system regards incidents that happen around a user as pieces of lifelog and captures them by mobile phone. For example, when a user goes shopping and finds an interesting goods, then he/she may get many information, i.e., when, where (by GPS), what (by taking a photo and/or by reading a barcode or 2-D barcode (QR code)), etc [15]. Detailed information on the object (name, vendor, price, etc) may be resolved from the barcode on the object with a help of common product database.
1238
Masayoshi Ohashi
Fig. 9 Home navigator: smart controller for intelligent home appliance
After being uploaded, the record is stored semantically using the Resource Description Framework (RDF) format, where all resources in real space are individually assigned a unique URI. At the same time, KDDI developed a mobile phone with a passive RFID reader [9] (Fig. 11). In demonstrations, RFID tags were placed on objects, and when an identification number was read out from the RFID, the object’s attributes (e.g., name, location, price, etc.) were resolved by a server and stored in RDF format. The stored data can be shown to a user in several styles including a blog, a calendar or mashed-up format on a map.
4.3.4 Real-space Collaborated Presence Service Presence sharing is one of the most popular applications in the ubiquitous computing environment. If buddies’ presence is shared, they can know more about each other and can use the presence as a trigger for new communication. The University of Tokyo and KDDI R&D Labs. developed a TV presence sharing system between smart spaces and mobile phone. The configuration of the presence service is depicted in Figure 12. Each user has his/her own presence server in which his/her TV presence is collected and kept. The presence information is exchanged between presence servers according to the buddy lists using SIP-based protocol. In their implementation, a smart remote control box in a smart space collects a presence which TV program is being watched by a user, and uploads it to a presence server, while a mobile phone can also collects presence of buddies through his/her presence server.
Japanese Ubiquotous Network Project: Ubila
1239
Fig. 10 Lifelog: collecting users’ daily information with mobile phone
Fig. 11 Mobile phone with RFID reader: passive RFID tags can be read out with a mobile phone
An example sequence is as follows: when user A shows his presence that “I am watching TV program X” through A’s presence server, then a user B can get and view user A’s presence through B’s mobile phone. The user B may start to watch the same TV program X with B’s mobile phone or user B may power on the TV using the mobile phone as a TV remote controller.
1240
Masayoshi Ohashi
Fig. 12 Real-space collaborated presence service
4.3.5 “Kansei” Service “Kansei” (it means “feelings” in Japanese) service is a prototype system developed by Kyushu Institute of Technology to find users who share similar feelings. In the “Kansei” service, users are given a set of vectors that represent their feelings that is uploaded to a P2P (Peer to Peer) network PIAX (see in 11). During a query process of a P2P network, the system automatically controls and guides a query propagation to network nodes whose feeling vectors resemble the original query node. Thus the user who originated the query can easily locate users (P2P network nodes) with similar feelings [17].
5 Field Trials In this section, several field trials using developed prototypes are reported.
5.1 Live Commerce Akiba “Live Commerce Akiba!” was jointly held with UNS2007 in the city of Akihabara, Tokyo, and operated by Keio University and the University of Tokyo. Keio University developed an extended smart furniture sensor board (see in 4.2.2) to measure the popularity of goods in stores. They installed sensors on goods and if a user touches or picks it up, then a sensor detects the motion and interprets it as a sign of interest (Fig. 13). A signboard showed the accumulated interest value with photos of the goods, allowing users to learn which were popular. At the same time,
Japanese Ubiquotous Network Project: Ubila
1241
the data were sent to the Point of Sale (POS) system of the store. Mobile phones and figures were displayed for this trial.
Fig. 13 Live Commerce Akiba: measuring the popularity of goods
5.2 Temperature Monitoring with Airy Notes “Airy Notes” system monitors environmental conditions using tiny sensor modules with wireless communication capabilities. Each sensor module can measure environmental parameters including temperature, and the data are collected and sent to the sink node. The obtained characteristics of the target area are shown on a map and can also be seen by mobile phone. The Keio University team conducted a field trial with about 160 sensors at Shinjuku Gyoen Garden, which is located very close to the Shinjuku area, a highly urbanized part of Tokyo. The results of the field trial demonstrated that the temperature of the garden area was lower than the edge of the garden, verifying the greening effect of the urban area [3] (Fig. 14).
5.3 Lifelog Trial The Lifelog project was introduced in Section 6. In 2007, a field trial was conducted in Tochigi prefecture with the Hakuyo Agricultural High School to promote e-agriculture. Lifelog provided support by using mobile phones to record farming processes, i.e., when crops were planted, when and how much fertilizer or chemicals were used, when the crops were harvested, etc (Fig. 15). By making such farming
1242
Masayoshi Ohashi
Fig. 14 Airy Notes field trial in Shinjuku Gyoen
processes transparent to consumers through the Lifelog, consumer confidence in the quality of farm products may increase.
Fig. 15 Lifelog trial for e-agriculture: work record is displayed with mashed up format
Japanese Ubiquotous Network Project: Ubila
1243
6 Creation of Videos 6.1 Motivation The Ubila project created two videos, “Small stories in 2008” in 2003 and the other is “Aura” in 2007, which are available at http://www.ubila.org/e/ e_index.html. A word “ubiquitous” was not recognized well in Japan when the Ubila project started in 2003. To promote the actual concept of “ubiquitous computing” to all people by showing how ubiquitous technologies can implicitly and invisibly support their daily lives, Ubila members joined a special working team of the Society of Ubiquitous Information1 and created promotion videos. The process of making videos played an important role because it created a common vision of ubiquitous computing that could be shared by all members. The rough process includes the following four steps: (1) retrieval of knowledge about the target domain, (2) visualization of structure between obtained knowledge, (3) creating scenarios, (4) and making the video. In steps (1) and (2), a workshop was held in which 14 participants brainstormed futuristic topics by viewing about 200 randomly chosen news articles. The futuristic topics were considered scenario seeds. All participants generated scenarios under specific future situations. For example, how would enhanced sensor networks that can even monitor the detail of peoples’ activities affect young peoples’ daily lives at Shibuya (Shibuya is a lively spot in Tokyo). Those topics and derived scenarios were integrated into a visionary map indicating mutual relationships. Finally a consolidated video scenario was composed based upon the map using a toolkit [18]. The significance of these processes was to create an aggregated future vision by converting informal knowledge of experts into formal expressions. In the final scenario, such knowledge, i.e., collective intelligence, was well reflected as well as the author’s view.
6.2 First Video: Small Stories in 2008 The first work, which was made in 2003, was called “Small Stories in 2008” and consists of three short stories, with technologies available in 2008. The first, “Love Triangle,” featured advanced urban communication infrastructure by exchanging location information and/or presence information among friends or communities. The second one, “First Love,” featured a community computing in an educational environment. iThe last one, “The Reunion,” described the extension of communication and service resources in a nomadic environment. 1
Private study group of ubiquitous computing in Japan headed by Tomonori Aoyama, Keio University
1244
Masayoshi Ohashi
Fig. 16 Video; Small Stories in 2008 (2003)
6.3 Second Video: Aura The second video, “Aura,” was made in 2006. Its intention was to send messages to people how we envisage the era of 2015, considering the progress of ubiquitous/sensing technologies. The working group took the same approach as in “Small Stories in 2008.” They chose three topics as the basis of the story: enhancement of personal skill and performance, activation of community communications, and improvement of social welfare. “Aura” basically showed the future convenient and amenity lifestyles, and a life surrounded by many sensors and devices was also demonstrated, posing questions about who has the right to capture and use the sensed data and/or how to protect people’s privacy under such environments.
Fig. 17 Video; Aura (2006)
Japanese Ubiquotous Network Project: Ubila
1245
7 Building and Operating Smart Spaces To give people chances to see the output of R&D, the Ubila project set up three smart spaces, two in Tokyo and one in Kokura, Kitakyushu. The spaces ran from 2005 to 2007. The Ubila project periodically conducted open demonstrations to show their latest developments. All smart spaces were interconnected with Japan Gigabit Network 2 (JGN2) which is a national open testbed high-speed network [6].
7.1 Akihabara Smart Space Akihabara, one of the largest areas in Japan for buying various kinds of electronics, computers, software, and animations, is also known as a collaboration place between academia and industry. The Ubila project installed a smart space in the Akihabara Daibiru building in the center of Akihabara. The space, which was mainly operated by the University of Tokyo, was designed by Hiroshi Shoji and generated a futuristic atmosphere that included meshed type ceilings and spaces separated by many portable poles (Fig. 18). There were hollow spaces within poles where sensors and various devices and power and network cables could be freely installed. Such flexible features of the space enabled various kinds of field trials and demonstrations using developed R&D outcomes. The space was also used for such joint activities as collaboration with Ubiquitous Networking Forum in which Sensor Network Committee jointly conducted demonstrations with the Ubila project [29]. A total of 14 demonstrations were conducted.
7.2 Yurakucho Smart Space (uPlatea) The Ubila project set up another demonstration and experiment space in Yurakucho, Tokyo, named “uPlatea” [31] (“platea” means a plaza or a place in Latin). uPlatea was designed to invisibly embed many computers, sensors, and network devices in harmony with conventional furniture. Since uPlatea could be used as a smart meeting space, a smart public space, or a smart living space, it could be used for ubiquitous service experiments in various life scenes. Smart panels called “uTextures” (see in 4.2.2) were placed in this environment. They were flexibly configurable and might work in many ways such as multi-tiled displays, CD music players, or smart bookshelves. Ten demonstrations were conducted and welcomed hundreds of visitors.
1246
Masayoshi Ohashi
7.3 Kitakyushu Smart Space The third space was built in Kitakyushu, located hundreds of miles away from the other two spaces. This space served as an experimental network hub station in which JGN2, Science Information Network (SINET), commercial ISP, and Kyushu Institute of Technology were interconnected to realize multi-home environments. In the Kitakyushu smart space, several field trials and demonstrations were conducted for network self configuration, wide area distributed computing, on-demand network quality control subproject for end-to-end QoS, etc.
8 Network or Protocol Oriented R&D Universal accessibility is a crucial feature of ubiquitous networks, even though basic accessibility is now well-established through fixed broadband access and third generation mobile services. However, ubiquitous networks are comprised of various devices, including sensors, actuators, robots, and terminals with a variety of communication capabilities. Therefore issues exist about how such ubiquitous nodes exchange data or information with a small number of protocols. Network control and management were also important research items of the Ubila project. The network status itself may be regarded as “network context.” For
Fig. 18 Akihabara Smart Space
Japanese Ubiquotous Network Project: Ubila
1247
Fig. 19 Yurakucho Smart Space: “uPlatea”
users to enjoy ubiquitous services using remote resources through networks, such network attributes as bandwidth, packet loss rate, or latency, need to be controlled based on network context. In this section, we show some Ubila R&D activities from the viewpoint of network and protocol issues.
8.1 Protocol Enhancement: OSNAP and CASTANET Sensor nodes are normally tiny devices that do not have full access capability to networks. The TCP/IP may not be supported. A protocol called “OSNAP (Open Sensor Network Access Protocol),” which connects sensor nodes to internet, was proposed by the Sensor Network Committee of the Ubiquitous Networking Forum [20]. When the TCP/IP is supported either at sensor nodes or gateway, the data of the sensor nodes can be accessed with this protocol. OSNAP is based on the Simple Network Management Protocol (SNMP), and sensor nodes may be regarded as network nodes to be managed. Another concept called “CASTANET” was proposed by the University of Tokyo [8]. It requests that the access methods of all ubiquitous nodes comply with the Representational State Transfer (REST) architecture style, whose popularity is becoming on the Internet. With CASTANET, sensor data obtained from sensor nodes and commands for driving actuators can be handled with HyperText Transfer Protocol (HTTP).
1248
Masayoshi Ohashi
8.2 Adaptive Network Control Based on Network Context Network context generally reflects the network state. The local network context may be easily obtained by measuring its performance locally. However, measuring network context over wide inter-domain networks is difficult. In the Ubila project, Kyushu Institute of Technology and KDDI R&D Labs. developed a network QoS measurement method based on a network tomography. With this method, sparsely placed network measurement nodes can estimate the global network performance and if anomalies happen, they can be identified. Fujitsu developed QoS control and routing control systems using a network administration and control server (Fig. 20). The server reserves a certain volume of network resources in advance. When it receives a QoS control request from a user, the server searches for the available network resource from the reserved ones and assigns the resource adaptively based on the request level over the inter network domain [19]. Fujitsu also implemented a web service interface to control network QoS. The interface is described with Web Service Description Language (WSDL) and requests are sent by Simple Object Access Protocol (SOAP). The system was tested and verified to work on the JGN2 network testbed.
Fig. 20 Adaptive Network Control: network context is collected at Context collection server and QoS server will control network routing
Japanese Ubiquotous Network Project: Ubila
1249
8.3 Automatic Network Configuration Network connection topology is a part of network context. Ubiquitous network nodes may sometimes move nomadically. One necessary feature for ubiquitous networks is that network nodes can be instantly connected and configured to a network based on a assigned network policy and context without complicated manual procedures. The KDDI R&D Labs. developed a centralized IP router setup protocol called “Router Auto-configuration Protocol (RAP).” If an IP router supports RAP, then it can be configured to an IP network with few settings.
9 Defining Context Aware System Architecture 9.1 Ubiquitous Network Architecture and Context As readers may know, the formal definition of the context awareness remains elusive. In addition, the Ubila project often received requests to clarify ubiquitous network architecture. Although many researchers discussed ubiquitous networks, few clearly defined context. The Ubila project established a study team on context awareness and network architecture and issued an account called, “A report on a ubiquitous network architecture” (in Japanese) and made it public through the Ubiquitous Networking Forum [30]. The proposed network architecture is shown in Figure 21. Its functional architecture consists of a ubiquitous network (in a narrow sense), network control function, a service control function, context collection and dissemination functions and an access open platform. The bottom line is that context has a central role in the system, and by appropriately applying context to both service and network control functions, optimal service can be offered to users. The architecture was found to be well suited to all Ubila members’ service scenarios. The next study defined context based on Dey’s somewhat recursive definition, “any information used by context-aware servicies” [1], and classified context into primary and secondary parts. Primary context is a kind of raw data typically obtained from sensors or network monitors that is readable by computer. RFID identifications, user input profiles, and the amount of network traffic are examples of primary context. Secondary context is semantically inferred data from primary contexts based on target applications. For example, the estimated inventory status from RFID data in stock control applications is secondary context.
1250
Masayoshi Ohashi RP0
Access Open Platform
Applications
RP1
RP10
RP5
Service logic control
RP2
(Service support function)
Context Dissemination
Context retrieval function
RP7
Context security function Context storage function
Context transport function
Context collection function
Context conversion function
Context analysis function
(Call control function) Session control function
RP3
Actuators
Service and data Control function
RP8
RP9
Service logic control other networks
RP13of
User admin. function RP11
Call control of other networks
RP14
Context dissemination of other networks
Network control Network config. function
Network control of Network analysis and other networks control function RP15
RP4 RP6
RP12 RP16
Ubiquitous Network
Other ubiquitous networks
Fig. 21 Proposed ubiquitous network architecture
9.2 Contributions to ITU-T Based on the study on context awareness, Fujitsu, NEC, and KDDI proposed ITUT SG13 to incorporate “context awareness” as a next generation network (NGN release 2) requirements. In NGN release 1 requirements, “presence” was incorporated to show a user’s situation. The contribution proposed to treat “context awareness” as an expanded and general concept of presence and could be used for various NGN service provisions. After discussion, the concept was understood, and the Jan. 2008 SG13 meeting in Seoul agreed to incorporate this requirement in “Y.NGN-R2-Reqts.”
10 Symposium and Contributions to Academia and Fora To demonstrate R&D output of the Ubila project, Ubiquitous Network Symposiums (UNS) were held every year. The first UNS was held in 2004 at Odaiba, Tokyo with about 300 participants. UNS2005 was held in the Kyoto International Conference Center, Kyoto. UNS2006 was not independently held but had booths and a conference at CEATEC JAPAN 2006 at Makuhari, Chiba (Fig. 22). As a final event, UNS2007 was held in Akihabara in October 2007. At UNS2007, all ongoing ubiquitous national projects met to demonstrate their research results. During its ongoing period, the Ubila project contributed to many academic activities. The biggest was Ubicomp2005, held in Tokyo [27]. Prof. Hide Tokuda,
Japanese Ubiquotous Network Project: Ubila
1251
Keio University, served as a general chair and many Ubila members and member organizations supported this successful conference that was attended by 625. The conference showed the Ubicomp records so far. In addition to Ubicomp, another international conference emerged from Japan: the International Symposium on Ubiquitous Computing Systems (UCS). UCS2003, the first UCS, was an intimate affair held in Kyoto. UCS2004 was held in Tokyo and welcomed international participants. In 2005, a workshop was held in Cheju, Korea, instead of UCS. In 2007, UCS2007 was jointly held with UNS2007. Participants could see the outcome of the Ubila project. The Ubila project also had a close relationship with fora in Japan. The first was “Ubiquitous Networking Forum” created in 2002 that helped national projects by collecting research results and/or documents as candidate material for standardization and made them public. The “Network Robot Forum,” another forum activity, focused on network robot R&D activities. A protocol CroSSML, which connects a ubiquitous network and a network robot was proposed by the Ubila project and verified its usefulness within the forum.
Fig. 22 Booth of the Ubila project at CEATEC 2006
1252
Masayoshi Ohashi
11 Related Activities and Future Trends in Japan 11.1 National R&D Projects in Parallel to Ubila As shown in Figure 1, there are many ubiquitous related activities in Japan. Alongside the Ubila project, two other projects were operating in parallel. One was the Ubiquitous Authentication and Agent project (UAA) headed by NTT [26]. They conducted R&D on authentication technology under ubiquitous environments and developed a P2P platform named P2P Interactive Agent eXtensions (PIAX) on which many agents may run and exchange information on a peer-to-peer basis [7]. Another project was headed by YRP Ubiquitous Networking Laboratory and called “A Study on Ultra Small Chip Networking Technology.” Its mission was to develop ultra-small scaled network nodes, i.e., super small active identification tags. The developed tags were named “dice” and were about 10 mm3 . These tags were deployed in a number of ubiquitous field trials. For identification, they proposed a unique identifier for RFID called “ucode.”
11.2 Other Ubiquitous Computing Related R&D Projects in Japan There were some other major ubiquitous computing related projects in Japan. Before the Ubila project was launched, there were preceding ubiquitous computing Projects. One was a “Cyber Assist Project for Ambient Intelligence” at CARC (Cyber Assist Research Center) from 2001 by AIST, National Institute of Advanced Industrial Science and Technology [16]. They tried to realize ambient intelligence using IPT (Information Processing Technology) to resolve the issues of "information overload," "digital divide" and "privacy protection". The other one was “Yaoyorozu project” [33]2 . The project was conducted from 2002 to 2005, and was supported by Ministry of Education, Culture, Sports, Science and Technology (MEXT). They took an approach to resolve issues through a combination or fusion of engineering, the humanities, and social sciences. There are also ongoing projects related to ubiquitous computing. The “Free Mobility Assistance Project” has been conducted by the Ministry of Land, Infrastructure, Transport and Tourism (MLIT) since 2003 for the support of universal mobility. They made a number of field trials nationwide to verify the mobility support using RFID tags. One example of the trial was “2009 Tokyo Ubiquitous Technoloty Project” performed in Ginza, Tokyo [25]. “Information Grand Voyage” is a currently ongoing big ICT project which aims to develop a next generation information retrieval / analysis technologies and tries to 2
The name “Yaoyorozu” is interesting since in classical Japanese it literally means “so many gods are everywhere and protecting us,” see http://www.8mg.jp/en/outline_origin01. htm
Japanese Ubiquotous Network Project: Ubila
1253
verify them through model services. It is supported by Ministry of Economy, Trade and Industry (METI). The project started from 2007 and planned to complete in 2009 [2].
11.3 Next Phase MIC Japan set the year of 2010 as the target date for u-Japan [12]. By this time it is expected that technical solutions should be readily available for resolving many current issues including ensuring public safety, disaster prevention, preserving the environment, conserving energy, and supporting the elderly, the disabled, and children. Those wide area requirements lead to a second phase ubiquitous networking project for integrated platforms to provide adaptive ubiquitous services that are easily customizable for end-users With these perspectives, in 2008 a second phase project called the “Cross UBIQuitous platform project (CUBIQ)” started to build an integrated platform in which adaptive ambient capabilities are easily provided to end-users.
12 Conclusion The Ubila project ended in March 2008. During its five years, more than 100 people were involved and presented unique activities with a close relationship among industry, academia, and government. The items in this article are just samples of Ubila R&D results . Many of the network related R&D achievements were omitted from this article. So far, 42 journals and 640 articles have been published and 108 patents have been filed. All Ubila members are convinced that the project was an important and significant step toward the realization of ubiquitous paradigms. Part of this section was performed under the research project of the Ministry of Internal Affairs and Communication (MIC) in Japan. Acknowledgements The author expresses his sincere thanks to all Ubila Project members, in particular head of all participating bodies, Tomonori Aoyama, Hideyuki Tokuda, Yuji Oie, Hiroyuki Morikawa, Hitomi Murakami, Tohru Asami, Yoshiaki Kiriha, Masafumi Kato and Tohru Hasegawa. He also shows thanks to Katsuya Watanabe, Yoshiaki Takeuchi and Yasuo Tawara for their continuous support from MIC. Hiroshi Tamura and members of Society of Ubiquitous Information have made significant contributions for creating videos.
1254
Masayoshi Ohashi
References [1] Dey AK, Abowd GD (2000) Towards a better understanding of context and context -awareness. In: Proceedings of the CHI 2000, The Hague, Netherlands [2] Information Grand Voyage (2007) http://www.igvpj.jp/index_en/ project-outline/000about-the-information-grand/ [3] Ito M, Katagiri Y, Ishikawa M, Tokuda H (2007) Airy notes: An experiment of microclimate monitoring in shinjuku gyoen garden. In: INSS 2007, Braunschweig, Germany, pp 260–266 [4] Iwai M, Komaki A, Tokuda H (2005) Mo@i: A large display service platform enabling multi-zones serviceability. In: Ubicomp 2005 Workshops:W6 The Spaces in-between: Seamful vs. Seamless Interactions, Tokyo, Japan [5] Iwamoto T, Suzuki G, Aoki S, Kohtake N, Takashio K, Tokuda H (2004) uphoto: A design and implementation of a snapshot based method for capturing contextual information. In: Pervasive 2004 Workshop on Memory and Sharing of Experiences, Vienna, Austria [6] JGN2 (2004) http://www.jgn.nict.go.jp [7] Kaneko Y, Harumoto K, Fukumura S, Shimojo S, Nishio S (2005) A locationbased peer-to-peer network for context-aware services in a ubiquitous environment. In: SAINT2005 Workshop, Trento, Italy [8] Kawahara Y, Kawanishi N, Ozawa M, Morikawa H, Asami T (2007) Designing a framework for scalable coordination of wireless sensor networks, context information and web services. In: IWSAWC 2007, Toronto, Canada [9] KDDI News Release (2006) http://www.kddi.com/corporate/ news_release/2006/0927/ [10] Kohtake N, Ohsawa R, Yonezawa T, Matsukura Y, Iwai M, Takashio K, Tokuda H (2005) u-texture: Self-organizable universal panels for creating smart surroundings. In: The Seventh International Conference on Ubiquitous Computing (UbiComp2005), Tokyo, Japan, pp 19–36 [11] MIC report (2002) http://www.soumu.go.jp/joho_tsusin/eng/ Releases/Telecommunications/news020611_3.html [12] MIC u-japan (2005) http://www.soumu.go.jp/menu_02/ict/ u-japan_en/r_s0421.html [13] MIC u-japan (2006) http://www.soumu.go.jp/joho_tsusin/ eng/Releases/Telecommunications/news060908_1.html [14] Minami M, Fukuju Y, Hirasawa K, Yokoyama S, Mizumachi M, Morikawa H, Aoyama T (2004) Dolphin; a practical approach for implementing a fully distributed indoor ultrasonic positioning system. In: Ubicomp2004, Nottingam, England [15] Morikawa D, Honjo M, Yamaguchi A, Ohashi M (2005) A design and implementation of a dynamic mobile portal adaptation system. In: WPMC2005, Alborg, Denmark, pp 1291 – 1299 [16] Nakashima H (2007) Advances in Ambient Intelligence, Frontiers in Artificial Intelligence and Applications, vol 164, IOS Press, chap Cyber Assist Project for Ambient Intelligence, pp 1–20
Japanese Ubiquotous Network Project: Ubila
1255
[17] Ohnishi K, Yoshida K, Oie Y (2007) P2p file sharing networks allowing participants to freely assign structured meta-data to files. In: INFOSCALE, Suzhou, China [18] Ohsawa Y (2003) Chance Discovery, Springer Verlag, chap 18, pp 262–275 [19] Okamura A, Nakatsugawa K, Yamada H, Suga J, Takechi R, Chugo A (2006) Optimum route control method based on access and core network status. In: World Telecommunications Congress 2006 [20] OSNAP (2008) http://www.ubiquitous-forum.jp/documents/ osnap/index.html(inJapanese) [21] Saruwatari S, Kashima T, Minami M, Morikawa H, Aoyama T (2005) Pavenet: A hardware and software framework for wireless sensor networks. Transaction of the Society of Instrument and Control Engineers E-S-1(1):74–84 [22] Si H, Kawahara Y, Morikawa H, Aoyama T (2005) A stochastic approach for creating dynamic context-aware services in smart home environment. In: Demo session of the 3rd International Conference on Pervasive Computing (Pervasive 2005), Munich, Germany [23] Suzuki G, Aoki S, Iwamoto T, Maruyama D, Koda T, Kohtake N, Takashio K, Tokuda H (2005) u-photo: Interacting with pervasive services using digital still images. In: The 3rd International Conference on Pervasive Computing (Pervasive 2005), Munich, Germany [24] Suzuki M, Saruwatari S, Kurata N, Morikawa H (2007) Demo abstract: A high - density earthquake monitoring system using wireless sensor networks. In: SenSys2007, Sydney, Australia, pp 373–374 [25] Tokyo Ubiquitous Technology Project (2009) http://www. tokyo-ubinavi.jp/en/tokyo/index.html [26] UAA (2006) http://www.uaa-project.org/ [27] Ubicomp2005 (2005) http://www.ubicomp.org/ubicomp2005/ [28] Ubila (2003) http://www.ubila.org/ [29] Ubiquitous Networking Forum (2002) http://www. ubiquitous-forum.jp/ [30] Ubiquitous Networking Forum (2003) A book of dokodemo. Tech. rep., Ubiquitous Networking Forum [31] uPlatea (2005) http://www.ht.sfc.keio.ac.jp/htweb/rg/ pamphlet2008.pdf [32] Weiser M (1991) The computer for the twenty-first century. Scientific American 265:94–104 [33] Yaoyorozu Project (2002) http://www.8mg.jp/en/index.html
Ubiquitous Korea Project Minkoo Kim, We Duke Cho, Jaeho Lee, Rae Woong Park, Hamid Mukhtar, and Ki-Hyung Kim
1 Introduction This chapter consists of four sections. Each section introduces a major project related with ubiquitous computing technology which has been conducted by Korean government. In the second section, we introduce UCN (Ubiquitous Computing Network) project in which service convergence solutions have been developed to design and manage human-centered composite services. In the third section, we introduce a project on intelligent service robots in ubiquitous environment. The fourth describes several projects related with ubiquitous health. Finally, we introduce a project which focuses on the architecture of USN (Ubiquitous Sensor Network) technology for nation-wide monitoring.
Minkoo Kim Division of Information and Computer Engineering, Ajou University, Suwon, Gyeonggi, South Korea, e-mail: [email protected] We Duke Cho UCN (foundation of ubiquitous computing and networking), President of UCN Jaeho Lee School of Electrical and Computer Engineering, The University of Seoul, Seoul, South Korea, email: [email protected] Rae Woong Park Department of Medical Informatics, Ajou University Suwon, Gyeonggi, South Korea, e-mail: [email protected] Hamid Mukhtar Division of Information and Computer Engineering, Ajou University, Suwon, Gyeonggi, South Korea, e-mail: [email protected] Ki-Hyung Kim Division of Information and Computer Engineering, Ajou University, Suwon, Gyeonggi, South Korea, e-mail: [email protected] H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0_46, © Springer Science+Business Media, LLC 2010
1257
1258
Minkoo Kim et al.
2 Ubiquitous Computing Network 2.1 Project Overview UCN(Ubiquitous Computing & Network) is the long-term governmental R&D project supported by the Ministry of Knowledge Economy of Republic of Korea. It has begun in 2003 and continues for 10 years. For it Korean government invests $139.5 million. It envisions “realization of Ubiquitous Smart Space for life care” and aims at developing core fundamental technologies based on Community Computing and system solution in a well-being and safety domain.
2.2 Vision: Ubiquitous Smart Space(USS) UCN has designed USS as a technological architecture of knowledge-based usociety. USS is the user-centered convergence space which fuses physical, virtual and logical space to provide goal-oriented services. It is characteristic of situation awareness, autonomic collaboration, and self-growing. In USS users can achieve own goal by autonomic collaboration among computing entities and services become optimized continuously by learning the change of situation and process.
2.3 Reference System Architecture and Deployment of USS Fig. 1 and Fig. 2 depict reference system architecture and deployment of USS, respectively.
2.4 R&D Strategy To achieve the above vision UCN has 4 strategic approaches. • Multi-disciplinary research to analyze social and economic needs and develop u-Services in a coming society • Concentration of core technologies: intelligent system components, smart objects, and service convergence platform • Evolution of Intelligence level with a view of situation awareness, autonomic collaboration, and self-growing • Enhancement of technology competence through collaborating with global network
Ubiquitous Korea Project
1259
Fig. 1 Reference Architecture of USS
Fig. 2 Deployment of USS
2.5 Actual Results UCN has attained strong actual results with a passionate effort and strategy since 2003. As of 2007 it delivered 489 papers, 235 patents(domestic and overseas), 7 technology transfers and 39 prototypes.
1260
Minkoo Kim et al.
2.6 Sub-projects UCN has four categories of project as follows. Categories Research theme System Design S1a: UCASE(Space Engineering) & Integration Technology which makes it possible that directs the space rapidly and corporately which can be handled by users through applying pre-provided space contents or following the pre-defined space composition. S1b: U-Commerce, U-Media, U-Finance Value Improved and Privacy Enhanced Economic Activities in Ubiquitous Environment by using seamless communication. S2: CDTK(Community Development Tool Kit) It can generate USS services based on communities consisting of computing elements. S3a: USPi(ubiquitous smart space Platform Initiative) The u-system platform which supports development of indexbased, situation-aware and autonomous converged services for human life care. S3b: Well-being Life Care System It calculates unconstrainedly well-being index from activities of daily living(ADLs) and provides life care services to prevent Metabolic Syndrome.
Middleware
S3c: Intelligent System for Public Safety The public safety system which supports context-aware and secure management of various dangerous situations with public infrastructure such as CCTV. B1a: Intelligent middleware platform based on context-aware agent Intelligent middleware platform providing users with contextaware services using intelligent context-aware agent. B1b: Group-Aware Software Infrastructure for Heterogeneous Multiple Smart Spaces - High performance context management for user and user group - Dynamic service reconfiguration supporting user mobility and collaboration - Continuous service interaction for heterogeneity and self evolving - Scalable semantic service discovery considering group context.
Ubiquitous Korea Project
1261
Smart Objects O1a: UMO(Ubiquitous Mobile Object) Framework It could provide the most suitable service to customer with sharing resource and context. Final goal is to develop S/W stack and H/W platform for UMO. O1b: CAMAR(Context-Aware Mobile Augmented Reality) Companion It augments personalized contents and enables users to share the contents with others through their mobile devices. It manages users’ contextual information and visualizes personalized contents over mobile devices. Moreover, it enables users to attach their contents to physical objects and to share the contents with others in CAMAR communities. O2: SMeet(Smart Meeting Space) System It supports efficient and easy collaboration with large and heterogeneous network displays reflecting users’ intention. O3: Personalized Well-being Life Care Service To promote well-being life it manages physical disease and mental stress levels by analyzing the user’s biological information. O4a: A high-tech human activity recognition system based on integrated WSN It performs robustly and accurately human activity recognition and reasoning in wide complicated environment by integrating Visual Sensor Network(VSN) and Wearable Sensor Network(WSN).
Networking
O4b: Smart Object interactive human activity sensing device It stores the activity data which uses the small-sized sensor with radio. This data will be able to analyze a momentum. B2: Environment-adaptive Scale-free uPAN It varies the data rate(20Kbps - 400Mbps) and transmits parameters according to environment change. B3: Intelligent Equipment (u-Zone Master) and OS for u-Zone Community Network It is self-constructed satisfying requirement(self-configuration, optimization, quality) of community members without internet infrastructure.
1262
Minkoo Kim et al.
2.7 Test Bed UCN holds a test bed which is equipped with system integrated platform and smart objects. It shows a future life focused on intelligent well-being and safety services by life index and is located in Ajou University.
3 Intelligent Service Robots in Ubiquitous Environment 3.1 Introduction Intelligent service robot project is one of the ten next generations driving forces in industry selected by Korean government in 2003. The project is being progressed through 3 stages over ten year period to 2013. By the year 2013, it aims to produce household service robots with which we can live together. For the past five years, the focus has been on securing core technology and building foundatation for the intelligent service robot. Development of intelligent service robots is characterized by the demand for integration of various technologies including artificial intelligence, intelligent human-robot interactions, and the ubiquitous network [23]. One of the key technologies we have developed for this grand project is an intelligence integration technology to ease the immense complexity of integrating numerous components and information as one consistent unit. An intelligent service robot in ubiquitous computing environment interacts with human and environment to provide services for the user using its sensors and manipulators. As an intelligent system, the robot needs to adapt to changing environment and the quality of the services is to increase over time. Such a robot is a complex distributed system as it asynchronously interacts with various elements in the environment using its sensors and manipulators. Furthermore the robot itself consists of multiple components on distributed computers to provide enough computing power for real-time requirement. As well as the integrated control of distributed functional components, an effective integration of data or information in the robot is also crucial for consistent and synergetic interoperation of the components. Without proper organization of control and data for both internal and external components and orderly interactions with the environment, the overwhelming complexity of the intelligent system would deter effective integration into a unified system. The complexity of the intelligent service robot can be managed by proper level of abstraction. Abstraction is the key to cope with the complexity and heterogeneity of intelligent service robots in ubiquitous environment. Although the complexity remains same, abstraction allows developers to focus on a few important concepts by factoring out details. Abstraction provides a perception of simplicity to the developers so that developers can conceptualize at a high level and design large complex systems like intelligent service robots.
Ubiquitous Korea Project
1263
Abstraction in general can be classified into control abstraction and data abstraction. Control abstraction is the abstraction of actions while data abstraction is that of information or knowledge. The result of control abstraction is often represented as a framework while data abstraction as an ontological data model. In this article, we present an agent-based control framework based on the traditional three-layer robot architecture [10] to provide appropriate level of control abstraction enabling integration of knowledge and functionalities for intelligent service robots. The framework is designed borrowing the core concept of the serviceoriented modeling [3] and model-driven architecture [14] to provide data abstraction in terms of ontological definition of shared knowledge. In the following sections, we first motivate our approach to control and data abstraction for an intelligent service robot and present the framework with design rationales with experimental results.
3.2 Task and Knowledge Management Framework Comprehensive range of diverse functionalities and unavoidable complexity in developing intelligent service robots call for a unified but simple control of the functional components. The diverse functionalities include recognition, speech understanding, manipulation, expression, navigation, reasoning, decision making, planning, and so on. The most important lesson learned from the past five years of development of intelligent service robots is the significance of the framework to handle the complexity of integration. Our framework is designed based on the following observations about an intelligent service robots. • Intelligent system: An intelligent service robot is by definition is an intelligent system. According to Erman [7], intelligent systems (1) pursue goals which vary over time, (2) incorporate, use, and maintain knowledge, (3) exploit diverse, ad hoc subsystems embodying a variety of selected methods, (4) interact intelligibly with users and other systems, and (5) need to be introspective and aware of their progress in applying their knowledge and subsystems in pursuit of their goals. • Heterogeneous system: An intelligent service robot consists of multiple functional components that are heterogeneous in terms of implementation languages and platforms. • Distributed system: The components and information to be integrated as an intelligent service robot may be distributed over multiple, typically single board computers, to satisfy the required processing power and timeliness and to properly handle parallel and layered activities. • Evolving system: The performance goals for an intelligent robot system are continually increasing in complexity.
1264
Minkoo Kim et al.
Fig. 3 Task and Knowledge Manager(TKM) in the deliberative layer of the underlying three-layer robot architecture consist of two parts: The Task Manager(TM) provides interface to the Control Components in the sequencing layer for a consistent and unified control view to the functional components. Knowledge Manager(KM), on the other hand, provides data interface as an information integration framework so that the distributed and heterogeneous components may share information and interact consistently.
;
Our basic control abstraction is based on the three-layer architecture [10] which consists of three components: a reactive feedback control mechanism (reactive layer), a mechanism for performing time-consuming deliberative computations (deliberative layer), and a reactive plan execution mechanism (sequencing layer) that connected the first two layers (Fig. 3.2). These components run as separate computational processes as this is essential for three-layer architectures. In addition to the control abstraction provide by the three-layer architecture, we developed simplified abstract control interface reflecting the characteristics of intelligent service robots as identified as an intelligent, heterogeneous, distributed, and evolving system (see Section 3.5). Our architecture is basically a knowledge-based architecture for integration of both control and information as shown in Figure 3.2. The agent-based Task Manager (TM) works as an integration middleware and provides a consistent and unified control view for the functional components which may be distributed over a network. Knowledge Manager (KM) provides an abstract interface, called Knowledge Source, to a common data repository for the distributed components so that they can share information or knowledge. The common data repository is designed based on the blackboard architecture model [2, 7, 4].
Ubiquitous Korea Project
1265
3.3 Task Manager As shown in left-hand side of Figure 3.2, Task Manager resides in the deliberative layer of the underlying three-layer robot architecture and provides abstract control interface to the Control Components in the sequencing layer. Control Components are components which are designed to interact with the Task Manager. Robot software components implement the functionalities required for the intelligent robot to perceive, reason, and acts. These components may be assembled as needed to build a Control Components which has an interface to interact with the Task Manager. Several Control Components can run at the same time. Task Manager may start, stop, supend, and resume the exeuction of Control Components as shown in Fig. 4. Control Components, on the other hand, can send information on local events anytime asynchronously to the Task Manager using the notify interface and query Task Manager to get information from the Task Manager. Fig. 4 On the other hand, CC can notify the TM to send events asynchronously to TM. CC can also query TM to get information from the TM.
3.4 Knowledge Manager As mentioned in the previous section, Task Manager(TM) provides a consistent and unified control abstraction for the functional components. Knowledge Manager(KM), on the other hand, provides a consistent and unified data abstraction for components of the system. Knowledge Manager works as a framework for integration of information to be shared among the distributed and heterogeneous components of an intelligent service robot. Our Knowledge Manager is designed based on the blackboard architecture model [2, 7, 4]. In this model, a group of specialists, called Knowledge Sources(KSs), interact using blackboard as the workspace, or a global database containing all information to be shared among Knowledge Sources. The data on the blackboard thus need to be formalized for Knowledge Sources to understand and to be simple enough for efficient accessing and match-
1266
Minkoo Kim et al.
ing operations among the Knowledge Sources. To meet these requirements, we use the generalized-list expression (GL-Expression). A GL-expression is a predicate consisting of a predicate name and the associated parameters in the form of(predicate-name p1 p2 ...), where the parameters p1 p2 ... can be a value, a variable, or another GL-expression. For the purpose of establishing a framework for data abstraction, we also created our own interface for KSs and interaction protocols upon the basic blackboard model as illustrated in Fig. 5. Fig. 5 Knowledge Source interface includes abstract data operations such as assert, retract, update, and match, and event operations such as post, subscribe, and unsubscribe. Blackboard can notify the occurrence of the subscribed event to the Knowledge Source.
A Knowledge Source interacts with other Knowledge Sources only through the blackboard. The blackboard in our Knowledge Management Framework provides the following abstract data interfaces for Knowledge Sources to access the blackboard. • • • •
assert asserts a fact in the form of the generalized list. retract retracts a fact that matches the generalized list. update replaces the matched fact with a new fact. match find a fact on the blackboard that matches the given generalized list to get bindings for the variables in the generalized list.
In addition to the data operations, Knowledge Source also can interact with other Knowledge Source using event operations through the blackboard. • post operation raises an event to the blackboard. The posted event is then delivered directly to the Knowledge Source that have subscribed on that event. • subscribe operation is used for Knowledge Source to subscribe on an event. When the subscribed event occurs, the blackboard asynchronously notify the associated Knowledge Sources to release Knowledge Source from the burden and overhead of polling the blackboard to find out the occurrences of particular events.
Ubiquitous Korea Project
1267
3.5 Service Agent Control abstraction and data abstraction can be further generalized to higher-level interface, that is, agent-level interface. Most agent interactions employ messagebased protocol such as KQML [8]. The protocol ensures coherence of interaction by imposing constraints to the communicated messages. In order to provide an agentlevel abstraction, we developed the notion of service agent that has message-based agent interaction and ontological domain model to provide the intended services as shown in Fig. 6.
Fig. 6 Service Agent (SA) provides data abstraction by the explicit domain model and control abstraction by agent-level interaction using an agent communication language. The type of agent communication includes send to send any information to other SA, query to query information from other SA, and request to ask other SA to complete the goal.
The notion of service agent enables us to build an open intelligent system in that each service agent has explicit model and specification of functions and information. The explicit model defines schema, instances, and services ontologically. Explicitness of the model allows us to incorporate new services into the system dynamically and to ease the complexity of integration for intelligent service robots. Abstraction at the agent-level provides data abstraction by explicit domain model and control abstraction by agent communication language. The overall relationship between Control Component, Knowledge Source, and Service Agent are depicted in Fig. 7. The basic unit to implement the functionalities of an intelligent service robot is Control Component. Control Component uses KS interface to access the shared knowledge in the blackboard, and SA interface to become a Service Agent interacting with other Service Agents.
1268
Minkoo Kim et al.
Fig. 7 Control Component in sequencing layer is the component that has the control interface with Task Manager. Knowledge Source has an data interface (KS) to the blackboard, a shared place for information. Service Agent has an agent interface (SA) to interact with other Service Agents.
3.6 Results and Conclusion As an integration framework, Task Manager has been deployed to the developers in the Intelligent Robotics Development Program, one of the 21St Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Republic of Korea. As an framework for control abstraction, the effectiveness of Task Manager has been experimentally proven. T-Rot, a bartender robot, at the IT exhibition at the 2005 Asia-Pacific Economic Corporation (APEC) summit is an example. In order to efficiently experiment with Knowledge Manager and Service Agent along with the Task Manager, we also developed a simulator Fig. 8 and conducted experiments for various errand tasks that are appropriate for the intelligent service robots. The experiment results are satisfactory in that the abstractions provided by Task Manager, Knowledge Manager, and Service Agent together resulted in effective design and implementation of intelligent service robots as an open, evolving, distributed, intelligent system in ubiquitous environment. In summary, our framework for intelligent service robots provides following features. • Control and data abstraction over three-layer robot architecture using the blackboard-based asynchronous interaction model. • Simple high-level abstract interactions using the request- and query-protocol in the Service Agent. • Flexible ontological message-based interface for both control and data abstraction In this paper, we introduced an integration framework to support both control and data abstraction through the agent-based Task Manager/ (TM) and blackboard-based Knowledge Manager/ (KM). While the Task Manager works as a middleware to provide a consistent and unified control abstraction for the functional components, our Knowledge Manager provides a consistent and unified data abstraction through a common data repository for the distributed components in a robot. Service Agent utilizing the control and data abstraction accomplishes agent-level abstraction. We
Ubiquitous Korea Project
1269
Fig. 8 A user and an intelligent service robot interact in the simulated ubiquitous environment. In real applications, the simulated environment may reflect the real world and could be used as an interaction media where the user can designate the object to fetch and lay out steps to follow to complete the task.
believe that our abstraction framework will effectively serve the purpose of coping with complexity of intelligent service robots. Acknowledgements This work was supported by the University of Seoul.
4 U-Health Projects in Korea 4.1 Concepts of u-Health A consensus on the definition of ubiquitous healthcare (u-health)is not yet available. The distinctions from similar terms such as telemedicine, telehealth, e-health, tele-monitoring, and mobile health are somewhat unclear. U-health can be roughly defined as “an intelligent system that enables health and medical services to be provided to consumers anytime and anywhere by making use of sensing, information, and communication technology.” The extensions of time, space, providers, consumers, and services are key characteristics of u-health (Fig. 9).
1270
Minkoo Kim et al.
Fig. 9 Scope of u-health. Source: Modified from SW Kang, Samsung Economic Research Institute. CEO information 602, 2007.
4.2 Why u-Health? The acute rise in national medical expenditures is accelerating, especially due to factors such as aging and chronic diseases. Building a cost-effective medical system is in international interests. National health insurance reimbursements for the elderly have been increasing abruptly, from 18.0% at 2000 to 26.8% at 2006 (National Health Insurance, 2006, 2007). National health insurance reimbursement for diabetes was 168 million USD at 2000, however, 280 million USD at 2005. There have been changes in medical demands due to increasing incomes and the resulting increase of interest in health. Proactive health consumers, rather than passive patients, are increasing the demand for various health services. The Korean annual per capita medical expenditure of 303 USD in 1997 increased to 830 USD in 2004 (2006, OECD). In 2000, the annual outpatient visits per person was 10.8 days at 2000; however, that has increased to 14.7 days per year. In addition, the annual days of hospitalization per person increased to 1.33 days in 2006 from 0.88 days in 2000 (National Health Insurance, 2007). Nevertheless, people living on islands or in remote places have difficulty obtaining medical service. The quality of medical service for them is still low. Therefore, there is a strong need for the government to restrain rising national medical expenditure and increase the cost-effectiveness of medical and health services throughout the nation. Korea’s telecommunication infrastructure is known to be well-established. The broadband penetration rate and mobile communication infrastructure of Korea rank
Ubiquitous Korea Project
1271
among the top in the world (Fig.11). The hospital information system penetration rate is also high (Fig.12). From the health industry’s point of view, the purpose of the various u-health projects in Korea is to study and develop u-health service models integrating various technologies [WLAN, WiBro, HSDPA(high-speed data packet access)]; set up commercialization plans for u-Health service models by testing their technological and business feasibility; pursue the development of relevant industries and balanced
Fig. 10 Change on health service usage in Korea 2000-2006. Source: National Health Insurance Corporation, Republic of Korea, 2007.
Fig. 11 Households access to broadband Internet in selected OECD countries. Source: Modified from OECD, DSTI/ICCP/IE(2007)4/FINAL
1272
Minkoo Kim et al.
regional development by attracting the participation of medical centers, medical companies, communication service providers, and local governing bodies; create a new IT market; and facilitate technology development.
4.3 U-Health Projects in Korea There were 54 projects related to u-health in Korea between 1998 and 2007. Thirty were still ongoing until 2007. (Ministry for Health, Welfare and Family Affairs, 2007). Medical practice through telemedicine is rare in Korea, as medical law allows telemedicine to only be practiced between doctors. Therefore, most of the past pilot projects mainly focused upon health counseling or consultations between doctors for second opinions (Fig.13). Some selected subjects of the u-health pilot projects in Korea include tele-PACS, telemedicine, emergency care services, ADHD (attention deficit hyperactivity disorder) screening services using Actiwatch, factory environment and workers’ health status monitoring services, remote health monitoring services for the elderly living alone through infrared sensor and power consumption monitoring, mobile health monitoring through wearable shirts, diabetes monitoring services using mobile phones, RFID-based drug management systems, and visiting nurse systems. Some of them will be briefly described • Wearable computer-based health monitoring service, 2006, Daegu City: A mobile vital sign monitoring service using wearable shirts. The heart rate, respira-
Fig. 12 Adoption ratio of hospital information systems. Source: Health Insurance Review and Assessment Service (HIRA), 2005; RAND, 2005
Ubiquitous Korea Project
1273
tion rate and actinography of the subjects in motion were remotely monitored in real-time. • Mobile diabetes monitoring service: A diabetes management service. A patient can measure his or her blood glucose level continuously by using a glucometerequipped mobile phone (diabetes phone). The results are transmitted to a central server though CDMA communication. His or her physician evaluates the change in patients’ glucose level. The results of evaluation are fed back to the patient. Significant improvements were observed in the HbA1c of the subjects. • RFID-based u-drug information sharing system pilot project, 2006: A 6-month pilot project funded by the Ministry of Information and Communication to facilitate transparent and efficient drug logistics, manage medicine history, guarantee genuine products, and prevent adverse drug events. The total budget was about 0.7 million USD. The project was continued up to the “u-Drug total management system construction project” in 2007. Through this project, the Ministry tried to audit the entire drug process ranging from manufacture and logistics to consumption in hospitals through RFID technology. An RFID-tag auto-labeling system was introduced to pharmacy factories. • Busan City u-healthcare project for public emergency and home healthcare service (2006-2007): This project was funded by Busan City and the Ministry of Information and Communication. The total budget was 2.5 million USD. The project consisted of visiting nurse and emergency services. – The targets for the service were 105 disabled elderly and handicapped people. One emergency center, 4 health centers, and 36 visiting nurses participated in the project. Visiting nurses visited their disabled, elderly, or handicapped patients with a mobile tele-monitoring device and a UMPC (ultramobile PC). The visiting nurse assessed and measured the patient’s vi-
Fig. 13 Service categories of u-health pilot projects during 2003-2007. Source: U-healthcare National Survey, Ministry for Health, Welfare and Family Affairs, Republic of Korea. 2007
1274
Minkoo Kim et al.
tal signs, peripheral blood glucose, and health status. The recorded data on the UMPC were transmitted to the central management server. The CDSS (clinical decision support system) in the UMPC supported the visiting nurse for proper patient management. Doctors in the center could evaluate the patients’ health status in real-time and health management plans could be adjusted to the measured health data. The service reduced administrative work and assisted visiting nurses in the systematic health management of their patients. – u-Emergency service: For this project, 2 emergency centers, 10 ambulances, 6 medical centers, 30 medical workers, and 14 paramedics were involved. Portable telemedicine and tele-monitoring devices were equipped in 10 ambulance cars. Paramedics could identify patients with RFID cards and measure their vital signs. They could discuss a patient’s health status with the doctors in an emergency center via videoconferencing while in transit through the HSDPA mobile communication network. Portable monitoring devices could measure the blood pressure, heart rate, respiration rate, and ECG. A total of 742 patients participated in the project. However, the heavy weight and inaccuracy of the tele-monitoring devices and constraints due to medical law limited the spread of the service.
Fig. 14 Busan City u-healthcare project service diagram.
Ubiquitous Korea Project
1275
4.4 Today’s Reality to Overcome There are numerous challenges for u-health to confront: legal restrictions, technical immaturity, system and service management difficulties, no insurance reimbursement by national medical insurance, experience limited to pilot projects, etc. Korean medical law permits telemedicine to only be performed between doctors. There is no use of telemedicine between nations. This is due to fears of adverse events caused by telemedicine devices or communication troubles. The cost-effectiveness of u-health is under a cloud. Both insurance reimbursement and government incentives are essential for the spread of u-health, but they are not in place. Most pilot projects focus on chronic diseases, especially diabetes and hypertension. Trials for other diseases are very rare. In spite of the many aforementioned challenges, industries and the government are cooperating together to promote u-health. A master plan for u-Healthcare R&D was issued in June 2008 by the Ministry for Health, Welfare and Family Affairs. In addition, the a˝ ˛oSpecial Act on U-health Promotion’ is in the planning stages.
5 USN Technology for Nation-wide Monitoring Wireless Sensor Networks (WSNs) consist of large number of small, low power, and intelligent sensors, and are envisioned to change the way in which data has been collected from environment; giving a new paradigm to monitor and control the ambient environments. Major applications of WSN are in the areas of environmental monitoring and control, medical, military and smart spaces. Network management refers to the process of managing, monitoring, and controlling the behavior of a network. A network management system generally provides a set of management functions that integrate configuration, operation, administration, security, and maintenance of all elements and services of a network. A large number of management solutions exist for traditional networks but these solutions are not directly applicable to WSNs because of the unique operational and functional attributes of WSNs. The uniqueness of the purported challenges in these networks makes management techniques of traditional networks readily impractical. For example, first the occurrence of a fault is a problem in networks but a a˝ ˛ofeature’ of WSNs. In case of large scale WSNs, i.e., of size more than hundreds of thousands nodes with severe resource constraints, faults occur frequently and components maintenance or energy recharge is not an option. The sensor nodes are often considered disposable and may be used only for once. In a few cases as reported in [13], configuration errors and even the environment interference can cause the loss of an entire WSN even before it starts to operate. Second, unlike for traditional networks, wherein the primary goals are to minimize response time and provide detailed management information, sensor networks are designed with the primary goal of minimizing the energy usage [22]. Optimizing the operational and functional properties of WSNs may require a unique solution for
1276
Minkoo Kim et al.
each application problem [11]. A workable method to achieve this goal may be to carry out the management activity through minimum communication between the network elements for monitoring purposes. The main task of WSN monitoring is to gather data about various network- and node-state attributes like battery level, transceiver power consumption, network topology, wireless bandwidth, link state, and the coverage and exposure bounds of WSNs. A sensor network management system should be able to perform a variety of management operations on the network elements based on the monitored data, e.g., controlling sampling frequency, switching node on/off (power management), controlling wireless bandwidth usage (traffic management), and performing network reconfiguration in order to recover from node and communication faults (fault management). Third, sensor nodes are typically deployed in remote or harsh conditions and the configuration of nodes in WSNs changes dynamically. Thus, a sensor network management system should allow the network to self-form, self-organize, and ideally to self-configure in the event of failures without prior knowledge of the network topology. In self-managed WSNs, after the nodes are deployed, say in an ad hoc manner, they wake up, perform a self-test, find out their localization [21] and monitor their energy levels (operational state), usage state, and administrative state. These activities are performed by management functions at network element level. Once the nodes find out their location, they can organize themselves in groups. Despite the importance of sensor network management, there is no existing generalized solution for WSN management [20]. Fourth, the ad-hoc and traditional networks are designed to run a large number of user applications. Therefore, the network components are installed, and configured with an objective to support a large number of different kinds of services. The WSNs are generally application-oriented. The wireless sensor nodes generally run a common application in a cooperative fashion, a behavior that is in contrast with the traditional and ad-hoc networks. A network management system designed for WSNs should provide a set of management functions that addresses such unprecedented network and behavioral features of WSNs. In this chapter we focus on the provision of a management framework for monitoring and controlling WSNs.
5.1 IP-USN as an Integrating Technology IP based Ubiquitous Sensor Networks (IP-USNs) are formed as a dovetail of WSNs and Internet Protocol (IP) network. These networks are unique and challenging at the same time, in the sense that these networks have acquired a niche, that is very diverse as compared to the peer wireless technologies (Fig.15). The management considerations for these networks greatly depend upon their distinct characteristics that arise from their unique design. A management architecture for USN must reflect the interpretation of the differences, that USNs have from WSNs and Mobile Ad hoc Networks (MANETs). (Fig.16) illustrates the ar-
Ubiquitous Korea Project
1277
chitectural, operational and functional differences between WSNs, MANETs and USNs. Architecturally, USN varies from these networks mainly in the network elements and the communication paradigm (shown through correspondence in the inset). The operational differences lie within the device roles of USN, especially due to the user and application heterogeneity. The functional differences arise from their application-specific form factors and code footprints. These distinctive characteristics of IP-USN form considerations for its management system, which pose specific requirements (Figure 15). Each of these requirements has a direct impact on the management system design as shown in (Fig.16). For instance R.1 purports the support for newer query types and formats, query generation and processing support at devices in the Internet domain, at the intermediate gateway and at the wireless devices in the WSN domain. R.2 mandates that the query transmissions and response transmissions between end points must be handled through intelligent means under the constraints that channel behavior varies under wireless-cum-wired multi-hop paths, and that the intermediate devices may altogether fail to deliver such transmissions due to a malfunction. The importance to optimally map the management roles, viz, manager and managed agents across the available devices in USN is signified in R.3. The differing form factors of devices in the Internet, the gateway and the sensor nodes requires (R.4)
Fig. 15 IP-USNs encompass all the scopes of connectivity
Fig. 16 Architectural presentations of WSNs, MANETs and IP-USNs (Inset: Communication paradigm)
1278
Minkoo Kim et al.
Fig. 17 Considerations and requirements for IP-USN management system
the inclusion of proxies to translate complex syntax and bulky queries into lighter and manageable queries, while keeping the semantics same. Lastly, R.5 underpins the need to define newer MIBs for low to medium to high end devices, such as PAN Information Base (PIB) for sensor nodes, and the MIB definition for the adaptation layer at the gateway [15].
Fig. 18 Relationship between the IP-USN management requirements (overlay) and design elements (inlay)
The management purview turns out to be a good starting point to state goals for NMS including the protocol design. In the next sub-section, we present such goals that form basis for the proposal of USN NMS.
Ubiquitous Korea Project
1279
5.2 Design Goals of WSN Management Management applications, when designed for traditional networks may have restrictions in terms of performance and response time as compared to the hardware limitations, when designed for sensor networks. The goals of network management for sensor networks need to establish a clear and direct relationship with the missionoriented design of sensor networks. The goals are defined as follows, 1. Scalability: Sensor nodes are assumed to be deployed in large numbers. The management system should be able to handle a large quantity as well as high density of nodes. 2. Limited Power consumption: Sensor devices are mostly battery-operated; therefore, the management applications should be able to run on the sensor nodes without consuming too much energy. The management operations should be lightweight on node-local resources in order to prolong its lifetime, thereby, contributing to the network lifetime as a whole. 3. Memory and Processing Limitations: The sensor nodes are supposed to have limited memory and processing power. The management applications need to be aware of such constraints and may only impose minimal overhead on the low-powered nodes for the storage of management information and processing. 4. Limited Bandwidth consumption: The energy cost associated with communication is usually more than that of sensing and processing. Therefore the management applications should be designed with this consideration in mind. Moreover, some sensor technologies may also have bandwidth limitations in the presence of high channel impairments. 5. Adaptability: The management system should be able to adjust to network dynamics and rapid changes in the network topology. The system should be able to gather the reported state of the network and the topology changes. It should also be able to handle node mobility, the addition of new nodes, and the failure of existing nodes. 6. Fault tolerance: WSNs are different from traditional networks. Sensor nodes may run out of energy causing a fault in the network. Moreover the node may go to sleep mode to conserve energy or may be disconnected from the sink node because of network partitioning. The management system should be aware of such dynamics and it should adjust accordingly. 7. Responsiveness: Network dynamics, such as movement of nodes, must be quickly reported to the management system. The manager should be able to get the current state and the changes in the topology of the network. 8. Cost: The cost of each sensor node has to be kept low at least by allocating minimal resources to the management system.
1280
Minkoo Kim et al.
5.3 LNMP Architecture LNMP Architecture [12] was proposed to provide a management solution for 6LoWPANs (IPv6 over low power personal area networks) [15] which is a realization of the IP-USN concept. The system is composed of an operational and informational architecture for such networks. The management operations are carried out by the entities within the 6LoWPAN, in two successive steps. First, network discovery is performed before carrying out any management tasks to take a network-wide snapshot of resources, in that device state monitoring must be catered for, in the architecture, for example the devices could be energy-starved, sleeping, alive but not available etc. After the devices are discovered, the second step is the actual management of the available devices. In the network discovery, all the 6LoWPAN coordinators report the status of all the subordinates to gateway up through their ancestor coordinators in the hierarchy, the network initialization phase. After the network initialization, only the eventual updates regarding the changes in the subordinate states are reported to the gateway. The updates may be followed by an acknowledgement in order to enhance the accuracy of device discovery. Besides the network discovery part, the operational architecture provides support for the implementation of the Simple Network Management Protocol (SNMP). On the severely limited bandwidth of 6LoWPANs, only thirty-three bytes of data can be sent in worst case over User Datagram Protocol (UDP).
Fig. 19 LMNP Device Monitoring procedure
Therefore the architecture uses a proxy on the USN gateway in order to reduce the amount of fragmentation for sending the management packets and to reduce the traffic within the USN network by replying back the constant data from the gateway’s Object identifier (OID) translation table. Local lightweight management messages are sent to the devices for variable data by parsing the incoming SNMP
Ubiquitous Korea Project
1281
packets and the replies are translated back to SNMP when they leave the 6LoWPAN. In (Fig.19), the steps 1 and 5 are SNMP requests and response respectively and step 4 shows the local lightweight management messages. In the informational architecture, Management Information Bases (MIBs) are defined by considering the special characteristics of these networks. The architecture focuses on approaches for the reuse of existing standard MIBs and defines an information base for special characteristics of this type of USNs.
5.4 Nationwide Deployment in Korea
Fig. 20 Deployment of IP-USN with KOREN Project
IP-USNs are deployed throughout Korea under the KOREN project [16]. Fig.20 shows the deployment sites for the project within Korea and Fig.21 shows the data collection mechanism from the IP-USN networks. The data from these networks is currently being collected in a centralized server. The IP-USN technology has been deployed throughout the nation. The deployed testbed consists of IP-USN networks from two different vendors and the interoperability between both the vendor specific networks has been tested. However, standardization of this technology is currently in progress and more standardization has to be done to make different vendor specific networks interoperable with each other.
1282
Minkoo Kim et al.
Fig. 21 Data Collection of the Sensor Data
References [1] Rand, the state and pattern of health information technology adoption. Technical report, 2005. [2] Hayes-Roth B. A blackboard architecture for control. Artificial Intelligence, 26(3):251–321, 1985. [3] M. Bell. Service-Oriented Modeling: Service Analysis, Design, and Architecture. John Wiley and Sons, 2008. [4] D.D. Corkill. Blackboard systems. AI Expert, 6(9):40–47, 1991. [5] National Health Insurance Corporation. 2005 national health statistical yearbook. Technical report, Republic of Korea, 2006. [6] National Health Insurance Corporation. 2006 national health statistical yearbook. Technical report, Republic of Korea, 2007. [7] Hayes-Roth F. Erman L.D., Lark J.S. Abe: An environment for engineering intelligent systems. IEEE Transactions on Software Engineering, 14(12):1758– 1770, 1988. [8] Mayfield J. Finin T., Labrou Y. Software Agents : KQML as an agent communication language, pages 291–316. MIT Press, Cambridge, Mass, in bradshaw, j.m. edition, 1997. [9] Ministry for Health. Welfare and family affairs, u-healthcare national survey. Technical report, Republic of Korea, 2007. [10] E. Gat. Artificial Intelligence and Mobile Robots: Case Studies of Successful Robot Systems., chapter On three-layer architectures, pages 195–210. MIT/AAAI Press, Cambridge, 1997. [11] K. Lee H. Song, D. Kim and J. Sung. Upnp-based sensor network management architecture. In ICMU Conf., Apr. 2005. [12] Kim Kang-Myo Ali Hammad Akbar Kim Ki-Hyung Seung-Wha Yoo Hamid Mukhtar, Shafique Ahmad Chaudhry. Lnmp- management architecture for ipv6 based low-power wireless personal area networks (6lowpan). In IEEE/IFIP NOMS, 2008.
Ubiquitous Korea Project
1283
[13] T.R.M. Braga J.M.S. Nogueira A.A.F. Loureiro L.B. Ruiz, F.B. Silva. On impact of management in wireless sensor networks. In IEEE/IFIP NOMS, Apr. 2004. [14] J. Miller and eds. J. Mukerji. Technical guide to model driven architecture: The mda guide v1.0.1., 2003. [15] Hui J. D. Culler N. Kushalnagar, G. Montenegro. 6lowpan: Transmission of ipv6 packets over ieee 802.15.4 networks, September 2007. [16] Korea Advanced Research Network. http://www.koren2.net. [17] OECD. Oecd health data2006, 2006. [18] OECD. Dsti/iccp/ie(2007)4/final, 2007. [19] Health Insurance Review and Assessment Service (HIRA). National survey on adoption rate of hospital information system. Technical report, 2005. [20] L. B. Ruiz. MANNA: A Management Architecture for Wireless Sensor Networks. PhD thesis, Federal Univ. of Minas Gerais, Belo Horizonte, MG, Brazil, Dec. 2003. [21] Vahag Karaya Micdrag Potkonjak Seapahn Meguerdichian, Sasa Slijepcevic. Localized algorithms in wireless ad-hoc networks: Location discovery and sensor exposure. In MobiHOC -Symposium on Mobile Ah Hoc Networking and Compuring, pages 106–116. [22] M. Welsh and G. Mainland. Programming sensor networks using abstract regions. In USENIX NSDI Conf., Mar. 2004. [23] Sang-deok Park Woo-sung Shim, Sang-rok Oh. International cooperation agency for korean it: Special report: Intelligent robots. IT Korea Journal, 14:16–27, 2005.
Index
3D, 37, 43, 44, 46, 57, 60–62, 456, 494, 537, 566, 827, 860, 882, 896, 947, 949, 952, 954, 993, 994, 999, 1003, 1004, 1077, 1081, 1083, 1086, 1093 3DML, 952, 954 802.11, 326, 682, 833, 838 802.11b, 121, 128 802.15.4, 121, 682, 1192 AAL service platform, 1172, 1173 AAL space, 1179, 1186, 1187 abnormal behavior detection, 685–687, 699 accessibility, 358, 423, 1119, 1128, 1208, 1212, 1224, 1246 action graph, 735, 737–739 Active Badge, 281, 290, 831, 855 Active Bat, 290, 826–828 active RFID, 720, 1234 activity recognition, 751, 752, 989, 993, 998, 999, 1001–1003, 1010, 1082, 1083, 1261 actuator network, 658 Adaboost, 487, 488, 497–501 adaptation, 214, 218–221, 223, 252, 354, 447, 449, 472, 560, 563, 567, 851, 853, 854, 1023, 1029, 1075, 1127, 1154, 1174, 1189, 1203, 1208, 1210, 1212, 1213, 1217, 1220, 1225, 1278 adaptive lighting, 1201, 1207, 1209, 1214, 1217, 1220, 1223 adaptive multimedia transport model, 903 affective computing, 354, 492, 568 agent based simulation, 865, 866 AI, 17, 259, 371, 613, 618, 645, 669, 670, 672, 680, 682, 684, 685, 693, 696, 1015 airy notes, 1241 AISE, 21 ALADIN, 1201
ALARM-NET, 318, 323 algorithm, 35, 37, 38, 40, 41, 43–48, 75, 78, 106, 119, 121, 123, 240, 295, 354, 355, 448, 457, 614, 622, 625–628, 651, 731, 732, 735, 736, 757, 806, 814, 829, 830, 835, 838, 869, 888, 895, 918, 1074, 1077, 1100, 1133, 1149, 1217–1219, 1221 Allen’s temporal logic, 614 ambient assisted living, 344, 639, 691, 1041, 1059, 1171 ambient communication, 796, 798, 800, 801, 813–816 ambient display, 858 Ambient Intelligence, 4, 229, 230, 232, 233, 245, 249, 250, 252, 283, 315, 364, 535, 541, 547, 610, 611, 613, 621, 622, 630, 635, 637, 648, 658, 693, 731, 852, 861, 873, 911, 912, 914, 920, 921, 924, 925, 928, 1039, 1040, 1048, 1051, 1139, 1142, 1171, 1173, 1179, 1204, 1252 ambient telephone, 801–803, 806, 809, 813 Amble Time Project, 922 AmI, 3, 16, 315, 443, 448, 560, 564, 637, 638, 731, 732, 748, 853, 854, 876, 912–915, 919, 924, 925, 928, 931, 1039–1041, 1043, 1139, 1142, 1176, 1178, 1179, 1182, 1204 Android, 262 ANTH, 1233 application model, 222, 223, 243, 681, 684, 686, 690 architecture, 121, 210, 235, 242, 243, 246, 247, 291, 318, 322, 327, 381, 637, 638, 647, 656, 682, 683, 802, 806, 814, 828, 864, 866, 907, 912, 917, 949, 993, 1000, 1091, 1094, 1095, 1121, 1126, 1128,
H. Nakashima et al. (eds.), Handbook of Ambient Intelligence and Smart Environments, DOI 10.1007/978-0-387-93808-0, © Springer Science+Business Media, LLC 2010
1285
1286 1129, 1140, 1172, 1175, 1177–1179, 1181, 1183, 1194, 1195, 1202, 1211, 1216, 1226, 1249, 1258, 1263–1265, 1268, 1276, 1280 argumentation, 852, 859, 862, 864, 868, 869, 915, 918 argumentation support system, 864 ART, 239, 1133, 1135 art, 252, 362, 367 articulated body, 82 Artificial Intelligence, 4, 92, 260, 263, 276, 364, 371, 610, 612, 618, 630, 636, 637, 669, 914, 1118, 1262 assisted living, 1050 assistive software, 410, 412, 415, 416, 422 assistive technology, 409, 1118, 1131, 1132, 1135 auction, 717 audio beamforming, 118 audio sensor model, 135 audio-video fusion, 140, 142 audio-visual identification, 884 audio-visual output, 1084 audio-visual tracking, 118, 1077 augmented reality, 233, 288, 362, 540, 545, 798, 1261 authentication, 286, 294, 298, 301, 303, 307, 780, 853, 956, 1123, 1186, 1187, 1230, 1252 automatic speech recognition, 559, 1074, 1085, 1183 autonomous system, 636 autonomous traffic management, 705 autonomous vehicles, 705 aware home, 375, 841, 842 awareness, 230, 232, 250, 354, 435, 479, 610, 621, 797, 857, 858, 918, 920 background subtraction, 63, 64, 444 Bayes, 41, 46, 58, 757, 1081 Bayes filtering, 835 Bayes, naive, 68, 472 Bayesian classifier, 753, 757, 758, 1083 Bayesian estimation, 118, 132, 134, 138, 141, 142 Bayesian network, 482, 487, 489, 624, 767, 900 beamforming, 119–121, 123, 124, 440, 803, 1075 behavior modeling, 670, 671, 676, 687 belief, 45, 46, 963, 968, 1016 bias-free, 704 biofeedback, 1203, 1209–1211, 1213, 1217, 1220, 1222
Index biometrics, 297, 302, 303, 308, 476 biosignal, 1211–1213, 1216, 1219, 1220 blackboard architecture, 246, 247, 1265 Bluetooth, 260, 287, 305, 307, 316, 323, 326, 327, 465, 466, 470, 642, 839, 1139, 1180, 1187, 1192, 1194 building automation, 642, 1211, 1214, 1215 bus systems, 715 CAALYX, 319, 330 Calm computing, 283, 911 CAMAR, 1261 camera calibration, 41–43, 45, 456, 1090 camera mouse, 410 camera network, 36, 39–48, 687, 1003, 1004, 1006 camera-based interface, 421 CASAS, 753, 755 CASTANET, 1247 CDTK, 1260 CHRONIC, 319, 325, 330 clarity, 704 classification, 36, 46, 98, 264, 265, 267, 272, 415, 439, 466, 472, 482, 487, 489, 490, 494, 499, 501, 536, 630, 651, 661, 671, 676–678, 685, 693, 697, 734, 735, 752, 757–761, 765, 829–831, 997, 999, 1002, 1010, 1017, 1044, 1046, 1221, 1222 CLEAR - Classification of Events, Activities, and Relationships, 1086 CLEAR workshops, 1086 COGAIN project, 1136 cognitive psychology, 355 collaboration, 98, 221, 249, 276, 336, 359, 363, 475, 536, 539, 857, 861, 889, 892, 918, 962, 1095, 1100, 1175, 1179, 1233, 1258, 1260, 1261 collaborative context recognition, 260 collaborative spatial work, 89 Collaborative Workspace, 1099, 1100 communication, 36, 37, 39, 45, 94, 129, 202, 206, 208, 211, 212, 229, 237, 238, 243, 250, 260, 269, 275, 316, 320, 325, 335, 357, 358, 360, 362, 364, 409, 422, 423, 433, 435, 436, 479, 493, 502, 559, 560, 564–566, 568, 570, 572, 636–639, 641, 642, 648, 650, 660, 661, 682, 707, 744, 775, 776, 779, 797, 798, 800, 802, 808, 815, 816, 854, 857, 863, 867, 872, 874, 881, 889, 890, 907–909, 917, 923, 924, 939, 941, 942, 944–948, 962, 963, 972, 974, 976, 986, 1046, 1092, 1094, 1095, 1101, 1119, 1121, 1123, 1127, 1135, 1145, 1146, 1174, 1175, 1182, 1183,
Index 1193, 1194, 1203, 1211, 1229, 1231, 1238, 1260, 1273, 1274 communication abstraction, 209 communication interface, 336 communication middleware, 1157, 1158 communication platform, 1224 communication protocol, 249, 334, 1141, 1145, 1146, 1156, 1159–1161, 1164 community computing, 1243, 1258 Community Development Tool Kit (CDTK), 1260 comnmunication infrastructure, 1024, 1243 comnmunication platform, 470 complex systems, 704 computer vision, 35, 36, 128, 362, 413, 421, 493, 550, 840, 842, 993, 1004, 1010 conceptual model, 340, 543, 545, 1147 confidence evaluation, 267 context, 80, 214–217, 230, 238, 246, 259, 260, 262, 264, 265, 267–269, 275, 284, 285, 356, 367, 441, 446, 449, 472, 541, 542, 547–549, 551, 560–563, 566, 567, 569–571, 573, 613, 626, 660–663, 719, 757, 765, 797, 798, 832, 856, 858, 861, 897–901, 965, 1016, 1021, 1040, 1041, 1044, 1059–1061, 1178, 1185, 1187, 1231, 1237, 1246, 1248, 1249, 1261 context analysis, 245–247 context awareness, 100, 214, 230, 232, 233, 245, 247–250, 252, 257–263, 268, 276, 375, 376, 560, 562, 565, 571, 572, 610, 621, 630, 637, 786, 789, 836, 860, 861, 883, 885, 886, 896, 897, 901, 909, 922, 1016, 1039, 1072, 1091, 1093, 1097, 1106, 1142, 1145, 1176, 1178, 1229, 1230, 1237, 1249, 1260 context bus, 1182, 1189 context estimates, 264, 269, 272, 273 context management, 205, 213 context model, 215, 217, 897, 898 context modeling, 213, 216, 217, 897, 1016, 1022, 1091 context recognition, 259, 260, 263–266, 269, 273 Context-Aware Mobile Augmented Reality (CAMAR), 1261 cooperative distributed problem solving, 264 CORBA, 207, 209, 210, 237–239, 250, 1143, 1145 Cricket, 827, 828, 886 critical ambient intelligence, 928 CSCW, 355 customization, 218, 356, 402, 670, 687 Cyber Assist, 1204, 1252
1287 data collection, 755, 756, 839, 840, 1031, 1047, 1049, 1053, 1073, 1086, 1087, 1208, 1281 data mining, 93, 109, 263, 273, 275, 358, 563, 610, 622–624, 672, 734, 1015, 1016, 1019, 1020 DDS, 943 De Digitale Stad, 943 decision making, 259, 264, 265, 269, 316, 341, 375, 672, 685, 733, 851–853, 858, 862–864, 866, 868, 975, 1263 device pairing, 298 dial-a-ride bus systems, 715 Digital City, 940, 941, 943–949, 951, 952, 954–956 Digital City Kyoto, 946, 949 Digital City, AOL, 943 Digital City, Europian, 943 digital ecosystem, 350 digital home, 372, 377, 379, 381, 383, 394, 403 digital short range communication (DSRC), 712 DIPR, 670–672, 675, 676, 680–682, 684, 693, 694, 696 direct manipulation, 772, 885, 889, 955 discriminative method, 80 display, 206, 236, 237, 239, 244, 252, 288, 300, 360, 374, 435, 437, 466, 467, 473, 489, 490, 493, 494, 535, 537, 543, 545, 548, 549, 570, 653, 660, 798, 815, 856, 857, 883, 885, 889, 892, 901, 917, 922, 932, 1027, 1132, 1209, 1237 distance bounding, 297 distance learning, 881, 902, 904 distributed algorithms, 36, 37, 44, 46, 47 distributed computing, 378, 401, 771, 772 distributed smart camera, 989 distributed surveillance, 993 distributed systems, 276, 298, 772, 773, 777, 1175 DOG Domotic OSGi Gateway, 1128 DOLPHIN, 828, 1232 Domotics, 1118, 1129, 1130 dynamic programmin, 263 earthquake sensing, 1234 EasyLiving, 375, 840 echo cancellation, 807, 808 embedded computing, 402 emotional system, 867 end-user programming, 372, 374, 376, 378, 401, 403
1288 environmental control, 917, 928, 931, 1117–1121, 1124, 1125 event calculus, 620 event-based communication, 1178 Expo, 720 eye control, 1133 eye tracking, 360, 1119, 1120, 1132, 1135 Facial Expression Recognition, 495 facial expression recognition, 479, 480, 487–491, 496, 497, 499, 502 falling person detection, 990, 1000, 1001 false positives, 302, 475, 612, 759–761 feature extraction, 479, 483, 682, 1001, 1074 fidelity, 704 FIPA, 1095 first order probabilistic logic, 897 fixed-route bus systems, 715 FM radio, 829 focus of attention, 440, 1080, 1096 FOPL, 897 frame problem, 620 Free-Net, 941 Gaussian Markov Model, 43 gaze tracking, 559 genetic algorithm, 717, 1212, 1217–1219 gesture, 360, 363, 413–415, 422, 452, 489, 492–494, 544, 559, 566, 691, 816, 854, 860, 885, 1080, 1083, 1120 GIS, 946, 955 GPS, 37, 215, 245, 260, 262, 291, 324, 565, 656, 706, 719, 833, 836–839, 912, 922, 927, 948, 1026, 1232, 1237 graceful failure, 277 group decision making, 270, 851, 863, 864, 867, 868 group decision support system, 863, 864, 866 GSM cell, 833, 836 GUI, 374, 383, 397, 789, 854, 1084, 1186, 1216 habituation, 627, 628, 1208 HAVi, 234–236 HCC, 350–356, 360, 362, 366 HCI, 5, 257, 276, 350, 353, 355, 358, 364, 440, 537, 541, 549, 861, 923, 1015, 1132 head pose, 440, 441, 443, 446, 448–450, 484, 490, 502, 1080, 1086, 1090 heart rate, 261, 270, 464, 1027, 1207, 1208, 1217, 1220, 1272, 1274 HEARTS System, 333 heterogeneity, 206, 232, 241, 243, 245, 331, 561, 778, 852, 903, 941, 1142, 1145, 1156, 1180, 1260, 1262
Index Hidden Markov, 767 high level abstraction, 230, 232–235, 247 HMM, 263, 472, 482, 488, 624, 625, 1082 home gateway, 1126, 1127 home navigator, 1237 human behavior, 259, 355, 712, 1048, 1081, 1084 human computer interface, 1209 human spaces, 361 human-centered computing, 350–352, 356, 363, 422 human-computer interaction, 260, 352, 362, 413, 421, 440, 464, 479, 502, 535, 542, 548, 572, 859, 889, 973, 1085 identification, 36, 260, 286, 303, 554, 623, 757, 759, 766, 767, 805, 831, 838, 841, 842, 860, 884, 894, 1079, 1080, 1098, 1181, 1235, 1238 image processing, 35, 855 inferred predicted outcomes, 680, 694, 696 information city, 917, 947 informatization, 940, 944–946 infrared, 91, 236, 261, 281, 300, 412, 413, 435, 465, 540, 646, 662, 721, 724, 826, 836, 838, 842, 855, 1135, 1212, 1272 infrastructure, 92, 93, 206, 213, 214, 230, 232–234, 237, 238, 241–246, 249–251, 253, 290, 291, 319, 353, 469, 549, 550, 640, 670, 672, 680, 681, 683, 684, 707, 712, 801, 827, 834, 856, 859, 883, 906, 913, 916, 921, 923, 925, 942, 945, 946, 1046, 1049, 1050, 1072, 1073, 1091, 1094, 1099, 1129, 1146, 1173–1175, 1203, 1206, 1260, 1261, 1270 input device, 360, 415, 423, 547, 1209 intelligence rules, 674, 684, 686, 691 intelligent, 3, 262, 276, 320, 322, 354, 355, 413, 470, 472, 479, 550, 559, 636, 638, 640, 680–683, 693, 696, 772, 815, 851, 864, 896, 917, 923, 962, 968, 975, 977, 989, 1024, 1042, 1095, 1098, 1119, 1124, 1126, 1176, 1178, 1197, 1201, 1208, 1210, 1223, 1237, 1257, 1258, 1262, 1263, 1265, 1267 intelligent agent, 703 intelligent decision room, 876 intelligent Domotic environment, 1126, 1127, 1131 intelligent environment, 371, 377, 378, 751, 852, 861, 962, 1016, 1043 intelligent meeting room, 852, 858 intelligent sensor, 682, 1275 intelligent sensor network, 681
Index intelligent service robot, 1257, 1262, 1263, 1265, 1267 intelligent state, 670, 675–677, 684, 688 Intelligent Transportation Systems (ITS), 706 intelligibility, 805, 1084 inter-application adaptation, 218–221 interaction, 41, 42, 44, 205–209, 214, 218, 220, 221, 229, 230, 232, 244, 245, 251, 252, 257, 264, 276, 304, 350, 354, 356, 358–360, 377, 382, 433, 434, 436, 451, 479, 535–537, 539–542, 544, 546, 547, 549–553, 555, 560, 563–565, 567, 568, 638, 640, 648, 683, 752, 797, 800, 813, 840, 854, 856, 860, 867, 889, 904, 906, 921, 945–947, 949, 951, 954, 955, 962, 965, 969–971, 1015, 1017, 1023, 1031, 1043, 1044, 1046, 1061, 1062, 1064, 1072, 1074, 1084, 1087, 1088, 1092, 1102, 1120, 1125, 1127, 1144, 1156, 1158, 1172, 1174, 1176, 1178, 1179, 1181, 1182, 1186, 1223, 1267 interactive surface, 537–539, 546 interface, 5, 109, 208, 216–218, 232, 235–239, 243, 249–252, 292, 352–354, 360–362, 374, 381, 410–413, 415, 419–422, 437, 539–544, 546–548, 554, 555, 559, 564, 565, 570–573, 630, 653, 658, 684, 799, 828, 830, 907, 917–919, 943, 949, 952, 955, 967, 968, 976, 993, 1028, 1123, 1124, 1129, 1144, 1150, 1157, 1182, 1186–1188, 1190, 1194, 1211, 1214, 1216, 1217, 1248, 1265–1267 interoperability, 207, 208, 229, 387, 565, 714, 852, 1120, 1123, 1140–1142, 1144, 1145, 1147, 1160, 1161, 1163, 1164, 1175, 1177–1180, 1182, 1187, 1190, 1197 interpersonal distance, 797, 802, 809, 811, 815, 816 intersection calculus, 618 intra-application adaptation, 218, 221 IP based Ubiquitous Sensor Networks (IP-USNs), 1276 IP-USN, 1276, 1277, 1280, 1281 Jini, 208, 1143 joint context estimates, 264, 272 Kalman filtering, 48, 411 KM, 1264, 1265, 1268 Knowledge Manager, 1264, 1265, 1268 knowledge representation and reasoning, 661 LAID, 852, 859, 861, 872
1289 laser pointer tracking, 885, 889, 892 Latent Variable Model, 71 Lifelog, 1237, 1241 light, 43, 65, 297, 413, 549, 560, 570, 800, 815, 826, 830, 858, 930, 1135, 1202, 1203, 1206, 1207, 1214, 1217, 1219 light adaptation, 1217 light intensity, 1207, 1214 light parameters, 1202, 1206, 1207, 1210, 1217 LNMP Architecture, 1280 Local Binary Patterns, 480, 495 localization, 35, 37, 38, 42–45, 119, 138, 444, 643, 644, 647, 656, 660, 661, 811, 813, 814, 826, 828, 829, 834, 860, 894, 1075–1077, 1098, 1193, 1276 location, 36, 37, 43, 44, 46, 47, 90, 91, 101, 123, 203, 214, 217, 232, 238–240, 248, 249, 251, 252, 257, 262, 287, 289–291, 297, 324, 359, 440, 465, 549, 559, 561, 565, 613, 650, 663, 693, 694, 719, 722, 762, 779, 786, 787, 796, 798, 800–802, 813, 816, 825–830, 832, 833, 835–841, 844, 857, 884–888, 894, 895, 898, 912, 922, 927, 928, 942, 947, 948, 951, 975, 1001, 1016, 1029, 1086, 1090, 1125, 1131, 1132, 1144, 1157, 1232, 1243, 1276 location awareness, 836 location estimation, 722, 829, 831, 833, 835 location privacy, 290, 291 location tracking, 290, 885, 886, 894, 895 location-aware content delivery, 720 locationing, 838 locomotion activities, 92, 109 locomotion behaviours, 89, 90, 94, 95, 98, 100
machine learning, 265, 350, 375, 488, 498, 612, 621, 624, 626, 627, 753, 756, 853, 1020 Markov Model, 40, 69, 73, 757, 764, 838, 854 Markov Model, Gaussian, 43 Markov Model, hidden, 482, 624–626, 854, 1105 meeting, 258, 276, 436, 437, 439, 453, 455, 490, 550, 552, 638, 795, 814, 862, 863, 865, 941, 1072, 1074, 1075, 1090, 1102, 1119 meeting room, 216, 361, 437, 440, 799, 837, 860, 861, 894, 895 Meetings, 1087 mental model, 252, 469, 555, 570, 919 metadata, 262, 289, 356, 364, 1186, 1193
1290 metaphor, 244, 282, 376, 383, 542, 543, 545, 559, 641, 882, 920, 939, 943, 948, 949, 972, 1084, 1085 microphone array, 118, 124, 435, 860, 884, 894, 896 middleware, 202–210, 218–221, 223, 230, 232–235, 237, 241–247, 249–253, 327, 334, 375, 542, 564, 565, 570, 774, 813, 815, 852, 856, 857, 860, 862, 901, 1092, 1098, 1128, 1140, 1145, 1146, 1175–1180, 1182, 1184, 1186, 1187, 1190, 1192, 1260, 1264, 1268 middleware infrastructure, 230, 233, 234, 245, 249, 394 MIT, 1176 mixed reality, 237–240, 249, 799, 800 mobile context, 260, 262 mobile device, 48, 203, 221, 241, 258, 259, 261, 268, 273, 275, 276, 290, 354, 361, 364, 365, 463, 550, 552, 715, 836, 840, 919, 923, 928, 947, 948, 1021, 1123, 1146, 1216, 1261 mobile interaction, 541 mobile robotics, 636, 653 Mobile Sensing Platform, 465 mobile technology, 947 mobile user, 564, 951, 1016 Monte Carlo, 73, 78, 79, 118, 136, 137, 625 motion analysis, 90–92 motion impairments, 411, 415, 416, 421 motion sensor, 754, 763, 1220, 1225 mouse-replacement, 409, 411, 422 moving objects, 36, 43, 44, 64, 89, 95, 128, 135, 842, 949, 997 multi-agent, 322, 331, 354, 375, 562, 636, 641, 703, 704, 720, 731–735, 746, 748, 751, 865, 866, 907, 908, 1092 multi-agent simulation, 705 multi-agent social simulation, 703, 704 multimedia analysis, 358, 359 multimedia interaction, 359 multimedia production, 357, 358 multimodal fusion, 123 multimodal interface, 354, 560, 565, 571, 856, 883 multiple residents, 751, 752, 765 multiple target tracking, 144 MyHeart, 320, 323 navigation, 92, 93, 214, 216, 260, 262, 638, 644, 647, 656, 704, 788, 833, 837, 947, 1027, 1124, 1209, 1216, 1263 network, 35–41, 46, 129, 208, 209, 211, 234, 236, 260, 373, 383, 397, 470, 628, 642,
Index 644, 646, 648, 682, 683, 687, 688, 771, 775, 782, 785, 798, 801, 802, 806–808, 813–815, 833, 838, 903, 912, 923, 925, 927, 941–945, 990, 993, 1004, 1049, 1077, 1127, 1128, 1140, 1142, 1143, 1159, 1160, 1193, 1195, 1230, 1231, 1237, 1240, 1245, 1248–1250, 1257, 1258, 1261, 1262, 1274–1277, 1279, 1280 network appliance, 374 neural network, 263, 267, 415, 488, 490, 626, 628, 629, 853, 1081, 1221 NIST, 860 nonverbal communication, 433, 436, 451, 453, 493, 1080 novelty detection, 626, 629 occlusion, 57, 58, 120, 1079 occupancy, 61, 920 occupancy sensing, 832 OLDES, 321, 326 ontology, 330, 375, 387, 392, 402, 562, 663, 664, 897–899, 1022, 1023, 1093, 1094, 1128–1131, 1140, 1180, 1181, 1183, 1190 Open Agent Architecture, 871 open distributed system, 1179, 1190 open-loop control, 1210 optical flow, 484, 490, 1082 OSGi, 1126, 1128, 1129, 1146, 1180, 1187, 1189, 1192, 1195, 1196 OSNAP, 1247 OWL, 375, 387, 395, 588, 601, 897, 900, 1024, 1130, 1141, 1189, 1191 Oxygen, 1176 paralysis, 412 ParcTab, 536, 549 particle filtering, 46, 73–75, 120, 835 password, 301, 302, 305, 1004 pattern classification, 264, 265, 272, 415 pattern recognition, 265, 267, 317, 479, 672 PAVENET, 1233, 1234 peephole, 545 peer-to-peer, 121, 714, 796, 857, 1187, 1252 performance evaluation, 453, 1077 PERSONA, 1171, 1172, 1175, 1176, 1179–1183, 1185–1190, 1192, 1195, 1196 personalization, 469, 1016, 1176 pervasive healthcare system, 316, 322, 323 pervasive sensing, 913 phonelet, 802, 807, 808 PhotoHelix, 554
Index physical distribution, 1182 physical interaction, 379, 383, 549, 552, 635, 638 PlaceLab, 842 planning, 99, 100, 111, 620, 642, 647, 650, 658, 662, 664, 706, 744, 853, 856, 917, 1176, 1225, 1263 point algebra, 614, 615 political philosophy, 911, 913 portability, 233, 243, 886, 1191 pose estimation, 61, 63, 68, 71, 81, 444, 449, 450 positioning, 91, 216, 260, 262, 263, 290, 297, 412, 422, 539, 654, 656, 719, 826, 828, 830, 833, 842, 886, 887, 1232 power efficiency, 275 prediction, 263, 422, 612, 629, 670, 676, 751, 767, 829, 838, 854, 888, 925, 930 price discrimination, 307 privacy, 250, 252, 287–290, 294, 295, 303, 307, 308, 318, 330, 366, 372, 378, 610, 707, 813, 835, 844, 852, 858, 919–921, 924, 985, 986, 989, 991, 992, 994, 1004, 1006, 1023, 1027, 1106, 1176–1178, 1186, 1244 privacy design, 919 programming, 120, 234, 235, 246, 247, 249–252, 353, 372, 374, 376, 377, 379, 383, 397, 399, 401, 403, 416, 540, 777, 875, 918, 1119, 1146, 1157, 1190, 1216, 1233 programming by example, 376, 401 programming, graphical, 376 proxemics, 797 psycho-physiological target value, 1202, 1206, 1207 public space, 914 public and private spaces, 362 public life, 913, 917, 918 public space, 920, 925, 927, 1245 qualitative representation, 91, 614 quantum communication, 300 RAP, 1249 reaction, 680, 681, 686, 687, 1207, 1217 reasoner, 1185, 1186, 1197 recognition accuracy, 264, 266, 272, 450, 489, 916, 1082 region connection calculus, 101 relaxation, 1204, 1206, 1207, 1217, 1220, 1221 relay attack, 297 remote communication, 437
1291 remote meeting, 435, 437, 1100 remote monitoring, 316, 333 remote procedure call, 775, 1159 repository, 220, 244, 918, 1093, 1144, 1147, 1185, 1189, 1190, 1264, 1268 resident interaction, 754 resource discovery, 861 Resurrecting Duckling, 299 resurrecting duckling, 299 RFID, 91, 103, 108, 260, 291–297, 302, 307, 564, 565, 720, 722, 723, 786, 813, 841, 858, 921, 1230, 1237, 1238, 1249, 1252, 1272–1274 risk management, 284, 285 RoboCup, 732, 734, 736, 738, 744, 746, 749 robot, 43, 635–638, 640, 642, 646–648, 650, 652, 653, 656, 660, 661, 928, 963–965, 968, 976, 1118, 1251, 1262, 1263, 1265, 1267–1269 Robotic Room, 843 room, intelligent, 852, 858 rough locations, 110 RPC, 775, 1128, 1145, 1156–1159, 1161–1163, 1177 SAIL, 1192, 1194, 1195 SAPHIRE, 322, 327 security, 250, 251, 282, 284–288, 290, 293, 294, 298, 299, 302–304, 306–308, 341, 479, 642, 643, 645, 779, 796, 852, 853, 913, 926, 956, 985, 992, 1119, 1123, 1174, 1177, 1178, 1186, 1225, 1226 security holes, 307 semantic service, 1140, 1141, 1153, 1155, 1260 Semantic Web, 276, 900, 1127, 1130, 1141, 1190 sensing, 123, 134, 229, 243, 261, 262, 464, 465, 550, 555, 638, 754, 813, 858, 886, 911, 919–921, 927, 1182, 1191, 1192, 1232, 1234, 1261, 1269, 1279 sensor event, 752, 755, 758, 759 sensor fusion, 45, 118, 119, 122, 126, 133, 141, 216, 364, 366, 639, 643, 833, 844 sensor network, 35–40, 46, 47, 128, 328, 830, 921, 1050, 1192, 1194, 1232, 1233, 1247, 1257, 1276 service, 109, 131, 207, 208, 211, 212, 217, 232, 237, 238, 240–243, 245, 250, 290, 295, 372, 765, 801, 837, 926, 943–945, 948, 955, 969, 972, 974, 1023, 1085, 1091, 1092, 1095, 1099, 1128, 1173, 1174, 1177–1179, 1181, 1184, 1186, 1187, 1189–1191, 1194, 1195, 1211, 1212,
1292 1231, 1235, 1240, 1249, 1257, 1258, 1261, 1262, 1267, 1268, 1271–1273, 1275 service agent, 1094 service aggregation, 375 service communication, 337, 1141, 1156 service communication protocol, 1145 service composition, 1176, 1178 service description language, 1141, 1248 service description model, 1142 service description translation, 1141 service discovery, 211–213, 786, 852, 899, 1140, 1141, 1143, 1144, 1147, 1156, 1175, 1178, 1179, 1195 service discovery protocol, 1143 service matching, 1141 service oriented architecture, 1140, 1142 service platform, 1141, 1173, 1175 service repository, 1147 service science, 5 SETI, 923 short-range link, 260 SimCity, 912 simplicity, 704 simulated annealing, 1212, 1218, 1219 situation awareness, 989, 1017, 1258 situation calculus, 618, 620, 621 situation modeling, 1092, 1098 skin conductance, 1207, 1208, 1213, 1220 skin response, 464 SLAT, 44 smart camera, 36, 48, 986, 990, 993, 995, 997–999, 1001, 1003 smart camera network, 993 Smart Classroom, 881–884, 886, 893 Smart Environment, 4, 89, 90, 92, 93, 96, 98, 100, 101, 103, 106, 202, 376, 413–415, 422, 423, 463, 535, 545–547, 612, 637, 638, 641, 647, 691, 758, 765, 766, 771, 825, 836, 840, 857–859, 883, 889, 896, 897, 906, 919, 1050, 1121 smart furniture, 1234, 1240 smart home, 296, 377, 378, 401, 609–612, 621, 625, 628, 752, 753, 757, 766, 948, 1117, 1118, 1121, 1123, 1139, 1187 smart hospital, 94, 97 smart object, 241–245, 1261 smart office, 852, 854, 875, 1106 Smart Platform, 883, 906–909 SOAP, 207, 249, 1157, 1160, 1248 soccer, 732, 733, 736, 740, 744 social and cultural awareness, 354 social interaction, 963 social network, 358, 1045, 1059
Index Sonic City, 922 spatial reasoning, 92, 615, 618 spatial speech reproduction, 809 spatio-temporal reasoning, 89, 610, 613, 624 speaker tracking, 440, 893, 1086 speakerphone, 797, 801, 802, 813 speaking activity, 433–435, 451–453, 456 speech enhancement, 803, 807, 808 strategic modeling, 731 subspace learning, 479, 487 successive best insertion, 717 Support Vector Machine, 487 surveillance, 39, 44, 261, 288, 357, 361, 363, 479, 490, 642, 643, 645, 646, 653, 664, 687, 916, 917, 920, 926, 928, 931, 985, 986, 988–993, 1004 sustainability, 351, 912, 913, 1204 SVM, 487, 488, 501, 1000–1002 T-pattern, 623 tangible interaction, 552 tangible interface, 376 tangible user interface, 552 targeted audio, 1095 Task and Knowledge Manager (TKM), 1264 task computing, 374 Task Manager, 1264, 1265, 1268 taxonomy, 36, 95, 301, 374, 377, 638, 736, 737, 745, 813 TeleCARE, 319 telephony, 260, 796–798, 800, 801, 808–810, 812, 815, 816 The Blacksburg Electronic Village, 942 The Seattle Community Network, 942 threat, 284, 289, 916, 992 time delta, 761 time synchronization, 118, 120, 122, 131, 829 TM, 1265, 1268 token, 251, 286, 297, 302, 303, 610, 613, 621–623, 625 topology estimation, 36, 38–41, 44 tour tend analysis, 720 TPSN, 120 tracking, 35–38, 40, 41, 46–48, 58, 59, 63, 67, 70, 72, 96, 107, 118–121, 126, 132, 136, 141, 144, 291, 295, 411, 413, 416, 483, 486, 540, 751, 754, 766, 767, 779, 796, 803, 807, 810, 813, 815, 831, 834, 835, 840, 843, 855, 857, 858, 860, 986, 988, 991, 993, 997–999, 1005, 1010, 1076–1079, 1082 tractability, 704 tuple space, 209, 246 turbocharging detection classifier, 672
Index u-health, 1269, 1271, 1272, 1275 Ubila, 1230, 1231, 1245 Ubiquitous Computing, 260, 281, 282, 350, 464, 549, 641, 863, 911, 922, 1016, 1074, 1232, 1243, 1258, 1262 Ubiquitous Computing Network (UCN), 1257 ubiquitous device, 361 ubiquitous group decision support system, 864 ubiquitous healthcare, 1269 Ubiquitous Mobile Object, 1261 ubiquitous smart space, 1258, 1260 Ubiquitous smart Space Platform Initiative (USPi), 1260 UCN, 1258–1260, 1262 ultra-wideband, 814, 830 ultrasonic, 260, 826–828, 1232 UMO, 1261 unobtrusive monitoring, 611 uPhoto, 1235, 1236 UPnP, 381, 388, 1141, 1148, 1149, 1151, 1180 urbanism, 912, 913, 928 usability, 229, 300, 304, 305, 355, 387, 394, 469, 471, 548, 715, 716, 852, 943, 977, 1028, 1046, 1052, 1053, 1102, 1174, 1208, 1209 user experience, 201, 362, 473, 883, 971 user interface, 203, 210, 220, 235, 236, 262, 283, 304, 328, 355, 360, 421, 469, 476, 542–544, 547, 548, 630, 647, 649, 815, 861, 916, 1015, 1072, 1121, 1125, 1131, 1174, 1196, 1205, 1208, 1209, 1212, 1216, 1226 user interface, graphical, 854 user tracking, 796, 807, 860 user-centered design, 355, 415, 542, 546, 975, 976, 1016, 1046, 1096 users with disabilities, 362, 1029, 1118 USN, 1257, 1275–1278, 1280, 1281 USPi, 1260
1293 USS, 1258 uTexture, 1234 utility value, 223, 265 vehicle to infrastructure (ITS), 706 vehicle to infrastructure (V2I) communication, 712 video analysis, 450, 986, 987 video sensor model, 135 virtual appliance, 373, 378, 379 virtual city, 912, 944, 954, 955 virtual environment, 362, 798 virtual world, 537, 545, 993, 1005, 1007, 1182 vision, 35, 36, 39–42, 44, 45, 47, 48, 293, 362, 421, 814, 830, 855, 892, 988–990, 1043, 1132, 1203, 1208 vision sensor, 691, 692 visual communication, 796, 800, 815, 816 visual dominance ratio, 452–454 visual sensor network, 35–38, 41, 43, 1261 visual tracking, 35, 47, 73, 1076 voting, 264–266, 269, 272, 549, 766, 869, 872 wayfinding, 89, 90, 92, 97, 100, 106 wearable computing, 288–290 Web service, 288, 917, 1141, 1146, 1151, 1159, 1160, 1187, 1248 weighted majority voting, 266, 272 WELL, 941 WiFi, 290, 307, 470, 829, 833, 838, 844, 882, 947 WiMax, 470 wireless network, 130, 138, 328, 813, 882 wireless sensor network, 36, 37, 45, 117, 128, 287, 307, 330, 1192, 1275 WSN, 118, 119, 307, 1192–1194, 1261, 1275–1277, 1279 ZigBee, 682, 1192, 1194, 1195