The Handbook of Research on Scalable Computing Technologies Kuan-Ching Li Providence University, Taiwan Ching-Hsien Hsu Chung Hua University, Taiwan Laurence Tianruo Yang St. Francis Xavier University, Canada Jack Dongarra University of Tennessee, USA Hans Zima Jet Propulsion Laboratory, California Institute of Technology, USA and University of Vienna, Austria
InformatIon scIence reference Hershey • New York
Director of Editorial Content: Senior Managing Editor: Assistant Managing Editor: Publishing Assistant: Typesetter: Cover Design: Printed at:
Kristin Klinger Jamie Snavely Carole Coulson Sean Woznicki Carole Coulson, Dan Wilson, Daniel Custer, Kait Betz Lisa Tosheff Yurchak Printing Inc.
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.igi-global.com/reference
Copyright © 2010 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data Handbook of research on scalable computing technologies / Kuan-Ching Li ... [et al.], editors. p. cm. Includes bibliographical references and index. Summary: "This book presents, discusses, shares ideas, results and experiences on the recent important advances and future challenges on enabling technologies for achieving higher performance"--Provided by publisher. ISBN 978-1-60566-661-7 (hardcover) -- ISBN 978-1-60566-662-4 (ebook) 1. Computational grids (Computer systems) 2. System design. 3. Parallel processing (Electronic computers) 4. Ubiquitous computing. I. Li, KuanChing. QA76.9.C58H356 2009 004--dc22 2009004402
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Editorial Advisory Board
Minyi Guo, The University of Aizu, Japan Timothy Shih, Tamkang University, Taiwan Ce-Kuen Shieh, National Cheng Kung University, Taiwan Liria Matsumoto Sato, University of Sao Paulo, Brazil Jeffrey Tsai, University of Illinois at Chicago, USA Chia-Hsien Wen, Providence University, Taiwan Yi Pan, Georgia State University, USA
List of Contributors
Allenotor, David / University of Manitoba, Canada ..........................................................................................471 Altmann, Jorn / Seoul National University, South Korea ..................................................................................442 Alves, C. E. R. / Universidade Sao Judas Tadeu, Brazil ....................................................................................378 Bertossi, Alan A. / University of Bologna, Italy .................................................................................................645 Buyya, Rajkumar / The University of Melbourne, Australia.....................................................................191, 517 Cáceres, E. N. / Universidade Federal de Mato Grosso do Sul, Brazil ..............................................................378 Cappello, Franck / INRIA & UIUC, France ........................................................................................................31 Chang, Jih-Sheng / National Dong Hwa University, Taiwan ................................................................................1 Chang, Ruay-Shiung / National Dong Hwa University, Taiwan ...........................................................................1 Chen, Jinjun / Swinburne University of Technologies, Australia.......................................................................396 Chen, Zizhong / Colorado School of Mines, USA ..............................................................................................760 Chiang, Kuo / National Taiwan University, Taiwan .............................................................................................. 123 Chiang, Shang-Feng / National Taiwan University, Taiwan ................................................................................ 123 Chiu, Kenneth / University at Binghamtom, State University of NY, USA ........................................................471 Dai, Yuan-Shun / University of Electronics Science Technology of China, China & University of Tennessee, Knoxville, USA .........................................................................................................219 de Assunção, Marcos Dias / The University of Melbourne, Australia ..............................................................517 de Mello, Rodrigo Fernandes / University of São Paulo – ICMC, Brazil.........................................................338 Dehne, F. / Carleton University, Canada ............................................................................................................378 Dodonov, Evgueni / University of São Paulo – ICMC, Brazil ...........................................................................338 Dongarra, Jack / University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA; & University of Manchester, UK ..........................................................................................................................219 Doolan, Daniel C. / Robert Gordon University, UK ...........................................................................................705 Dou, Wanchun / Nanjing University, P. R. China .............................................................................................396 Dümmler, Jörg / Chemnitz University of Technology,Germany .........................................................................246 Eskicioglu, Rasit / University of Manitoba, Canada ..........................................................................................486 Fahringer, Thomas / University of Innsbruck, Austria ........................................................................................89 Fedak, Gilles / LIP/INRIA, France .......................................................................................................................31 Ferm, Tore / Sydney University, Australia ..........................................................................................................354 Gabriel, Edgar / University of Houston, USA ....................................................................................................583 Gaudiot, Jean-Luc / University of California, Irvine, USA ...............................................................................552 Gentzsch, Wolfgang / EU Project DEISA and Board of Directors of the Open Grid Forum, Germany .............62 Graham, Peter / University of Manitoba, Canada .............................................................................................486 Grigg, Alan / Loughborough University, UK ......................................................................................................606 Grigoras, Dan / University College Cork, Ireland .............................................................................................705 Guan, Lin / Loughborough University, UK ........................................................................................................606 Gunturu, Sudha / Oklahoma State University, USA ..........................................................................................841 Guo, Minyi / Shanghai Jiao Tong University, China ..........................................................................................421
Gupta, Phalguni / Indian Institute of Technology Kanpur, India .......................................................................645 He, Xiangjian / University of Technology, Sydney (UTS), Australia ..........................................................739, 808 Jang, Yong J. / Yonsei University, Seoul, Korea .................................................................................................276 Ji, Yanqing / Gonzaga University, USA ..............................................................................................................874 Jiang, Hai / Arkansas State University, USA ......................................................................................................874 Jiang, Hong / University of Nebraska–Lincoln, USA .........................................................................................785 Kondo, Derrick / ENSIMAG - antenne de Montbonnot, France..........................................................................31 Lam, King Tin / The University of Hong Kong, Hong Kong .............................................................................658 Li, Xiaobin / Intel® Corporation, USA ..............................................................................................................552 Li, Xiaolin / Oklahoma State University, USA ....................................................................................................841 Liu, Chen / Florida International University, USA ............................................................................................552 Liu, Shaoshan / University of California, Irvine, USA.......................................................................................552 Malécot, Paul / Université Paris-Sud, France ......................................................................................................31 Malyshkin, V.E. / Russian Academy of Science, Russia .....................................................................................295 March, Verdi / National University of Singapore, Singapore ............................................................................140 Mihailescu, Marian / National University of Singapore, Singapore..................................................................140 Nadeem, Farrukh / University of Innsbruck, Austria ..........................................................................................89 Nanda, Priyadarsi / University of Technology, Sydney (UTS), Australia ..........................................................739 Oh, Doohwan / Yonsei University, Seoul, Korea ................................................................................................276 Ou, Zhonghong / University of Oulu, Finland ...................................................................................................682 Parashar, Manish / Rutgers, The State University of New Jersey, USA ..............................................................14 Pierson, Jean-Marc / Paul Sabatier University, France ......................................................................................14 Pinotti, M. Cristina / University of Perugia, Italy .............................................................................................645 Prodan, Radu / University of Innsbruck, Austria .................................................................................................89 Quan, Dang Minh / International University in Germany, Germany ................................................................442 Ranjan, Rajiv / The University of Melbourne, Australia ..................................................................................191 Rauber, Thomas / University Bayreuth, Germany .............................................................................................246 Rautiainen, Mika / University of Oulu, Finland ................................................................................................682 Rezmerita, Ala / Université Paris-Sud, France ....................................................................................................31 Rizzi, Romeo / University of Udine, Italy ...........................................................................................................645 Ro, Won W. / Yonsei University, Seoul, Korea....................................................................................................276 Rünger, Gudula / Chemnitz University of Technology,Germany .......................................................................246 Shen, Haiying / University of Arkansas, USA ....................................................................................................163 Shen, Wei / University of Cincinnati, USA .........................................................................................................718 Shorfuzzaman, Mohammad / University of Manitoba, Canada.......................................................................486 Song, S. W. / Universidade de Sao Paulo, Brazil................................................................................................378 Sun, Junzhao / University of Oulu, Finland .......................................................................................................682 Tabirca, Sabin / University College Cork, Ireland .............................................................................................705 Tang, Feilong / Shanghai Jiao Tong University, China ......................................................................................421 Teo, Yong Meng / National University of Singapore, Singapore ........................................................................140 Thulasiram, Ruppa K. / University of Manitoba, Canada ........................................................................312, 471 Thulasiraman, Parimala / University of Manitoba, Canada ............................................................................312 Tian, Daxin / Tianjin University, China ..............................................................................................................858 Tilak, Sameer / University of California, San Diego, USA ................................................................................471 Wang, Cho-Li / The University of Hong Kong, Hong Kong ..............................................................................658 Wang, Sheng-De / National Taiwan University, Taiwan ....................................................................................... 123 Wu, Qiang / University of Technology, Australia ...............................................................................................808 Xiang, Yang / Central Queensland University, Australia ...................................................................................858 Xu, Meilian / University of Manitoba, Canada ..................................................................................................312 Yang, Laurence Tianruo / St. Francis Xavier University, Canada ...........................................................442, 841 Yi, Jaeyoung / Yonsei University, Seoul, Korea .................................................................................................276
Ylianttila, Mika / University of Oulu, Finland...................................................................................................682 Yu, Ruo-Jian / National Taiwan University, Taiwan ............................................................................................. 123 Zeng, Qing-An / University of Cincinnati, USA .................................................................................................718 Zhou, Jiehan / University of Oulu, Finland........................................................................................................682 Zhu, Yifeng / University of Maine, USA .............................................................................................................785 Zomaya, Albert Y. / Sydney University, Australia..............................................................................................354
Table of Contents
Foreword .......................................................................................................................................... xxxi Preface ............................................................................................................................................xxxiii Acknowledgment ........................................................................................................................... xxxiv
Volume I Section 1 Grid Architectures and Applications Chapter 1 Pervasive Grid and its Applications ....................................................................................................... 1 Ruay-Shiung Chang, National Dong Hwa University, Taiwan Jih-Sheng Chang, National Dong Hwa University, Taiwan Chapter 2 Pervasive Grids: Challenges and Opportunities ................................................................................... 14 Manish Parashar, Rutgers, The State University of New Jersey, USA Jean-Marc Pierson, Paul Sabatier University, France Chapter 3 Desktop Grids: From Volunteer Distributed Computing to High Throughput Computing Production Platforms ......................................................................................................... 31 Franck Cappello, INRIA & UIUC, France Gilles Fedak, LIP/INRIA, France Derrick Kondo, ENSIMAG - antenne de Montbonnot, France Paul Malécot, Université Paris-Sud, France Ala Rezmerita Université Paris-Sud, France Chapter 4 Porting Applications to Grids................................................................................................................ 62 Wolfgang Gentzsch, EU Project DEISA and Board of Directors of the Open Grid Forum, Germany
Chapter 5 Benchmarking Grid Applications for Performance and Scalability Predictions .................................. 89 Radu Prodan, University of Innsbruck, Austria Farrukh Nadeem, University of Innsbruck, Austria Thomas Fahringer, University of Innsbruck, Austria
Section 2 P2P Computing Chapter 6 Scalable Index and Data Management for Unstructured Peer-to-Peer Networks ...................................123 Shang-Feng Chiang, National Taiwan University, Taiwan Kuo Chiang, National Taiwan University, Taiwan Ruo-Jian Yu, National Taiwan University, Taiwan Sheng-De Wang, National Taiwan University, Taiwan Chapter 7 Hierarchical Structured Peer-to-Peer Networks.................................................................................. 140 Yong Meng Teo, National University of Singapore, Singapore Verdi March, National University of Singapore, Singapore Marian Mihailescu, National University of Singapore, Singapore Chapter 8 Load Balancing in Peer-to-Peer Systems ............................................................................................ 163 Haiying Shen, University of Arkansas, USA Chapter 9 Decentralized Overlay for Federation of Enterprise Clouds............................................................... 191 Rajiv Ranjan, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia
Section 3 Programming Models and Tools Chapter 10 Reliability and Performance Models for Grid Computing ................................................................. 219 Yuan-Shun Dai, University of Electronics Science Technology of China, China & University of Tennessee, Knoxville, USA Jack Dongarra, University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA; & University of Manchester, UK
Chapter 11 Mixed Parallel Programming Models Using Parallel Tasks ............................................................... 246 Jörg Dümmler, Chemnitz University of Technology,Germany Thomas Rauber, University Bayreuth, Germany Gudula Rünger, Chemnitz University of Technology,Germany Chapter 12 Programmability and Scalability on Multi-Core Architectures .......................................................... 276 Jaeyoung Yi, Yonsei University, Seoul, Korea Yong J. Jang, Yonsei University, Seoul, Korea Doohwan Oh, Yonsei University, Seoul, Korea Won W. Ro, Yonsei University, Seoul, Korea Chapter 13 Assembling of Parallel Programs for Large Scale Numerical Modeling............................................ 295 V.E. Malyshkin, Russian Academy of Science, Russia Chapter 14 Cell Processing for Two Scientific Computing Kernels ..................................................................... 312 Meilian Xu, University of Manitoba, Canada Parimala Thulasiraman, University of Manitoba, Canada Ruppa K. Thulasiram, University of Manitoba, Canada
Section 4 Scheduling and Communication Techniques Chapter 15 On Application Behavior Extraction and Prediction to Support and Improve Process Scheduling Decisions ............................................................................................. 338 Evgueni Dodonov, University of São Paulo – ICMC, Brazil Rodrigo Fernandes de Mello, University of São Paulo – ICMC, Brazil Chapter 16 A Structured Tabu Search Approach for Scheduling in Parallel Computing Systems........................ 354 Tore Ferm, Sydney University, Australia Albert Y. Zomaya, Sydney University, Australia Chapter 17 Communication Issues in Scalable Parallel Computing ..................................................................... 378 C.E.R. Alves, Universidade Sao Judas Tadeu, Brazil E. N. Cáceres, Universidade Federal de Mato Grosso do Sul, Brazil F. Dehne, Carleton University, Canada S. W. Song, Universidade de Sao Paulo, Brazil
Chapter 18 Scientific Workflow Scheduling with Time-Related QoS Evaluation ................................................ 396 Wanchun Dou, Nanjing University, P. R. China Jinjun Chen, Swinburne University of Technologies, Australia
Section 5 Service Computing Chapter 19 Grid Transaction Management and Highly Reliable Grid Platform ................................................... 421 Feilong Tang, Shanghai Jiao Tong University, China Minyi Guo, Shanghai Jiao Tong University, China Chapter 20 Error Recovery for SLA-Based Workflows Within the Business Grid ............................................... 442 Dang Minh Quan, International University in Germany, Germany Jorn Altmann, Seoul National University, South Korea Laurence T. Yang, St. Francis Xavier University, Canada Chapter 21 A Fuzzy Real Option Model to Price Grid Compute Resources ........................................................ 471 David Allenotor, University of Manitoba, Canada Ruppa K. Thulasiram, University of Manitoba, Canada Kenneth Chiu, University at Binghamtom, State University of NY, USA Sameer Tilak, University of California, San Diego, USA
Volume II Chapter 22 The State of the Art and Open Problems in Data Replication in Grid Environments ........................ 486 Mohammad Shorfuzzaman, University of Manitoba, Canada Rasit Eskicioglu , University of Manitoba, Canada Peter Graham, University of Manitoba, Canada Chapter 23 Architectural Elements of Resource Sharing Networks ..................................................................... 517 Marcos Dias de Assunção, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia
Section 6 Optimization Techniques Chapter 24 Simultaneous MultiThreading Microarchitecture ............................................................................... 552 Chen Liu, Florida International University, USA Xiaobin Li, Intel® Corporation, USA Shaoshan Liu, University of California, Irvine, USA Jean-Luc Gaudiot, University of California, Irvine, USA Chapter 25 Runtime Adaption Techniques for HPC Applications ........................................................................ 583 Edgar Gabriel, University of Houston, USA Chapter 26 A Scalable Approach to Real-Time System Timing Analysis............................................................. 606 Alan Grigg, Loughborough University, UK Lin Guan, Loughborough University, UK Chapter 27 Scalable Algorithms for Server Allocation in Infostations ................................................................. 645 Alan A. Bertossi, University of Bologna, Italy M. Cristina Pinotti, University of Perugia, Italy Romeo Rizzi, University of Udine, Italy Phalguni Gupta, Indian Institute of Technology Kanpur, India
Section 7 Web Computing Chapter 28 Web Application Server Clustering with Distributed Java Virtual Machine ...................................... 658 King Tin Lam, The University of Hong Kong, Hong Kong Cho-Li Wang, The University of Hong Kong, Hong Kong Chapter 29 Middleware for Community Coordinated Multimedia ....................................................................... 682 Jiehan Zhou, University of Oulu, Finland Zhonghong Ou, University of Oulu, Finland Junzhao Sun, University of Oulu, Finland Mika Rautiainen, University of Oulu, Finland Mika Ylianttila, University of Oulu, Finland
Section 8 Mobile Computing and Ad Hoc Networks Chapter 30 Scalability of Mobile Ad Hoc Networks ............................................................................................. 705 Dan Grigoras, University College Cork, Ireland Daniel C. Doolan, Robert Gordon University, UK Sabin Tabirca, University College Cork, Ireland Chapter 31 Network Selection Strategies and Resource Management Schemes in Integrated Heterogeneous Wireless and Mobile Networks............................................................ 718 Wei Shen, University of Cincinnati, USA Qing-An Zeng, University of Cincinnati, USA
Section 9 Fault Tolerance and QoS Chapter 32 Scalable Internet Architecture Supporting Quality of Service (QoS) ................................................. 739 Priyadarsi Nanda, University of Technology, Sydney (UTS), Australia Xiangjian He, University of Technology, Sydney (UTS), Australia Chapter 33 Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing ................................. 760 Zizhong Chen, Colorado School of Mines, USA
Section 10 Applications Chapter 34 Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems ................... 785 Yifeng Zhu, University of Maine, USA Hong Jiang, University of Nebraska–Lincoln, USA Chapter 35 Image Partitioning on Spiral Architecture .......................................................................................... 808 Qiang Wu, University of Technology, Australia Xiangjian He, University of Technology, Australia
Chapter 36 Scheduling Large-Scale DNA Sequencing Applications .................................................................... 841 Sudha Gunturu, Oklahoma State University, USA Xiaolin Li, Oklahoma State University, USA Laurence Tianruo Yang, St. Francis Xavier University, Canada Chapter 37 Multi-Core Supported Deep Packet Inspection .................................................................................. 858 Yang Xiang, Central Queensland University, Australia Daxin Tian, Tianjin University, China Chapter 38 State-Carrying Code for Computation Mobility ................................................................................. 874 Hai Jiang, Arkansas State University, USA Yanqing Ji, Gonzaga University, USA
Compilation of References ............................................................................................................... 895
Detailed Table of Contents
Foreword .......................................................................................................................................... xxxi Preface ............................................................................................................................................xxxiii Acknowledgment ........................................................................................................................... xxxiv
Volume I Section 1 Grid Architectures and Applications Chapter 1 Pervasive Grid and its Applications ....................................................................................................... 1 Ruay-Shiung Chang, National Dong Hwa University, Taiwan Jih-Sheng Chang, National Dong Hwa University, Taiwan With the advancements of computer system and communication technologies, Grid computing can be seen as the popular technology bringing about significant revolution for the next generation distributed computing application. As regards general users, a grid middleware is complex to setup and necessitates a steep learning curve. How to access to the grid system transparently from the point of view of users turns into a critical issue then. Therefore, various challenges may arise from the incomprehensive system design as coordinating existing computing resources for the sake of achieving pervasive grid environment. We are going to investigate into the current research works of pervasive grid as well as analyze the most important factors and components for constructing a pervasive grid system here. Finally, in order to facilitate the efficiency in respect of teaching and research within a campus, we would like to introduce our pervasive grid platform. Chapter 2 Pervasive Grids: Challenges and Opportunities ................................................................................... 14 Manish Parashar, Rutgers, The State University of New Jersey, USA Jean-Marc Pierson, Paul Sabatier University, France Pervasive Grid is motivated by the advances in Grid technologies and the proliferation of pervasive systems, and is leading to the emergence of a new generation of applications that use pervasive and ambient
information as an integral part to manage, control, adapt and optimize. However, the inherent scale and complexity of Pervasive Grid systems fundamentally impact how applications are formulated, deployed and managed, and presents significant challenges that permeate all aspects of systems software stack. In this chapter, the authors present some use-cases of Pervasive Grids and highlight their opportunities and challenges. They then present why semantic knowledge and autonomic mechanisms are seen as foundations for conceptual and implementation solutions that can address these challenges. Chapter 3 Desktop Grids: From Volunteer Distributed Computing to High Throughput Computing Production Platforms ......................................................................................................... 31 Franck Cappello, INRIA & UIUC, France Gilles Fedak, LIP/INRIA, France Derrick Kondo, ENSIMAG - antenne de Montbonnot, France Paul Malécot, Université Paris-Sud, France Ala Rezmerita Université Paris-Sud, France Desktop Grids, literally Grids made of Desktop Computers, are very popular in the context of “Volunteer Computing” for large scale “Distributed Computing” projects like SETI@home and Folding@home. They are very appealing, as “Internet Computing” platforms for scientific projects seeking a huge amount of computational resources for massive high throughput computing, like the EGEE project in Europe. Companies are also interested of using cheap computing solutions that does not add extra hardware and cost of ownership. A very recent argument for Desktop Grids is their ecological impact: by scavenging unused CPU cycles without increasing excessively the power consumption, they reduce the waste of electricity. This book chapter presents the background of Desktop Grid, their principles and essential mechanisms, the evolution of their architectures, their applications and the research tools associated with this technology. Chapter 4 Porting Applications to Grids................................................................................................................ 62 Wolfgang Gentzsch, EU Project DEISA and Board of Directors of the Open Grid Forum, Germany Aim of this chapter is to guide developers and users through the most important stages of implementing software applications on Grid infrastructures, and to discuss important challenges and potential solutions. Those challenges come from the underlying grid infrastructure, like security, resource management, and information services; the application data, data management, and the structure, volume, and location of the data; and the application architecture, monolithic or workflow, serial or parallel. As a case study, we present the DEISA Distributed European Infrastructure for Supercomputing Applications and describe its DEISA Extreme Computing Initiative DECI for porting and running scientific grand challenge applications. The chapter concludes with an outlook on Compute Clouds, and suggests ten rules of building a sustainable grid as a prerequisite for long-term sustainability of the grid applications.
Chapter 5 Benchmarking Grid Applications for Performance and Scalability Predictions .................................. 89 Radu Prodan, University of Innsbruck, Austria Farrukh Nadeem, University of Innsbruck, Austria Thomas Fahringer, University of Innsbruck, Austria Application benchmarks can play a key role in analyzing and predicting the performance and scalability of Grid applications, serve as an evaluation of the fitness of a collection of Grid resources for running a specific application or class of applications (Tsouloupas & Dikaiakos, 2007), and help in implementing performance-aware resource allocation policies of real time job schedulers. However, application benchmarks have been largely ignored due to diversified types of applications, multi-constrained executions, dynamic Grid behavior, and heavy computational costs. To remedy these, we present an approach taken by the ASKALON Grid environment that computes application benchmarks considering variations in the problem size of the application and machine size of the Grid site. Our system dynamically controls the number of benchmarking experiments for individual applications and manages the execution of these experiments on different Grid sites. We present experimental results of our method for three real-world applications in the Austrian Grid environment.
Section 2 P2P Computing Chapter 6 Scalable Index and Data Management for Unstructured Peer-to-Peer Networks ...................................123 Shang-Feng Chiang, National Taiwan University, Taiwan Kuo Chiang, National Taiwan University, Taiwan Ruo-Jian Yu, National Taiwan University, Taiwan Sheng-De Wang, National Taiwan University, Taiwan In order to improve the scalability and reduce the traffic of Gnutella-like unstructured peer-to-peer networks, index caching and controlled flooding mechanisms had been an important research topic in recent years. In this chapter we will describe and present the current state of the art about index management schemes, interest groups and data clustering for unstructured peer-to-peer networks. Index caching mechanisms are an approach to reducing the traffic of keyword querying. However, the cached indices may incur redundant replications in the whole network, leading to the less efficient use of storage and the increase of traffic. We propose a multiplayer index management scheme that actively diffuses the indices in the network and groups indices according to their request rate. The peers of the group that have indices with higher request rate will be placed in layers that receive queries earlier. Our simulation shows that the proposed approach can keep a high success query rate as well as reduce the flooding size.
Chapter 7 Hierarchical Structured Peer-to-Peer Networks.................................................................................. 140 Yong Meng Teo, National University of Singapore, Singapore Verdi March, National University of Singapore, Singapore Marian Mihailescu, National University of Singapore, Singapore Structured peer-to-peer networks are scalable overlay network infrastructures that support Internet-scale network applications. A globally consistent peer-to-peer protocol maintains the structural properties of the network with peers dynamically joining, leaving and failing in the network. In this chapter, we discuss hierarchical distributed hash tables (DHT) as an approach to reduce the overhead of maintaining the overlay network. In a two-level hierarchical DHT, the top-level overlay consists of groups of nodes where each group is distinguished by a unique group identifier. In each group, one or more nodes are designated as supernodes and act as gateways to nodes at the second level. Collisions of groups occur when concurrent node joins result in the creation of multiple groups with the same group identifier. This has the adverse effects of increasing the lookup path length due to a larger top-level overlay, and the overhead of overlay network maintenance. We discuss two main approaches to address the group collision problem: collision detection-and-resolution, and collision avoidance. As an example, we describe an implementation of hierarchical DHT by extending Chord as the underlying overlay graph. Chapter 8 Load Balancing in Peer-to-Peer Systems ............................................................................................ 163 Haiying Shen, University of Arkansas, USA Structured peer-to-peer (P2P) overlay networks like Distributed Hash Tables (DHTs) map data items to the network based on a consistent hashing function. Such mapping for data distribution has an inherent load balance problem. Thus, a load balancing mechanism is an indispensable part of a structured P2P overlay network for high performance. The rapid development of P2P systems has posed challenges in load balancing due to their features characterized by large scale, heterogeneity, dynamism, and proximity. An efficient load balancing method should flexible and resilient enough to deal with these characteristics. This chapter will first introduce the P2P systems and the load balancing in P2P systems. It then introduces the current technologies for load balancing in P2P systems, and provides a case study of a dynamism-resilient and proximity-aware load balancing mechanism. Finally, it indicates the future and emerging trends of load balancing, and concludes the chapter. Chapter 9 Decentralized Overlay for Federation of Enterprise Clouds............................................................... 191 Rajiv Ranjan, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia This chapter describes Aneka-Federation, a decentralized and distributed system that combines enterprise Clouds, overlay networking, and structured peer-to-peer techniques to create scalable wide-area networking of compute nodes for high-throughput computing. The Aneka-Federation integrates numerous small scale Aneka Enterprise Cloud services and nodes that are distributed over multiple control and enterprise domains as parts of a single coordinated resource leasing abstraction. The system is designed with the
aim of making distributed enterprise Cloud resource integration and application programming flexible, efficient, and scalable. The system is engineered such that it: enables seamless integration of existing Aneka Enterprise Clouds as part of single wide-area resource leasing federation; self-organizes the system components based on a structured peer-to-peer routing methodology; and presents end-users with a distributed application composition environment that can support variety of programming and execution models. This chapter describes the design and implementation of a novel, extensible and decentralized peer-to-peer technique that helps to discover, connect and provision the services of Aneka Enterprise Clouds among the users who can use different programming models to compose their applications. Evaluations of the system with applications that are programmed using the Task and Thread execution models on top of an overlay of Aneka Enterprise Clouds have been described here.
Section 3 Programming Models and Tools Chapter 10 Reliability and Performance Models for Grid Computing ................................................................. 219 Yuan-Shun Dai, University of Electronics Science Technology of China, China & University of Tennessee, Knoxville, USA Jack Dongarra, University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA; & University of Manchester, UK Grid computing is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. It is hard to analyze and model the Grid reliability because of its largeness, complexity and stiffness. Therefore, this chapter introduces the Grid computing technology, presents different types of failures in grid system, models the grid reliability with star structure and tree structure, and finally studies optimization problems for grid task partitioning and allocation. The chapter then presents models for star-topology considering data dependence and treestructure considering failure correlation. Evaluation tools and algorithms are developed, evolved from Universal generating function and Graph Theory. Then, the failure correlation and data dependence are considered in the model. Numerical examples are illustrated to show the modeling and analysis. Chapter 11 Mixed Parallel Programming Models Using Parallel Tasks ............................................................... 246 Jörg Dümmler, Chemnitz University of Technology,Germany Thomas Rauber, University Bayreuth, Germany Gudula Rünger, Chemnitz University of Technology,Germany Parallel programming models using parallel tasks have shown to be successful for increasing scalability on medium-size homogeneous parallel systems. Several investigations have shown that these programming models can be extended to hierarchical and heterogeneous systems which will dominate in the future. In this article, we discuss parallel programming models with parallel tasks and describe these programming models in the context of other approaches for mixed task and data parallelism. We discuss compiler-based as well as library-based approaches for task programming and present extensions
to the model which allow a flexible combination of parallel tasks and an optimization of the resulting communication structure. Chapter 12 Programmability and Scalability on Multi-Core Architectures .......................................................... 276 Jaeyoung Yi, Yonsei University, Seoul, Korea Yong J. Jang, Yonsei University, Seoul, Korea Doohwan Oh, Yonsei University, Seoul, Korea Won W. Ro, Yonsei University, Seoul, Korea In this chapter, we will describe today’s technological trends on building a multi-core based microprocessor and its programmability and scalability issues. Ever since multi-core processors have been commercialized, we have seen many different multi-core processors. However, the issues related to how to utilize the physical parallelism of cores for software execution have not been suitably addressed so far. Compared to implementing multiple identical cores on a single chip, separating an original sequential program into multiple running threads has been an even more challenging task. In this chapter, we introduce several different software programs which can be successfully ported on the future multi-core based processors and describe how they could benefit from the multi-core systems. Towards the end, the future trends in the multi-core systems are overviewed. Chapter 13 Assembling of Parallel Programs for Large Scale Numerical Modeling............................................ 295 V.E. Malyshkin, Russian Academy of Science, Russia The main ideas of the Assembly Technology (AT) in its application to parallel implementation of large scale realistic numerical models on a rectangular mesh are considered and demonstrated by the parallelization (fragmentation) of the Particle-In-Cell method (PIC) application to solution of the problem of energy exchange in plasma cloud. The implementation of the numerical models with the assembly technology is based on the construction of a fragmented parallel program. Assembling of a numerical simulation program under AT provides automatically different useful dynamic properties of the target program including dynamic load balance on the basis of the fragments migration from overloaded into underloaded processor elements of a multicomputer. Parallel program assembling approach also can be considered as combination and adaptation for parallel programming of the well known modular programming and domain decomposition techniques and supported by the system software for fragmented programs assembling. Chapter 14 Cell Processing for Two Scientific Computing Kernels ..................................................................... 312 Meilian Xu, University of Manitoba, Canada Parimala Thulasiraman, University of Manitoba, Canada Ruppa K. Thulasiram, University of Manitoba, Canada This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes
the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computation-intensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional Cooley-Tukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.
Section 4 Scheduling and Communication Techniques Chapter 15 On Application Behavior Extraction and Prediction to Support and Improve Process Scheduling Decisions ............................................................................................. 338 Evgueni Dodonov, University of São Paulo – ICMC, Brazil Rodrigo Fernandes de Mello, University of São Paulo – ICMC, Brazil The knowledge of application behavior allows predicting their expected workload and future operations. Such knowledge can be used to support, improve and optimize scheduling decisions by distributing data accesses and minimizing communication overheads. Different techniques can be used to obtain such knowledge, varying from simple source code analysis, sequential access pattern extraction, history-based approaches and on-line behavior extraction methods. The extracted behavior can be later classified into different groups, representing process execution states, and then used to predict future process events. This chapter describes different approaches, strategies and methods for application behavior extraction and classification, and also how this information can be used to predict new events, focusing on distributed process scheduling. Chapter 16 A Structured Tabu Search Approach for Scheduling in Parallel Computing Systems........................ 354 Tore Ferm, Sydney University, Australia Albert Y. Zomaya, Sydney University, Australia Task allocation and scheduling are essential for achieving the high performance expected of parallel computing systems. However, there are serious issues pertaining to the efficient utilization of computational resources in such systems that need to be resolved, such as, achieving a balance between system throughput and execution time. Moreover, many scheduling techniques involve massive task graphs with complex precedence relations, processing costs, and inter-task communication costs. In general, there are two main issues that should be highlighted: problem representation and finding an efficient solution in a timely fashion. In the work proposed here, we have attempted to overcome the first problem by using a structured model which offers a systematic method for the representation of the scheduling problem. The model used can encode almost all of the parameters involved in a scheduling problem in a very
systematic manner. To address the second problem, a Tabu Search algorithm is used to allocate tasks to processors in a reasonable amount of time. The use of Tabu Search has the advantage of obtaining solutions to more general instances of the scheduling problem in reasonable time spans. The efficiency of the proposed framework is demonstrated by using several case studies. A number of evaluation criteria will be used to optimize the schedules. Communication- and computation-intensive task graphs are analyzed, as are a number of different task graph shapes and sizes. Chapter 17 Communication Issues in Scalable Parallel Computing ..................................................................... 378 C.E.R. Alves, Universidade Sao Judas Tadeu, Brazil E. N. Cáceres, Universidade Federal de Mato Grosso do Sul, Brazil F. Dehne, Carleton University, Canada S. W. Song, Universidade de Sao Paulo, Brazil In this book chapter, we discuss some important communication issues to obtain a highly scalable computing system. We consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain scalable parallel algorithms. The communication cost is modeled by the number of communication rounds and the objective is to design algorithms that require the minimum number of communication rounds. We discuss some important issues and make considerations of practical importance, based on our previous experience in the design and implementation of parallel algorithms. The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful we should attempt to minimize this amount, even when it is already within the limit allowed by the CGM model. The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize and the overall communication time taken in the communication rounds. Sometimes a larger number of communication rounds may actually reduce the total amount of data transmitted in the communications rounds. These two issues have guided us to present efficient parallel algorithms for the string similarity problem, used as an illustration.
Chapter 18 Scientific Workflow Scheduling with Time-Related QoS Evaluation ................................................ 396 Wanchun Dou, Nanjing University, P. R. China Jinjun Chen, Swinburne University of Technologies, Australia This chapter introduces a scheduling approach for cross-domain scientific workflow execution with timerelated QoS evaluation. Generally, scientific workflow execution often spans self-managing administrative domains to achieving global collaboration advantage. In practice, it is infeasible for a domain-specific application to disclose its process details for privacy or security reasons. Consequently, it is a challenging endeavor to coordinate scientific workflows and its distributed domain-specific applications from service invocation perspective. Therefore, in this chapter, we aim at proposing a collaborative scheduling approach, with time-related QoS evaluation, for navigating cross-domain collaboration. Under this collaborative scheduling approach, a private workflow fragment could maintain temporal consistency with a global scientific workflow in resource sharing and task enactments. Furthermore, an evaluation is presented to demonstrate the scheduling approach.
Section 5 Service Computing Chapter 19 Grid Transaction Management and Highly Reliable Grid Platform ................................................... 421 Feilong Tang, Shanghai Jiao Tong University, China Minyi Guo, Shanghai Jiao Tong University, China As Grid technology is expanding from scientific computing to business applications, open grid platform increasingly needs the support of transaction services. This chapter proposes a grid transaction service (GridTS) and GridTS based transaction processing model, defines two kinds of grid transactions: atomic grid transaction for short-lived reliable applications and long-lived transaction for business processes. We also present solutions to managing these two kinds of transactions to reach different consistent requirements. Moreover, this chapter investigates a mechanism for automatic generation of compensating transactions in the execution of long-lived transactions through the GridTS. Finally, we discuss the future trends along the reliable grid platform research. Chapter 20 Error Recovery for SLA-Based Workflows Within the Business Grid ............................................... 442 Dang Minh Quan, International University in Germany, Germany Jorn Altmann, Seoul National University, South Korea Laurence T. Yang, St. Francis Xavier University, Canada This chapter describes the error recovery mechanisms in the system handling the Grid-based workflow within the Service Level Agreement (SLA) context. It classifies the errors into two main categories. The first is the large-scale errors when one or several Grid sites are detached from the Grid system at a time. The second is the small-scale errors which may happen inside an RMS. For each type of error, the chapter introduces a recovery mechanism with the SLA context imposing the goal to the mechanisms. The authors believe that it is very useful to have an error recovery framework to avoid or eliminate the negative effects of the errors. Chapter 21 A Fuzzy Real Option Model to Price Grid Compute Resources ........................................................ 471 David Allenotor, University of Manitoba, Canada Ruppa K. Thulasiram, University of Manitoba, Canada Kenneth Chiu, University at Binghamtom, State University of NY, USA Sameer Tilak, University of California, San Diego, USA A computational grid is a geographically disperssed heterogeneous computing facility owned by dissimilar organizations with diverse usage policies. As a result, guaranteeing grid resources availability as well as pricing them raises a number of challenging issues varying from security to management of the grid resources. In this chapter we design and develop a grid resources pricing model using a fuzzy real option approach and show that finance models can be effectively used to price grid resources.
Volume II Chapter 22 The State of the Art and Open Problems in Data Replication in Grid Environments ........................ 486 Mohammad Shorfuzzaman, University of Manitoba, Canada Rasit Eskicioglu , University of Manitoba, Canada Peter Graham, University of Manitoba, Canada Data Grids provide services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored at distributed locations around the world. For example, the next-generation of scientific applications such as many in high-energy physics, molecular modeling, and earth sciences will involve large collections of data created from simulations or experiments. The size of these data collections is expected to be of multi-terabyte or even petabyte scale in many applications. Ensuring efficient, reliable, secure and fast access to such large data is hindered by the high latencies of the Internet. The need to manage and access multiple petabytes of data in Grid environments, as well as to ensure data availability and access optimization are challenges that must be addressed. Chapter 23 Architectural Elements of Resource Sharing Networks ..................................................................... 517 Marcos Dias de Assunção, The University of Melbourne, Australia Rajkumar Buyya, The University of Melbourne, Australia This chapter first presents taxonomies on approaches for resource allocation across resource sharing networks such as Grids. It then examines existing systems and classifies them under their architectures, operational models, support for the life-cycle of virtual organisations, and resource control techniques. Resource sharing networks have been established and used for various scientific applications over the last decade. The early ideas of Grid computing have foreseen a global and scalable network that would provide users with resources on demand. In spite of the extensive literature on resource allocation and scheduling across organisational boundaries, these resource sharing networks mostly work in isolation, thus contrasting with the original idea of Grid computing. Several efforts have been made towards providing architectures, mechanisms, policies and standards that may enable resource allocation across Grids. A survey and classification of these systems are relevant for the understanding of different approaches utilised for connecting resources across organisations and virtualisation techniques. In addition, a classification also sets the ground for future work on inter-operation of Grids.
Section 6 Optimization Techniques Chapter 24 Simultaneous MultiThreading Microarchitecture ............................................................................... 552 Chen Liu, Florida International University, USA Xiaobin Li, Intel® Corporation, USA Shaoshan Liu, University of California, Irvine, USA Jean-Luc Gaudiot, University of California, Irvine, USA Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that modern superscalar processors can explore is inherently limited. Hence, multithreading architectures have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading (SMT) achieves some of the best possible system resource utilization and accordingly higher instruction throughput. In this chapter, we describe the origin of SMT microarchitecture, comparing it with other multithreading microarchitectures. We identify several key aspects for high-performance SMT design: fetch policy, handling long-latency instructions, resource sharing control, synchronization and communication. We also describe some potential benefits of SMT microarchitecture: SMT for fault-tolerance and SMT for secure communications. Given the need to support sequential legacy code and emerge of new parallel programming model, we believe SMT microarchitecture will play a vital role as we enter the multi-thread multi/many-core processor design era. Chapter 25 Runtime Adaption Techniques for HPC Applications ........................................................................ 583 Edgar Gabriel, University of Houston, USA This chapter discusses runtime adaption techniques targeting high-performance computing applications. In order to exploit the capabilities of modern high-end computing systems, applications and system software have to be able to adapt their behavior to hardware and application characteristics. Using the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the advantage of using adaptive techniques to exploit characteristics of the network and of the application. This allows to reduce the execution time of applications significantly and to avoid having to maintain different architecture dependent versions of the source code. Chapter 26 A Scalable Approach to Real-Time System Timing Analysis............................................................. 606 Alan Grigg, Loughborough University, UK Lin Guan, Loughborough University, UK This Chapter describes a real-time system performance analysis approach known as reservation-based analysis (RBA). The scalability of RBA is derived from an abstract (target-independent) representation of system software components, their timing and resource requirements and run-time scheduling policies. The RBA timing analysis framework provides an evolvable modeling solution that can be
instigated in early stages of system design, long before the software and hardware components have been developed, and continually refined through successive stages of detailed design, implementation and testing. At each stage of refinement, the abstract model provides a set of best-case and worst-case timing ‘guarantees’ that will be delivered subject to a set of scheduling ‘obligations’ being met by the target system implementation. An abstract scheduling model, known as the rate-based execution model then provides an implementation reference model with which compliance will ensure that the imposed set of timing obligations will be met by the target system. Chapter 27 Scalable Algorithms for Server Allocation in Infostations ................................................................. 645 Alan A. Bertossi, University of Bologna, Italy M. Cristina Pinotti, University of Perugia, Italy Romeo Rizzi, University of Udine, Italy Phalguni Gupta, Indian Institute of Technology Kanpur, India The server allocation problem arises in isolated infostations, where mobile users going through the coverage area require immediate high-bit rate communications such as web surfing, file transferring, voice messaging, email and fax. Given a set of service requests, each characterized by a temporal interval and a category, an integer k, and an integer hc for each category c, the problem consists in assigning a server to each request in such a way that at most k mutually simultaneous requests are assigned to the same server at the same time, out of which at most hc are of category c, and the minimum number of servers is used. Since this problem is computationally intractable, a scalable 2-approximation on-line algorithm is exhibited. Generalizations of the problem are considered, which contain bin-packing, multiprocessor scheduling, and interval graph coloring as special cases, and admit scalable on-line algorithms providing constant approximations.
Section 7 Web Computing Chapter 28 Web Application Server Clustering with Distributed Java Virtual Machine ...................................... 658 King Tin Lam, The University of Hong Kong, Hong Kong Cho-Li Wang, The University of Hong Kong, Hong Kong Web application servers, being today’s enterprise application backbone, have warranted a wealth of J2EE-based clustering technologies. Most of them however need complex configurations and excessive programming effort to retrofit applications for cluster-aware execution. This chapter proposes a clustering approach based on distributed Java virtual machine (DJVM). A DJVM is a collection of extended JVMs that enables parallel execution of a multithreaded Java application over a cluster. A DJVM achieves transparent clustering and resource virtualization, extolling the virtue of single-system-image (SSI). We evaluate this approach through porting Apache Tomcat to our JESSICA2 DJVM and identify scalability issues arising from fine-grain object sharing coupled with intensive synchronizations among distributed threads. By leveraging relaxed cache coherence protocols, we are able to conquer the scalability barri-
ers and harness the power of our DJVM’s global object space design to significantly outstrip existing clustering techniques for cache-centric web applications. Chapter 29 Middleware for Community Coordinated Multimedia ....................................................................... 682 Jiehan Zhou, University of Oulu, Finland Zhonghong Ou, University of Oulu, Finland Junzhao Sun, University of Oulu, Finland Mika Rautiainen, University of Oulu, Finland Mika Ylianttila, University of Oulu, Finland Community Coordinated Multimedia (CCM) envisions a novel paradigm that enables the user to consume multiple media through requesting multimedia-intensive Web services via diverse display devices, converged networks, and heterogeneous platforms within a virtual, open and collaborative community. These trends yield new requirements for CCM middleware. This chapter aims to systematically and extensively describe middleware challenges and opportunities to realize the CCM paradigm by reviewing the activities of middleware with respect to four viewpoints, namely mobility-aware, multimedia-driven, service-oriented, and community-coordinated.
Section 8 Mobile Computing and Ad Hoc Networks Chapter 30 Scalability of Mobile Ad Hoc Networks ............................................................................................. 705 Dan Grigoras, University College Cork, Ireland Daniel C. Doolan, Robert Gordon University, UK Sabin Tabirca, University College Cork, Ireland This chapter addresses scalability aspects of mobile ad hoc networks management and clusters built on top of them. Mobile ad hoc networks are created by mobile devices without the help of any infrastructure for the purpose of communication and service sharing. As a key supporting service, the management of mobile ad hoc networks is identified as an important aspect of their exploitation. Obviously, management must be simple, effective, consume least of resources, reliable and scalable. The first section of this chapter discusses different incarnations of the management service of mobile ad hoc networks considering the above mentioned characteristics. Cluster computing is an interesting computing paradigm that, by aggregation of network hosts, provides more resources than available on each of them. Clustering mobile and heterogeneous devices is not an easy task as it is proven in the second part of the chapter. Both sections include innovative solutions for the management and clustering of mobile ad hoc networks, proposed by the authors.
Chapter 31 Network Selection Strategies and Resource Management Schemes in Integrated Heterogeneous Wireless and Mobile Networks............................................................ 718 Wei Shen, University of Cincinnati, USA Qing-An Zeng, University of Cincinnati, USA Integrated heterogeneous wireless and mobile network (IHWMN) is introduced by combing different types of wireless and mobile networks (WMNs) in order to provide more comprehensive service such as high bandwidth with wide coverage. In an IHWMN, a mobile terminal equipped with multiple network interfaces can connect to any available network, even multiple networks at the same time. The terminal also can change its connection from one network to other networks while still keeping its communication alive. Although IHWMN is very promising and a strong candidate for future WMNs, it brings a lot of issues because different types of networks or systems need to be integrated to provide seamless service to mobile users. In this chapter, we focus on some major issues in IHWMN. Several noel network selection strategies and resource management schemes are also introduced for IHWMN to provide better resource allocation for this new network architecture.
Section 9 Fault Tolerance and QoS Chapter 32 Scalable Internet Architecture Supporting Quality of Service (QoS) ................................................. 739 Priyadarsi Nanda, University of Technology, Sydney (UTS), Australia Xiangjian He, University of Technology, Sydney (UTS), Australia The evolution of Internet and its successful technologies has brought a tremendous growth in business, education, research etc. over the last four decades. With the dramatic advances in multimedia technologies and the increasing popularity of real-time applications, recently Quality of Service (QoS) support in the Internet has been in great demand. Deployment of such applications over the Internet in recent years, and the trend to manage them efficiently with a desired QoS in mind, researchers have been trying for a major shift from its Best Effort (BE) model to a service oriented model. Such efforts have resulted in Integrated Services (Intserv), Differentiated Services (Diffserv), Multi Protocol Label Switching (MPLS), Policy Based Networking (PBN) and many more technologies. But the reality is that such models have been implemented only in certain areas in the Internet not everywhere and many of them also faces scalability problem while dealing with huge number of traffic flows with varied priority levels in the Internet. As a result, an architecture addressing scalability problem and satisfying end-to-end QoS still remains a big issue in the Internet. In this Chapter we propose a policy based architecture which we believe can achieve scalability while offering end to end QoS in the Internet.
Chapter 33 Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing ................................. 760 Zizhong Chen, Colorado School of Mines, USA Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage also increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques are highly scalable.
Section 10 Applications Chapter 34 Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems ................... 785 Yifeng Zhu, University of Maine, USA Hong Jiang, University of Nebraska–Lincoln, USA This chapter discusses the false rates of Bloom filters in a distributed environment. A Bloom filter (BF) is a space-efficient data structure to support probabilistic membership query. In distributed systems, a Bloom filter is often used to summarize local services or objects and this Bloom filter is replicated to remote hosts. This allows remote hosts to perform fast membership query without contacting the original host. However, when the services or objects are changed, the remote Bloom replica may become stale. This chapter analyzes the impact of staleness on the false positive and false negative for membership queries on a Bloom filter replica. An efficient update control mechanism is then proposed based on the analytical results to minimize the updating overhead. This chapter validates the analytical models and the update control mechanism through simulation experiments. Chapter 35 Image Partitioning on Spiral Architecture .......................................................................................... 808 Qiang Wu, University of Technology, Australia Xiangjian He, University of Technology, Australia Spiral Architecture is a relatively new and powerful approach to image processing. It contains very useful geometric and algebraic properties. Based on the abundant research achievements in the past decades, it is shown that Spiral Architecture will play an increasingly important role in image processing and
computer vision. This chapter presents a significant application of Spiral Architecture for distributed image processing. It demonstrates the impressive characteristics of spiral architecture for high performance image processing. The proposed method tackles several challenging practical problems during the implementation. The proposed method reduces the data communication between the processing nodes and is configurable. Moreover, the proposed partitioning scheme has a consistent approach: after image partitioning each sub-image should be a representative of the original one without changing the basic object, which is important to the related image processing operations. Chapter 36 Scheduling Large-Scale DNA Sequencing Applications .................................................................... 841 Sudha Gunturu, Oklahoma State University, USA Xiaolin Li, Oklahoma State University, USA Laurence Tianruo Yang, St. Francis Xavier University, Canada This chapter studies a load scheduling strategy with near-optimal processing time that is designed to explore the computational characteristics of DNA sequence alignment algorithms, specifically, the Needleman-Wunsch Algorithm. Following the divisible load scheduling theory, an efficient load scheduling strategy is designed in large-scale networks so that the overall processing time of the sequencing tasks is minimized. In this study, the load distribution depends on the length of the sequence and number of processors in the network and, the total processing time is also affected by communication link speed. Several cases have been considered in the study by varying the sequences, communication and computation speeds, and number of processors. Through simulation and numerical analysis, this study demonstrates that for a constant sequence length as the numbers of processors increase in the network the processing time for the job decreases and minimum overall processing time is achieved. Chapter 37 Multi-Core Supported Deep Packet Inspection .................................................................................. 858 Yang Xiang, Central Queensland University, Australia Daxin Tian, Tianjin University, China Network security applications such as intrusion detection systems (IDSs), firewalls, anti-virus/spyware systems, anti-spam systems, and security visualisation applications are all computing-intensive applications. These applications all heavily rely on deep packet inspection, which is to examine the content of each network packet’s payload. Today these security applications cannot cope with the speed of broadband Internet that has already been deployed, that is, the processor power is much slower than the bandwidth power. Recently the development of multi-core processors brings more processing power. Multi-core processors represent a major evolution in computing hardware technology. While two years ago most network processors and personal computer microprocessors had single core configuration, the majority of the current microprocessors contain dual or quad cores and the number of cores on die is expected to grow exponentially over time. The purpose of this chapter is to discuss the research on using multi-core technologies to parallelize deep packet inspection algorithms, and how such an approach will improve the performance of deep packet inspection applications. This will eventually provide a security system the capability of real-time packet inspection thus significantly improve the overall status of security on current Internet infrastructure.
Chapter 38 State-Carrying Code for Computation Mobility ................................................................................. 874 Hai Jiang, Arkansas State University, USA Yanqing Ji, Gonzaga University, USA Computation mobility enables running programs to move around among machines and is the essence of performance gain, fault tolerance, and system throughput increase. State-carrying code (SCC) is a software mechanism to achieve such computation mobility by saving and retrieving computation states during normal program execution in heterogeneous multi-core/many-core clusters. This chapter analyzes different kinds of state saving/retrieving mechanisms for their pros and cons. To achieve a portable, flexible and scalable solution, SCC adopts the application-level thread migration approach. Major deployment features are explained and one example system, MigThread, is used to illustrate implementation details. Future trends are given to point out how SCC can evolve into a complete lightweight virtual machine. New high productivity languages might step in to raise SCC to language level. With SCC, thorough resource utilization is expected.
Compilation of References ............................................................................................................... 895
xxxi
Foreword
I am delighted to write the Foreword to this book, as it is a very useful resource in a time where change is dramatic and guidance on how to proceed in the development and use of scalable computing technology is in demand. The book is timely, as it comes at the meeting point of two major challenges and opportunities. Information technology, having grown at an increasingly rapid pace since the construction of the first electronic computer, has now reached a point where it represents an essential, new and transformational enabler of progress in science, engineering, and the commercial world. The performance of today's computer hardware and the sophistication of their software systems yield a qualitatively new tool for scientific discovery, industrial engineering, and business solutions. The solutions complement and promise to go beyond those achievable with the classical two pillars of science - theory and real-world experiments. The opportunity of the third pillar is substantial; however, building it is still a significant challenge. The second challenge and opportunity lays is the current transformation that the computer industry is undergoing. The emergence of multicore processors has been called "the greatest disruption information technology has seen." As several decades of riding Moore's law to easily accelerate clock speeds have come to and end, parallel hardware and software solutions must be developed. While the challenge of creating such solutions is formidable, it also represents an opportunity that is sure to create food for thought and work for new generations of scientists, engineers, students, and practitioners. Scalable computing technologies are at the core of both challenges; they help create the hardware and software architectures underlying the third pillar of science and they help create the parallel computing solutions that will make or break the multicore revolution. The book addresses many issues related to these challenges and opportunities. Among them is the question of the computer model of the future. Will we continue to obtain computer services from local workstations and personal computers? Will the compute power be concentrated in servers? Will these systems be connected in the form of Grids? The book also discusses the Cloud model, where the end user obtains all information services via networks from providers "out there" - possibly via small hand-held devices. Embedded and real-time computer systems are another factor in this equation, as information technology continues to penetrate all appliances, equipment, and wearables in our daily lives. While computer systems evolve, the question of the relevant new applications continues to boggle our minds. Classical performance-thirsty programs are those in the area of science and engineering. Future scalable applications are likely to include business and personal software, such as web and mobile applications, tools running on ad-hoc networks, and a myriad of entertainment software. Among the grandest challenges is the question of programming tools and environments for future, scalable software. In the past, parallel programming has been a niche for a small number of scientists and geeks. With multicores and large-scale parallel systems, this technology now must quickly be learned by masses of software engineers. Many new models are being proposed. They include those where multiple
xxxii
cores and computers communicate by exchanging messages as well as those that share a global address space. The book also discusses mixed models, which will likely have an important role in bridging and integrating heterogeneous computer architectures. The book touches on both classical and newly emerging issues to reach for the enormous opportunities ahead. Among the classical issues are those of performance analysis and modeling, benchmarking, development of scalable algorithms, communication, and resource management. While many solutions to these issues have been proposed in the past, evolving them to true scalability is likely to lead to many more doctoral dissertations at universities and technologies in industries. Among the chief pressing new issues is the creation of scalable hardware and software solutions. Tomorrow's high-performance computers may contain millions of processors; even their building blocks may contain tens of cores within foreseeable time. Today's hardware and software solutions are simply inadequate to deal with this sheer scale. Managing power and energy is another issue that has emerged as a major concern. On one hand, power dissipation of computer chips is the major reason that clock speeds can no longer increase; on the other hand, the overall consumption of information technology's power has risen to a political issue - we will soon use more energy for information processing than for moving matter! Furthermore, as computer systems scale to a phenomenal number of parts, their dependability is of increasing concern; failures and their tolerance may need to be considered as part of standard operating procedures. Among the promising principles underlying many of these technologies is that of dynamic adaptation. Future hardware and software systems may no longer by static. They many change, adapting to new data, environments, faults, resource availability, power, and user demands. They may dynamically incorporate newly available technology, possibly creating computer solutions that evolve continually. The large number of challenges, opportunities, and solutions presented herein will benefit a broad readership from students, to scientists, to practitioners. I am pleased to be able to recommend this book to all those who are looking to learn, use, and contribute to future scalable computing technologies.
Rudolf Eigenmann Professor of Electrical and Computer Engineering and Technical Director for HPC, Computing Research Institute Purdue University November 2008
Rudolf Eigenmann is a professor at the School of Electrical and Computer Engineering and Technical Director for HPC of the Computing Research Institute at Purdue University. His research interests include optimizing compilers, programming methodologies and tools, performance evaluation for high-performance computers and applications, and Internet sharing technology. Dr. Eigenmann received a Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich, Switzerland.
xxxiii
Preface
There is a constantly increasing demand for computational power for solving complex computational problems in science, engineering and business. The past decade has witnessed a proliferation of more and more high-performance scalable computing systems. The impressive progress is mainly due to the availability of enabling technologies in hardware, software or networks. High-end innovations on such enabling technologies have been fundamental and present cost-effective tools to explore the currently available high performance systems to make further progress. To that end, this Handbook of Research on Scalable Computing Technologies presents, discusses, share ideas, results and experiences on the recent important advances and future challenges on such enabling technologies. This handbook is directed to those interested in: developing programming tools and environments for academic or research computing, extracting the inherent parallelism, and achieving higher performance. This handbook will also be useful for upper-level undergraduate and graduate students studying this subject. Main topics covered in this book are on scalable computing and cover a wide array of topics: • • • • • • • • • • • • • • •
Architectures and systems Software and middleware Data and resource management paradigms Programming models, tools, problem solving environments Trust and security Service-oriented computing Data-intensive computing Cluster and Grid computing Community and collaborative computing networks Scheduling and load balancing Economic and utility computing models Peer-to-Peer systems Multi-core/Many-core based computing Parallel and distributed techniques Scientific, engineering and business computing
This book is a valuable source targeted to those interested in the development of field of grid engineering for academic or enterprise computing, aimed for computer scientists, researchers and technical managers working all areas of science, engineering and economy from academia, research centers and industries.
xxxiv
Acknowledgment
Of course, the represented areas/topics in this handbook, are not an exhaustive representation of the world of current scalable computing. Nonetheless, they represent the rich and many-faceted knowledge, that we have the pleasure of sharing with the readers. The editors would like to acknowledge all of the authors for their insights and excellent contributions to this handbook and the help of all involved in the collaboration and review process of the handbook, without whose support the project could not have been satisfactorily completed. Most of the authors of chapters included in this handbook also served as referees for chapters written by other authors. Thanks go to all those who provided constructive and comprehensive reviews. Special thanks also go to the publishing team at IGI Global, whose contributions throughout the whole process from inception of the initial idea to final publication have been invaluable. In particular to, Rebecca Beistline, who continuously prodded via e-mail for keeping the project on schedule and to Joel A. Gamon who has been helping us to complete the book project’s production professionally.
Kuan-Ching Li Ching-Hsien Hsu Laurence Tianruo Yang Jack Dongarra Hans Zima
Section 1
Grid Architectures and Applications
1
Chapter 1
Pervasive Grid and its Applications Ruay-Shiung Chang National Dong Hwa University, Taiwan Jih-Sheng Chang National Dong Hwa University, Taiwan
ABSTRACT With the advancements of computer system and communication technologies, Grid computing can be seen as the popular technology bringing about significant revolution for the next generation distributed computing application. As regards general users, a grid middleware is complex to setup and necessitates a steep learning curve. How to access to the grid system transparently from the point of view of users turns into a critical issue then. Therefore, various challenges may arise from the incomprehensive system design as coordinating existing computing resources for the sake of achieving pervasive grid environment. The authors are going to investigate into the current research works of pervasive grid as well as analyze the most important factors and components for constructing a pervasive grid system here. Finally, in order to facilitate the efficiency in respect of teaching and research within a campus, they would like to introduce their pervasive grid platform.
INTRODUCTION The current scientific problems are becoming more and more complex for computers. As the aid of the advances in the computing power of hardware and the diversification of the Internet services, distributed computing applications are becoming more and more important and wide-spread. However, the past technologies such as cluster and parallel computing are insufficient to process the data-intensive or computing-intensive applications with the large amount of data file transmissions. In addition, from the perspective of the most users, a secure and powerful computing environment is beneficial for a tremendous amount of computing jobs and data-intensive applications. Fortunately, a new technology DOI: 10.4018/978-1-60566-661-7.ch001
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Pervasive Grid and its Applications
called grid computing (Reed, 2003; Foster, 2002; Foster, 2001) has been developed to contribute to the powerful computing ability to support such distributed computing applications recently. Grid is a burgeoning technology with the capability of integrating a variety of computing resources as well as scheduling jobs from various sites, in order to supply a number of users with breakthrough computing power at low cost. The most current grid system in operation is on the basis of middleware-based approach. A few grid middleware projects have been developed such as Globus, Legion, UNICORE, and SRB so far. However, as regards general users, a grid middleware is complex to setup and necessitates a steep learning curve. Take Globus as an example, which is now in widespread use for the deployment of grid middleware, only a command mode environment is provided for users. To cooperate with Globus very well must have strong knowledge in grid functions and system architecture. As far as a general user is concerned, it seems rather complex as manipulating grid middleware. Overhead of managing and maintaining a Grid middleware will limit the popularization for users. In addition, it is hard to integrate various computing resources such as mobile devices, handsets, laptops into ubiquitous computing platform due to the deficient system functionalities to underlying heterogeneous resources. How to access to the grid system transparently from the point of view of users turns into a critical issue then. On the other hand, as far as a programmer is concerned, a lack of programming modules may increase the complexity of system development for pervasive grid. Limited support for applications level components also restricts programmers to develop pervasive services. Therefore, various challenges may arise from the incomprehensive system design as coordinating existing computing resources for the sake of achieving pervasive grid environment. We are going to investigate into the current research works of pervasive grid as well as analyze the most important factors and components for constructing a pervasive grid system here. In addition, in order to facilitate the efficiency in respect of teaching and research within a campus, we would like to introduce our pervasive grid platform to make resources available as conveniently as possible. The pervasive grid platform integrates all of the wired and mobile devices into a uniform resource on the grounds of the existing grid infrastructure. Resources can be accessed easily anytime and anywhere through the pervasive grid platform.
CURRENT AND FUTURE RESEARCH TRENDS (Mario, 2003) brought up an architecture of pervasive grid with utilization of diverse grid technologies as indicated in figure 1(a). For example, knowledge grid is able to extract interesting information from a huge amount of data source by means of the data mining technology. Semantic grid is an emerging technology, aiming for the translation of semantic job to corresponding grid job or command. The grid fabric provides various grid services, including data grid and information grid. Data grid intends to process data-intensive jobs by way of the powerful distributed storage system and data management technology in order to bring about superior performance with minimal job exection time. Information grid provides job broker with complete system information for job dispatch. The interconnection between diverse computing resources is achieved via P2P technology coupled with efficient management strategies, tending towards a more fullness architecture. There had been several works (Arshad, 2006; Pradeep Padala, 2003; Vazhkudai, 2002) attempting to develop a high performance framework for grid-enabled operating system. A modular architecture
2
Pervasive Grid and its Applications
Figure 1.
called GridOS (Pradeep Padala, 2003) was proposed in order to provide a layered infrastructure. Four design principles are considered including modularity, policy neutrality, universality of infrastructure, and minimal core operating system changes. Figure 1(b) points out the system framework of GridOS from the point of view of modular design. As regards kernel level, GridOS focuses on a high-performance I/O processing engine. In terms of dataintensive applications, since a large amount of data are distributed and transported across the Internet, how to process these requests in an efficient way is worth a great deal of thought. Two aspects need to be taken into consideration including inner disk I/O processing and TCP transmission throughput. The improvement of inner I/O processing can be realized by integrating user-level FTP service to kernel level one. The overhead of copying data in system-space to user-space can be avoided. As for the improvement of TCP transmission throughput, the optimal buffer size is calculated to maximize the throughput. In addition, there are three modules based on the above I/O processing engine to support multi-thread communication with different quality of service requirement, resource allocation management, and process communication management respectively: • •
Communication module Resource management module
3
Pervasive Grid and its Applications
•
Process management module
(Arshad, 2006) suggested several design points for developing a P2P-based grid operating system. Since centralized system may not be appropriate in support of Plug and Play. That is, if there are many external computing resources attempting to join a pervasive grid environment, the centralized system is hard to manage the join and leave process dynamically in an efficient way. Hence, enabling grid operating system to discovery distributed resources and sharing resources in a P2P fashion transparently may be a proper alternative. Figure 1(c) shows the overall architecture of P2P-based Grid Operating System. The existing gird middleware support only a few types of applications without interactive ones. The layer of grid-enabled process management is going to support grid-enabled interactive applications rather than batch ones. Further, process management layer also dominates process migration which is the transit of a process between two grid nodes. In regard to the underlying connection, each node connects with others by P2P communication in a self-organization way. All near peers are organized into a sub-Grid while a sub-Grid will be a member of a RootGrid. In addition, in order to provide all inter-processes with a grid wide shared memory space for accessing required data, a virtual file system is used to emulate such global data access system. A proxy-based clustered architecture was proposed in (Thomas, 2002) for supporting mobile devices to be integrated into a grid computing environment, as shown in Figure 1(d). A dedicated node called interlocutor running a grid middleware such as Globus is responsible for the job management and resource aggregation on behalf of mobile devices. All requests from users will be handled and decomposed by the interlocutor for further job dispatch or resource requests. This is a scalable way to help mobile devices join a grid computing environment since most part of them is insufficient to install and run a grid middleware. In (Junseok, 2004), a proxy-based wireless grid architecture was proposed. A proxy component is deployed as a interface between computing resources and mobile devices for service managements and QoS requirements. Having built the proxy, a mobile user can connect a grid environment with ease, without taking care of the differences between various mobile devices, for attaining to heterogeneous interworking and pervasive computing. Degistry and discovery mechanism are deployed via Web Service while all non-grid wireless devices are capable of access to grid system as illustrated in Figure 2(a). A conception of pervasive wireless grid was put forward in (Srinivasan, 2005). The whole computing environment consists of a backbone grid and access grids as depicted in Figure 2(b). Mobile devices are considered a terminal to connect the backbone grid. Most of computing jobs are dispatched to the backbone grid. In addition, the impact of service handoff for mobile users is discussed in this paper. In conclusion, under the consideration of the above discussion, to determine the implementation of pervasive grid system with OS level or middleware level depends on the requirements. As indicated in Figure 2(c), the middleware level is suitable for developing a pervasive computing system for mobile devices due to the scalability. Since mobile devices cannot afford the overhead of running a middleware system, a proxy-based approach may be a proper solution. A dedicated proxy server handles the interconnections between mobile devices and grids, and then the join of a grid environment will become easier as for mobile devices. As regards fixed devices with powerful computing ability and storage resources, the implementation based on OS level will be an efficient way to bring all available resources into full play.
4
Pervasive Grid and its Applications
Figure 2.
APPLICATION OF PERVASIVE GRID In this session, we would like to introduce the application based on the pervasive grid conception. In our implementation, we have made use of the Globus Toolkit as our system infrastructure. It provides several fundamental grid technologies along with an open source implementation of a series of grid services and libraries. A few critical components within Globus Toolkit are listed below: • •
•
Security: GSI (Grid Security Infrastructure) provides the authentication and authorization mechanisms for system protection according to X.509 proxy certificates. Data management: It is utilized to manipulate data including GridFTP and RLS (Replica Location Service). RLS maintains location information of replicas from logical file names (LFN) to physical file names (PFN). Execution management: GRAM (The Grid Resource Allocation and Management) provides a series of uniform interfaces to simplify the access of remote grid resources for job execution. A job is defined by RSL (Resource Specification Language) in terms of binary execution file, arguments, standard output, and so forth.
5
Pervasive Grid and its Applications
•
Information services: MDS (Monitoring and Discovery System) enables the monitoring and discovery services for grid resources.
We have developed a portal program on client side by means of the CoG Toolkit. The Java CoG Toolkit provides a series of programming interfaces as well as reusable objects in grid services, such as GSI, GRAM, GridFTP, MDS and so on. It presents programmers with a mapping between the Globus Toolkit and Java APIs, so as to ease the programming complexity. Figure 2(d) reveals the overall architecture of pervasive grid with the hierarchical components. The underlying grid middleware is deployed by the Globus Toolkit. Pervasive grid platform is implemented on the basis of the Globus Toolkit. A service-oriented provider, consisting of data, computation and information service, offers users a comprehensive computing environment. Data transmission and replication are the main operations in data service. We utilize GridFTP as the underlying transmission protocol. Computation service provides the computing resources for job executions. Information service gives up-to-date resource information such as CPU frequency, available memory space and average system loading. Such information could be utilized by region job dispatcher during job submission in order to decide a proper grid site for execution. E-campus applications are built above the platform and services. Pervasive grid system makes use of the advantages of pervasive grid to provide students and teachers with a digitalized education system. From the perspective of the most users, a friendly interface without complicated manipulations is necessary. In order to simplify the interconnection and operations, we have developed a user portal by means of Java CoG Toolkit. Due to the nature of cross-platform execution of Java, our portal solution can run on various operating systems. A user could connect and access the e-campus services in an unsophisticated way via our client portal.
System Overview This research is based on the grid technology in support of pervasive computing for digitalized platform in a campus. We attempt to develop a pervasive grid environment based on the grid computing technology to coordinate all of wireless and wired computing devices within a grid computing environment. From the standpoint of users, all the resources are considered a uniform type regardless of the type of resources. A user can access a variety of resources conveniently through the Web Services deployed in our system. We have adapted the layered-design approach to implement the pervasive grid system. The design framework of pervasive grid system appears in Figure 3(a). The layered-design makes pervasive grid system more flexible if new services are added to the system as needed. Based on the Web Services architecture, we develop a service-oriented provider, which offers users a comprehensive grid computing service, including data, computation and information service. It provides flexibility for future services that support the pervasive grid system. There are five components within pervasive grid system as shown in Figure 3(b): • • • •
6
Core computing infrastructure Edge grid node Web services Pervasive grid platform
Pervasive Grid and its Applications
Figure 3.
•
Applications
The core computing infrastructure is the main computing and storage resource. It provides a computing platform with a capability of storage elements, scheduling system, workflow management. In opposition to the core computing infrastructure, the edge grid node is a terminal, such as notebooks, PDAs, and personal computers, for connection with the core grid infrastructure. An edge grid node could access the grid services as well as publish services to the public. Web Services is a popular technology based on XML and HTTP for the construction of distributed computing applications. It works at an open-architecture with the capability of bridging any programming language, computing hardware or operating system. Accordingly, we adopt Web Services as our software interface, in order to build a uniform entry between an edge grid node and our grid services. As for the pervasive grid platform, it is
7
Pervasive Grid and its Applications
a middleware to provide the basic grid services for users, such as location management, service handoff, and personal information management. In the applications layer, we develop some useful applications based on the pervasive grid platform, including e-Classroom and e-Ecology.
Pervasive Grid Platform In the mater of the core computing infrastructure, it contains the computing power and storage capability, to provide mobile or wired users with grid services. An edge grid node is just a terminal between a user and the core computing infrastructure. It is necessary to provide users with an efficient interface in a seamless and transparent way by the core computing infrastructure. Consequently, it is essential to develop a high-performance platform to process the user’s requirements. There are several works to be addressed as given below in terms of the pervasive grid platform: •
•
•
•
To process the join and leave of edge computing nodes: Our system follows GSI (Grid Security Infrastructure) to design a user authentication/authorization mechanism for adapting to our environment. Managing the interconnection between the core computing infrastructure and edge grid nodes: There are several differences and limitations among various edge grid devices. The pervasive grid platform is capable of managing these differences as well as fit in with user’s QoS (Quality of Service). The interconnection between the pervasive grid platform and core computing infrastructure: As presented in Figure 6, we implement the interface to handle the interconnection between the pervasive grid platform and core computing infrastructure through the Globus APIs. The corresponding algorithm is developed to cope with user’s jobs via Globus as well. Job dispatch, management, and QoS: We are concerned with the development of the flexible, high-performance, and reliable dispatcher and scheduler within the pervasive grid platform, in order to suffice for the requirements of users. Users with different priority could obtain the corresponding service level.
Grid Service Provider On the basis of the pervasive grid platform, we would like to implement a service-oriented provider in a module-design way based on the Web Services technology. It is easy to add or remove a service without taking great pains to maintain system services. For example, Data Grid (Ann Chervenak, 2000; Hoschek, 2000) service is intended to provide a large amount of storage resources and distributed access technology for data-intensive applications. There are three grid service modules within our system, including Data-Grid service, Computational service, and Information service. The computational service is to supply users with computing services for job execution. The information service has the capability of gathering the information about hardware resources. With the exception of the core grid infrastructure, an edge node could also publish and provide some specified services. For instance, a PDA (Personal Digital Assistant) may publish the GPS service to the public. Other edge grid nodes could access the GPS service provided by the PDA. Through the share of services, our system has not merely a better service-oriented architecture, but a complete and diverse service provider. Therefore, as shown in Figure 3(c), it is reasonable to deploy and build a service repository
8
Pervasive Grid and its Applications
Figure 4.
system for maintaining all registered services dynamically such as query, joint, remove for services.
Applications In the light of the development of the pervasive grid platform and service-oriented provider system, we make a study of academic applications within a campus environment called e-Campus system, for teachers and students with the comprehensive services, regarding researches and teaching. There are two applications for e-Campus system, including e-Ecology and e-Classroom. The National Dong Hwa University (NDHU) has widespread natural ecosystem. It is precious treasure for teaching and education. In addition, with regard to visitors, they may feel like understanding and observing the natural environment within NDHU. For this reason above, we are attempted to develop the e-Ecology application, as shown in Figure 3(d), by keeping records of the daily activities of the natural ecology system as video files over a long period of time in NDHU. It is observable that the data size of video files must be very tremendous.
9
Pervasive Grid and its Applications
As presented in Figure 4(a), in order to cope with such large amount data, we have implemented a storage broker based on the Data-Grid technology in support of the e-Ecology system. The overall components of the storage broker are presented in Figure 4 (b). The file mover uses GridFTP as its transmission protocol to copy files between two grid nodes. The upload processing engine gets the space information of each storage node from MDS. We adapt the roulette wheel algorithm (Goldberg, 1989) to the storage broker for choosing a node to upload a file. According to the roulette wheel algorithm, the larger capability of storage resources has the larger probability to be chose, with a view to achieve the system balance for storage resources. Download processing engine is an agent distributed in each node. When the broker gets a download request, it retrieves information from RLS database and redirects this request to the node that contains the file. The download agent receives the request and start transferring file. The Storage Broker would distribute the download jobs to each storage node, in order to shorten the download time. Therefore, users could browse the digital files smoothly without significant delay. The search engine helps us to look for some specific files by some keywords or properties. With the efficient storage broker system, the visitors can join the pervasive grid to access our ecological data via the e-Ecology system, as long as they are authorized. The students or teachers also can investigate into the ecology within NDHU for their researches. In respect of e-Classroom, the video data for a course can be digitalized as well as stored in our system via Data-Grid technology. The students can review the course by browsing the video data via e-Classroom. With the review of courses in multimedia, the teaching efficiency could be improved. In addition, as shown in Figure 4 (c), the teaching data could be shared among various universities for distance learning, so as to achieve the objective of the share of education resources.
Implementation We have implemented an integrated portal of pervasive grid system with a kind user interface called NDHU Grid Client, as shown in Figure 4 (d). NDHU Grid Client is very friendly towards students and teachers even if they have little knowledge of computer. Several functions are integrated in this portal, including user’s certification tool, GridFTP transmission tool, and grid job tool and e-Campus applications. One application is created by an internal frame as an independent thread. Each job will not influence other jobs by means of multi-threading programming model. In the matter of e-Campus applications, take e-Ecology as an example, if we feel like browsing an ecological video file via e-Ecology, we should connect the storage broker at first, as shown in Figure 5(a). Then we input the LFN (Logical File Name) of the video file. The storage broker will search an optimal site containing this file to download via GridFTP. GridFTP supports parallel data transfer using multiple TCP streams to improve the bandwidth over a single one. We make use of parallel data transfer to shorten the waiting time for users. After the transmission, the ecological file is presented through e-Ecology interface, as shown in Figure 5(b).
Performance Evaluation and Analyses GridFTP supports parallel data transfer using multiple TCP streams for better performance. We adapt the parallel data transfer to our system, in order to shorten the download time. Increasing the parallelism of transmissions seems to achieve better performance; it may lead to more computing overhead on account of too many working threads in your system. We have experimented on a variety of the number
10
Pervasive Grid and its Applications
Figure 5.
of TCP data streams from one stream to six, for downloading a video file with 700 Mega Bytes, with a view to determine the appropriate parallelism value. The result is shown in Figure 5(c). It is found from the result that data stream of three is superior to the others. Therefore, we adopt the parallelism as three data streams in our implementation. We have also made experiments on the comparison between the conventional transmission with single stream and the parallel transmission. As shown in Figure 5(d), the result indicates that our transmission model can outperform the conventional one. Users can obtain excellent browsing quality for large size video data via e-Campus system.
CONCLUSION In this chapter, we have investigated into the current research works of pervasive grid as well as analyzed the most important factors and components for constructing a pervasive grid system. Two approaches of implementation of pervasive grid system have been exploited mainly as yet including OS level and middleware level. To determine the implementation of pervasive grid system with OS level or middleware level depends on the system requirements and environment. Finally, we have introduced applications of pervasive grid system.
11
Pervasive Grid and its Applications
RFFERENCES Ali, A., McClatchey, R., Anjum, A., Habib, I., Soomro, K., Asif, M., et al. (2006). From grid middleware to a grid operating system. In Proceedings of the Fifth International Conference on Grid and Cooperative Computing, (pp. 9-16). China: IEEE Computer Society. Cannataro, M., & Talia, D. (2003). Towards the next-generation grid: A pervasive environment for knowledge-based computing. In Proceedings of the International Conference on Information Technology: Computers and Communications (pp.437-441), Italy. Chervenak, A., Foster, L., Kesselman, C., Salisbury, C., & Tueckem, S. (2000). The data grid: Towards an architecture for the distributed management and analysis of large scientific data sets. Journal of Network and Computer Applications, 23(3), 187–200. doi:10.1006/jnca.2000.0110 CoG Toolkit (n.d.). Retrieved from http://www.cogkit.org/ Foster, I. (2002). The grid: A new infrastructure for 21st century science. Physics Today, 55, 42–47. doi:10.1063/1.1461327 Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the the grid: Enabling scalable virtual organization. The International Journal of Supercomputer Applications, 15(3), 200–222. Globus: Grid security infrastructure (GSI) (n.d.). Retrieved from http://www.globus.org/security/ Globus: The grid resource allocation and management (GRAM) (n.d.). Retrieved from http://www. globus.org/toolkit/docs/3.2/gram/ Goldberg, & D. E. (1989). Genetic algorithm: In search, optimization and machine learning. New York: Addison-Wesley. Grid Computing, I. B. M. (n.d.). Retrieved from http://www-1.ibm.com/grid/ GridFTP (n.d.). Retrieved from http://www.globus.org/toolkit/docs/4.0/data/gridftp/ GSI (Globus Security Infrastructure). Retrieved from http://www.globus.org/Security/ Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H., & Stockinger, K. (2000). Data management in an international data grid project. grid computing - GRID 2000 (pp.333-361). UK. Hwang, J., & Arvamudham, P. (2004). Middleware services for P2P computing in wireless grid networks. IEEE Internet Computing, 8(4)40–46. doi:10.1109/MIC.2004.19 Information Services. (n.d.). Retrieved from http://www.globus.org/toolkit/mds/ Legion (n.d.). from http://www.legion.virginia.edu/ Padala, P., & Wilson, J. N. (2003). GridOS: Operating system services for grid architectures. In High Performance Computing (pp. 353-362). Berlin: Springer. Phan, T., Huang, L., & Dulan, C. (2002). Challenge: Integrating mobile wireless devices into the computational grid. In Proceedings of the 8th annual international conference on Mobile computing and networking (pp. 271-278), USA.
12
Pervasive Grid and its Applications
Reed, D. A. (2003). Grids: The teragrid, and beyond. IEEE Computer, 36(1), 62–68. Replica Location Service (RLS) (n.d.). Retrieved from http://www.globus.org/toolkit/docs/4.0/data/rls/ Siagri, R. (2007). Pervasive computers and the GRID: The birth of a computational exoskeleton for augmented reality. In 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The foundations of software engineering (pp.1-4), Croatia. SRB (Storage Resource Broker) (n.d.). Retrieved from http://www.sdsc.edu/srb/index.php/Main_Page Srinivasan, S. H. (2005). Pervasive wireless grid architecture. In Proceedings of The Second Annual Conference on Wireless On-demand Network Systems and Services (pp.83-88), Switzerland. The EU Data Grid Project (n.d.). Retrieved from http://www.eu-datagrid.org/. The Globus Alliance (n.d.). Retrieved from http://www.globus.org/ Unicore (n.d.). Retrieved from http://unicore.sourceforge.net Vazhkudai, S., & Syed, J., & Maginnis T. (2002). PODOS - The design and implementation of a performance oriented Linux cluster. Future Generation Computer Systems, 18(3), 335–352. doi:10.1016/ S0167-739X(01)00055-3
KEY TERMS AND DEFINITIONS Grid Computing: A new technology has been developed to contribute to the powerful computing ability for supporting distributed computing applications. Grid Middleware: A toolkit of software between grid applications and grid fabrics provides a series of functionalities including grid security infrastructure, data management, job management, and information services. The Grid Resource Allocation and Management (GRAM): GRAM provides a series of uniform interfaces to simplify the access of remote grid resources for job execution. Grid Security Infrastructure (GSI): It provides the authentication and authorization mechanisms for system protection according to X.509 proxy certificates. GridFTP: A securer file transmission protocol in grid computing. Pervasive Grid: A novel grid architecture that enables users to manipulate grid services transparently. Replica Location Service (RLS): RLS maintains the location information of replicas from logical file names (LFN) to physical file names (PFN).
13
14
Chapter 2
Pervasive Grids
Challenges and Opportunities Manish Parashar Rutgers, The State University of New Jersey, USA Jean-Marc Pierson Paul Sabatier University, France
ABSTRACT Pervasive Grid is motivated by the advances in Grid technologies and the proliferation of pervasive systems, and is leading to the emergence of a new generation of applications that use pervasive and ambient information as an integral part to manage, control, adapt and optimize. However, the inherent scale and complexity of Pervasive Grid systems fundamentally impact how applications are formulated, deployed and managed, and presents significant challenges that permeate all aspects of systems software stack. In this chapter, the authors present some use-cases of Pervasive Grids and highlight their opportunities and challenges. They then present why semantic knowledge and autonomic mechanisms are seen as foundations for conceptual and implementation solutions that can address these challenges.
INTRODUCTION Grid computing has emerged as the dominant paradigm for wide-area distributed computing (Parashar & Lee, 2005). The goal of the original Grid concept is to combine resources spanning many organizations into virtual organizations that can more effectively solve important scientific, engineering, business and government problems. Over the last decade, significant resources and research efforts have been devoted towards making this vision a reality and have lead to the development and deployment of a number of Grid infrastructures targeting a variety of applications. However, recent technical advances in computing and communication technologies and associated cost dynamics are rapidly enabling a ubiquitous and pervasive world - one in which the everyday objects surrounding us have embedded computing and communication capabilities and form a seamless Grid of DOI: 10.4018/978-1-60566-661-7.ch002
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Pervasive Grids
information and interactions. As these technologies weave themselves into the fabrics of everyday life (Weiser, 1991), they have the potential of fundamentally redefining the nature of applications and how they interact with and use information. This leads to a new revolution in the original Grid concept and the realization of a Pervasive Grid vision. The Pervasive Grid vision is driven by the advances in Grid technologies and the proliferation of pervasive systems, and seamlessly integrates sensing/actuating instruments and devices together with classical high performance systems as part of a common framework that offers the best immersion of users and applications in the global environment. This is, in turn, leading to the emergence of a new generation of applications using pervasive and ambient information as an integral part to manage, control, adapt and optimize (Pierson, 2006; Matossian et al., 2005; Bangerth, Matossian, Parashar, Klie, &Wheeler, 2005; Parashar et al., 2006). These applications include a range of application areas including crisis management, homeland security, personal healthcare, predicting and managing natural phenomenon, monitoring and managing engineering systems, optimizing business processes, etc (Baldridge et al., 2006). Note that it is reasonable to argue that in concept, the vision of Pervasive Grids was inherent in the visions of “computing as a utility” originally by Corbat et al (Corbat & Vyssotsky, 1965) and later by Foster et al (Foster, Kesselman, & Tuecke, 2001). In this sense, Pervasive Grids are the next significant step towards realizing the metaphor of the power grid. Furthermore, while, Foster et al., defined a computational Grid in (Foster & Kesselman, 1999) as “... a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities”, the term pervasive in this definition refers to the transparent access to resources rather than the nature of the resources themselves. Pervasive Grids focus on the latter and essentially address an extreme generalization of Grid concept where the resources are pervasive and include devices, services, information, etc. The aim of this chapter is to introduce the vision of Pervasive Grid computing and to highlight its opportunities and challenges. In this paper we first described the nature of applications in a Pervasive Grid and outline their requirements. We then describe key research challenges, and motivate semantic knowledge and autonomic mechanisms as the foundations for conceptual and implementation solutions that can address these challenges.
PERVASIVE GRID APPLICATIONS AND THEIR REqUIREMENTS The applications enabled by Pervasive Grid systems can be classified along three broad axes based on their programming and runtime requirements. Opportunistic applications can discover and use available pervasive information and resources, to potentially adapt, optimize, improve QoS, provide a better user experience, etc. For example, a navigation system may use real-time traffic information (possibly obtained from other vehicles) to reduce or avoid congested routes. Similarly, a vehicle safety system may use information from oncoming vehicles to appropriately warn the driver of possible hazards. A key characteristic of these applications is that they do not depend on the availability of the information, but can opportunistically use information if it is available. Note that this application may consume raw information and process it locally. Alternately, they may “outsource” the processing of information using available resources at the source of the information or within the pervasive environment. While the above applications are centered on a single user, in cooperative applications, multiple application entities (possibly wireless devices) cooperate with each other, each providing partial information, to make collective decisions in an autonomous manner. An example is a swarm of wireless robotic devices cooperatively
15
Pervasive Grids
exploring a disaster site or a group of cars sharing information to estimate the overall traffic situation. Finally, certain control applications provide autonomic control capabilities using actuation devices in addition to sensors, for example, a car may anticipate traffic/road conditions and appropriately apply the brakes. As an illustration consider the somewhat futuristic use-case scenario presented below that describes how an international medical emergency may be handled using the “anytime-anywhere” access to information and services provided by a Pervasive Grid. This scenario shares some of the views of (Akogrimo, 2004) while adding the semantic dimension to the process. Mr. Smith lives in Toulouse, France, and leaves for a few days to Vienna, Austria. Unfortunately, on the way, he is involved in an accident leaving him lying unconscious on the road. When help arrives, they only find a single piece of information on Mr. Smith, i.e., a numerical identifier (for example on a smart card), which allows the helps to quickly access Mr. Smith’s medical file (which is at least partially in France, perhaps in Toulouse), to find important information (for example, details of drug allergies, of its operational antecedents - was already anesthetized, with which product? did he have an operation? are there records available such as an operation report or x-rays?) that will allow the responders to adapt and customize the care given to Mr. Smith.
Let us consider this use-case in detail. First, let us assume (unrealistically) that the problem of the single identifier is solved (this particular point is a subject political, ethical, and is far from being solved, even at the European scale), and that Mr. Smith has a health card that encodes his identifier. Pervasive sensors are already embedded with Mr. Smith to monitor his blood pressure and sugar rate in his blood. These data are available through a specific application available for a range of devices (Palm, notebooks,...) and transmitted via WiFi from the sensors to the application devices. Further, Mr. Smith’s medical data is distributed across various medical centers. The contents of the medical files must be accessible in a protected way: Only authorized individuals should be able to access relevant parts of the file, and ideally these authorizations are given by Mr. Smith himself. Note that all the documents would be naturally in French and possibly in different formats and modalities.
Now, the Austrian responder, who only speaks German, has a Palm, with WiFi connection. The WiFi hot spot is located in the ambulance and allows the responder to consult patient medical records through a public hospital network. The intervention by the responder begins on the spot of the accident and continues on the road towards the hospital. Please note that at this stage, the responder has no idea of the pervasive presence of the sensors embedded with Mr. Smith. When the responder wants to access information about allergies to certain medication, he should initially know where this information resides. From both the identifier of Mr. Smith and the request itself (allergies?), the system seeks the storage centers likely to have some information about Mr. Smith. The responder contacts these centers. He also needs to obtain authorization to enter the French information systems, which he obtains by starting from his certificate of membership to the health Austrian system. Trust certificates are established to allow him to access the network of care where the required data are.
16
Pervasive Grids
An integration service must transform the responder’s request to be compatible with the schema of the databases containing the relevant information, and negotiates, according to his profile and of the presented request, the parts of the database accessible to him. The request is expressed using a common vocabulary and semantic (ontology of the medical field) representation to get around the language issue. To reach the data itself, the responder presents the mandatory certificates to read the files. Mr. Smith must have previously created certificates for standard accesses to some of his data, for example, the people being able to endorse the responders role can access information about drug allergies. A repository of the standard certificates for Mr. Smith must be accessible on line. The responder presents the retrieved certificates which authorizes the access and returns the data.
After this interaction, two kinds of information are available: First, the system alerts the responder of the presence of sensors with Mr. Smith, and starts the download of the appropriate application (graphic and language interface must be adapted) on its Palm. Thanks to the retrieved information, the responder knows the sugar rate in the blood. The second kind of information is related to the medical records of Mr. Smith. The metadata of the documents are analyzed to know their nature and to see how the Palm can exploit them. An adaptation service is probably required, to create a chain of transformation from the original documents (in written and spoken French) into documents that can be used by the responder currently in the moving ambulance, where he can only read and not listen (due to the noisy environment). Appropriate services include a service for audio-to-text transformation, a French-German translation service, etc. Finally, the first-aid worker gets the relevant data and administers the appropriate medication to the patient.
During the transportation, information about patient (drugs, known allergies, identifier of the patient) are transmitted to the hospital. In the hospital, even before the arrival of the ambulance, a surgeon can recover, using similar mechanisms but with different conditions (less constrained terminal, higher role in the care network, etc.), more complete information (operational antecedents, scanner, etc.) in order to be able to intervene appropriately. The surgeon can decide to start some more complex computation on the data he retrieved like comparing this patient characteristics (and data, such as images, analysis, etc.) to a patient database to better suit this particular patient case and provide personalized help. This may lead to use utility computing facilities on a stable infrastructure. In the scenario, the responder is very active, interacting with the local sensors and the global infrastructure. One should understand that much of the tasks should be automated, delegated and performed transparently by his device. The pervasive grid ecosystem, which integrates computers, networks, data archives, instruments, observatories, experiments, and embedded sensors and actuators, is also enabling new paradigms in science and engineering - ones that are information/data-driven and that symbiotically and opportunistically combines computations, experiments, observations, and real-time information to understand and manage natural and engineering systems. For example, an Instrumented Oil-Field can (theoretically) achieve efficient and robust control and management of diverse subsurface and near subsurface geo-systems by completing the symbiotic feedback loop between measured data and a set of computational models, and can provide efficient,
17
Pervasive Grids
cost-effective and environmentally safe production of oil reservoirs. Similar strategies can be applied to CO2 sequestration, contaminated site cleanup, bio-landfill optimization, aquifer management and fossil fuel production. Another example application is the modelling and understanding of complex marine and coastal phenomena, and the associated management and decision-making processes. This involves an observational assessment of the present state, and a scientific understanding of the processes that will evolve the state into the future, and requires combining surface remote sensing mechanisms (satellites, radar) and spatially distributed in situ subsurface sensing mechanisms to provide a well sampled blueprint of the ocean, and coupling this real-time data with modern distributed computational models and experiments. Such a pervasive information-driven approach is essential to address important national and global challenges such as (1) safe and efficient navigation and marine operations, (2) efficient oil and hazardous material spill trajectory prediction and clean up, (3) monitoring, predicting and mitigating coastal hazards, (4) military operations, (5) search and rescue, and (6) prediction of harmful algal blooms, hypoxic conditions, and other ecosystem or water quality phenomena. For example, underwater and aerial robots and oceanic observatories can provide real-time data which, coupled with online satellite, radar and historical data, advanced models and computational and data-management systems, can be used to predict and track extreme weather and coastal behaviours, manage atmospheric pollutants and water contaminants (oil spills), perform underwater surveillance, study coastal changes, track hydrothermal plumes (black smokers), and study the evolution of marine organisms and microbes. An area where pervasive grids can potentially impact in a dramatic way is crisis management and response where immediate and intelligent responses to a rapidly changing situation could mean the difference between life and death for people caught up in a terrorist or other crisis situation. For example, a prototype disaster response test bed, which combines information and data feeds from an actual evolving crisis event with a realistic simulation framework (where the on-going event data are continually and dynamically integrated with the on-line simulations), can provide the ability for decision support and crisis management of real situations as well as more effective training of first-responders. Similarly, one can conceive of a fire management application where computational models use streaming information from sensors embedded in the building along with real time and predicted weather information (temperature, wind speed and direction, humidity) and archived history data to predict the spread of the fire and to guide fire-fighters, warning of potential threats (blowback if a door is opened) and indicating most effective options. This information can also be used to control actuators in the building to manage the fire and reduce damage.
CROSSCUTTING CHALLENGES The Pervasive Grid environment is inherently large, heterogeneous and dynamic, globally aggregating large numbers of independent computing and communication resources, data stores, instruments and sensing/actuating devices. The result is an unprecedented level of uncertainty that is manifested in all aspects of the Pervasive Grid: System, Information and Application (Parashar & Browne, 2005; Parashar, 2006). •
18
System uncertainty reflects in its structure (e.g., flat, hierarchical, P2P, etc.), in the dynamism of its components (entities may enter, move or leave independently and frequently), in the heterogeneity
Pervasive Grids
• •
of its components (their connectivity, reliability, capabilities, cost, etc.), in the lack of guarantees, and more importantly, in the lack of common knowledge of numbers, locations, capacities, availabilities and protocols used by its constituents. Information uncertainty is manifested in its quality, availability, compliance with common understanding and semantics, as well the trust in its source. Finally, application uncertainty is due to the scale of the applications, the dynamism in application behaviours, and the dynamism in its compositions, couplings and interactions (services may connect to others on a dynamic and opportunistic way).
The scale, complexity, heterogeneity, and dynamism of Pervasive Grid environments and the resulting uncertainty present thus requires that the underlying technologies, infrastructures and applications must be able to detect and dynamically respond during execution to changes in the state of execution environment, the state and requirements of the application and the overall context of the applications. This requirement suggests that (Parashar & Browne, 2005): 1. 2.
3.
Applications should be composed from discrete, self-managing components which incorporate separate specifications for all of functional, non-functional and interaction-coordination behaviours. The specifications of computational (functional) behaviours, interaction and coordination behaviours and non-functional behaviours (e.g. performance, fault detection and recovery, etc.) should be separated so that their combinations are composable. The interface definitions of these components should be separated from their implementations to enable heterogeneous components to interact and to enable dynamic selection of components.
Given these features, a Pervasive Grid application requiring a given set of computational behaviours may be integrated with different interaction and coordination models or languages (and vice versa) and different specifications for non-functional behaviours such as fault recovery and QoS to address the dynamism and heterogeneity of the application and the underlying environments.
RESEARCH OPPORTUNITIES IN PERVASIVE GRID COMPUTING We believe that addressing the challenges outlined above requires new paradigm for realizing the Pervasive Grid Infrastructure and its technologies that is founded on semantic knowledge and autonomic mechanisms (Parashar & Browne, 2005; Parashar, 2006). Specifically: 1. 2. 3. 4.
Static (defined at the time of instantiation) application requirements, system and application behaviours to be relaxed The behaviours of elements and applications to be sensitive to the dynamic state of the system and the changing requirements of the application and to be able to adapt to these changes at runtime, Common knowledge to be expressed semantically (ontology and taxonomy) rather than in terms of names, addresses and identifiers, The core enabling middleware services (e.g., discovery, coordination, messaging, security) to be driven by such a semantic knowledge. Further the implementations of these services must be resilient and must scalably support asynchronous and decoupled behaviours.
19
Pervasive Grids
Key research challenges includes: •
•
•
20
Programming models, abstractions and systems: Applications targeted to emerging Pervasive Grids must be able to address high levels of uncertainty inherent in these environments, and require the ability to discover, query, interact with, and control instrumented physical systems using semantically meaningful abstractions. As a result, they require appropriate programming models and systems that support notions of dynamic space-time context, as well as enable applications capable of correctly and consistently adapting their behaviours, interactions and compositions in real time in response to dynamic data and application/system state, while satisfying real time, functional, performance, reliability, security, and quality of service constraints. Furthermore, since these behaviours and adaptations are context dependent, they need to be specified separately and at runtime, and must consistently and correctly orchestrate appropriate mechanisms provided by the application components to achieve autonomic management. Data/information quality/uncertainty management: A key issue in pervasive systems is the characterization of the quality of information and the need of estimating its uncertainty, so that it can effectively drive the decision making process. This includes algorithms and mechanisms to synthesize actionable information with dynamic qualities and properties from streams of data from the physical environment, and address issues of data quality assurance, statistical synthesis and hypotheses testing, and in-network data assimilation, spatial and/or temporal multiplexing, clustering and event detection. Works done in the field of data management (Dong, Halevy, & Yu, 2007; Benjelloun, Sarma, Halevy, Theobald, & Widom, 2008) gives some hints on how to handle the data integration when the certainty of individual sources is not sure. Another related aspect is providing mechanisms for adapting the level and frequency of sensing based on this information. Achieving this in an online and in-network manner (as opposed to post-processing stored data) with strict space-time constraints presents significant challenges, wich are not addressed by most existing systems. Note that, since different in-network data processing algorithms will have different cost/performance behaviours, strategies for adaptive management of tradeoffs so as to optimize overall application requirements are required. Systems software and runtime & middleware services: Runtime execution and middleware services have to be extended to support context-/content-/location-aware and dynamic, data/ knowledge-driven and time-constrained executions, adaptations, interactions, compositions of application elements and services, while guaranteeing reliable and resilient execution and/ or predictable and controllable performances. Furthermore, data acquisition, assimilation and transport services have to support seamless acquisition of data from varied, distributed and possibly unreliable data sources, while addressing stringent real-time, space and data quality constraints. Similarly, messaging and coordination services must support content-based scalable and asynchronous interactions with different service qualities and guarantees. Finally, sensor system management techniques are required for the dynamic management of sensor systems including capacity and energy aware topology management, runtime management including adaptations for computation/communication/power tradeoffs, dynamic load-balancing, and sensor/actuator system adaptations.
Pervasive Grids
RELATED WORK Research Landscape in Grid and Autonomic Computing Grid computing research efforts over the last decade can be broadly divided into efforts addressing the realization of virtual organizations and those addressing the development of Grid applications. The former set of efforts have focused on the definition and implementation of the core services that enable the specification, construction, operation and management of virtual organizations and instantiation of virtual machines that are the execution environments of Grid applications. Services include: •
• •
• •
Security services to enable the establishment of secure relationships between a large number of dynamically created subjects and across a range of administrative domains, each with its own local security policy, Resource discovery services to enable discovery of hardware, software and information resources across the Grid, Resource management services to provide uniform and scalable mechanisms for naming and locating remote resources, support the initial registration/discovery and ongoing monitoring of resources, and incorporate these resources into applications, Job management services to enable the creation, scheduling, deletion, suspension, resumption, and synchronization of jobs, Data management services to enable accessing, managing, and transferring of data, and providing support for replica management and data filtering.
Efforts in this class include Globus (The Globus Alliance), Unicore (Unicore Forum), Condor (Thain, Tannenbaum, & Livny, 2002) and Legion (Grimshaw & Wulf, 1997). Other efforts in this class include the development of common APIs, toolkits and portals that provide high-level uniform and pervasive access to these services. These efforts include the Grid Application Toolkit (GAT) (Allen et al., 2003), DVC (Taesombut & Chien, 2004) and the Commodity Grid Kits (CoG Kits) (Laszewski, Foster, & Gawor, 2000). These systems often incorporate programming models or capabilities for utilizing programs written in some distributed programming model. For example, Legion implements an object-oriented programming model, while Globus provides a capability for executing programs utilizing message passing. The second class of research efforts deals with the formulation, programming and management of Grid applications. These efforts build on the Grid implementation services and focus on programming models, languages, tools and frameworks, and application runtime environments. Research efforts in this class include GrADS (Berman et al., 2001), GridRPC (Nakada et al., 2003), GridMPI (Ishikawa, Matsuda, Kudoh, Tezuka, & Sekiguchi, 2003), Harness (Migliardi & Sunderam, 1999), Satin/IBIS (Nieuwpoort, Maassen, Wrzesinska, Kielmann, & Bal, 2004) (Nieuwpoort et al., n.d.), XCAT (Govindaraju et al., 2002) (Krishnan & Gannon, 2004), Alua (Ururahy & Rodriguez, 2004), G2 (Kelly, Roe, & Sumitomo, 2002), J-Grid (Mathe, Kuntner, Pota, & Juhasz, 2003), Triana (Taylor, Shields, Wang, & Philp, 2003), and ICENI (Furmento, Hau, Lee, Newhouse, & Darlington, 2003). These systems have essentially built on, combined and extended existing models for parallel and distributed computing. For example, GridRPC extends the traditional RPC model to address system dynamism. It builds on Grid system services to combines resource discovery, authentication/authorization, resource allocation and task
21
Pervasive Grids
scheduling to remote invocations. Similarly, Harness and GridMPI build on the message passing parallel computing model, Satin supports divide-and-conquer parallelism on top of the IBIS communication system. GrADS builds on the object model and uses reconfigurable object and performance contracts to address Grid dynamics, XCAT and Alua extend the component based model. G2, J-Grid, Triana and ICENI build on various service based models. G2 builds on .Net (Microsoft .Net), J-Grid builds on Jini (Jini Network Technology) and current implementations of Tirana and ICENI build on JXTA (Project JXTA, 2001). While this is natural, it also implies that these systems implicitly inherit the assumptions and abstractions that underlie the programming models of the systems upon which they are based and thus in turn inherit their assumptions, capabilities and limitations. In the last years, the semantic grid paradigm has gained much interests from authors and at the Global Grid forum. In (De Roure, Jennings, & Shadbolt, 2005), De Roure and Jennings propose a view on semantic grid its past, present and future. The identify some key requirements of the semantic grid: Resource description discovery and use, process description and enactment, security and trust, annotation to enrich the description of digital content, information integration and fusion (potentially on the fly), context awareness, communities, smart environments, ... Ontologies, semantic web services are entitled to give some help to achieve a semantic grid. Works on semantic grid can be enlarged to encompass pervasive computing (Roure, 2003). In this work, the author defines where semantic grid can benefit from pervasive devices, and vice-versa: Indeed on one side the semantic grid can benefit to the processing of the data acquired for instance by sensors, on the other hand, the semantic grid benefits from potential metadata coming from the pervasive appliances themselves allowing for the automatic creation of annotation describing them. There has also been research by the authors and other on applying Autonomic Computing (Kephart & Chess, 2003; Parashar & Hariri, 2006) concepts to Grid systems and applications. The autonomic computing paradigm is inspired by biological systems, and aims at developing systems and applications that can manage and optimize themselves using only high-level guidance. The key concept is a separation of (management, optimization, fault-tolerance, security) policies from enabling mechanisms, allowing a repertoire of mechanisms to operate at runtime to respond to the heterogeneity and dynamics, both of the applications and the infrastructure. This enables undesired changes in operation to trigger changes in the behaviourof the computing system to respond to the changes, so that the system continues to operate (or possibly degrades) in a conformant manner - for example,the system may recover from faults, reconfigure itself to match its environment, and maintainits operations at a near optimal performance. Autonomic techniques have be applied to various aspects of Grid computing such as application runtime management, workloadmanagement and data distribution, data steaming and processing, etc. (Parashar & Hariri,2006). As we will see in the next part, these works on semantically enhanced grids and autonomic computing are complimentary to other works directly related to the presence of mobile and context aware appliances in the environment. Most of these works do not deal with all the specificities of Pervasive Grids. We now detail some works in that specific directions.
Pervasive Grid Efforts Davies, Storz and Friday (Storz, Friday, & Davies, 2003; Davies, Friday, & Storz, 2004) were among the first to introduce the concept of “Ubiquitous Grid”, that is close to our Pervasive Grid vision. The purpose of their research paper is to compare the notion of Grid Computing (definition of I. Foster (Fos-
22
Pervasive Grids
ter, Kesselman, Nick, & Tuecke, 2002)) and the notion of Pervasive Systems (definition of M. Weiser (Weiser, 1991)). They identify similar interests: heterogeneity, interoperability, scalability, adaptability and fault tolerance, resources management, services composition, discovery, security, communication, audit, payment. They then briefly present a use-case for a ubiquitous Grid, which they develop using Globus Toolkit 3 (GT3). Lack of details makes it difficult to evaluate exactly what has been done to make GT3 behave as a an ubiquitous Grid, and what aspects of ubiquity has been addressed. Hingne et al. (Hingne, Joshi, Finin, Kargupta, & Houstis, 2003) propose a multi-agent approach to realize a P-Grid. They are primarily interested in communication, heterogeneity, discovery and services composition, and scheduling of tasks between the different devices constituting the P-Grid. McKnight et al. (McKnight, Howison, & Bradner, 2004) introduce the concept of a Wireless Grid. Their interest is in the mobile and nomadic issues, which they compare with traditional computing Grids, P2P networks and web services. An interesting aspect of this article is that it investigates the relationships between these actors. In the article, the authors focus on services that they identify as the most important, i.e., resources description and discovery, coordination, trust management and access control. In (Srinavisan, 2005), S.H. Srinivasan details a Wireless Pervasive Grid architecture. The author separates Grid in two parts: the “backbone grid”, physically linked and analogous to network backbones, and the wireless “access grid”. Agents realize the proxy between the two grids, and act on behalf of mobile devices in the “access grid” on the “backbone grid”. Interesting aspects of this effort are the pro-activity and context-awareness of the presentation to end-users. Coulson et al. (Coulson et al., 2005) present a middleware structured using a lightweight run-time component model (OpenCom) that enables appropriate profiles to be configured on a wide range of device types, and facilitates runtime reconfiguration (as required to adapt to dynamic environments). More recently, Coronato and De Pietro (Coronato & Pietro, 2007) describe MiPEG, a middleware consisting of a set of services (compliant to grid standard OGSA) enhancing classic Grid environments (namely the Globus Toolkit) with mechanisms for handling mobility, context-awareness, users’ session and distribution of tasks on the users’ computing facilities. Complementary to these, existing research efforts have tackled aspects of integrating pervasive systems with computing Grids, primarily in the fields of mobile computing and pervasive computing. They include works on interaction, mobility and context adaptation. Research presented in (Allen et al., 2003; Graboswki, Lewandowski, & Russell, 2004; Gonzalez-Castano, Vales-Alonso, Livny, Costa-Montenegro, & Anido-Rifo, 2003) focused on the use of light devices to interact with computing Grids, e.g., submitting jobs and visualizing results. A closer integration of mobile devices with the Grids is addressed in (Phan, Huang, & Dulan, 2002; Park, Ko, & Kim, 2003), which proposes proxy services to distribute and organize jobs among a pool of light devices. The research presented in (Kurkovsky, Bhagyavati, Ray, & Yang, 2004) solicits surrounding devices to participate in a problem-solving environment. Mobile Grids have received much interest in the last years with the development of ad hoc networks and/or IPv6 and the works in the mobile computing field. Some researchers (Chu & Humphrey, 2004; Clarke & Humphrey, 2002) have investigated how a Grid middleware (Legion, OGSI.NET) can be adapted to tackle mobility issues. In (Litke, Skoutas, & Varvarigou, 2004), the authors present the opportunities of research challenges in resource management in mobile grid environments, namely resource discovery and selection, job management from scheduling, replication, to migration and monitoring, and replica management. (Li, Sun, & Ifeachor, 2005) gives some challenges of mobile ad-hoc network and adds a Quality of Service dimension to previous works, including provisioning and continuity of service, latency, energy constraints, and fault tolerance in general. The authors map their observations to a mobile
23
Pervasive Grids
healthcare scenario. (Oh, Lee, & Lee, 2006) proposes in a wireless world to allocate dynamically the tasks to surroundings resources, taking into account the context of these resources (their possibilities in terms of resources: energy, network, CPU power,...). In (Waldburger & Stiller, 2006) the authors focus on the provisioning of services in mobile grids and compare business and technical metrics between Grid Computing, Service Grid, mobile and knowledge grids, SOA and P2P systems. They extend the vision of classical Virtual Organizations to Mobile Dynamic VO. Mobile agents are used in (Guo, Zhang, Ma, & Zhang, 2004; Bruneo, Scarpa, Zaia, & Puliafito, 2003; Baude, Caromel, Huet, & Vayssiere, 2000) to migrate objects and codes among the nodes while (Wang, Yu, Chen, & Gao, 2005) apply mobile agents to MANET with dynamic and ever changing neighbors. (Wong & Ng, 2006) focus on security while combining mobile agents and the Globus grid middleware to handle mobile grid services. (Akogrimo, 2004; Jiang, O’Hanlon, & Kirstein, 2004) are interested in the advantages of mobility features of IPv6 in the notification and adaptation of Grids. The authors of (Messig & Goscinski, 2007) relate their work on autonomic system management in mobile grid environment, encompassing the self discovery, selfconfiguration and dynamic deployment and self healing for fault tolerance. Context-awareness is the primary focus of the work presented in (Jean, Galis, & Tan, 2004). The authors present an extension of virtual organization to context, providing personalization of the services. In (Zhang & Parashar, 2003), the authors propose a context aware access control in grids. (Yamin et al., 2003; Otebolaku, Adigun, Iyilade, & Ekabua, 2007) include mobility and context-awareness in their presentation.
CONCLUSION The proliferation of pervasive sensing/actuating devices coupled with advances in computing and communication technologies are rapidly enabling the next revolution in Grid computing - the emergence of Pervasive Grids. This, in turn, is enabling a new generation of application that use pervasive information and services to manage, control, adapt and optimize natural and engineering real-world systems. However, the inherent scale and complexity of Pervasive Grid systems fundamentally impact the nature of applications and how they are formulated, deployed and managed, and presents significant challenges that permeate all aspects of systems software stack from applications to programming models and systems to middleware and runtime services. This paper outlined the vision of Pervasive Grid Computing along with its opportunities and challenges, and presented a research agenda for enabling this vision.
REFERENCES Allen, G., Davis, K., Dolkas, K. N., Doulamis, N. D., Goodale, T., Kielmann, T., et al. (2003). Enabling applications on the grid: A Gridlab overview. International Journal of High Performance Computing Applications: Special issue on Grid Computing: Infrastructure and Applications. Baldridge, K., Biros, G., Chaturvedi, A., Douglas, C. C., Parashar, M., How, J., et al. (2006, January). National Science Foundation DDDAS Workshop Report. Retrieved from http://www.dddas.org/nsfworkshop-2006/wkshp report.pdf.
24
Pervasive Grids
Bangerth, W., Matossian, V., Parashar, M., Klie, H., & Wheeler, M. (2005). An autonomic reservoir framework for the stochastic optimization of well placement. Cluster Computing, 8(4), 255–269. doi:10.1007/s10586-005-4093-3 Baude, F., Caromel, D., Huet, F., & Vayssiere, J. (2000, May). Communicating mobile active objects in java. In R. W. Marian Bubak Hamideh Afsarmanesh & B. Hetrzberger (Eds.), Proceedings of HPCN Europe 2000 (Vol. 1823, p. 633-643). Berlin: Springer. Retrieved from http://www-sop.inria.fr/oasis/ Julien.Vayssiere/publications/18230633.pdf Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., & Widom, J. (2008). Databases with uncertainty and lineage. The VLDB Journal, 17(2), 243–264. doi:10.1007/s00778-007-0080-z Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I., & Gannon, D. (2001). The grads project: Software support for high-level grid application development. International Journal of High Performance Computing Applications, 15(4), 327–344. doi:10.1177/109434200101500401 Bruneo, D., Scarpa, M., Zaia, A., & Puliafito, A. (2003). Communication paradigms for mobile grid users. In CCGRID 03 (p. 669). Chu, D., & Humphrey, M. (2004, November 8). Bmobile ogsi.net: Grid computing on mobile devices. In Grid computing workshop (associated with supercomputing 2004), Pittsburgh, PA. Clarke, B., & Humphrey, M. (2002, April 19). Beyond the ”device as portal”: Meeting the requirements of wireless and mobile devices in the legion grid computing system. In 2nd International Workshop On Parallel And Distributed Computing Issues In Wireless Networks And Mobile Computing (associated with ipdps 2002), Ft. Lauderdale, FL. Corbat, F. J., & Vyssotsky, V. A. (1965). Introduction and overview of the multics system. FJCC, Proc. AFIPS, 27(1), 185–196. Coronato, A., & Pietro, G. D. (2007). Mipeg: A middleware infrastructure for pervasive grids. Journal of Future Generation Computer Systems. Coulson, G., Grace, P., Blair, G., Duce, D., Cooper, C., & Sagar, M. (2005, April). A middleware approach for pervasive grid environments. In Uk-ubinet/ uk e-science programme workshop on ubiquitous computing and e-research. Davies, N., Friday, A., & Storz, O. (2004). Exploring the grid’s potential for ubiquitous computing. IEEE Pervasive Computing / IEEE Computer Society [and] IEEE Communications Society, 3(2), 74–75. doi:10.1109/MPRV.2004.1316823 De Roure, D., Jennings, N., & Shadbolt, N. (2005, March). The semantic grid: Past, present, and future. Proceedings of the IEEE, 93(3), 669–681. doi:10.1109/JPROC.2004.842781 Dong, X., Halevy, A. Y., & Yu, C. (2007). Data integration with uncertainty. In Vldb ’07: Proceedings of the 33rd International Conference on Very Large Data Bases (pp. 687–698). VLDB Endowment. Foster, I., & Kesselman, C. (Eds.). (1999). The grid: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers, Inc.
25
Pervasive Grids
Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). The physiology of the grid: An open grid services architecture for distributed systems integration. Retrieved from citeseer.nj.nec.com/foster02physiology. html Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200–222. Furmento, N., Hau, J., Lee, W., Newhouse, S., & Darlington, J. (2003). Implementations of a serviceoriented architecture on top of jini, jxta and ogsa. In Proceedings of uk e-science all hands meeting. Gonzalez-Castano, F. J., Vales-Alonso, J., Livny, M., Costa-Montenegro, E., & Anido-Rifo, L. (2003). Condor grid computing from mobile handheld devices. SIGMOBILE Mobile Comput. Commun. Rev., 7(1), 117–126. doi:10.1145/881978.882005 Govindaraju, M., Krishnan, S., Chiu, K., Slominski, A., Gannon, D., & Bramley, R. (2002, June). Xcat 2.0: A component-based programming model for grid web services (Tech. Rep. No. Technical ReportTR562). Dept. of C.S., Indiana Univ., South Bend, IN. Graboswki, P., Lewandowski, B., & Russell, M. (2004). Access from j2me-enabled mobile devices to grid services. In Proceedings of Mobility Conference 2004, Singapore. Grimshaw, A. S., & Wulf, W. A. (1997). The legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), 39–45. doi:10.1145/242857.242867 Guo, S.-F., Zhang, W., Ma, D., & Zhang, W.-L. (2004, Aug.). Grid mobile service: using mobile software agents in grid mobile service. Machine learning and cybernetics, 2004. In Proceedings of 2004 International Conference on, 1, 178-182. Hingne, V., Joshi, A., Finin, T., Kargupta, H., & Houstis, E. (2003). Towards a pervasive grid. In International parallel and distributed processing symposium (ipdps’03) (p. 207). Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., & Sekiguchi, S. (2003). The design of a latency-aware mpi communication library. In Proceedings of swopp03. Jean, K., Galis, A., & Tan, A. (2004). Context-aware grid services: Issues and approaches. In Computational science–iccs 2004: 4th international conference Krak’ow, Poland, June 6–9, 2004, proceedings, part iii (LNCS Vol. 3038, p. 1296). Berlin: Springer. Jiang, S., O’Hanlon, P., & Kirstein, P. (2004). Moving grid systems into the ipv6 era. In Proceedings of Grid And Cooperative Computing 2003 (LNCS 3033, pp. 490–499). Heidelberg, Germany: SpringerVerlag. Kelly, W., Roe, P., & Sumitomo, J. (2002). G2: A grid middleware for cycle donation using. net. In Proceedings of the 2002 International Conference on Parallel and Distributed Processing Techniques and Applications. Kephart, J. O., & Chess, D. M. (2003). The vision of autonomic computing. Computer IEEE Computer Society, 36(1), 41–50.
26
Pervasive Grids
Krishnan, S., & Gannon, D. (2004). Xcat3: A framework for cca components as ogsa services. In Proceedings of Hips 2004, 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments. Kurkovsky, S. Bhagyavati, Ray, A., & Yang, M. (2004). Modeling a grid-based problem solving environment for mobile devices. In ITCC (2) (p. 135). New York: IEEE Computer Society. Laszewski, G. v., Foster, I., & Gawor, J. (2000). Cog kits: A bridge between commodity distributed computing and high-performance grids. In ACM 2000 Conference on java grande (p.97 - 106). San Francisco, CA: ACM Press. Li, Z., Sun, L., & Ifeachor, E. (2005). Challenges of mobile ad-hoc grids and their applications in ehealthcare. In Proceedings of Second International Conference on Computational Intelligence in Medicine And Healthcare (cimed’ 2005). Litke, A., Skoutas, D., & Varvarigou, T. (2004). Mobile grid computing: Changes and challenges of resource management in a mobile grid environment. In Proceedings of Practical Aspects of Knowledge Management (PAKM 2004), Austria. Mathe, J., Kuntner, K., Pota, S., & Juhasz, Z. (2003). The use of jini technology in distributed and grid multimedia systems. In MIPRO 2003, Hypermedia and Grid Systems (p. 148-151). Opatija, Croatia. Matossian, V., Bhat, V., Parashar, M., Peszynska, M., Sen, M., & Stoffa, P. (2005). Autonomic oil reservoir optimization on the grid. [John Wiley and Sons.]. Concurrency and Computation, 17(1), 1–26. doi:10.1002/cpe.871 McKnight, L., Howison, J., & Bradner, S. (2004, July). Wireless grids, distributed resource sharing by mobile, nomadic and fixed devices. IEEE Internet Computing, 8(4), 24–31. doi:10.1109/MIC.2004.14 Messig, M., & Goscinski, A. (2007). Autonomic system management in mobile grid environments. In Proceedings of the Fifth Australasian Symposium on ACSW Frontiers (ACSW’ 07), (pp. 49–58). Darlinghurst, Australia: Australian Computer Society, Inc. Migliardi, M., & Sunderam, V. (1999). The harness metacomputing framework. In Proceedings of Ninth Siam Conference on Parallel Processing for Scientific Computing. San Antonio, TX: SIAM. Nakada, H., Matsuoka, S., Seymour, K., Dongarra, J., Lee, C., & Casanova, H. (2003). Gridrpc: A remote procedure call api for grid computing. Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., & Kielmann, T. (2005). Ibis: a flexible and efficient Java-based Grid programming environment. Concurrency and Computation, 17(7/8), 1079-1108. Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Kielmann, T., & Bal, H. E. (2004). Satin: Simple and efficient Java-based grid programming. Journal of Parallel and Distributed Computing Practices. Oh, J., Lee, S., & Lee, E. (2006). An adaptive mobile system using mobile grid computing in wireless network. In Computational Science And Its Applications - ICCSA 2006 (LNCS Vol. 3984, pp. 49-57). Berlin: Springer.
27
Pervasive Grids
Otebolaku, A., Adigun, M., Iyilade, J., & Ekabua, O. (2007). On modeling adaptation in context-aware mobile grid systems. In Icas ’07: Proceedings of the Third International Conference on Autonomic And Autonomous Systems (p. 52). Washington, DC: IEEE Computer Society. Parashar, M., & Browne, J. (2005, Mar). Conceptual and implementation models for the grid. Proceedings of the IEEE, 93(3), 653–668. doi:10.1109/JPROC.2004.842780 Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic grid computing concepts, requirements, infrastructures, autonomic computing: Concepts, infrastructure and applications, (pp. 49–70). Boca Raton, FL: CRC Press. Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic computing: Concepts, infrastructure and applications. Boca Raton, FL: CRC Press. Parashar, M., & Lee, C. A. (2005, March). Scanning the issue: Special isssue on grid-computing. In Proceedings of the IEEE, 93 (3), 479-484. Retrieved from http://www.caip.rutgers.edu/TASSL/Papers/ proc-ieee-intro-04.pdf Parashar, M., Matossian, V., Klie, H., Thomas, S. G., Wheeler, M. F., Kurc, T., et al. (2006). Towards dynamic data-driven management of the ruby gulch waste repository. In V. N. Alexandrox & et al. (Eds.), Proceedings of the Workshop on Distributed Data Driven Applications and Systems, International Conference on Computational Science 2006 (ICCS 2006) (Vol. 3993, pp. 384–392). Berlin: Springer Verlag. Park, S.-M., Ko, Y.-B., & Kim, J.-H. (2003, December). Disconnected operation service in mobile grid computing. In First International Conference on Service Oriented Computing (ICSOC’2003), Trento, Italy. Phan, T., Huang, L., & Dulan, C. (2002). Challenge: integrating mobile wireless devices into the computational grid. In Mobicom ’02: Proceedings of the 8th annual international conference on mobile computing and networking (pp. 271–278). New York: ACM Press. Pierson, J.-M. (2006, June). A pervasive grid, from the data side (Tech. Rep. No. RR-LIRIS-2006-015). LIRIS UMR 5205 CNRS/INSA de Lyon/Universit Claude Bernard Lyon 1/Universit Lumire Lyon 2/ Ecole Centrale de Lyon. Retrieved from http://liris.cnrs.fr/publis/?id=2436 Roure, D. D. (2003). Semantic grid and pervasive computing. http://www.semanticgrid.org/GGF/ggf9/gpc/ Srinavisan, S. (2005). Pervasive wireless grid architecture. In Second annual conference on wireless on-demand network systems and services (wons’05). Storz, O., Friday, A., & Davies, N. (2003, October). Towards ‘ubiquitous’ ubiquitous computing: an alliance with ‘the grid’. In Proceedings of the First Workshop On System Support For Ubiquitous Computing Workshop (UBISYS 2003) in association with Fifth International Conference On Ubiquitous Computing, Seattle, WA. Retrieved from http://ciae.cs.uiuc.edu/ubisys/papers/alliance-w-grid.pdf Taesombut, N., & Chien, A. (2004). Distributed virtual computer (dvc): Simplifying the development of high performance grid applications. In Workshop on Grids and Advanced Networks (GAN ’04), IEEE Cluster Computing and the Grid (ccgrid2004) Conference, Chicago.
28
Pervasive Grids
Taylor, I., Shields, M., Wang, I., & Philp, R. (2003). Distributed p2p computing within triana: A galaxy visualization test case. In International Parallel and Distributed Processing Symposium (IPDPS’03). Nice, France: IEEE Computer Society Press. Thain, D., Tannenbaum, T., & Livny, M. (2002). Condor and the grid. John Wiley & Sons Inc. Ururahy, C., & Rodriguez, N. (2004). Programming and coordinating grid environments and applications. In Concurrency and computation: Practice and experience. Waldburger, M., & Stiller, B. (2006). Toward the mobile grid:service provisioning in a mobile dynamic virtual organization. In. Proceedings of the IEEE International Conference on Computer Systems and Applications, 2006, (pp.579–583). Wang, Z., Yu, B., Chen, Q., & Gao, C. (2005). Wireless grid computing over mobile ad-hoc networks with mobile agent. In Skg ’05: Proceedings of the first international conference on semantics, knowledge and grid (p. 113). Washington, DC: IEEE Computer Society. Weiser, M. (1991, February). The computer for the 21st century. Scientific American, 265(3), 66–75. Wong, S.-W., & Ng, K.-W. (2006). Security support for mobile grid services framework. In Nwesp’06: Proceedings of the international conference on next generation web services practices (pp.75–82). Washington, DC: IEEE Computer Society. Yamin, A., Augustin, I., Barbosa, J., da Silva, L., Real, R., & Cavalheiro, G. (2003). Towards merging context-aware, mobile and grid computing. International Journal of High Performance Computing Applications, 17(2), 191–203. doi:10.1177/1094342003017002008 Zhang, G., & Parashar, M. (2003). Dynamic context-aware access control for grid applications. In 4th international workshop on grid computing (grid 2003), (pp. 101 – 108). Phoenix, AZ: IEEE Computer Society Press. Retrieved from citeseer.ist.psu.edu/zhang03dynamic.html
KEY TERMS AND DEFINITIONS Autonomic Computing: Accounts for a system that does not need human intervention to work, repair, adapt and optimize. Autonomous entities must adapt to their usage context to find the best fit for their execution. Grid: The goal of the original Grid concept is to combine resources spanning many organizations into virtual organizations that can more effectively solve important scientific, engineering, business and government problems. Pervasive: A term that covers the ubiquity of the system. A pervasive system is transparent to its users that use it without noticing it. It is often linked with mobility since it helps to cover the anywhere/ anytime resources access for nomadic users. Pervasive Grid: A pervasive grid mixes a grid resource sharing with an anywhere/anytime access to these resources, either data or computing resources. Quality of Service: Designs the achievable performances that a system, an application or a service is expected to deliver to its consumers.
29
Pervasive Grids
Semantic Knowledge: Designs the enriched value of the information. Raw information coming from sensors or monitored by the system is not enough to achieve ubiquitous access to resources. Only higher level abstractions allow for handling seamlessly the system. Uncertainty: The dubiety that can be put on the system, the application or the information in a pervasive grid. Information cannot be accepted without doubt and double checking, redundancy, is often the rule.
30
31
Chapter 3
Desktop Grids
From Volunteer Distributed Computing to High Throughput Computing Production Platforms Franck Cappello INRIA and UIUC, France Gilles Fedak LIP/INRIA, France Derrick Kondo ENSIMAG - antenne de Montbonnot, France Paul Malécot Université Paris-Sud, France Ala Rezmerita Université Paris-Sud, France
ABSTRACT Desktop Grids, literally Grids made of Desktop Computers, are very popular in the context of “Volunteer Computing” for large scale “Distributed Computing” projects like SETI@home and Folding@home. They are very appealing, as “Internet Computing” platforms for scientific projects seeking a huge amount of computational resources for massive high throughput computing, like the EGEE project in Europe. Companies are also interested of using cheap computing solutions that does not add extra hardware and cost of ownership. A very recent argument for Desktop Grids is their ecological impact: by scavenging unused CPU cycles without increasing excessively the power consumption, they reduce the waste of electricity. This book chapter presents the background of Desktop Grid, their principles and essential mechanisms, the evolution of their architectures, their applications and the research tools associated with this technology.
DOI: 10.4018/978-1-60566-661-7.ch003
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Desktop Grids
ORIGINS AND PRINCIPLES Nowadays, Desktop Grids are very popular and are among the largest distributed systems in the world: the BOINC platform is used to run over 60 Internet Computing projects and scale up to 4 millions of participants. To arrive at this outstanding result, theoretical and experimental projects and researches have investigated on how to take advantage of idle CPU’s and derived the principles the of Desktop Grids.
Origins of Desktop Grids The very first paper discussing a Desktop Grid like system (Shoch & Hupp, 1982) presented the Worm programs and several key ideas that are currently investigated in autonomous computing (self replication, migration, distributed coordination, etc). Several projects preceded the very popular SETI@home. One of the first application of Desktop Grids was cracking RSA keys. Another early system, in 1997, gave the name of “distributed computing” used sometimes for Desktop Grids: distributed.net. The aim of this project was finding prime numbers using the Mersen algorithm. The folding@home project was one of the first project with SETI@home to gather thousands of participants in the first years of 2000. At that time folding@home used the COSM technology. The growing popularity of Desktop Grids has raised a significant interest in the industry. Companies like Entropia (Chien, Calder, Elbert, Bhatia, 2003), United Device1, Platform2, Mesh Technologies3 and Data Synapse have proposed Desktop Grid middleware. Performance demanding users are interested by these platforms, considering their costperformance ratio which is even lower than the one of clusters. As a mark of success, several Desktop Grid platforms are daily used in production by large companies in the domains of pharmacology, petroleum, aerospace, etc. The origin of Desktop Grids came from the association of several key concepts: 1) cycle stealing, 2) computing over several administration domains and 3) the Master-Worker computing paradigm. Desktop Grids inherit the principle of aggregating inexpensive, often already in place, resources, from past research in cycle stealing. Roughly speaking, cycle stealing consists of using the CPU’s cycles of other computers. This concepts is particularly relevant when the target computers are idle. Mutka and al. demonstrated in 1987 that the CPU’s of workstations are mostly unused (M. W. Mutka & Livny, 1987), opening the opportunity for high demanding users to scavenge these cycles for their applications. Due to its high attractiveness, cycle stealing has been studied in many research projects like Condor (Litzkow, Livny, Mutka, 1988), Glunix (Ghormley, Petrou, Rodrigues, Vahdat, Anderson, 1998) and Mosix (Barak, Guday, 1993), to cite a few. In addition to the development of these computing environments, a lot of research has focused on theoretical aspects of cycle stealing (Bhatt, Chung, Leighton, Rosenberg, 1997). Early cycle stealing systems where bounded to the limits of a single administration domain. To harness more resources, techniques were proposed to cross the boundaries of administration domains. A first approach was proposed by Web Computing projects such as Jet (Pedroso, Silva, Silva, 1997), Charlotte (Baratloo, Karaul, Kedem, Wyckoff, 1996), Javeline (P. Cappello et al., 1997), Bayanihan (Sarmenta & Hirano, 1999), SuperWeb (Alexandrov, Ibel, Schauser, Scheiman, 1997), ParaWeb (Brecht, Sandhu, Shan, Talbot, 1996) and PopCorn (Camiel, London, Nisan, Regev, 1997). These projects have emerged with Java, taking benefit of the virtual machine properties: high portability across heterogeneous hardware and OS, large diffusion of virtual machine in Web browsers and a strong security model associated with bytecode execution. Performance and functionality limitations are some of the fundamental
32
Desktop Grids
motivations of the second generation of Global Computing systems like COSM4, BOINC (Anderson, 2004) and XtremWeb (Fedak, Germain, Néri, Cappello, 2001). These systems use some firewall and NAT traversing protocols to transport the required communications. The Master-Worker paradigm is the third enabling concept of Desktop Grids. The concept of MasterWorker programming is quite old (Mattson, Sanders, Massingill, 2004), but its application to large scale computing over many distributed resources has emerged few years before 2000 (Sarmenta & Hirano, 1999). The Master-Worker programming approach essentially allows the implementing of non trivial (bag of tasks) parallel applications on loosely coupled computing resources. Because it can be combined with simple fault detection and tolerance mechanisms, it fits extremely well with the Desktop Grid platforms that are very dynamic by essence.
Main Principles Desktop Grids have emerged while the community was considering clustering and hierarchical designs as good performance-cost tread-offs. However several parameters distinguish Desktop Grids from clusters: scale, communication, heterogeneity and volatility. Moreover, Desktop Grids share with Grid a common objective: to extend the size and accessibility of a computing infrastructure beyond the limit of a single administration domain. In (Foster & Iamnitchi, 2003), the authors present the similarities and differences between Grids and Desktop Grids. Two important distinguishing parameters are the user community (professional or not) and the resource ownership (who own the resources and who is using them). From the system architecture perspective, we consider two main differences: the system scale and the lack of control of the participating resources. The notion of Large Scale is linked to a set of features that has to be taken into account. An example is the system dynamicity caused by node volatility: in Internet Computing Plaforms (also called Desktop Grids), a non predictable number of nodes may leave the system at any time. Some researches even consider that they may quit the system without any prior mention and reconnect the system in the same way. The lack of control of the participating nodes has a direct consequence on nodes connectivity. Desktop Grid designers cannot assume that external administrator is able to intervene in the network setting of the nodes, especially their connection to Internet via NAT and Firewalls. This means that we have to deal with the in place infrastructure in terms of performance, heterogeneity, dynamicity and connectivity. Large scale and lack of control have many consequences, at least on the architecture of system components, the deployment methods, programming models, security (trust) and more generally on the theoretical properties achievable by the system. These characteristics established a new research context in distributed systems. From previous considerations, the Desktop Grid designers arrived to a set of properties that any Desktop Grid system should fulfill: resources connectivity across administrative boundaries, resilience to high resource volatility, job scheduling efficient for heterogeneous resources, standalone, self and automatically managed resource applications. Several extra properties have been considered and integrated in some Desktop Grids: resources security, results certification, etc. Figure 1 presents the simple architecture of basic Desktop Grids. A typical Desktop Grid consists in 3 components: clients that submit requests, servers that accept request and return results and a coordinator that schedules the client requests to the servers. Desktop Grids have applications in High Throughput Computing as well as in data access and communication. Thus, for a shake of simplicity, the requests and results presented in the figure can be either for computing or data operations. Clients may send requests with some specific requirements, such as CPU
33
Desktop Grids
Figure 1. General architecture of desktop Grids
architecture, OS version, the availability of some applications and libraries. Because only some servers may provide the required environment, the task of the coordinator is generally extended to realize the match making between clients requests and servers capabilities. Clients and servers are PCs belonging to different administrative domains. They are protected by firewall and may sit behind a NAT. By default, there is no possibility of direct communication between them. As a consequence, any Desktop Grid should implement some protocols to cross administrative domains boundaries. The communication between the components of the Desktop Grid concerns, data, job descriptions, job parameters and results but also application codes. If the application is not available on servers, it is transmitted by the client or the coordinator to the servers, prior to the execution. The coordinator can be implemented in various ways. The simple organization consists in a central node. This architecture can be extended to support the central node failure by using replicated nodes. Other designs consider using a distributed architecture where several nodes handle and manage the clients requests and server results. In addition to scheduling and matchmaking, the coordinator must implement fault detection and fault tolerance mechanisms because it is expected that some servers fail or quite the Desktop Grid (permanently or not) without prior notification. The lack of control of the servers implies that Desktop Grids rely on humans (In most cases, the owners of the PCs) for the installation of the server software on participating PCs. However Desktop Grid systems must not rely on PC owners for the managements and maintenance of the software. Thus the server software is designed to allow remote upgrade and remote management. The server software as well as all other Desktop Grid related software components are managed remotely by the Desktop Grid administrator.
CLASSIFICATION OF DESKTOP GRIDS In this section, we propose a classification of the Desktop Grids systems.
34
Desktop Grids
Figure 2. Overview of the OurGrid platform architecture
Local Desktop Grids Enterprise Desktop Grid consists of Desktop PC hosts within a LAN. LAN’s are often found within a corporation or University, and several companies such as Entropia and United Devices have specifically targeted these LAN’s as a platform for supporting Desktop Grid applications. Enterprise Desktop Grids are an attractive platform for large scale computation because the hosts usually have better connectivity with 100Mbps Ethernet for example and have relatively less volatility and heterogeneity than Desktop Grids that span the entire Internet. Nevertheless, compared to dedicated clusters, enterprise Desktop Grids are volatile and heterogeneous platforms, and so the main challenge is then to develop fault-tolerant, scalable, and efficient scheduling. Enterprises also provides commercial Desktop Grids. Their source code is most of the time unavailable and there is less documentation about their internal components. The server part may be available for use inside an enterprise. There are several industrial Desktop Grid platforms from Entropia (Chien et al., 2003) (ceased commercial operations in 2004), from United Device, Platform, Mesh Technologies.
Collaborative Desktop Grids Collaborative Desktop Grids consists of several Local Desktop Grids which agree to aggregate their resources for a common goal. The OurGrid project (Andrade, Cirne, Brasileiro, Roisenberg, 2003 ; Cirne et al., 2006) is a typical example of such systems. It proposes a mechanisms for laboratories to put together their local Desktop Grids. A mechanism allows the local resource managers to construct a P2P network (Figure 2). This solution is attractive because utilization of computing power by scientists is usually not constant. When scientists need an extra computing power, this setup allows them to access easily their friend universities resources. In exchange, when their resources are idle, it can be given or rented to others universities. This requires a cooperation of the local Desktop Grid systems, usually at the resource manager level, and mechanisms to schedule several applications. A similar approach has been proposed by the Condor team under the term “flock of condor” (Pruyne & Livny, 1996).
35
Desktop Grids
Internet Volunteer Desktop Grids For over a decade, the largest distributed computing platforms in the world have been Internet Volunteer Desktop Grids, (IVDG) which use the idle computing power and free storage of a large set of networked (and often shared) hosts to support large-scale applications. In this case of Grid, owners of resources are end-user Internet volunteer who provide their personal computer for free. IVDG are an extremely attractive platform because they offer huge computational power at relatively low cost. Currently, many projects, such as SETI@home (Anderson, Cobb, Korpela, Lebofsky, Werthimer, 2002), FOLDING@ home (Shirts & Pande, 2000), and EINSTEIN@home5, use TeraFlops of computing power of hundreds of thousands of desktop PC’s to execute large, high-throughput applications from a variety of scientific domains, including computational biology, astronomy, and physics.
Single-Application Internet Volunteer Desktop Grids. At the beginning of Internet Volunteer Desktop Grids, most of the largest projects were running only one application. Only data were automatically distributed, most of the time using a simple CGI script on a web server. Upgrading the application was requiring that volunteers manually download and install the application. In this section, we will describe some of these projects. The Great Internet Mersenne Prime Search (GIMPS)6 is one of the oldest computation using resources provided by volunteer Desktop Grid users. It’s started in 1996 and still running. The 44th known Mersenne prime have been found in september 2006. Each client connect to a central server (PrimeNet) to get some works. Resources are divided in 3 classes based on the processor model and gets different type of tasks. The program only use 8Mb of RAM, 10Mb of disk space and do very little communications with the servers (permanent connection is not required). The program checkpoints every half hour. Since 1997, Distributed.net7 tries to solve cryptographic challenges. RC5 and several DES challenges have been solved. The first version of SETI@Home has been released in may 1999. There was already 400,000 preregistered volunteers. 200,000 clients registered the first week. Between July 2001 and July 2002, the platform computed workunits at an average rate of 27.36 TeraFLOPS. The programs is doing some treatments on a signal recorded by a radio-telescope and then search for particular artificially made signal in it. The original record is split in workunit both by time (107s long) and by frequency (10KHz) The Electric Sheep8 (Draves, 2005) screen-saver “realizes the collective dream of sleeping computers”. It harnesses the power of idle computers (because they are running the screen-saver) to render, using a genetic algorithm, the fractal animation displayed by itself. The computation uses the volunteers to decide which animation is beautiful and should be improved. This system consists only of one application but, as the project website claims, about 30,000 unique IP addresses contact the server each day and 2Tb are transfered. At article writing time, the unique centralized server was the bottleneck of this system.
XtremWeb. XtremWeb (Fedak et al., 2001; Cappello et al., 2004) is an open source research project at LRI and LAL that belongs to light weight Grid systems. It is primary designed in order to explore scientific issues about Desktop Grid, Global Computing and Peer to Peer distributed systems but have been also used in real computations, especially in physics. First version was released in 2001.
36
Desktop Grids
Figure 3. Overview of the XtremWeb platform architecture
The architecture (Figure 3) is similar to most well known platforms. It is a three-tier architecture with clients, servers and workers. Several instances of those components might be used at the same time. Clients allows platform’s users to interact with the platform by submitting stand-alone jobs, retrieving results and managing the platform. Workers are responsible for executing jobs. The server is a coordination service that connects clients and workers. The server accepts tasks from clients, distributes them to workers according to the scheduling policy, provides applications for running them and supervises the execution by detecting worker crash or disconnection. If needed tasks are restarted on other available workers. At the end, it retrieves and stores results before clients download them. Clients and Workers are initiators of all connections to the server which have for consequence that only the server needs to be accessible from behind firewalls. Multiples protocols are supported and can be used depending on the type of workload. Communications may also be secured both by encryption and authentication. Since its first version, XtremWeb has been deployed over networks of common Desktop PCs providing an efficient and cost effective solution for a wide range of application domains: bioinformatics, molecular synthesis, high energy physics, numerical analysis and many more. At the same time, there have been many researches around Xtremweb: XtremWeb-CH9 (Abdennadher & Boesch, 2006) funded by the University of Applied Sciences in Geneva, is an enrichment of XtremWeb in order to better match P2P concepts. Communications are distributed, i.e. direct communications between workers are possible. It provides a distributed scheduler that takes into account the heterogeneity and volatility of workers. There is an automatic detection of the optimal granularity according to the number of available workers and scheduling tasks. There is also a monitoring tool for visualizing the executions of the applications.
BOINC All these mono-application projects share many common components. So, there was a need for a generic platform that would provide all these components for an easy integration and deployment of these
37
Desktop Grids
Figure 4. Overview of the BOINC platform architecture
projects. Only the part that really does the computation need to be changed for each project. The Berkeley Open Infrastructure for Network Computing (BOINC) (Anderson, 2004) is the largest volunteer computing platform. More than 900,000 users from nearly all countries participate with more than 1,300,000 computers. More than 40 projects, not including private projects, are available including the popular SETI@Home project. Projects usually last several month mainly because of the time needed to attract volunteers and set up a users community. Each client (computing node) is manually attached by the user to one or more projects (servers). Each project runs a central server (Figure 4) and most of the scheduling is done by clients. Projects have the ability to run several a small number of different applications which can be updated (jobs have to be very homogeneous). The BOINC server is composed of several daemons which execute the management tasks: first, workunits are produced by a generator. Then, the transitioner, the daemon that will take care of the different states of the workunit life cycle, will replicate (redundancy) the workunit in several results (instances of workunits). Each result will be executed on a different client. Then, back to the server, each result will be checked by the validator before being stored in the project science database by the assimilator. All communications are done using cgi programs on the project server, so, only port 80 and client to server connections are needed. Each user is rewarded with credits, a virtual money, for the count of cycles used on its computer. The client maintains a cache of results to be executed between connections to the Internet. The scheduler tries to enforce many constraints: First, the user may choose to run the applications according to its activity (screen-saver), working hours, resources available. Second, the user assigns a resource share ratio to the projects. Third, sometimes, some projects may run out of work to distribute.
38
Desktop Grids
Some others projects were inspired by the BOINC platform. SLINC10 (Baldassari, Finkel, & Toth, 2006) addresses the main limitations of BOINC by simplifying the project creation process. This software is also operating system independent as it runs on the Java platform. It is also database independent (use Hibernate) while BOINC runs only with Mysql. All communications between components are done with XML-RPC and for simplifying the architecture, they have removed the validator component. User’s applications are also programming language independent, but only Java and C++ available for now. Two versions of the same application, the first one written in Java, the second one written in C++ have almost the same performance. Some BOINC issues have not been fixed here, such as the time needed to have all the volunteers register their resources. The POPCORN (Nisan, London, Regev, Camiel, 1998) is a platform for global distributed computing over the Internet. It has been available from mid 1997 until mid 1998. Today, only the papers and documentation are still available. This platform runs on the Java platform and tasks are executed on workers as “computelets”, a system similar to usual Java applets. Computelets need only to be instanciated for a task to be distributed. Error and verification process is left to the application level. The platform provides a debugging tool that shows the tree of spawned computelets (for debugging concurrency issues). There is also a market system that enable users to sell their CPU time. The currency works almost the same as BOINC credits. Some applications have been tested on the platform: brute force breaking, genetic algorithm,... At the implementation level, they had some issues with Java immaturity (in 1997-1998). Bayanihan (Sarmenta & Hirano, 1999) is another platform for volunteer computing over the Internet. It is written in Java and uses Hord, a package similar to Sun’s RMI for communications. Many clients (applet started from a web browser or command line applications) connect to one or more servers. Korea@Home (Jung, 2005) is a Korean volunteer computing platform. Work management is centralized on one server but since version 2, there is a P2P mechanism that allows direct communication between computing nodes (agents). This platform harnesses more than 36,000 agents, about 300 of them are available at the same time.
EVOLUTION OF MIDDLEWARE ARCHITECTURE Job Management The functionality required for job management includes job submission, resource discovery, resource selection and scheduling, and resource binding. With respect to job submission, most systems, like XtremWeb or Entropia, have a interface similar to batch systems, such as PBS, where a job’s executable and inputs are specified. Recently, there have been efforts to provide higher-level programming abstractions, such as Map-Reduce (Dean & Ghemawat, 2004). After a job is submitted to the system, the job management system must identify a set of available resources. Resource discovery is the process of identifying which resources are currently available and is challenging given the dynamicity and large-scale of systems. There have been both centralized and distributed approaches. The classic method is via matchmaking (Raman, Livny, Solomon, 1998) where application requirements are paired with compatible resource requirements via ClassAds. A number of works have addressed the scalability and fault-tolerance issue of this type of centralized matchmaking system.
39
Desktop Grids
Several distributed approaches have been proposed. The challenges of building a distributed resource discovery system are the overheads of distributing queries, guaranteeing that queries can be satisfied, being able to support a range of application constraints specified through queries, and being able to handle dynamic loads on nodes. In (Zhou & Lo, 2006), the authors propose distributed resource discovery using a distributed hash table (DHT) in the context of a P2P system. This was one of the first P2P resource discovery mechanisms ever proposed. However, one the characteristics of resources can be heavily skewed such that the query load is heavily imbalanced. In (Iamnitchi, Foster, Nurmi, 2002), the authors propose a P2P approach where the overheads of a query are limited with a time-to-live (TTL). The drawback of this approach is that there is no guarantee that a resource that meets the constraints of the application will be found. In (Kim et al., 2006), the authors proposed a rendezvous-node tree (RNT) where load is balanced using random application assignment. The RNT deals with load dynamics by conducting a random-walk (of limited length) after the mapping. In (Lee, Ren, Eigenmann, 2008), the authors use a system where information is summarized hierarchically, and a bloom filter is used to reduce the overheads for storage and maintenance. After a set of suitable resources have been determined, the management system must then selection a subset of the resources and determine how to schedule tasks among the resources. We discuss this issue in-depth in next section. One resources have been selected and a schedule has been determined, the tasks must then be deployed across resources, i.e., bound. In systems such the Condor Matchmaker, binding occurs last in a separate step between the consumer and provider (without the matchmaker as the middle-man) to allow for the detection of any change in state. If change in state occurs (for example, the resource is no longer available), then the renegotiation of selected resources can occur.
Resource Scheduling At the application and resource management level, most research assumes that a centralized scheduler maintains a queue of tasks to be scheduled, and a ready queue of available workers. As workers become available, they notify the server, and the scheduler on the server places the corresponding task requests of workers in the ready queue. During resource selection, the scheduler examines the ready queue to determine the possible choices for task assignment. Because the hosts are volatile and heterogeneous, the size of the host ready queue changes dramatically during application execution as workers are assigned tasks (and thus removed from the ready queue), and as workers of different speeds and availability complete tasks and notify the server. The host ready queue is usually only a small subset of all the workers, since workers only notify the server when they are available for task execution. At the Worker level, most research assumes that the worker running on each host periodically sends a heartbeat to the server that indicates the state of the task. In the XtremWeb system (Fedak et al., 2001), a worker sends a heartbeat every minute to indicate whether the task is running or has failed. With respect to the recovery from failures, some works assume local checkpointing abilities. However, remote checkpointing is still work in progress in real Internet-wide systems such as BOINC (Anderson, 2004) and XtremWeb (Fedak et al., 2001). Also, most works do not assume the server can cancel a task once it has been scheduled on a worker. The reason for this is that resource access is limited, as firewalls are usually configured to block all incoming connections precluding incoming RPC’s and to allow only outgoing connections (often on a restricted set of ports like port 80). As such, the heuristics cannot preempt a task once it has been as-
40
Desktop Grids
signed, and workers must make the initiative to request tasks from the server. This platform model deviates significantly from traditional grid scheduling models (Berman, Wolski, Figueira, Schopf, Shao, 1996 ; Casanova, Legrand, Zagorodnov, Berman, 2000 ; Casanova, Obertelli, Berman, Wolski, 2000 ; Foster & Kesselman, 1999). The pull nature of work distribution and random behavior of resources in desktop grids places several limitations on scheduling operations. First, it makes advance planning with sophisticated Gantt charts difficult as resources may not be available for task execution at the scheduled time slot. Second, as task requests are typically handled in a centralized fashion and a (web) server can handle a maximum of a few hundred connections, the choice of resources available is always a small subset of the whole. Nevertheless, we focus on scheduling solutions applicable in current centralized systems below. The majority of application models in desktop grid scheduling have focused on jobs requiring either high-throughput (Sonnek, Nathan, Chandra, Weissman, 2006) or low latency (Heien, Fujimoto, Hagihara, 2008 ; Kondo, Chien, 2004). These jobs are typically compute-intensive. There are four complementary strategies for scheduling in desktop grid environments, namely resource selection, resource prioritization, task replication, and host availability prediction. In practice, these strategies are often combined in heuristics. With respect to resource selection, hosts can be prioritized according to various static or dynamic criteria. Surprisingly, simple criteria such as clock rates has been shown to be effective with real-world traces (Kondo, Chien, 2004). Other studies (Kondo, Chien, Casanova, 2007 ; Sonnek et al., 2006) have used probabilistic techniques based on a hosts history of unavailability to distinguish more stables hosts from others. With respect to resource exclusion, hosts can be excluded using various criteria, as often slow hosts (either due to failures, slow clock rates, or other host load) are the bottlenecks in the computation. Thus, excluding them from the entire resource pool can improve performance dramatically. With respect to task replication, schedulers often replicate a fixed number of times. The authors in the studies (Kondo et al., 2007) and (Sonnek et al., 2006) investigated the use of probabilistic methods for varying the level of replication according to a host’s volatility. With respect to host availability prediction, recently the authors in (Andrzejak, Domingues, Silva, 2006) have shown that simple prediction methods (in particular a naive bayes classifier) can allow one to give guarantees on host availability. In particular, in that study, the authors show how to predict that N hosts will be available for T time.
Volatility Tolerance There are several issues with volatility concerning detection and resolution. With respect to detection, systems such as XtremWeb (Fedak et al., 2001) and Entropia (Chien et al., 2003) use heartbeats. In BOINC, where a centralized (web) server can only handle a few hundred simultaneously, the use of heartbeats with millions of resources is not an option. Moreover, heartbeats can’t be used when BOINC operate without network connection. Instead BOINC uses job deadlines as a indication of whether the job as permanently failed or not. When a failure has been detected, one can resolve the failure in a number of ways. Task checkpointing is one means of dealing with task failures since the task state can be stored periodically either on the local disk or on a remote checkpointing server; in the event that a failure occurs, the application can be restarted from the last checkpoint. In combination with checkpointing, process migration can be
41
Desktop Grids
used to deal with CPU unavailability or when a “better” host becomes available by moving the process to another machine. The authors in (Araujo, Domingues, Kondo, Silva, 2008 ; Domingues, Araujo, Silva, 2006) develop a distributed checkpoint system where checkpoints are stored in peers in a P2P fashion using a DHT, or using a clique. Thus, when a failure occurs, a checkpoint could potentially be used to restart the computation on another node in scalable way. Another common solution for masking failures is replication. The authors in (Ghare & Leutenegger, 2004 ; Kondo et al., 2007 ; Sonnek et al., 2006) use probabilistic models to analyze various replication issues. The platform model used in (Ghare & Leutenegger, 2004) assumes that the resources are shared, task preemption is disallowed, and checkpointing is not supported. The application models was based on tightly-coupled applications, while the other was based on loosely-coupled application, which consisted of task parallel components before each barrier synchronization. The authors then assume that the probability of task completion follows a geometric distribution. The work in (Leutenegger & Sun, 1993) examines analytically the costs of executing task parallel applications in desktop grid environments. The model assumes that after a machine is unavailable for some fixed number of time units, at least one unit of work can be completed. Thus, the estimates for execution time are lower bounds. The assumption is restrictive, especially since the size of an availability intervals can be correlated in time (Mutka & Livny, 1991); that is, a short availability interval (which would likely cause task failure) will most likely be followed by another short availability interval. In terms of proactively avoiding failures, the authors in (Andrzejak, Kondo, Anderson, 2008) use prediction methods for avoiding resources likely to fail. They show the existence of long stretches of availability of certain Internet hosts and that such patterns can be modeled efficiently with basic classification algorithms. Simple and computationally cheap metrics are reliable indicators of predictability, and resources can be divided into high and low predictability groups based on such indicators. So, they show that a deployment of enterprise services in a pool of volatile resources is possible and incurs reasonable overheads.
Data Management Despite the attractiveness of Desktop Grids, little work has been done to support data-intensive applications in this context of massively distributed, volatile, heterogeneous, and network-limited resources. Most Desktop Grid systems, like BOINC (Anderson, 2004), XtremWeb (Fedak et al., 2001), Condor (Litzkow et al., 1988) and OurGrid (Andrade et al., 2003) rely on a centralized architecture for indexing and distributing the data, and thus potentially face issues with scalability and fault tolerance. Thus, data management is still a challenging issue. Parameter-sweep applications composed of a large set of independent tasks sharing large data are the first class of applications which has driven a lot of effort in the area of data distribution. Authors in (Wei, Fedak, Cappello, 2005) have shown that using a collaborative data distribution protocol BitTorrent over FTP can improve execution time of parameter sweep applications. In contrast, it has also been observed that the BitTorrent protocol suffers a higher overhead compared to FTP when transferring small files. Thus, one must be allowed to select the correct distribution protocol according to the size of the file and level of “sharability” of data among the task inputs. Recently, a similar approach has been followed in (Costa, Silva, Fedak, Kelley, 2008), where the BitTorrent protocol has been integrated within the BOINC platform.
42
Desktop Grids
This works confirm that the basic blocks for building Data Management components can be found in P2P systems. Recently, a subsystem dedicated to data management for Desktop Grid, named BitDew, has been proposed in (Fedak, He, Cappello, 2008). It could be easily integrated into systems like BOINC, OurGrid or XtremWeb. It offers programmers (or an automated agent that works on behalf of the user) a simple API for creating, accessing, storing and moving data with ease, even on highly dynamic and volatile environments. Researchers of DHT’s (Distributed Hash Tables) (Stoica, Morris, Karger, Kaashoek, Balakrishnan, 2001 ; Maymounkov & Mazières, 2002 ; Rowstron & Druschel, 2001) and collaborative data distribution (Cohen, 2003 ; Gkantsidis & Rodriguez, 2005 ; Fernandess & Malkhi, 2006), storage over volatile resources (Bolosky, Douceur, Ely, Theimer, 2000 ; Butt, Johnson, Zheng, Hu, 2004 ; Vazhkudai, Ma, Strickland, Tammineedi, Scott, 2005) and wide-area network storage (Bassi et al., 2002 ; Rhea et al., 2003) offer various tools that could be of interest for Data Grids. To build Data Grids from and to utilize them effectively, one needs to bring together these components into a comprehensive framework. BitDew suits this purpose by providing an environment for data management and distribution in Desktop Grids. Large data movement across wide-area networks can be costly in terms of performance because bandwidth across the Internet is often limited, variable and unpredictable. Caching data on the local storage of the Desktop PC (Iamnitchi, Doraimani, Garzoglio, 2006 ; Otoo, Rotem, Romosan, 2004 ; Vazhkudai et al., 2005) with adequate scheduling strategies (Santos-Neto, Cirne, Brasileiro, Lima, 2004 ; Wei et al., 2005) to minimize data transfers can improve overall application execution performance. Long-running applications are challenging due to the volatility of executing nodes. To achieve application execution, it requires local or remote checkpoints to avoid losing the intermediate computational state when a failure occurs. In the context of Desktop Grid, these applications have to cope with replication and sabotage. An idea proposed in (Kondo, Araujo, et al., 2006) is to compute a signature of checkpoint images and use signature comparison to eliminate diverging execution. Thus, indexing data with their checksum as commonly done by DHT and P2P software permits basic sabotage tolerance even without retrieving the data. BitDew leverages the use of metadata, a technique widely used in Data Grid (Jin, Xiong, Wu, Zou, 2006), but in more directive style. It defines 5 different types of metadata: i) replication indicates how many occurrences of data should be available at the same time in the system, ii) fault tolerance controls the resilience of data in presence of machine crash, iii) lifetime is a duration, absolute or relative to the existence of other data, which indicates when a datum is obsolete, iv) affinity drives movement of data according to dependency rules, v) transfer protocol gives the runtime environment hints about the file transfer protocol appropriate to distribute the data. Programmers tag each data with these simple attributes, and simply let the BitDew runtime environment manage operations of data creation, deletion, movement, replication, as well as fault tolerance.
Security Model In this section we review the security model of several Desktop Grid system. The BOINC (Anderson, 2004) middleware is a popular Volunteer Computing System which permits to aggregate huge computing power from thousands of Internet users. A key points is the asymmetry of its security model: there are few projects well identified and which belongs to established institutions (by example, University of Berkeley for the SETI@Home project) while volunteers are numerous and anonymous. Of course the notion of users exists in BOINC, because volunteers needs to receive a re-
43
Desktop Grids
ward from their contribution. However, the definition of a users is close to the one of avatar: it allows users to participate to forum and receives credits according to the computing time and power given to the project. Despite anonymity, the security model is based on trust. Volunteers trust the project they are contributing to. Security mechanism is simple and based on asymmetric cryptography. Security model aims at enforcing the trust between volunteers and the project they participate in. At installation time, the owners of a project produce a pair of public/private keys and stores those keys in a safe place, typically, as recommended on the BOINC web site in a machine isolated from the network. When volunteers contribute for the first time to the project, they obtain the public key of the project. Project owners have to digitally sign the application files of the project, so that volunteers can verify that the binary codes downloaded by the BOINC client really belongs to the project. This mechanism ensures that, if a pirate get access to one of the BOINC server, he would not be able to upload malicious code to hundreds of thousands users. If volunteers trust the projects, the reverse is not true. To protect against malicious users, BOINC implements a result certification mechanism (Sarmenta, 2002), based on redundant computation. BOINC gives the ability to project administrator to write their own custom results certifying code according to their application. XtremWeb is an Internet Desktop Grid middleware which also permits public resources computing. It differs from BOINC by the ability given to every participants to submit new applications and tasks in the system. XtremWeb is a P2P system in the sense that every participant can provide computing resources but also utilize others participants’ computing resources. XtremWeb is organized as a three-tiers architecture where clients consumes resources, worker provides resources and coordinator is a central agent which manages the system by performing the scheduling and fault-tolerance tasks. Even if BOINC defines users in its implementation, they are anonymous and are only used to facilitate the platform management They can’t be trusted, only project owners can be trusted. In contrast with BOINC, because everyone can submit application, there cannot be any form of trust between users, applications, results and even the coordinator itself. Thus XtremWeb security model is based on autonomous mechanisms which aims at protecting each component of the platform from the others elements. For instance, to protect volunteers’ computer from malicious code, a sandbox mechanism is used to isolate and monitor the running application, and prevent it to damage volunteers system. Public/ private keys mechanism are also used to authenticate the coordinator to prevent results to be uploaded to another coordinator. The XGRID system, proposed by Apple is a Desktop Grid designed to run on a local network environment. XGrid features ease of use and ease of deployment. To work, the Xgrid system needs a Xgrid server, which can be configured with or without password. If the server run without password, then every user in the local environment can submit jobs and application, else only those who can authenticate to the servers are granted this authorization. Computing nodes, in the Xgrid system can accept jobs or no, this property is set on the computing nodes itself. Thus there is no real distinction between users and there’s no possibility for a user or a machine to accept or refuse other users’ application or work. While this solution is acceptable when used within a single organization (lab or small company), this solution would not scale to a large Grid setup which typically aims at several institutions to cooperate.
44
Desktop Grids
Figure 5. Bridging service Grid and desktop Grid, the superworker approach vs. the gliding-in approach
Bridging Service Grids and Desktop Grids There exists 2 main approaches to bridge Service Grids and Desktop Grids (see Fig. 5). In this section we present the principles of these two approaches and discuss them according to security perspective.
The Superworker Approach Since the superworker is a centralized agent this solution has several drawbacks: i) the superworker can become a bottleneck when the number of computing resources increases, ii) the round trip for a work unit is increased because it has to be marshalled/unmarshalled by the superworker, iii) it introduces a single point of failure in the system, which has low fault-tolerance. On the other hand, this centralized solution provides better security properties, concerning the integration with the Grid. First the superworker does not require modification of the infrastructure, it can be ran under any user identity as long as the user has the right to submit jobs on Grid. Next, as works are wrapped by the by the superworker, they are run under the user identity, which conforms with the regular security usage, in contrast with the approach described in the following paragraph. A first solution proposed used by the Lattice (Myers, Bazinet, Cummings, 2008) project and the SZTAKI Desktop Grid (Balaton et al., 2007) is to build a superworker which enables several Grid or cluster resources to compute to a Desktop Grid. The superworker is a bridge implemented as a daemon between the Desktop Grid server and the Service Grid resources. From the Desktop Grid server point of view, the Grid or cluster appears as one single resources with large computing capabilities. The superworker continuously fetches tasks or work units from the Desktop Grid server, wraps and submits the tasks accordingly to the local Grid or cluster resources manager. When the computations are finished on the SG computing nodes, the superworker send back the results to the Desktop Grid server. Thus,
45
Desktop Grids
the superworker by itself is a scheduler which needs to continuously scan the queues of the computing resources and watch for available resources to launch jobs.
The Gliding-In Approach The Gliding-in approach to cluster resources spread in different Condor pool using the Global Computing system (XtremWeb) was first introduced in (Lodygensky et al., 2003). The main principle consists in wrapping the XtremWeb worker as regular condor task and submit this task to the Condor pool. Once the worker is executed on a Condor resource, the worker pulls jobs from the Desktop Grid server, executes the XtremWeb task and return the result to the XtremWeb server. As a consequence, the Condor resources communicates directly to the XtremWeb server. Similar mechanisms are now commonly employed in Grid Computing (Thain & Livny, 2004). For example Dirac (Tsaregorodtsev, Garonne, Stokes-Rees, 2004) uses a combination of push/pull mechanism to execute jobs on several Grid clusters. The generic approach on the Grid is called a pilot job. Instead of submitting jobs directly to the Grids gatekeeper, this system submits so-called pilot jobs. When executed, the pilot job fetches jobs from an external job scheduler. The gliding-in or pilot job approach has several advantages. While simple, this mechanism efficiently balance the load between heterogeneous computing sites. It benefits from the fault tolerance provided by the Desktop Grid server: if Grid nodes fail then jobs get rescheduled to the next available resources. Finally, as the performance study of the Falskon (Raicu, Zhao, Dumitrescu, Foster, Wilde, 2007) system shows, it gives better performances because series of jobs does not have to go throught the gatekeeper queues which is generally characterized by long waiting time, and communications are direct between the CE and the Desktop Grid server without intermediate agent such as the superworker. From the security point of view, this approach breaks the Grid security rules because jobs owner may be different than pilot job owners. This is a well known issue of pilot jobs and new solution such as gLExec (Sfiligoi et al., 2007) are proposed to circumvent this security hole.
Result Certification Result certification in desktop grids is essential for several reasons. First, malicious users can report erroneous results. Second, hosts can unintentionally report erroneous results because of viruses that corrupt the system or hardware problems (for example overheating of the CPU). Third, differences in system or hardware configuration can result in different computational results. We discuss three of the most common state-of-the-art methods (Sarmenta, 2002 ; Zhao & Lo, 2001; Taufer, Anderson, Cicotti, 2005) for result certification namely spot-checking, majority voting, and credibility-based techniques, and emphasize the issues related to each method. The majority voting method detects erroneous results by sending identical workunits to multiple workers. After the results are retrieved, the result that appears most often is assumed to be correct. In (Sarmenta, 2002), the author determines the amount of redundancy for majority voting needed to achieve a bound on the frequency of voting errors given the probability that a worker returns a erroneous result. Let the error rate be the probability that a worker is erroneous and returns an erroneous result unit, and let be the percentage of final results (after voting) that are incorrect. Let m be the number of identical results out of 2m−1 required before a vote is considered complete and a result is decided upon. Then the probability of an incorrect result being accepted after a majority
46
Desktop Grids
vote is given by: ö æ 2m -1 ç2m - 1÷ j ÷÷j (1 - j)2m -1- j majv (j, m ) = å j =m çç ÷÷ j çè ø
(1)
The redundancy of majority voting is m 1-f . The main issues for majority voting are the following. First, the error bound assumes that error rates are not correlated among hosts. Second, majority voting is most effective when error rates are relatively low (≤1%); otherwise the required redundancy could be too high. A more efficient method for error detection is spot-checking, whereby a workunit with a known correct result is distributed at random to workers. The workers’ results are then compared to the previously computed and verified result. Any discrepancies cause the corresponding worker to be blacklisted, i.e., any past or future results returned from the erroneous host are discarded (perhaps unknowingly to the host). Erroneous workunit computation was modelled as a Bernoulli process (Sarmenta, 2002) to determine the error rate of spot-checking given the portion of work contributed by the host, and the rate at which incorrect results are returned. The model uses a work pool that is divided into equally sized batches. Allowing the model to exclude coordinated attacks, let q be the frequency of spot-checking, let n be the amount of work contributed by the erroneous worker, let f be the fraction of hosts that commit at least 1 error, and let s be the error rate per erroneous host. (1-qs)n is the probability that an erroneous host is not discovered after processing n workunits. The rate which spot-checking with blacklisting will fail to catch bad results is given by: scbl (q, n, f , s ) =
sf (1 - qs )n (1 - f ) + f (1 - qs )n
(2)
The amount of redundancy of spot-checking is given by 1 1-q . There are several critical issues related to spot-checking with blacklisting. First, it assumes that blacklisting will effectively remove erroneous hosts, in spite of the possibility of hosts registering with new identities or high host churn as shown by (Anderson & Fedak, 2006). Without blacklisting, the upper bound on the error rate is much higher and does not decrease inversely with n. Second, spot-checking is effective only if error rates are consistent over time. Third, spot-checking is most
47
Desktop Grids
effective when error rates are high (>1%); otherwise, the number of workunits to be computed per worker n must be extremely high. To address the potential weaknesses of majority voting and spot-checking, credibility-based systems were proposed (Sarmenta, 2002), which use the conditional probabilities of errors given the history of host result correctness. The idea is based on the assumption that hosts that have computed many results with relatively few errors have a higher probability of errorless computation than hosts with a history of returning erroneous results. Workunits are assigned to hosts such that more attention is given to the workunits distributed to higher risk hosts. To determine the credibility of each host, any error detection method such as majority voting, spotchecking, or various combinations of the two can be used. The credibilities are then used to compute the conditional probability of a result’s correctness.
RESEARCH AND ExPLORATION TOOLS Platforms Observations Prior to improve algorithms of existing platforms and simulate them, observing existing software real behavior on Internet is necessary. First, in (Kondo, Taufer, Brooks, Casanova, Chien, 2004 ; Kondo, Fedak, Cappello, Chien, Casanova, 2006) hundreds of desktop PC have been measured and characterized at the University of California at San Diego and the University of Paris-Sud. In (Anderson & Fedak, 2006), the authors measured aggregate statistics gathered through BOINC. A limitation of this work is that the measurements do not describe the temporal structure of availability of individual resources Recently, the XtremLab11 (Malécot, Kondo, Fedak, 2006) project, running on BOINC, have collected over 15 months of traces from 15,000 hosts. It runs an active measurement software that give the exact amount of computing power earned by the project by time for each node. After minimal treatment, they are used by simulators.
DG Simulation: SimBOINC There are several challenges for desktop grid simulation. First, one needs ways for abstracting failures and handling them. In simulation toolkits, one needs a way to specify the type of failure (permanent or transient) and how the system reacts to the failures (for example, restart after the system becomes available again). Simulators, such as SimGrid, are beginning to use exception handling for cleanly dealing with failures. Second, one needs to be able to deal with scaling issues. In some respects, building a trace-driven simulator using 50,000 resources is trivial when resources are not shared and they are interconnected with trivial network models. However, when resources are shared by a number of competing entities, issues of scale arise because one must recompute the allocation of the resource for each entity whenever the resource state changes. Third, as desktop grids and volunteer computing systems are invariably distributed over wide-area networks, one needs accurate network models that scale to hundreds and thousands of resources. The open issue is to get the speed of flow-based network models and at the same time the accuracy of packet-level simulation. Below we describe one recent approach for desktop grid simulation.
48
Desktop Grids
SimBOINC is a simulator for heterogeneous and volatile desktop grids and volunteer computing systems. The goal of this project is to provide a simulator by which to test new scheduling strategies in BOINC, and other desktop and volunteer systems, in general. SimBOINC is based on the SimGrid simulation toolkit for simulating distributed and parallel systems, and uses SimGrid (Casanova, Legrand, Quinson) to simulate BOINC (in particular, the client CPU scheduler, and eventually the work fetch policy) by implementing a number of required functionalities.
Simulator Overview. SimBOINC simulates a client-server platform where multiple clients request work from a central server. In particular, we have implemented a client class that is based on the BOINC client, and uses (almost exactly) the client’s CPU scheduler source code. The characteristics of client (for example, speed, project resource shares, and availability), of the workload (for example, the projects, the size of each task, and checkpoint frequency), and of the network connecting the client and server (for example, bandwidth and latency) can all be specified as simulation inputs. With those inputs, the simulator will execute and produce an output file that gives the values for a number of scheduler performance metrics, such as effective resource shares, and task deadline misses. The current simulator can simulate a single client that downloads workunits from multiple projects and use its CPU scheduler to decide when to schedule each workunit. The server in SimBOINC is different from the typical BOINC server in that there is one server for multiple projects, and so requests for work from multiple projects are channeled to a single server. The server consists of a request_handler that basically uses work_req_seconds and project_id parameters sent in the scheduler_request to determine the amount of work from a specific project to send to a client. We understand that for testing new work-fetch policies and CPU schedulers, only a single client that work downloads for multiple projects is needed. But we wanted SimBOINC to be a general purpose volunteer computing simulator that could simulate new uses of BOINC by different kinds of applications. For example, people should be able to use SimBOINC to simulate the scheduling of low-latency jobs or for simulating large peer-to-peer file distribution; in both these cases, simulating multiple clients would be essential.
Execution. SimBOINC expects the following inputs in the form of xml files: • • • • •
Platform file: This specifies the hosts in the platform and the network connecting the hosts. Host availability trace files: These are to be specified within the platform file. Workload file: This specifies the jobs, i.e., projects, to be executed on the clients. Client states file: This specifies the configuration of the BOINC Clients simulator file: This specifies the configuration of the specific simulator execution.
The platform file is where one constructs the computing and network resources on which the BOINC client and server run. In particular, SimBOINC expects a set of CPU resources, and a set of network links that connect those resources. For each resource, one can specify set of attributes. For
49
Desktop Grids
example, with CPU resources, one can specify the power, and corresponding availability trace files. For network resources, one can specify their bandwidth and latency. The workload file specifies the projects to be executed over the BOINC platform. In particular, it specifies for each project, the name, total number of tasks to execute, the task size in terms of computation, the task size in terms of communication, the checkpoint frequency for each task, and the delay_bound, and rsc_fpops_est BOINC task attributes.
Client States File. The client states input file is based on the client states format exported by the BOINC client to store persistent state. The idea is that the client states files could be collected and assembled to produce a client_states input file to SimBOINC, which would allow the simulation of BOINC clients using realistic settings.
Simulation File This simulation input file specifies the type of simulation to be conducted (e.g. BOINC), the maximum time for simulation after which the simulation will be terminated, and the output file name.
Using Availability Traces In SimGrid, the availability of network and CPU resources can be specified through traces. For CPU resources, one specifies a cpu availability file that denotes the availability of the cpu as a percentage over time. Also, for the cpu, one specifies a failure file that indicates when the cpu fails. A cpu is considered to fail when it is not available anymore for computation. In SimGrid, a CPU failure causes all processing running on that CPU to terminate. In BOINC, at least three things can cause an executing task to fail. First, the task could be preempted by the BOINC client because of the client scheduling policy. Second, the task could be preempted by the BOINC client because of user activity according to the user’s preferences. Third, the host could fail (for example due to a machine crash or shutdown). In SimBOINC, the failures of a host specified in the CPU trace files represent the failure resulting from the latter two causes. That is, when a cpu fails as specified in the traces, all processes on the cpu will terminate. However, their state is maintained and persists through the failure so that when the host becomes available again, the processes will be restarted in the same state. That is, the tasks that had been executing before the failure are restarted from the last checkpoint after the failure, and the client state data structure is the same as before the failure.
Logging SimBOINC uses the logging facility called XBT provided by SimGrid, which is similar in spirit to log4j (and in turn, log4cxx, and etc.) It allows for runtime configuration of messages output and the level of detail. However, it does yet support appenders. We chose to use XBT instead the BOINC’s message logger because XBT it integrated with SimGrid, and as such can show more informative messages by default (like the name of the process, the simulation time, and etc).
50
Desktop Grids
Simulator Output and Performance Metrics. The simulator output file must be specified in the simulation input file. The simulator then outputs the following metrics to that file in xml: for each client for each project that the client participates in total number of tasks completed resource share and effective resource shared calculated by the using the CPU time for each completed task compared to the total number and percentage of missed report deadlines for completed tasks number and percentage of report deadlines met for completed tasks. Also, for each CPU specified in the platform.xml file, the simulator will output a corresponding .trace file, which records information about the execution of tasks on that CPU. In particular, the trace file shows in each column, the simulation time, the task name, the event (START, COMPLETED, CANCELLED, or FAILED), the CPU name, and completion time when applicable.
Use of SimGrid. We chose to implement the BOINC simulator using SimGrid for a number of reasons. First, SimGrid provides a number abstractions and tools that simplify the process of simulating of complex parallel and distributed systems. For example, SimGrid provides abstractions for processes, computing elements, network links, and etc. These abstractions and tools greatly simplified the implementation of the BOINC simulator. Second, we can leverage the proven accuracy of SimGrid’s resource models. For example, SimGrid models allocation of network bandwidth among competing data transfers using a flow-based TCP model for networks that has been shown to be reasonably accurate. Third, SimGrid was implemented in C and using it with BOINC’s C++ source code is straightforward.
APPLICATIONS Bag of Task Applications Applications composed of a set of independent tasks is the most common class of application that one can execute on a Desktop Grid. This class of application is straight-forward to schedule and simple to execute when there is little IO. However, it is a very popular class of application that is used in many scientific domains. In particular, it permits multi-parametric studies, when one application, typically a simulation code is run against a large set of parameters in order to explore a range of possible solutions.
Data Intensive Enabling Data Grids is one of the fundamental efforts of the computational science community as emphasized by projects such as EGEE (Enabling Grigs for E-Science in Europe,) and PPDG (2006). This effort is pushed by the new requirements of E-Science. That is, large communities of researchers collaborate to extract knowledge and information from huge amounts of scientific data. This has lead to the emergence of a new class of application, called data-intensive applications that require secure and coordinated access to large datasets, wide-area transfers and broad distribution of TeraBytes of data while keeping track of multiple data replicas. The Data Grid aims at providing such an infrastructure and services to enable data-intensive applications. Despite the attractiveness of Desktop Grids, little work has
51
Desktop Grids
been done to support data-intensive applications in this context of massively distributed, volatile, shared and heterogeneous resources. Most Desktop Grid systems, like BOINC (Anderson, 2004), XtremWeb (Fedak et al., 2001) and OurGrid (Andrade et al., 2003) rely on a centralized architecture for indexing and distributing the data, and thus potentially face issues with scalability and fault-tolerance. Large data movement across wide-area networks can be costly in terms of performance because bandwidth across the Internet is often limited, variable and unpredictable. Caching data on local workstation storage (Iamnitchi et al., 2006 ; Otoo et al., 2004 ; Vazhkudai et al., 2005) with adequate scheduling strategies (Santos-Neto et al., 2004 ; Wei et al., 2005) to minimize data transfers can improve overall application execution time. Implementing a simple execution principle like “Owner Compute” still requires the system to efficiently locate data and to provide a model for the cost of moving data. Moreover, accurate modeling (Qiu & Srikant, 2004) and forecasting of P2P communication is still a challenging and open issue, and it will be required before one can efficiently execute more demanding types of applications, such as those that require real-time or stream processing.
Long Running Applications Long-running applications are challenging due to the volatility of executing nodes and often require checkpointing services. To achieve their execution it requires local or remote checkpointing to avoid loosing their computational state when a failure occurs.
Real-Time Applications In this paragraph, we focus on enabling soft real-time applications to execute on enterprise desktop Grids; soft real-time applications often have a deadline associated with each task but can afford to miss some of these deadlines. A number of soft real-time applications ranging from information processing of sensor networks (Sensor Networks), real-time video encoding (Rodriguez, Gonzalez, Malumbres), to interactive scientific visualization (Lopez et al., 1999 ; Smallen, Casanova, Berman, 2001) could potentially utilize desktop Grids. An example of such an application that has soft real-time requirements is on-line parallel tomography (Smallen et al., 2001). Tomography is the construction of 3-D models from 2-D projections, and it is common in electron microscopy to use tomography to create 3-D images of biological specimens. On-line parallel tomography applications are embarrassingly parallel as each 2-D projection can be decomposed into independent slices that must be distributed to a set of resources for processing. Each slice is on the order of kilobytes or megabytes in size, and there are typically hundreds or thousands of slices per projection, depending on the size of each projection. Ideally, the processing time of a single projection can be done while the user is acquiring the next image from the microscope, which typically takes several minutes (Hsu, 2005). As such, on-line parallel tomography could potentially be executed on desktop Grids if there were effective method for meeting the application’s relatively stringent time demands.
Network-Intensive Applications There are a few desktop Grid applications that are not CPU or data intensive; they use other resources available on the compute node. The execution time is not limited by processing speed, the amount of available memory, or communication times but by the availability of these resources.
52
Desktop Grids
The network is one of these resources. Malicious distributed applications (zombies PC) use it for sending a huge amount of data: sending SPAM, distributed attack targeting a given host. But network may also be useful for web spiders. For example, YaCy is a P2P-based search engine. On each volunteer resources, a web crawler collects data from the web that are locally indexed and stored. A local client is available for retrieving search results from other computing nodes through a DHT. Those tasks often require special scheduling policy from the desktop Grid because usual criteria cannot be used. For example, BOINC has support for non CPU-intensive (a special mode that applies to a whole project) tasks but some limitations are imposed: First, the client doesn’t maintain a cache of task to run: there is only one task present on the client at a given time. This is due to the fact that BOINC can’t estimate completion time by measuring CPU usage as it does for normal projects. Second, non CPU-intensive applications have to restrict there CPU usage to the minimum because there are some other CPU intensive task running at the same time: BOINC doesn’t mix scheduling policies.
CONCLUSION Thorough this chapter, we have presented an historical review of Desktop Grid System as well as the state-of-the-art of scientific researches and most recent technological innovations. In the late 90’s, the history of Desktop Grid System has started with simple computational applications featuring trivial and massive parallelism. Systems were based on common and rough-and-ready technologies, such as Web server with servers-side scripts and Java applets. Despite or because of this seeming architectural simplicity, these systems grew rapidly to appear amongst the largest distributed applications. In the early 2000’s, the challenge of gathering TeraFlops of volunteers’ PCs was met, attracting the attention of the mainstream media. Several high-tech companies have been built-up to sell services and commercial systems. In some sense, Desktop Grids systems appeared to be the most successful amongst the Grid applications to popularize and democratize the Grids to people at large. During the first decade of research in Desktop Grids systems, a huge effort has been made to bring this paradigm to a common facility usable for a broad range of scientific and industrial applications. As a consequence, this effort has led to a impressive set of innovations which has improved Desktop Grids system in term of reliability (for instance fault-tolerant communication libraries, distributed checkpointing), of data management (use of P2P protocols to distribute and manage data), of security (result certification, sandboxing) and performance (new classes of scheduling heuristics based on replication and evaluation of host availability). What are the perspectives of Desktop Grid Systems? The singularity of DG system is the location the are, at the frontier between Grid systems and Internet. As DG systems will become more efficient and more reliable, they will incorporate more deeply into Grid system. On one hand this will enable more scientists to benefit from this technology. On the other hand, the price will be an increased complexity in term of management. Has such, future of DG system will certainly follow the evolution of the Internet towards more users provided content, social network, distributed intelligence etc... Desktop Grid computing may also have a role to play in the context of Could computing. Currently the service infrastructure envisioned for Clouds is designed from large scale data centers. However, like for P2P systems, an approach of Cloud computing based on community of users sharing resources for free may counter balance the actual trend toward commercial service infrastructures. Of course, using
53
Desktop Grids
Desktop Grid as an underlying technology and infrastructure for Cloud computing raises a lot of research issues and opens exciting perspectives for Desktop Grids.
REFERENCES Abdennadher, N., & Boesch, R. (2006, August). A scheduling algorithm for high performance peer-topeer platform. In W. Lehner, N. Meyer, A. Streit, & C. Stewart (Eds.), Coregrid Workshop, Euro-Par 2006 (p. 126-137). Dresden, Germany: Springer. Alexandrov, A. D., Ibel, M., Schauser, K. E., & Scheiman, C. (1997, April). SuperWeb: Towards a global web-based parallel computing infrastructure. In Proceedings of the 11th IEEE International Parallel Processing Symposium (IPPS). Anderson, D. (2004). BOINC: A system for public-resource computing and storage. In Proceedings of the 5th IEEE/ACM International Grid Workshop, Pittsburgh, PA. Anderson, D., & Fedak, G. (2006). The computational and storage potential of volunteer computing. In Proceedings of The IEEE International Symposium on Cluster Computing and The Grid (CCGRID’06). Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002, November). Seti@ home: An experiment in public-resource computing. Communications of the ACM, 45(11), 56–61. doi:10.1145/581571.581573 Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, P. (2003, June). OurGrid: An approach to easily assemble grids with equitable resource sharing. In Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing. Andrzejak, A., Domingues, P., & Silva, L. (2006). Predicting Machine Availabilities in Desktop Pools. In IEEE/IFIP Network Operations and Management Symposium (pp. 225–234). Andrzejak, A., Kondo, D., & Anderson, D. P. (2008). Ensuring collective availability in volatile resource pools via forecasting. In 19th Ifip/Ieee Distributed Systems: Operations And Management (DSOM 2008). Samos Island, Greece. Araujo, F., Domingues, P., Kondo, D., & Silva, L. M. (2008, April). Using cliques of nodes to store desktop grid checkpoints. In Coregrid Integration Workshop, Crete, Greece. Balaton, Z., Gombas, G., Kacsuk, P., Kornafeld, A., Kovacs, J., & Marosi, A. C. (2007, March 26-30). Sztaki desktop grid: a modular and scalable way of building large computing grids. In Proceedings of the 21st International Parallel And Distributed Processing Symposium, Long Beach, CA. Baldassari, J., Finkel, D., & Toth, D. (2006, November 13-15). Slinc: A framework for volunteer computing. In Proceedings of the 18th Iasted International Conference On Parallel And Distributed Computing And Systems (PDCS 2006). Dallas, TX. Barak, A., Guday, S., & R., W. (1993). The MOSIX Distributed Operating System, Load Balancing for UNIX (Vol. 672). Berlin: Springer-Verlag.
54
Desktop Grids
Baratloo, A., Karaul, M., Kedem, Z., & Wyckoff, P. (1996). Charlotte: Metacomputing on the Web. In Proceeidngs of the 9th International Conference On Parallel And Distributed Computing Systems (PDCS-96). Bassi, A., Beck, M., Fagg, G., Moore, T., Plank, J. S., & Swany, M. (2002). The Internet BackPlane Protocol: A Study in Resource Sharing. In Second ieee/acm international symposium on cluster computing and the grid, Berlin, Germany. Berman, F., Wolski, R., Figueira, S., Schopf, J., & Shao, G. (1996). Application-Level Scheduling on Distributed Heterogeneous Networks. In Proc. of supercomputing’96, Pittsburgh, PA. Bhatt, S. N., Chung, F. R. K., Leighton, F. T., & Rosenberg, A. L. (1997). An optimal strategies for cycle-stealing in networks of workstations. IEEE Transactions on Computers, 46(5), 545–557. doi:10.1109/12.589220 Bolosky, W., Douceur, J., Ely, D., & Theimer, M. (2000). Feasibility of a Serverless Distributed file System Deployed on an Existing Set of Desktop PCs. In Proceedings of sigmetrics. Brecht, T., Sandhu, H., Shan, M., & Talbot, J. (1996). Paraweb: towards world-wide supercomputing. In Ew 7: Proceedings of the 7th workshop on acm sigops european workshop (pp. 181–188). New York: ACM. Butt, A. R., Johnson, T. A., Zheng, Y., & Hu, Y. C. (2004). Kosha: A Peer-to-Peer Enhancement for the Network File System. In Proceeding of International Symposium On Supercomputing SC’04. Camiel, N., London, S., Nisan, N., & Regev, O. (1997, April). The PopCorn Project: Distributed computation over the Internet in Java. In Proceedings of the 6th international world wide web conference. Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette, F., & Néri, V. (2004). Computing on large scale distributed systems: Xtremweb architecture, programming models, security, tests and convergence with grid. Future Generation Computer Science (FGCS). Cappello, P., Christiansen, B., Ionescu, M., Neary, M., Schauser, K., & Wu, D. (1997). Javelin: InternetBased Parallel Computing Using Java. In Proceedings of the sixth acm sigplan symposium on principles and practice of parallel programming. Casanova, H., Legrand, A., & Quinson, M. SimGrid: a Generic Framework for Large-Scale Distributed Experimentations. In Proceedings of the 10th ieee international conference on computer modelling and simulation (uksim/eurosim’08). Casanova, H., Legrand, A., Zagorodnov, D., & Berman, F. (2000, May). Heuristics for Scheduling Parameter Sweep Applications in Grid Environments. In Proceedings of the 9th heterogeneous computing workshop (hcw’00) (pp. 349–363). Casanova, H., Obertelli, G., Berman, F., & Wolski, R. (2000, Nov.). The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid. In Proceedings of supercomputing 2000 (sc’00). Chien, A., Calder, B., Elbert, S., & Bhatia, K. (2003). Entropia: Architecture and performance of an enterprise desktop grid system. Journal of Parallel and Distributed Computing, 63, 597–610. doi:10.1016/ S0743-7315(03)00006-6
55
Desktop Grids
Cirne, W., Brasileiro, F., Andrade, N., Costa, L., Andrade, A., & Novaes, R. (2006, September). Labs of the world, unite!!! Journal of Grid Computing, 4(3), 225–246. doi:10.1007/s10723-006-9040-x Cohen, B. (2003). Incentives build robustness in BitTorrent. In Workshop on economics of peer-to-peer systems, Berkeley, CA. Costa, F., Silva, L., Fedak, G., & Kelley, I. (2008, in press). Optimizing the Data Distribution Layer of BOINC with BitTorrent. In 2nd workshop on desktop grids and volunteer computing systems (pcgrid 2008), Miami, FL. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. In Osdi’04: Sixth symposium on operating system design and implementation, (pp. 137–150). San Francisco, CA. Domingues, P., Araujo, F., & Silva, L. M. (2006, December). A dht-based infrastructure for sharing checkpoints in desktop grid computing. In Conference on e-science and grid computing (escience ’06), Amsterdam, The Netherlands. Draves, S. (2005, March). The electric sheep screen-saver: A case study in aesthetic evolution. In 3rd european workshop on evolutionary music and art. Fedak, G., & Germain, C. N’eri, V., & Cappello, F. (2001, May). XtremWeb: A Generic Global Computing System. In Proceedings of the ieee international symposium on cluster computing and the grid (ccgrid’01). Fedak, G., He, H., & Cappello, F. (2008, November). BitDew: A Programmable Environment for LargeScale Data Management and Distribution. In Proceedings of the acm/ieee supercomputing conference (sc’08), Austin, TX. Federation, M. D. The Biomedical Informatics Research Network (2003). In I. Foster & C. Kesselman (Eds.), The grid, blueprint for a new computing infrastructure (2nd ed.). San Francisco: Morgan Kaufmann. Fernandess, Y., & Malkhi, D. (2006). On Collaborative Content Distribution using Multi-Message Gossip. In Proceedings of the international parallel and distributed processing symposium. Rhodes Island, Greece: IEEE. Foster, I., & Kesselman, C. (Eds.). (1999). The Grid: Blueprint for a New Computing Infrastructure. San Francisco, USA: Morgan Kaufmann Publishers, Inc. Foster, I. T., & Iamnitchi, A. (2003). On death, taxes, and the convergence of peer-to-peer and grid computing. 2735, 118-128. Ghare, G., & Leutenegger, L. (2004, June). Improving Speedup and Response Times by Replicating Parallel Programs on a SNOW. In Proceedings of the 10th workshop on job scheduling strategies for parallel processing. Ghormley, D., Petrou, D., Rodrigues, S., Vahdat, A., & Anderson, T. (1998, July). GLUnix: A global layer unix for a network of workstations. Software, Practice & Experience, 28(9), 929. doi:10.1002/ (SICI)1097-024X(19980725)28:9<929::AID-SPE183>3.0.CO;2-C
56
Desktop Grids
Gkantsidis, C., & Rodriguez, P. (2005, March). Network Coding for Large Scale Content Distribution. In Proceedings of ieee/infocom 2005, Miami, USA. Heien, E., Fujimoto, N., & Hagihara, K. (2008). Computing low latency batches with unreliable workers in volunteer computing environments. In Pcgrid. Hsu, A. (2005, March). Personal communication. Iamnitchi, A., Doraimani, S., & Garzoglio, G. (2006). Filecules in High-Energy Physics: Characteristics and Impact on Resource Management. In proceeding of 15th ieee international symposium on high performance distributed computing hpdc 15, Paris. Iamnitchi, A., Foster, I. T., & Nurmi, D. (2002). A peer-to-peer approach to resource location in grid environments. In Hpdc (p. 419). Jin, H., Xiong, M., Wu, S., & Zou, D. (2006). Replica Based Distributed Metadata Management in Grid Environment. Computational Science (LNCS 3944, pp. 1055-1062). Berlin: Springer-Verlag. Jung, E. B., Choi, S.-J., Baik, M.-S., Hwang, C.-S., Park, C.-Y., & Young, S. (2005). Scheduling scheme based on dedication rate in volunteer computing environment. In Third international symposium on parallel and distributed computing (ispdc 2005), Lille, France. Kim, J.-S., Nam, B., Keleher, P. J., Marsh, M. A., Bhattacharjee, B., & Sussman, A. (2006). Resource discovery techniques in distributed desktop grid environments. In Grid (pp. 9-16). Kondo, D., Araujo, F., Malecot, P., Domingues, P., Silva, L. M., & Fedak, G. (2006). Characterizing result errors in internet desktop grids (Tech. Rep. No. INRIA-HALTech Report 00102840), INRIA, France. Kondo, D., Chien, A., & H., C. (2004, November). Rapid Application Turnaround on Enterprise Desktop Grids. In Acm conference on high performance computing and networking, sc2004. Kondo, D., Chien, A. A., & Casanova, H. (2007). Scheduling task parallel applications for rapid turnaround on enterprise desktop grids. Journal of Grid Computing, 5(4), 379–405. doi:10.1007/s10723007-9063-y Kondo, D., Fedak, G., Cappello, F., Chien, A. A., & Casanova, H. (2006, December). On Resource Volatility in Enterprise Desktop Grids. In Proceedings of the 2nd IEEE International Conference On E-Science And Grid Computing (eScience’06) (pp. 78–86). Amsterdam, Netherlands. Kondo, D., Taufer, M., Brooks, C., Casanova, H., & Chien, A. (2004, April). Characterizing and evaluating desktop grids: An empirical study. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’04). Lee, S., Ren, X., & Eigenmann, R. (2008). Efficient content search in ishare, a p2p based internet-sharing system. In PCGRID. Leutenegger, S., & Sun, X. (1993). Distributed computing feasibility in a non-dedicated homogeneous distributed system. In Proceedings of SC’93, Portland, OR. Litzkow, M., Livny, M., & Mutka, M. (1988). Condor - A hunter of idle workstations. In Proceedings of the 8th International Conference Of Distributed Computing Systems (ICDCS).
57
Desktop Grids
Lodygensky, O., Fedak, G., Cappello, F., Neri, V., Livny, M., & Thain, D. (2003). XtremWeb & Condor: Sharing resources between Internet connected condor pools. In Proceedings of CCGRID’2003, Third International Workshop On Global And Peer-To-Peer Computing (GP2PC’03) (pp. 382–389). Tokyo, Japan. Lopez, J., Aeschlimann, M., Dinda, P., Kallivokas, L., Lowekamp, B., & O’Hallaron, D. (1999, June). Preliminary report on the design of a framework for distributed visualization. In Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA’99) (pp. 1833–1839). Las Vegas, NV. Malécot, P., Kondo, D., & Fedak, G. (2006, June). Xtremlab: A system for characterizing internet desktop grids. In Poster in the 15th ieee international symposium on high performance distributed computing hpdc’06. Paris, France. Mattson, T., Sanders, B., & Massingill, B. (2004). Patterns for parallel programming. New York: Addison-Wesley. Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-peer Information System Based on the XOR Metric. In Proceedings of the 1st international workshop on peer-to-peer systems (iptps’02) (pp. 53–65). Mutka, M., & Livny, M. (1991, July). The available capacity of a privately owned workstation environment. Performance Evaluation, 4(12). Mutka, M. W., & Livny, M. (1987). Profiling workstations’ available capacity for remote execution. In Proceedings of performance-87, the 12th ifip w.g. 7.3 international symposium on computer performance modeling, measurement and evaluation. Brussels, Belgium. Myers, D. S., Bazinet, A. L., & Cummings, M. P. (2008). Expanding the reach of grid computing: combining globus- and boinc-based systems. In Grids for Bioinformatics and Computational Biology. New York: Wiley. Nisan, N., London, S., Regev, O., & Camiel, N. (1998). Globally distributed computation over the internet - the popcorn project. In International conference on distributed computing systems 1998 (p. 592). New York: IEEE Computer Society. Otoo, E., Rotem, D., & Romosan, A. (2004). Optimal File-Bundle Caching Algorithms for Data-Grids. In Sc ’04: Proceedings of the 2004 acm/ieee conference on supercomputing (p. 6). Washington, DC: IEEE Computer Society. Pedroso, J., Silva, L., & Silva, J. (1997, June). Web-based metacomputing with JET. In Proc. of the acm ppopp workshop on java for science and engineering computation. PPDG. (2006). From fabric to physics (Tech. Rep.). The Particle Physics Data Grid. Pruyne, J., & Livny, M. (1996). A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. Journal on Future Generations of Computer Systems, 12. Qiu, D., & Srikant, R. (2004). Modeling and performance analysis of bittorrent-like peer-to-peer networks. Computer Communication Review, 34(4), 367–378. doi:10.1145/1030194.1015508
58
Desktop Grids
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., & Wilde, M. (2007). Falkon: a fast and light-weight task execution framework. In Ieee/acm supercomputing. Raman, R., Livny, M., & Solomon, M. H. (1998). Matchmaking: Distributed resource management for high throughput computing. In Hpdc (p. 140). Rhea, S. C., Eaton, P. R., Geels, D., Weatherspoon, H., Zhao, B. Y., & Kubiatowicz, J. (2003). Pond: The oceanstore prototype. In Fast. Rodriguez, A., Gonzalez, A., & Malumbres, M. P. Performance evaluation of parallel mpeg-4 video coding algorithms on clusters of workstations. International Conference on Parallel Computing in Electrical Engineering (PARELEC’04), 354-357. Rowstron, A., & Druschel, P. (2001, November). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th ifip/acm international conference on distributed systems platforms (middleware 2001), Heidelberg, Germany. Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting Replication and Data Reuse to Efficiently Schedule Data-intensive Applications on Grids. In Proceedings of the 10th workshop on job scheduling strategies for parallel processing. Sarmenta, L. F. G. (2002). Sabotage-tolerance mechanisms for volunteer computing systems. Future Generation Computer Systems, 18(4), 561–572. doi:10.1016/S0167-739X(01)00077-2 Sarmenta, L. F. G., & Hirano, S. (1999). Bayanihan: Building and studying volunteer computing systems using Java. Future Generation Computer Systems, 15(5/6), 675-686. Sensor Networks. Retrieved from http://www.sensornetworks.net.au/network.html Sfiligoi, K. O., Venekamp, G., Yocum, D., Groep, D., & Petravick, D. (2007). Addressing the Pilot security problem with gLExec (Tech. Rep. No. FERMILAB-PUB-07-483-CD). Fermi National Laboratory, Batavia, IL. Shirts, M., & Pande, V. (2000). Screen savers of the world, unite! Science, 290, 1903–1904. doi:10.1126/ science.290.5498.1903 Shoch, J. F., & Hupp, J. A. (1982). 03). The “worm” programs - early experience with a distributed computation. Communications of the ACM, 3(25). Smallen, S., Casanova, H., & Berman, F. (2001, Nov.). Tunable on-line parallel tomography. In Proceedings of Supercomputing’01, Denver, CO. Sonnek, J. D., Nathan, M., Chandra, A., & Weissman, J. B. (2006). Reputation-based scheduling on unreliable distributed infrastructures. In ICDCS (p. 30). Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001, August). Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM ’01 Conference, San Diego, CA. Taufer, M., Anderson, D., Cicotti, P., & III, C. L. B. (2005). Homogeneous redundancy: a technique to ensure integrity of molecular simulation results using public computing. In Proceedings of The International Heterogeneity In Computing Workshop.
59
Desktop Grids
Thain, D., & Livny, M. (2004). Building reliable clients and services. In The grid2 (pp. 285–318). San Francisco: Morgan Kaufman. The seti@home project. Retrieved from http://setiathome.ssl.berkeley.edu/ Tsaregorodtsev, A., Garonne, V., & Stokes-Rees, I. (2004). Dirac: A scalable lightweight architecture for high throughput computing. In Fifth IEEE/ACM International Workshop On Grid Computing (Grid’04). Vazhkudai, S., & Ma, X. V. F., Strickland, J., Tammineedi, N., & Scott, S. (2005). Freeloader:scavenging desktop storage resources for scientific data. In Proceedings of Supercomputing 2005 (SC’05), Seattle, WA. Wei, B., Fedak, G., & Cappello, F. (2005). scheduling independent tasks sharing large data distributed with BitTorrent. In The 6th IEEE/ACM International Workshop On Grid Computing, 2005, Seattle, WA. Yacy - distributed p2p-based Web indexing. Zhao, S., & Lo, V. (2001, May). Result Verification and Trust-based Scheduling in Open Peer-to-Peer Cycle Sharing Systems. In Proceedings of Ieee Fifth International Conference on Peer-To-Peer Systems. Zhou, D., & Lo, V. M. (2006). Wavegrid: A scalable fast-turnaround heterogeneous peer-based desktop grid system. In IPDPS.
KEY TERMS AND DEFINITIONS Cycle Stealing: Consists in using the unused cycles of desktop workstations. Participating workstations also donate some supporting amount of disk storage space, RAM, and network bandwidth, in addition to raw CPU power. The volunteer must get back full usage of its resources with no delay when it request them. Desktop Grid: A computing environment making use of Desktop computers connected via the Internet. Desktop Grids are not used only for voluntary computing projects, but also for enterprise Grids. connected via non dedicated network connection Master-Worker Paradigm: Consists in two entities: a master and several workers. The master decomposes the problem into smaller tasks and distributes them among workers. The worker receives the task from the master, executes it and sends back the result to the master. Result Certification: In distributed computing the result certification is a mechanism that aims to validate the results computed by volatile and possibly malicious hosts. The most common mechanisms for result validation are: the majority voting, spot-checking and credibility-based technique. Volunteer Computing: An arrangement in which computer owners provide there computing resources to one or more projects that are using them to do distributed computing. Those Desktop Grids are made of plenty tiny and uncontrollable administrative domains.
60
Desktop Grids
ENDNOTES 1 United Devices Inc., http://www.ud.com/ 2 Platform Computing Inc., http://www.platform.com/ 3 Mesh Technologies, http://www.meshtechnologies.com/ 4 The COSM project, http://www.mithral.com/projects/cosm/ 5 EINSTEIN@home, http://einstein.phys.uwm.edu 6 The Great Internet Mersenne Prime Search, http://www.mersenne.org/ 7 Distributed.net, www.distributed.net 8 Electric Sheep, http://electricsheep.org/ 9 XtremWeb-CH’s website, http://www.xtremwebch.net/ 10 Simple Light-weight Infrastructure for Network Computing, http://slinc.sourceforge.net/ 11 XtremLab: A System for Characterizing Internet Desktop Grids, http://xtremlab.lri.fr
61
62
Chapter 4
Porting Applications to Grids1 Wolfgang Gentzsch EU Project DEISA and Board of Directors of the Open Grid Forum, Germany
ABSTRACT Aim of this chapter is to guide developers and users through the most important stages of implementing software applications on Grid infrastructures, and to discuss important challenges and potential solutions. Those challenges come from the underlying grid infrastructure, like security, resource management, and information services; the application data, data management, and the structure, volume, and location of the data; and the application architecture, monolithic or workflow, serial or parallel. As a case study, the author presents the DEISA Distributed European Infrastructure for Supercomputing Applications and describes its DEISA Extreme Computing Initiative DECI for porting and running scientific grand challenge applications. The chapter concludes with an outlook on Compute Clouds, and suggests ten rules of building a sustainable grid as a prerequisite for long-term sustainability of the grid applications.
INTRODUCTION Over the last 40 years, the history of computing is deeply marked of the affliction of the application developers who continuously are porting and optimizing their application codes to the latest and greatest computing architectures and environments. After the von-Neumann mainframe came the vector computer, then the shared-memory parallel computer, the distributed-memory parallel computer, the very-long-instruction word computer, the workstation cluster, the meta-computer, and the Grid (never fear, it continues, with SOA, Cloud, Virtualization, Many-core, and so on). There is no easy solution to this, and the real solution would be a separation of concerns between discipline-specific content and domain-independent software and hardware infrastructure. However, this often comes along with a loss DOI: 10.4018/978-1-60566-661-7.ch004
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Porting Applications to Grids
of performance stemming from the overhead of the infrastructure layers. Recently, users and developers face another wave of complex computing infrastructures: the Grid. Let’s start with answering the question: What is a Grid? Back in 1998, Ian Foster and Carl Kesselman (1998) attempted the following definition: “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities.” In a subsequent article (Foster, 2002), “The Anatomy of the Grid,” Ian Foster, Carl Kesselman, and Steve Tuecke changed this definition to include social and policy issues, stating that Grid computing is concerned with “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.” The key concept is the ability to negotiate resource-sharing arrangements among a set of participating parties (providers and consumers) and then to use the resulting resource pool for some purpose. They continued: “The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization.” This author’s concern, from the beginning (Gentzsch, 2002), was that the new definition seemed very ambitious, and as history has proven, many of the Grid projects with a focus on these ambitious objectives did not lead to a sustainable grid production environment, so far. We can only repeat that the simpler the grid infrastructure, and the easier to use, and the sharper its focus, the bigger is its chance for success. And it is for a good reason (which we will explain in the following) that currently the so-called Clouds are becoming more and more popular (Amazon, 2007). Over the last ten years, hundreds of applications in science, industry and enterprises have been ported to Grid infrastructures, mostly prototypes in the early definition of Foster & Kesselman (1998). Each application is unique in that it solves a specific problem, based on modeling, for example, a specific phenomenon in nature (physics, chemistry, biology, etc.), presented as a mathematical formula together with appropriate initial and boundary conditions, represented by its discrete analogue using sophisticated numerical methods, translated into a programming language computers can understand, adjusted to the underlying computer architecture, embedded in a workflow, and accessible remotely by the user through a secure, transparent and application-specific portal. In just these very few words, this summarizes the wide spectrum and complexity we face in problem solving on grid infrastructures. The user (and especially the developer) faces several layers of complexity when porting applications to a computing environment, especially to a compute or data grid of distributed networked nodes ranging from desktops to supercomputers. These nodes, usually, consist of several to many loosely or tightly coupled processors and, more and more, these processors contain few to many cores. To run efficiently on such systems, applications have to be adjusted to the different layers, taking into account different levels of granularity, from fine-grain structures deploying multi-core architectures at processor level to the coarse granularity found in application workflows representing for example multi-physics applications. Not enough, the user has to take into account the specific requirements of the grid, coming from the different components of the grid services architecture, such as security, resource management, information services, and data management. Obviously, in this article, it seems impossible to present and discuss the complete spectrum of applications and their adaptation and implementation on Grids. Therefore, we restrict ourselves in the following to briefly describe the different application classes, present a checklist (or classification) with
63
Porting Applications to Grids
respect to grouping applications according to their appropriate grid-enabling strategy. Also, for lack of space, here, we are not able to include a discussion of mental, social, or legal aspects which sometimes might be the knock-out criteria for running applications on a grid. Other show-stoppers such as sensitive data, security concerns, licensing issues, and intellectual property, were discussed in some detail in Gentzsch (2007a). In the following, we will consider the main three areas of impact on porting applications to grids: infrastructure issues, data management issues, and application architecture issues. These issues can have an impact on effort and success of porting, on the resulting performance of the grid application, and on the user-friendly access to the resources, the grid services, the application, the data, and the final processing results, among others.
APPLICATIONS AND THE GRID INFRASTRUCTURE As mentioned before, the successful porting of an application to a grid environment highly depends on the underlying distributed resource infrastructure. The main services components offered by a grid infrastructure are security, resource management, information services, and data management. Bart Jacob et al. suggest that each of these components can affect the application architecture, its design, deployment, and performance. Therefore, the user has to go through the process of matching the application (structure and requirements) with those components of the grid infrastructure, as described here, closely following the description in Jacob at al. (2003).
Applications and Security The security functions within the grid architecture are responsible for the authentication and authorization of the user, and for the secure communication between the grid resources. Fortunately, these functions are an inherent part of most grid infrastructures and don’t usually affect the applications themselves, supposed the user (and thus the user’s application) is authorized to use the required resources. Also, security from an application point of view might be taken into account in the case that sensitive data is passed to a resource to be processed by a job and is written to the local disk in a non-encrypted format, and other users or applications might have access to that data.
Applications and Resource Management The resource management component provides the facilities to allocate a job to a particular resource, provides a means to track the status of the job while it is running and its completion information, and provides the capability to cancel a job or otherwise manage it. In conjunction with Monitoring and Discovery Service (described below) the application must ensure that the appropriate target resource(s) are used. This requires that the application accurately specifies the required environment (operating system, processor, speed, memory, and so on). The more the application developer can do to eliminate specific dependencies, the better the chance that an available resource can be found and that the job will complete. If an application includes multiple jobs, the user must understand (and maybe reduce) their interdependencies. Otherwise, logic has to be built to handle items such as inter-process communication, sharing of data, and concurrent job submissions. Finally, the job management provides mechanisms to
64
Porting Applications to Grids
query the status of the job as well as perform operations such as canceling the job. The application may need to utilize these capabilities to provide feedback to the user or to clean up or free up resources when required. For instance, if one job within an application fails, other jobs that may be dependent on it may need to be cancelled before needlessly consuming resources that could be used by other jobs.
Applications and Resource Information Services An important part of the process of grid-enabling an application is to identify the appropriate (if not optimal) resources needed to run the application, i.e. to submit the respective job to. The service which maintains and provides the knowledge about the grid resources is the Grid Information Service (GIS), also known as the Monitoring and Discovery Service (e.g. MDS in Globus [xx]. MDS provides access to static and dynamic information of resources. Basically, it contains the following components: • • • •
Grid Resource Information Service (GRIS), the repository of local resource information derived from information providers. Grid Index Information Service (GIIS), the repository that contains indexes of resource information registered by the GRIS and other GIISs. Information providers, translate the properties and status of local resources to the format defined in the schema and configuration files. MDS client which initially performs a search for information about resources in the grid environment.
Resource information is obtained by the information provider and it is passed to GRIS. GRIS registers its local information with the GIIS, which can optionally also register with another GIIS, and so on. MDS clients can query the resource information directly from GRIS (for local resources) and/or a GIIS (for grid-wide resources). It is important to fully understand the requirements for a specific job so that the MDS query can be correctly formatted to return resources that are appropriate. The user has to ensure that the proper information is in MDS. There is a large amount of data about the resources within the grid that is available by default within the MDS. However, if the application requires special resources or information that is not there by default, the user may need to write her own information providers and add the appropriate fields to the schema. This may allow the application or broker to query for the existence of the particular resource/requirement.
Applications and Data Management Data management is concerned with collectively maximizing the use of the limited storage space, networking bandwidth, and computing resources. Within the application, data requirements have been built in which determine, how data will be move around the infrastructure or otherwise accessed in a secure and efficient manner. Standardizing on a set of grid protocols will allow to communicate between any data source that is available within the software design. Especially data intensive applications often have a federated database to create a virtual data store or other options including Storage Area Networks, network file systems, and dedicated storage servers. Middleware like the Globus Toolkit provide GridFTP and Global Access to Secondary Storage data transfer utilities in the grid environment. The
65
Porting Applications to Grids
GridFTP facility (extending the FTP File Transfer Protocol) provides secure and reliable data transfer between grid hosts. Developers and users face a few important data management issues that need to be considered in application design and implementation. For large datasets, for example, it is not practical and may be impossible to move the data to the system where the job will actually run. Using data replication or otherwise copying a subset of the entire dataset to the target system may provide a solution. If the grid resources are geographically distributed with limited network connection speeds, design considerations around slow or limited data access must be taken into account. Security, reliability, and performance become an issue when moving data across the Internet. When the data access may be slow or prevented one has to build the required logic to handle this situation. To assure that the data is available at the appropriate location by the time the job requires it, the user should schedule the data transfer in advance. One should also be aware of the number and size of any concurrent transfers to or from any one resource at the same time. Beside the above described main requirements for applications for running efficiently on a grid infrastructure, there are a few more issues which are discussed in Jacob (2003), such as scheduling, load balancing, grid broker, inter-process communication, and portals for easy access, and non-functional requirements such as performance, reliability, topology aspects, and consideration of mixed platform environments.
The Simple API for Grid Applications (SAGA) Among the many efforts in the grid community to develop tools and standards which simplify the porting of applications to Grids by enabling the application to make easy use of the Grid middleware services as described above, one of the more predominant ones is SAGA, a high-level Application Programmers Interface (API), or programming abstraction, defined by the Open Grid Forum (OGF, 2008), an international committee that coordinates standardization of Grid middleware and architectures. SAGA intends to simplify the development of grid-enabled applications, even for scientists without any background in computer science or grid computing. Historically, SAGA was influenced by the work on the GAT Grid Application Toolkit, a C-based API developed in the EU-funded project GridLab (GAT, 2005). The purpose of SAGA is two-fold: 1. 2.
Provide a simple API that can be used with much less effort compared to the interfaces of existing grid middleware. Provide a standardized, portable, common interface for the various grid middleware systems.
According to Goodale (2008) SAGA facilitates rapid prototyping of new grid applications by allowing developers a means to concisely state very complex goals using a minimum amount of code. SAGA provides a simple, POSIX-style API to the most common Grid functions at a sufficiently highlevel of abstraction so as to be able to be independent of the diverse and dynamic Grid environments. The SAGA specification defines interfaces for the most common Grid-programming functions grouped as a set of functional packages. Version 1.0 (Goodale, 2008) defines the following packages: •
66
File package - provides methods for accessing local and remote file systems, browsing directories, moving, copying, and deleting files, setting access permissions, as well as zero-copy reading and writing
Porting Applications to Grids
•
•
• •
Replica package - provides methods for replica management such as browsing logical file systems, moving, copying, deleting logical entries, adding and removing physical files from a logical file entry, and search logical files based on attribute sets. Job package - provides methods for describing, submitting, monitoring, and controlling local and remote jobs. Many parts of this package were derived from the largely adopted DRMAA [11] specification. Stream package - provides methods for authenticated local and remote socket connections with hooks to support authorization and encryption schemes. RPC package - is an implementation of the OGF GridRPC API definition and provides methods for unified remote procedure calls.
The two critical aspects of SAGA are its simplicity of use and the fact that it is well on the road to becoming a community standard. It is important to note, that these two properties are provide the added value of using SAGA for Grid application development. Simplicity arises from being able to limit the scope to only the most common and important grid-functionality required by applications. There a major advantages arising from its simplicity and imminent standardization. Standardization represents the fact that the interface is derived from a wide-range of applications using a collaborative approach and the output of which is endorsed by the broader community. More information about the SAGA C++ Reference Implementation (developed at the Center for Computation and Technology at the Louisiana State University) and various aspects of grid enabling toolkits is available on the SAGA implementation home page (SAGA, 2006). It also provides additional information with regard to different aspects of grid enabling toolkits.
GRID APPLICATIONS AND DATA Any e-science application at its core has to deal with data, from input data (e.g. in the form of output data from sensors, or as initial or boundary data), to processing data and storing of intermediate results, to producing final results (e.g. data used for visualization). Data has a strong influence on many aspects of the design and deployment of an application and determines whether a grid application can be successfully ported to the grid. Therefore, in the following, we present a brief overview of the main data management related aspects, tasks and issues which might affect the process of grid-enabling an application, such as data types and size, shared data access, temporary data spaces, network bandwidth, time-sensitive data, location of data, data volume and scalability, encrypted data, shared file systems, databases, replication, and caching. For a more in-depth discussion of data management related tasks, issues, and techniques, we refer to Bart Jacob’s tutorial on application enabling with Globus (Jacob, 2003).
Shared Data Access Sharing data access can occur with concurrent jobs and other processes within the network. Access to data input and the data output of the jobs can be of various kinds. During the planning and design of the grid application, potential restrictions on the access of databases, files, or other data stores for either read or write have to be considered. The installed policies need to be observed and sufficient access rights have to be granted to the jobs. Concerning the availability of data in shared resources,
67
Porting Applications to Grids
it must be assured that at run-time of the individual jobs the required data sources are available in the appropriate form and at the expected service level. Potential data access conflicts need to be identified up front and planned for. Individual jobs should not try to update the same record at the same time, nor dead lock each other. Care has to be taken for situations of concurrent access and resolution policies imposed. The use of federated databases may be useful in data grids where jobs must handle large amounts of data in various different data stores, you. They offer a single interface to the application and are capable of accessing data in large heterogeneous environments. Federated database systems contain information about location (node, database, table, record) and access methods (SQL, VSAM, privately defined methods) of connected data sources. Therefore, a simplified interface to the user (a grid job or other client) requires that the essential information for a request should not include the data source, but rather use a discovery service to determine the relevant data source and access method.
Data Topology Issues about the size of the data, network bandwidth, and time sensitivity of data determine the location of data for a grid application. The total amount of data within the grid application may exceed the amount of data input and output of the grid application, as there can be a series of sub-jobs that produce data for other sub-jobs. For permanent storage the grid user needs to be able to locate where the required storage space is available in the grid. Other temporary data sets that may need to be copied from or to the client also need to be considered. The amount of data that has to be transported over the network is restricted by available bandwidth. Less bandwidth requires careful planning of the data traffic among the distributed components of a grid application at runtime. Compression and decompression techniques are useful to reduce the data amount to be transported over the network. But in turn, it raises the issue of consistent techniques on all involved nodes. This may exclude the utilization of scavenging for a grid, if there are no agreed standards universally available. Another issue in this context is time-sensitive data. Some data may have a certain lifetime, meaning its values are only valid during a defined time period. The jobs in a grid application have to reflect this in order to operate with valid data when executing. Especially when using data caching or other replication techniques, it has to be assured that the data used by the jobs is up-to-date, at any given point in time. The order of data processing by the individual jobs, especially the production of input data for subsequent jobs, has to be carefully observed. Depending on the job, the authors Jacob at al. (2003) recommend to consider the following datarelated questions which refer to input as well as output data of the jobs within the grid application: • • • • •
68
Is it reasonable that each job or set of jobs accesses the data via the network? Does it make sense to transport a job or set of jobs to the data location? Is there any data access server (for example, implemented as a federated database) that allows access by a job locally or remotely via the network? Are there time constraints for data transport over the network, for example, to avoid busy hours and transport the data to the jobs in a batch job during off-peak hours? Is there a caching system available on the network to be exploited for serving the same data to several consuming jobs?
Porting Applications to Grids
•
Is the data only available in a unique location for access, or are there replicas that are closer to the executable within the grid?
Data Volume The ability for a grid job to access the data it needs will affect the performance of the application. When the data involved is either a large amount of data or a subset of a very large data set, then moving the data set to the execution node is not always feasible. Some of the considerations as to what is feasible include the volume of the data to be handled, the bandwidth of the network, and logical interdependences on the data between multiple jobs. Data volume issues: In a grid application, transparent access to its input and output data is required. In most cases the relevant data is permanently located on remote locations and the jobs are likely to process local copies. This access to the data results in a network cost and it must be carefully quantified. Data volume and network bandwidth play an important role in determining the scalability of a grid application. Data splitting and separation: Data topology considerations may require the splitting, extraction, or replication of data from data sources involved. There are two general approaches that are suitable for higher scalability in a grid application: Independent tasks per job and a static input file for all jobs. In the case of independent tasks, the application can be split into several jobs that are able to work independently on a disjoint subset of the input data. Each job produces its own output data and the gathering of all of the results of the jobs provides the output result by itself. The scalability of such a solution depends on the time required to transfer input data, and on the processing time to prepare input data and generate the final data result. In this case the input data may be transported to the individual nodes on which its corresponding job is to be run. Preloading of the data might be possible depending on other criteria like timeliness of data or amount of the separated data subsets in relation to the network bandwidth. In the case of static input files, each job repeatedly works on the same static input data, but with different parameters, over a long period of time. The job can work on the same static input data several times but with different parameters, for which it generates differing results. A major improvement for the performance of the grid application may be derived by transferring the input data ahead of time as close as possible to the compute nodes. Other cases of data separation: More unfavorable cases may appear when jobs have dependencies on each other. The application flow may be carefully checked in order to determine the level of parallelism to be reached. The number of jobs that can be run simultaneously without dependences is important in this context. For independent jobs, there needs to be synchronization mechanisms in place to handle the concurrent access to the data. Synchronizing access to one output file: Here all jobs work with common input data and generate their output to be stored in a common data store. The output data generation implies that software is needed to provide synchronization between the jobs. Another way to process this case is to let each job generate individual output files, and then to run a post-processing program to merge all these output files into the final result. A similar case is that each job has its individual input data set, which it can consume. All jobs then produce output data to be stored in a common data set. Like described above, the synchronization of the output for the final result can be done through software designed for the task. Hence, thorough evaluation of the input and output data for jobs in the grid application is needed to properly handle it. Also, one should weigh the available data tools, such as federated databases, a data
69
Porting Applications to Grids
joiner, and related products and technologies, in case the grid application is highly data oriented or the data shows a complex structure.
PORTING AND PROGRAMMING GRID APPLICATIONS Besides taking into account the underlying grid resources and the application’s data handling, as discussed in the previous two paragraphs, another challenge is the porting of the application program itself. In this context, developers and users are facing mainly two different approaches when implementing their application on a grid. Either they port an existing application code on a set of distributed grid resources. Often, in the past, the application previously has been developed and optimized with a specific computer architecture in mind, for example, mainframes or servers, single- or multiple-CPU vector computers, shared- or distributed-memory parallel computers, or loosely coupled distributed systems like workstation clusters, for example. Or developers start from scratch and design and develop a new application program with the grid in mind, often such that the application architecture respectively its inherent numerical algorithms are optimally mapped onto the best-suited (set of) resources in a grid. In both scenarios, the effort of implementing an application can be huge. Therefore, it is important to perform a careful analysis beforehand on: the user requirements for running the application on a grid (e.g. cost, time); on application type (e.g. compute or data intensive); application architecture and algorithms (e.g. explicit, or implicit) and application components and how they interact (e.g. loosely or tightly coupled, or workflows); what is the best way to map the application onto a grid; and which is the best suited grid architecture to run the application in an optimally performing way. Therefore, in the following, we summarize the most popular strategies for porting an existing application to a grid, and for designing and developing a new grid application. Many scientific papers and books deal with the issues of designing, programming, and porting grid applications, and it is difficult to recommend the best suited among them. Here, we mainly follow the books from Ian Foster and Carl Kesselman (1999 & 2004), the IBM Redbook (Jacob, 2003), the SURA Grid Technology Cookbook (SURA, 2007), several research papers on programming models and environments, e.g. Soh (2006), Badia (2003), Karonis (2002), Seymour (2002), Buyya (2000), Venugopal (2004), Luther (2005), Altintas (2004), and Frey (2005), and our own experience at Sun Microsystems and MCNC (Gentzsch, 2004), RENCI (Gentzsch, 2007), D-Grid (Gentzsch, 2008, and Neuroth, 2007), and currently in DEISA-2 (2008).
Grid Programming Models and Environments Our own experience in porting applications to distributed resource environments is very similar to the one from Soh et al. (2006) who present a useful discussion on grid programming models and environments which we briefly summarize in the following. In their paper, they start with differentiating application porting into resource composition and program composition. Resource composition, i.e. matching the application to the grid resources needed, has already been discussed in paragraphs 2 and 3 above. Concerning program composition, there is a wide spectrum of strategies of distributing an application onto the available grid resources. This spectrum ranges from the ideal situation of simply distributing a list of, say, n parameters together with n identical copies of that application program onto the Grid, to the other end of the spectrum where one has to compose or parallelize the program into chunks or components
70
Porting Applications to Grids
that can be distributed to the grid resources for execution. In the latter case, Soh (2006) differentiates between implicit parallelism, where programs are automatically parallelized by the environment, and explicit parallelism which requires the programmer to be responsible for most of the parallelization effort such as task decomposition, mapping tasks to processors and inter-task communication. However, implicit approaches often lead to non-scalable parallel performance, while explicit approaches often are complex and work- and time-consuming. In the following we summarize and update the approaches and methods discussed in detail in Soh (2006): Superscalar (or STARSs), sequential applications composed of tasks are automatically converted into parallel applications where the tasks are executed in different parallel resources. The parallelization takes into account the existing data dependencesbetween the tasks, building a dependence graph. The runtime takes care of the task scheduling and data handling between the different resources, and takes into account the locality of the data between other aspects. There are several implementations available, like GRID Superscalar (GRIDSs) for computational Grids (Badia, 2003), which is also used in production at the MareNostrum supercomputer at the BSC in Barcelona; or Cell Superscalar (CellSs) for the Cell processor (Perez, 2007) and SMP Superscalar (SMPSs) for homogeneous multicores or shared memory machines. Explicit Communication, such as Message Passing and Remote Procedure Call (RPC). A messages passing example is MPICH-G2 (Karonis, 2002), a Grid-enabled implementation of the Message Passing Interface (MPI) which defines standard functions for communication between processes and groups of processes, extended by the Globus Toolkit. An RPC example is GridRPC, an API for Grids (Seymour, 2002), which offers a convenient, high-level abstraction whereby many interactions with a Grid environment can be hidden. Bag of Tasks, which can be easily distributed on grid resources. An example is the Nimrod-G Broker (Buyya, 2000) which is a Grid-aware version of Nimrod, a specialized parametric modeling system. Nimrod uses a simple declarative parametric modeling language and automates the task of formulating, running, monitoring, and aggregating results. Another example is the Gridbus Broker (Venugopal, 2004) that permits users access to heterogeneous Grid resources transparently. Distributed Objects, as in ProActive (2005), a Java based library that provides an API for the creation, execution and management of distributed active objects. Proactive is composed of only standard Java classes and requires no changes to the Java Virtual Machine (JVM) allowing Grid applications to be developed using standard Java code. Distributed Threads, for example Alchemi (Luther, 2005), a Microsoft .NET Grid computing framework, consisting of service-oriented middleware and an application program interface (API). Alchemi features a simple and familiar multithreaded programming model. Grid Workflows. Many Workflow Environments have been developed in recent years for Grids, such as Triana, Taverna, Simdat, P-Grade, and Kepler. Kepler, for example, is a scientific workflow management system along with a set of Application Program Interfaces (APIs) for heterogeneous hierarchical modeling (Altintas, 2004). Kepler provides a modular, activity oriented programming environment, with an intuitive GUI to build complex scientific workflows. Grid Services. An example is the Open Grid Services Architecture (OGSA) (Frey, 2005) which is an ongoing project that aims to enable interoperability between heterogeneous resources by aligning Grid technologies with established Web services technology. The concept of a Grid service is introduced as a Web service that provides a set of well defined interfaces that follow specific conventions. These grid services can be composed into more sophisticated services to meet the needs of users.
71
Porting Applications to Grids
Grid-Enabling Application Programs and Numerical Algorithms In many cases, restructuring (grid-enabling, decomposing, parallelizing) the core algorithm(s) within a single application program doesn’t make sense, especially in the case of a more powerful higher-level grid-enabling strategy. For example, in the case of parameter jobs (see below), many identical copies of the application program together with different data-sets can easily be distributed onto many grid nodes, or where the application program components can be mapped onto a workflow, or where applications (granularity, run time, special dimension, etc.) simply are too small to efficiently run on a grid, and the grid latencies and management overhead become too dominant. In other cases, however, where e.g. just one very long run has to be performed, grid-enabling the application program itself can lead to dramatic performance improvements and, thus, time savings. In an effort to better guide the reader through this complex field, in the following, we will briefly present a few popular application codes and their algorithmic structure and provide recommendations for some meaningful grid-enabling strategies. General Approach. First, we have to make sure that we gain an important benefit form running our application on a grid. And we should start asking a few more general questions, top-down. Has this code been developed in-house, or is it a third-party code, developed elsewhere? Will I submit many jobs (as e.g. in a parameter study), or is the overall application structure a workflow, or is it a single monolithic application code? In case of the latter, are the core algorithms within the application program of explicit or of implicit nature? In many cases, grid-enabling those kinds of applications can be based on experience made in the past with parallelizing them for the moderately or massively parallel systems, see e.g. Fox et al. (1994) and Dongarra et al. (2003). In-house Codes. In case of an application code developed in-house, the source code of this application is often still available, and ideally the code developers are still around. Then, we have the possibility to analyze the structure of the code, its components (subroutines), dependencies, data handling, core algorithms, etc. With older codes, sometimes, this analysis has already been done before, especially for the vector and parallel computer architectures of the 1980ies and 1990ies. Indeed, some of this knowledge can be re-used now for the grid-enabling process, and often only minor adjustments are needed to port such a code to the grid. Third-Party Codes licensed from so-called Independent Software Vendors (ISVs) cannot be gridenabled without the support from these ISVs. Therefore, in this case, we recommend to contact the ISV. In case the ISV receives similar requests from other customers as well, there might be a real chance that the ISV will either provide a grid-enabled code or completely change its sales strategy and sell its software as a service, or develops its own application portal to provide access to the application and the computing resources. But, obviously, this requires patience and is thus not a solution if you are under a time constraint. Parameter Jobs. In science and engineering, often, the application has to run many times: same code, different data. Only a few parameters have to be modified for each individual job, and at the end of the many job runs, the results are analyzed with statistical or stochastic methods, to find a certain optimum. For example, during the design of a new car model, many crash simulations have to be performed, with the aim to find the best-suited material and geometry for a specific part of the wire-frame model of the car. Application Workflows. It is very common in so-called Problem Solving Environments that the application program consists of a set of components or modules which interact with each other. This can be modeled in grid workflow environments which support the design and the execution of the workflow
72
Porting Applications to Grids
representing the application program. Usually, these grid workflow environments contain a middleware layer which maps the application modules onto the different resources in the grid. Many Workflow Environments have been developed in recent years for Grids, such as Triana (2003), Taverna (2008), Simdat (2008), P-Grade (2003), and Kepler (Altintas, 2004). One application which is well suited for such a workflow is climate simulation. Today’s climate codes consist of modules for simulating the weather on the continent with mesoscale meteorology models, and include other modules for taking into account the influence from ocean and ocean currents, snow and ice, sea ice, wind, clouds and precipitation, solar and terrestrial radiation, absorption, emission, and reflection, land surface processes, volcanic gases and particles, and human influences. Interactions happen between all these components, e.g. air-ocean, air-ice, ice-ocean, ocean-land, etc. resulting in a quite complex workflow which can be mapped onto the underlying grid infrastructure. Highly Parallel Applications. Amdahl’s Law states that the scalar portion of a parallel program becomes a dominant factor as processor number increases, leading to a loss in application scalability with growing number of processors. Gustafson (1988) proved that this holds only for fixed problem size, and that in practice, with increasing number of processors, the user increases problem size as well, always trying to solve the largest possible problem on any given number of CPUs. Gustafson demonstrated this on a 1028-processor parallel system, for several applications. For example, he was able to achieve a speed-up factor of over 1000 for a Computational Fluid Dynamics application with 1028 parallel processes on the 1028-processor system. Porting these highly parallel applications to a grid, however, has shown that many of them degrade in performance simply because overhead of communication for message-passing operations (e.g. send and receive) drops from a few microseconds on a tightly-coupled parallel system to a few milliseconds on a (loosely-coupled) workstation cluster or grid. In this case, therefore, we recommend to implement a coarse-grain Domain Decomposition approach, i.e. to dynamically partition the overall computational domain into sub-domains (each consisting of as many parallel processes, volumes, finite elements, as possible), such that each sub-domain completely fits onto the available processors of the corresponding parallel system in the grid. Thus, only moderate performance degradation from the reduced number of inter-system communication can be expected. A prerequisite for this to work successfully is that the subset of selected parallel systems is of homogeneous nature, i.e. architecture and operating system of these parallel systems should be identical. One Grid infrastructure which offers this feature is the Distributed European Infrastructure for Supercomputing Applications (DEISA, 2008), which (among others) provides a homogeneous cluster of parallel AIX machines distributed over several of the 11 European supercomputing centers which are part of DEISA (see also Section 5 in this Chapter). Moderately Parallel Applications. These applications, which have been parallelized in the past, often using Message Passing MPI library functions for the inter-process communication on workstation clusters or on small parallel systems, are well-suited for parallel systems with perhaps a few dozen to a few hundreds of processors, but they won’t scale easily to a large number of parallel processes (and processors). Reasons are a significant scalar portion of the code which can’t run in parallel and/or the relatively high ratio of inter-process communication to computation, resulting in relatively high idle times of the CPUs waiting fore the data. Many commercial codes fall in this category, for example finiteelement codes such as Abaqus, Nastran, or Pamcrash. Here we recommend to check if the main goal is to analyze many similar scenarios with one and the same code but on different data sets, and run as many codes in parallel as possible, on as many moderately parallel sub-systems as possible (this could be virtualized sub-systems on one large supercomputer, for example).
73
Porting Applications to Grids
Explicit vs. Implicit Algorithms. Discrete Analogues of systems of partial differential equations, stemming from numerical methods such as finite difference, finite volume, or finite element discretizations, often result in large sets of explicit or implicit algebraic equations for the unknown discrete variables (e.g. velocity vectors, pressure, temperature). The explicit methods are usually slower (in convergence to the exact solution vector of the algebraic system) than the implicit ones but they are also inherently parallel, because there is no dependence of the solution variables among each other, and therefore there are no recursive algorithms. In case of the more accurate implicit methods, however, solution variables are highly inter-dependent leading to recursive sparse-matrix systems of algebraic equations which cannot easily split (parallelized) into smaller systems. Again, here, we recommend to introduce a Domain Decomposition approach as described in the above section on Highly Parallel Algorithms, and solve an implicit sparse-matrix system within each domain, and bundle sets of ‘neighboring’ domains into super-sets to submit to the (homogeneous) grid. Domain Decomposition. This has been discussed in the paragraphs on Highly Parallel Applications and on Explicit vs. Implicit Algorithms. Job Mix. Last but not lease, one of the most trivial but most widely used scenarios often found in university and research computer centers is the general job mix, stemming from hundreds or thousands of daily users, with hundreds or even thousands of different applications, with varying requirements for computer architecture, data handling, memory and disc space, timing, priority, etc. This scenario is ideal for a grid which is managed by an intelligent Distributed Resource Manager (DRM), for example GridWay (2008) for a global grid, Sun Grid Engine Enterprise Edition (Chaubal, 2003) for an enterprise grid, or the open source Grid Engine (2001) for a departmental grid or a simple cluster. These DRMs are able to equally balance the overall job load across the distributed resource environment and submit the jobs always to the best suited and least loaded resources. This can result in overall resource utilization of 90% and higher.
Applications and Grid Portals Grid portals are an important part of the process of grid-enabling, composing, manipulating, running, and monitoring applications. After all the lower layers of the grid-enabling process have been performed (described in the previous paragraphs), often, the user is still exposed to the many details of the grid services and even has to take care of configuring, composing, provisioning, etc. the application and the services “by hand”. This however can be drastically simplified and mostly hidden from the user through a Grid portal, which is a Web-based portal able to expose Grid services and resources through a browser to allow users remote, ubiquitous, transparent and secure access to grid services (computers, storage, data, applications, etc). The main goal of a Grid portal is to hide the details and complexity of the underlying Grid infrastructure from the user in order to improve usability and utilization of the Grid, greatly simplifying the use of Grid-enabled applications through a user-friendly interface. Grid portals have become popular in research and the industry communities. Using Grid portals, computational and data-intensive applications such as genomics, financial modeling, crash test analysis, oil and gas exploration, and many more, can be provided over the Web as traditional services. Examples of existing scientific application portals are the GEONgrid (2008) and CHRONOS (2004) portals that provide a platform for the Earth Science community to study and understand the complex dynamics of Earth systems; the NEESGrid project (2008) focuses on earthquake engineering research; the BIRN portal (2008) targets biomedical informatics researchers; and the MyGrid portal (2008) provides access
74
Porting Applications to Grids
to bioinformatics tools running on a back-end Grid infrastructure. As it turns out, scientific portals are usually being developed inside specific research projects. As a result they are specialized for specific applications and services satisfying project requirements for that particular research application area. In order to rapidly build customized Grid portals in a flexible and modular way, several more generic toolkits and frameworks have been developed. These frameworks are designed to meet the diverse needs and usage models arising from both research and industry. One of these frameworks is EnginFrame, which simplifies development of highly functional Grid portals exposing computing services that run on a broad range of different computational Grid systems. EnginFrame (Beltrame, 2006) has been adopted by many industrial companies, and by organizations in research and education.
Example: The EnginFrame Portal Environment EnginFrame (2008) is a Web-based portal technology that enables the access and the exploitation of grid-enabled applications and infrastructures. It allows organizations to provide application-oriented computing and data services to both users (via Web browsers) and in-house or ISV applications (via SOAP/WSDL based Web services), thus hiding the complexity of the underlying Grid infrastructure. Within a company or department, an enterprise portal aggregates and consolidates the services and exposes them to the users, through the Web. EnginFrame can be integrated as Web application in a J2EE standard application server or as a portlet in a JSR168 compliant portlet container. As a Grid portal framework, EnginFrame offers a wide range of functionalities to IT developers facing the task to provide application-oriented services to the end users. EnginFrame’s plug-in mechanism allows to easily and dynamically extend its set of functionalities and services. A plug-in is a selfcontained software bundle that encapsulates XML Extensible Markup Language service descriptions, custom layout or XSL Extensible Stylesheet Language and the scripts or executables involved with the services actions. A flexible authentication delegation offers a wide set of pre-configured authentication mechanisms: OS/NIS/PAM, LDAP, Microsoft Active Directory, MyProxy, Globus, etc. It can also be extended throughout the plug-in mechanism. Besides authentication, EnginFrame provides an authorization framework that allows to define groups of users and Access Control Lists (ACLs), and to bind ACLs to resources, services, service parameters and service results. The Web interface of the services provided by the portal can be authorized and thus tailored to the specific users’ roles and access rights. EnginFrame supports a wide variety of compute Grid middleware like LSF, PBS, Sun Grid Engine, Globus, gLite and others. An XML virtualization layer invokes specific middleware commands and translates results, jobs and Grid resource descriptions into a portable XML format called GridML that abstracts from the actual underlying Grid technology. For the GridML, as for the service description XML, the framework provides pre-built XSLs to translate GridML into HTML. EnginFrame data management allows for browsing and handling data on the client side or remotely archived in the Grid and then to host a service working environment in file system areas called spoolers. The EnginFrame architecture is structured into three tiers, Client, Resource, Server. The Client Tier normally consists of the user’s Web browser and provides an easy-to-use interface based on established Web standards like XHTML and JavaScript, and it is independent from the specific software and hardware environment used by the end user. When needed, the client tier also provides integration with desktop virtualization technologies like Citrix Metaframe (ICA), VNC, X, and Nomachine NX. The Resource Tier consists of one or more Agents deployed on the back-end Grid infrastructure whose role is to control
75
Porting Applications to Grids
and provide distributed access to the actual computing resources. The Server Tier consists of a server component that provides resource brokering to manage resource activities in the back-end. The EnginFrame server authenticates and authorizes incoming requests from the Web, and asks an Agent to execute the required actions. Agents can perform different kind of actions that range from the execution of a simple command on the underlying Operating System, to the submission of a job to the Grid. The results of the executed action are gathered by the Agent and sent back to the Server which applies post processing transformations, filters the output according to ACLs and transforms the results into a suitable format according to the nature of the client: HTML for Web browsers and XML in a SOAP message for Web services client applications.
CASE STUDY: APPLICATIONS ON THE DEISA INFRASTRUCTURE As one example, in the following, we will discuss the DEISA Distributed European Infrastructure for Supercomputing Applications. DEISA (2008) is different from many other Grid initiatives which aim at building a general purpose grid infrastructure and therefore have to cope with many (almost) insurmountable barriers such as complexity, resource sharing, crossing administrative (and even national) domains, handling IP and legal issues, dealing with sensitive data, working on interoperability, and facing the issue to expose every little detail of the underlying infrastructure services to the grid application user. DEISA avoids most of these barriers by staying very focused: The main focus of DEISA is to provide the European supercomputer user with a flexible, dynamic, user-friendly supercomputing ecosystem (one could say Supercomputing Cloud, see next paragraph) for easy handling, submitting, and monitoring long-running jobs on the best-suited and least-loaded supercomputer(s) in Europe, trying to avoid the just mentioned barriers. In addition, DEISA offers application-enabling support. For a similar European funded initiative especially focusing on enterprise applications, we refer the reader to the BEinGRID project (2008), which consists of 18 so-called business experiments each dealing with a pilot application that addresses a concrete business case, and is represented by an end-user, a service provider, and a Grid service integrator. Experiments come from key business sectors such as multimedia, financial, engineering, chemistry, gaming, environmental science, and logistics and so on, based on different Grid middleware solutions, see (BEinGRID, 2008).
The DEISA Project DEISA is the Distributed European Initiative for Supercomputing Applications, funded by the EU in Framework Program 6 (DEISA1, 2004 – 2008) and Framework Program 7 (DEISA2, 2008 – 2011). The DEISA Consortium consists of 11 partners, MPG-RZG (Germany, consortium lead), BSC (Spain), CINECA (Italy), CSC (Finland), ECMWF (UK), EPCC (UK), FZJ (Germany), LRS (Germany), IDRIS (France), LRZ (Germany), and SARA (Netherlands), and 3 asociated partners KTH (Sweden), CSCS (Switzerland), and JSCC (Russia). DEISA develops and supports a distributed high performance computing infrastructure and a collaborative environment for capability computing and data management. The resulting infrastructure enables the operation of a powerful Supercomputing Grid built on top of national supercomputing services, facilitating Europe’s ability to undertake world-leading computational science research. DEISA is certainly instrumental for advancing computational sciences in scientific and industrial disciplines within
76
Porting Applications to Grids
Europe and is paving the way towards the deployment of a cooperative European HPC ecosystem. The existing infrastructure is based on the coupling of eleven leading national supercomputing centers, using dedicated network interconnections (currently 10 GBs) of GÉANT2 and the NRENs. DEISA2 develops activities and services relevant for applications enabling, operation, and technologies, as these are indispensable for the effective support of computational sciences in the area of supercomputing. The service provisioning model is extended from one that supports a single project (in DEISA1) to one supporting Virtual European Communities (now in DEISA2). Collaborative activities will be carried out with new European and other international initiatives. Of strategic importance is the cooperation with the PRACE (2008) initiative which is preparing for the installation of a limited number of leadership-class Tier-0 supercomputers in Europe.
The DEISA Infrastructure Services The essential services to operate the infrastructure and support its efficient usage are organized in the three Service Activities Operations, Technologies, and Applications: Operations refer to operating the infrastructure including all existing services, adopting approved new services from the Technologies activity, and advancing the operation of the DEISA HPC infrastructure to a turnkey solution for the future European HPC ecosystem by improving the operational model and integrating new sites. Technologies cover monitoring of technologies in use in the project, identifying and selecting technologies of relevance for the project, evaluating technologies for pre-production deployment, and planning and designing specific sub-infrastructures to upgrade existing services or deliver new ones based on approved technologies. User-friendly access to the DEISA Supercomputing Grid is provided by DEISA Services for Heterogeneous management Layer (DESHL, 2008) and the UNiforme Interface for COmputing Resources (UNICORE, 2008). Applications cover the areas applications enabling and extreme computing projects, environment and user related application support, and benchmarking. Applications enabling focuses on enhancing scientific applications from the DEISA Extreme Computing Initiative (DECI), Virtual Communities and EU projects. Environment and user related application support addresses the maintenance and improvement of the DEISA application environment and interfaces, and DEISA-wide user support in the applications area. Benchmarking refers to the provision and maintenance of a European Benchmark Suite for supercomputers. In DEISA2, two Joint Research Activities (JRA) complement the portfolio of service activities. JRA1 (Integrated DEISA Development Environment) aims at an integrated environment for scientific application development, based on a software infrastructure for tools integration, which provides a common user interface across multiple computing platforms. JRA2 (Enhancing Scalability) aims at the enabling of supercomputer applications for the efficient exploitation of current and future supercomputers, to cope with a production infrastructure characterized by an aggressive parallelism on heterogeneous HPC architectures at a European scale.
DECI – DEISA Extreme Computing Initiative for Supercomputing Applications The DEISA Extreme Computing Initiative (DECI, 2008) has been launched in May 2005 by the DEISA Consortium, as a way to enhance its impact on science and technology. The main purpose
77
Porting Applications to Grids
of this initiative is to enable a number of “grand challenge” applications in all areas of science and technology. These leading, ground breaking applications must deal with complex, demanding and innovative simulations that would not be possible without the DEISA infrastructure, and which benefit from the exceptional resources provided by the Consortium. The DEISA applications are expected to have requirements that cannot be fulfilled by the national services alone. In DEISA2, the single-project oriented activities (DECI) will be qualitatively extended towards persistent support of Virtual Science Communities. This extended initiative will benefit from and build on the experiences of the DEISA scientific Joint Research Activities where selected computing needs of various scientific communities and a pilot industry partner were addressed. Examples of structured science communities with which close relationships are planned to be established are EFDA and the European climate community. DEISA2 will provide a computational platform for them, offering integration via distributed services and web applications, as well as managing data repositories.
Applications Adapted to the DEISA Grid Infrastructure In the following, we describe examples of application profiles and use cases that are well-suited for the DEISA supercomputing Grid, and that can benefit from the computational resources made available by the DECI Extreme Computing Initiative. International collaboration involving scientific teams that access the nodes of the AIX super-cluster in different countries, can benefit from a common data repository and a unique, integrated programming and production environment (via common global file systems). Imagine, for example, that team A in France and team B in Germany dispose of allocated resources at IDRIS in Paris and FZJ in Juelich, respectively. They can benefit from a shared directory in the distributed super-cluster, and for all practical purposes it looks as if they were accessing a single supercomputer. Extreme computing demands of a challenging project requiring a dominant fraction of a single supercomputer. Rather than spreading a huge, tightly coupled parallel application on two or more supercomputers, DEISA can organize the management of its distributed resource pool such that it is possible to allocate a substantial fraction of a single supercomputer to this project which is obviously more efficient that splitting the application and distributing it over several supercomputers. Workflow applications involving at least two different HPC platforms. Workflow applications are simulations where several independent codes act successively on a stream of data, the output of one code being the input of the next one in the chain. Often, this chain of computations is more efficient if each code runs on the best-suited HPC platform (e.g. scalar, vector, or parallel supercomputers) where it develops the best performance. Support of these applications via UNICORE (2008) which allows treating the whole simulation chain as a single job is one of the strengths of the DEISA Grid. Coupled applications involving more than one platform. In some cases, it does make sense to spread a complex application over several computing platforms. This is the case of multi-physics, multi-scale application codes involving several computing modules each dealing with one particular physical phenomenon, and which only need to exchange a moderate amount of data in real time. DEISA has already developed a few applications of this kind, and is ready to consider new ones, providing substantial support to their development. This activity is more prospective, because systematic production runs of coupled applications require a co-allocation service which is currently being implemented.
78
Porting Applications to Grids
APPLICATIONS IN THE CLOUD With increasing demand for higher performance, efficiency, productivity, agility, and lower cost, since several years, Information Communication Technologies, ICT, are dramatically changing from static silos with manually managing resources and applications, towards dynamic virtual environments with automated and shared services, i.e. from silo-oriented to service-oriented architectures. With sciences and businesses turning global and competitive, applications, products and services becoming more complex, and research and development teams being distributed, ICT is in transition again. Global challenges require global approaches: on the horizon, so-called virtual organizations and partner grids will provide the necessary communication and collaboration platform, with grid portals for secure access to resources, applications, data, and collaboratories. One component which will certainly foster this next-generation scenario is Cloud Computing, as recently offered by companies like Sun (2006) Network.com, IBM (2008), Amazon (2007) Elastic Compute Cloud, and Google (2008) App Engine, Google Group (2008), and CloudCamp (2008), and many more in the near future. Clouds will become important dynamic components of research and enterprise grids, adding a new ‘external’ dimension of flexibility to them by enhancing their ‘home’ resource capacity whenever needed, on demand. Existing businesses will use them for their peak demands and for new projects, service providers will host their applications on them and provide Software as a Service (SaaS), start-ups will integrate them in their offerings without the need to buy resources upfront, and setting up new Web 2.0 communities will become very easy. To Cloud-enable applications will follow similar strategies as with grid-enabling, as discussed in the previous paragraphs. Similarly challenging as with Grids, though, are the cultural, mental, legal, and political aspects in the Cloud context. Building trust and reputation among the users and the providers will help in some scenarios. But it is currently difficult to imagine that users may easily entrust their corporate core assets and sensitive data to Cloud service providers. Today (in October 2008) the status of Clouds seems to be similar to the status of Grids in the early 2000s: a few simple and well-suited application scenarios run on Clouds, but by far most of the more complex and demanding applications in research and enterprises will face many barriers on Clouds which still have to be removed, one by one. One example of an early innovative Cloud system came from Sun when it truly built its SunGrid (2005) from scratch, based on the vision that the network is the computer. As with other early technologies in the past, Sun paid a high price for being first and doing all the experiments and the evangelization, but their reputation as an innovator is here to stay. Its successor, Sun Network.com (2008), is very popular among its few die-hard clients. This is because of an easy-to use technology (Grid Engine, Jini, JavaSpaces), but it’s especially because of their innovative early users, such as CDO2 (2008), and because of the instant support users get from the Sun team. A similar promising example in the future might be the DEISA Distributed European Infrastructure for Supercomputing Applications, with its DECI – DEISA Extreme Computing Initiative. Why is DECI currently so successful in offering millions of supercomputing cycles to the European e-Science community and helping scientists gain new scientific insights? Several reasons, in my opinion: because DEISA has a very targeted focus on specific (long-running) supercomputing applications and most of the applications just run on one – best-suited - system; because of its user-friendly access - through technology like DESHL (2008) and UNICORE (2008); because of staying away from those more ambitious general-purpose Grid efforts; because of its coordinating function which leaves the consortium partners (the European supercomputer centers) fully independent; and – similar to network.com – because of
79
Porting Applications to Grids
ATASKF (DECI (2008), the application task force, application experts who help the users with porting their applications to the DEISA infrastructure. If all this is here to stay, and the (currently funded) activities will be taken over by the individual supercomputer centers, DEISA will have a good chance to exist for a long time, even after the funding will run dry. And then, we might end up with a DEISA Cloud which will become an (external) HPC node within your Grid application workflow. With this sea-change ahead of us, there will be a continuous strategic importance for sciences and businesses to support the work of the Open Grid Forum (OGF, 2008). Because only standards will enable building e-infrastructures and grid-enabled applications easily from different components and to transition towards an agile platform for federated services. Standards, developed in OGF, guarantee – upfront - interoperation of components best suited for your applications, and thus reducing dependency from proprietary building blocks, keeping cost under control, and increasing research and business flexibility.
CONCLUSION: 10 RULES FOR BUILDING A SUSTAINABLE GRID FOR SUSTAINABLE APPLICATIONS Sustainable grid-enabled applications require sustainable grid infrastructures. It doesn’t make any sense, for example, in a three-year funded Grid project, to develop or port a complex application to the Grid which will shut down after the project ends. We have to make sure that we are able to build sustainable grid infrastructures which will last for a long time. Therefore, in the following, the author offers ‘his’ 10 rules for building a sustainable grid, available also from the OGF Thought Leadership (2008). These rules are derived from mainly four sources: my research on major grid projects published in a RENCI report (Gentzsch, 2007a), the e-IRG Workshop on “A Sustainable Grid Infrastructure for Europe” in (Gentzsch, 2007b), the 2nd International Workshop on Campus and Community Grids at OGF20 in Manchester (McGinnis, 2007), and my personal experience with coordinating the German D-Grid Initiative (D-Grid, 2008). The rules presented here are mainly non-technical, because I believe most of the challenges in building and operating a grid are in the form of cultural, legal and regulatory barriers. Rule 1: Identify your specific benefits. Your first thought should be about your users and your organization. What’s in it for them? Identify the benefits which fit best: transparent access to and better utilization of resources; almost infinite compute and storage capacity; flexibility, adaptability and automation through dynamic and concerted interoperation of networked resources; cost reduction through utility model; shorter time-to-market because of more simulations at the same time on the grid. Grid technology helps to adjust an enterprise’s IT architecture to real business requirements (and not vice versa). For example, global companies will be able to decompose their highly complex processes into modular components of a workflow which can be distributed around the globe such that on-demand availability and access to suitable workforce and resources are assured, productivity increased, and cost reduced. Application of grid technology in these processes, guarantees seamless integration of and communication among all distributed components and provides transparent and secure access to sensitive company information and other proprietary assets, world-wide. Grid computing is especially of great benefit for those research and business groups which cannot afford expensive IT resources . It enables engineers to remotely access any IT resource as a utility, to simulate any process and any product (and product life cycle) before it is built, resulting in higher quality, increased functionality, and cost and risk reduction.
80
Porting Applications to Grids
Rule 2: Evangelize your decision makers first. They give you the money and authority for your grid project. The more they know about the project and the more they believe in it (and in you) the more money and time you will get, and the easier becomes your task to lead and motivate your team and to get things done. Present a business case (current deficiencies, specific benefits of the grid (see Rule #1), how much will it cost and how much will it return, etc. They might also have to modify existing policies, top down, to make it easier for users (and providers) to cope with the challenges of and to accept and use the new services. For example, why would a researcher (or a department in an enterprise) stop buying computers when money continues to be allocated for buying it? This policy should be changed to support a utility model instead of an ownership model. If you are building a national grid, for example, convincing your government to modify its research funding model is a tough task. Rule 3: Don’t re-invent wheels. In the early grid days, many grid projects tried to develop the whole software stack themselves: from the middleware layer, to the software tools, to grid-enabling the applications, to the portal and Web layer…and got troubled by the next technology change. Today, so many grid technologies, products and projects exist that you want to start looking for similar projects, select your favorite (successful) ones which fit best your users’ needs, and ‘copy’ what they have built, and that will be your prototype. Then, you might still have some time and money left to optimize it so it fully matches the requirements of your users. Consider, however, that all grids are different. For example, research grids are mainly about sharing (e.g. sharing resources, knowledge, data), commercial enterprise grids are about cost and revenue (e.g. TCO, ROI, productivity). Therefore, if your community is academic, look for academic use cases, if it’s commercial, look for commercial use cases in your respective business field. Rule 4: KISS (Keep It Simple and Stupid). It took your users years to get acquainted with their current working environment and tools. Ideally, you won’t change that. Try hard to stick with what they have and how they do things. Plan for an incremental approach and lots of time listening and talking. Social effects dominate in grids. Join forces with the system people to change/modify mainly the lower layers of the architecture. Your users are your customers, they are king. Differentiate between two groups of users: the end users who are designing and developing the products (or the research results) which account for all the earnings of your company (or reputation and therefore funding for your research institute), and the system experts who are eager to support the end users with the best possible services. Therefore, you can only succeed if you demonstrate a handful of clear benefits to these two user groups. Rule 5: Evolution, not revolution. As the saying goes: “never change a running system”... We all hate changes in our daily lives, except when we are sure that things will drastically improving. Your users and their applications deeply depend on a reliable infrastructure. So, whenever you have to change especially the user layer, only change it in small steps and in large time cycles. And, start with enhancing existing service models moderately, and test suitable utility models first as pilots. And, very important, part of your business plan has to be an excellent training and communications strategy. Rule 6: Establish a governance structure. Define clear responsibilities and dependencies for specific tasks, duties and people during and after the project. An advisory board should include your representatives of your end-users as well as application and system experts. In case of more complex projects, e.g. consisting of an integration project and several application or community projects, an efficient management board should lead and steer coordination and collaboration among the projects and the working groups. The management board (Steering Committee) should consist of leaders of the sub-projects. Regular face-to-face meetings are very important. Rule 7: Money, money, money. Don’t have unrealistic expectations that grid computing will save you
81
Porting Applications to Grids
money initially.. In their early stage, grid projects need enough funding to get over the early-adopter phase into a mature state with a rock-solid grid infrastructure such that other user communities can join easily. In research grids, for example, we estimate this funding phase currently to be in the order of 3-5 years, with more funding in the beginning for the grid infrastructure, and later more funding for the application communities. In larger (e.g. global) research grids, funding must cover Teams or Centers of Excellence, for building, managing and operating the grid infrastructure, and for middleware tools, application support, and training. Also, today’s funding models in research and education are often project based and thus not ready for a utilitarian approach where resource usage is based on a pay-as-you- go approach. Old funding models first have to be adjusted accordingly before a utility model can be introduced successfully. For example, today’s existing government funding models are often counter-productive when establishing new and efficient forms of utility services (see Rule #2). In the long run, grid computing will save you money through a much more efficient, flexible and productive infrastructure. Rule 8: Secure some funding for after the end of the project. Continuity especially for maintenance and support are extremely important for the sustainability of your grid infrastructure. Make sure at the beginning of your project that additional funding will be available after the end of the project, to guarantee service and support and continuous improvement and adjustment of the infrastructure. Rule 9: Try not to grid-enable your applications in the first place. Adjusting your application to changing technologies costs a lot of effort and money, and takes a lot of your precious time. Did you macro-assemble, vectorize, multitask, parallelize, or multithread your application yourself in the past? Then, grid-enabling that code is relatively easy, ay we have seen in this article. But doing this from scratch is not what the user should do. Better to use the money to buy (lease, rent, subscribe to) software as a service or to hire a few consultants who grid-enable your application and/or (even better) help you enable your grid architecture to dynamically cope with the applications and user requirements (instead vice versa). Today, in grids, we are looking more at chunks of independent jobs, (or chunks of transactions). And we let our schedulers and brokers decide how to distribute these chunks onto the best-suited and least-loaded servers in the grid, or let the servers decide themselves to share the chunks with their neighbors automatically whenever they become overloaded. Rule 10: Adopt a ‘human’ business model. Don’t invent new business models. This usually increases the risk for failure. Learn from the business models we have with our other service infrastructures: water, gas, telephony, electricity, mass transportation, the Internet, and the World Wide Web. Despite this wide variety of areas, there is only a handful of successful business models: on one end of the spectrum, you pay the total price, and the whole thing is yours. Or you pay only a share of it, but pay the other share on a per usage basis. Or you rent everything, and pay chunks back on a regular basis, like a subscription fee or leasing. Or you pay just for what you use. Sometimes, however, there are ‘hidden’ or secondary applications. For example, electrical power alone doesn’t help. It’s only useful if it generates something, e.g. light, or heat, or cold, etc. And this infrastructure is what creates a whole new industry of new appliances: light bulbs, heaters, refrigerators, etc. Back to grids: providing the right (transparent) infrastructure (services) and the right (simple) business model will most certainly create a new set of services which most probably will improve our quality of life in the future.
82
Porting Applications to Grids
REFERENCES Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., & Mock, S. (2004). Kepler: an extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece. Amazon Elastic Compute Cloud (2007). Retrieved from www.amazon.com/ec2 Badia, R. M., Labarta, J. S., Sirvent, R. L., Perez, J. M., Cela, J. M., & Grima, R. (2003). Programming grid applications with GRID superscalar. Journal of Grid Computing, 1, 151–170. doi:10.1023/ B:GRID.0000024072.93701.f3 Baker, S. (2007). Google and the wisdom of clouds. Business Week, Dec. 13. Retrieved from www. businessweek.com/magazine/content/07_52/b4064048925836.htm BEinGRID. (2008). Business experiments in grids. Retrieved from www.beingrid.com Beltrame, F., Maggi, P., Melato, M., Molinari, E., Sisto, R., & Torterolo, L. (2006, February 2-3). SRB Data grid and compute grid integration via the enginframe grid portal. In Proceedings of the 1st SRB Workshop, San Diego, CA. Retrieved from www.sdsc.edu/srb/Workshop/SRB-handout-v2.pdf BIRN. (2008). Biomedical informatics research network. Retrieved from www.nbirn.net/index.shtm Buyya, R., Abramson, D., & Giddy, J. (2000). Nimrod/G: An architecture for a resource management and scheduling system in a global computational grid. In Proceedings of the 4th International Conference on High Performance Computing in the Asia-Pacific Region. Retrieved from www.csse.monash. edu.au/~davida/nimrod/nimrodg.htm CDO2. (2008). CDOSheet for pricing and risk analysis. Retrieved from www.cdo2.com Chaubal, Ch. (2003). Sun grid engine, enterprise edition—Software configuration guidelines and use cases. Sun Blueprints, Retrieved from www.sun.com/blueprints/0703/817-3179.pdf CloudCamp. (2008). Retrived from http://www.cloudcamp.com/ D-Grid (2008). Retrieved from www.d-grid.de/index.php?id=1&L=1 DECI. (2008). DEISA extreme computing initiative. Retrieved from www.deisa.eu/science/deci DEISA. (2008). Distributed European infrastructure for supercomputing applications. Retrieved from www.deisa.eu DESHL. (2008). DEISA services for heterogeneous management layer. http://forge.nesc.ac.uk/projects/ deisa-jra7/ Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., & White, A. (2003). Sourcebook of parallel computing. San Francisco: Morgan Kaufmann Publishers. EnginFrame. (2008). Grid and cloud portal. Retrieved from www.nice-italy.com Foster, I. (2000). Internet computing and the emerging grid. Nature. Retrieved from www.nature.com/ nature/webmatters/grid/grid.html
83
Porting Applications to Grids
Foster, I. (2002). What is the Grid? A three point checklist. Retrieved from http://www-fp.mcs.anl. gov/~foster/Articles/WhatIsTheGrid.pdf Foster, I. Kesselman, & C., Tuecke, S. (2002). The anatomy of the Grid: Enabling scalable virtual organizations. Retrieved from www.globus.org/alliance/publications/papers/anatomy.pdf Foster, I., & Kesselman, C. (Eds.). (1999). The Grid: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers. Foster, I., & Kesselman, C. (Eds.). (2004). The Grid 2: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers. Fox, G., Williams, R., & Messina, P. (1994). Parallel computing works! San Francisco: Morgan Kaufmann Publishers. Frey, J., Mori, T., Nick, J., Smith, C., Snelling, D., Srinivasan, L., & Unger, J. (2005). The open grid services architecture, Version 1.0. Retrieved from www.ggf.org/ggf_areas_architecture.htm GAT. (2005). Grid application toolkit. www.gridlab.org/WorkPackages/wp-1/ Gentzsch, W. (2002). Response to Ian Foster’s “What is the Grid?” GRIDtoday, August 5. Retrieved from www.gridtoday.com/02/0805/100191.html Gentzsch, W. (2004). Grid computing adoption in research and industry. In A. Abbas (Ed.), Grid computing: A practical guide to technology and applications (pp. 309 – 340). Florence, KY: Charles River Media Publishers. Gentzsch, W. (2004). Enterprise resource management: Applications in research and industry. In I. Foster & C. Kesselman (Eds.), The Grid 2: Blueprint for a new computing infrastructure (pp. 157 – 166). San Francisco: Morgan Kaufmann Publishers. Gentzsch, W. (2007a). Grid initiatives: Lessons learned and recommendations. RENCI Report. Retrieved from www.renci.org/publications/reports.php Gentzsch, W. (Ed.). (2007b). A sustainable Grid infrastructure for Europe, Executive Summary of the e-IRG Open Workshop on e-Infrastructures, Heidelberg, Germany. Retrieved from www.e-irg.org/ meetings/2007-DE/workshop.html Gentzsch (2008). Top 10 rules for building a sustainable Grid. In Grid thought leadership series. Retrieved from www.ogf.org/TLS/?id=1 GEONgrid. (2008). Retrieved from www.geongrid.org Goodale, T., Jha, S., Kaiser, H., Kielmann, T., Kleijer, P., Merzky, A., et al. (2008). A simple API for Grid applications (SAGA). Grid Forum Document GFD.90. Open Grid Forum. Retrieved from www. ogf.org/documents/GFD.90.pdf Google (2008). Google App Engine. Retrieved from http://code.google.com/appengine/ Google Groups. (2008). Cloud computing. Retrieved from http://groups.google.ca/group/cloud-computing
84
Porting Applications to Grids
Grid Engine. (2001). Open source project. Retrieved from http://gridengine.sunsource.net/ GridSphere (2008). Retrieved from www.gridsphere.org/gridsphere/gridsphere GridWay. (2008). Metascheduling technologies for the grid. Retrieved from www.gridway.org/ Gustafson, J. (1988). Reevaluating Amdahl’s law. Communications of the ACM, 31, 532–533. doi:10.1145/42411.42415 Jacob, B., Ferreira, L., Bieberstein, N., Gilzean, C., Girard, J.-Y., Strachowski, R., & Yu, S. (2003). Enabling applications for Grid computing with Globus. IBM Redbook. Retrieved from www.redbooks. ibm.com/abstracts/sg246936.html?Open Jha, S., Kaiser, H., El Khamra, Y., & Weidner, O. (2007, Dec. 10-13). Design and implementation of network performance aware applications using SAGA and Cactus. 3rd IEEE Conference on eScience and Grid Computing, (pp. 143- 150). Bangalore, India. Karonis, N. T., Toonen, B., & Foster, I. (2003). MPICH-G2: A Grid-enabled implementation of the message passing interface. [JPDC]. Journal of Parallel and Distributed Computing, 63, 551–563. doi:10.1016/S0743-7315(03)00002-9 Lee, C. (2003). Grid programming models: Current tools, issues and directions. In G. F. Fran Berman, T. Hey, (Eds.), Grid computing (pp. 555–578). New York: Wiley Press. Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005). Peer-to-peer grid computing and a. NETbased alchemi framework. high performance computing: Paradigm and Infrastructure. In M. Guo, (Ed.). New York: Wiley Press. Retrieved from www.alchemi.net McGinnis, L., Wallom, D., & Gentzsch, W. (Eds.). (2007). 2nd International Workshop on Campus and Community Grids. retrieved from http://forge.gridforum.org/sf/go/doc14617?nav=1 MyGrid. (2008). Retrieved from www.mygrid.org.uk NEESgrid. (2008). Retrieved from www.nees.org/ Neuroth, H., Kerzel, M., & Gentzsch, W. (Eds.). (2007). German Grid Initiative D-Grid. Göttingen, Germany: Universitätsverlag Göttingen Publishers. Retrieved from www.d-grid.de/index.php?id=4&L=1 OGF. (2008). Open Grid Forum. Retrieved from www.ogf.org P-GRADE. (2003). Parallel grid run-time and application development environment. Retrieved from www.lpds.sztaki.hu/pgrade/ Perez, J.M., Bellens, P., Badia, R.M., & Labarta, J. (2007, August). CellSs: Programming the Cell/ B.E. made easier. IBM Journal of R&D, 51(5). Portal, C. H. R. O. N. O. S. (2004). Retrieved from http://portal.chronos.org/gridsphere/gridsphere PRACE. (2008). Partnership for advanced computing in Europe. Retrieved from www.prace-project. eu/
85
Porting Applications to Grids
Proactive (2005). Proactive manual REVISED 2.2., Proactive, INRIA. Retrieved from http://www-sop. inria.fr/oasis/Proactive/ Saara Väärtö, S. (Ed.). (2008). Advancing science in Europe. DEISA – Distributed European Infrastructure for Supercomputing Applications. EU FP6 Project. Retrieved from www.deisa.eu/press/DEISAAdvancingScienceInEurope.pdf SAGA. (2006). SAGA implementation home page Retrieved from http://fortytwo.cct.lsu. edu:8000/SAGA Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., & Casanova, H. (2002). Overview of GridRPC: A remote procedure call API for Grid computing. In Proceedings of the Third International Workshop on Grid Computing, Baltimore, MD (LNCS 2536, pp. 274–278). Berlin: Springer. SIMDAT. (2008). Grids for industrial product development. Retrieved from www.scai.fraunhofer.de/ about_simdat.html Soh, H., Shazia Haque, S., Liao, W., & Buyya, R. (2006). Grid programming models and environments. In Yuan-Shun Dai, et al. (Eds.) Advanced parallel and distributed computing (pp. 141–173). Hauppauge, NY: Nova Science Publishers. Sun Network. com (2008). Retrieved from www.network.com/ SunGrid. (2005). Sun utility computing. Retrieved from www.sun.com/service/sungrid/ SURA Southeastern Universities Research Association. (2007). The Grid technology cookbook: Programming concepts and challenges. Retrieved from www.sura.org/cookbook/gtcb/ TAVERNA. (2008). The Taverna Workbench 1.7. Retrieved from http://taverna.sourceforge.net/ TRIANA. (2003). The Triana Project. Retrieved from www.trianacode.org/ UNICORE. (2008). UNiform Interface to COmputing Resources. Retrieved from www.unicore.eu/ Venugopal, S., Buyya, R., & Winton, L. (2004). A grid service broker for scheduling distributed dataoriented applications on global grids. Proceedings of the 2nd workshop on Middleware for grid computing, Toronto, Canada, (pp. 75–80). Retrieved from www.Gridbus.org/broker
KEY TERMS AND DEFINITIONS Clouds Computing: Computing paradigm focusing on provisioning of metered services related to the use of hardware, software platforms, and applications, billed on a pay-per-use base, and pushed by vendors such as Amazon, Google, Microsoft, Salesforce, Sun, and others. Accordingly, there are many different (but similar) definitions (as with Grid Computing). DECI: The purpose of the DEISA Extreme Computing Initiative (DECI) is to enhance the impact of the DEISA research infrastructure on leading European science and technology. DECI identifies, enables, deploys and operates “flagship” applications in selected areas of science and technology. These leading, ground breaking applications must deal with complex, demanding, innovative simulations that
86
Porting Applications to Grids
would not be possible without the DEISA infrastructure, and which would benefit from the exceptional resources of the Consortium. DEISA: The Distributed European Infrastructure for Supercomputing Applications is a consortium of leading national supercomputing centres that currently deploys and operates a persistent, production quality, distributed supercomputing environment with continental scope. The purpose of this EU funded research infrastructure is to enable scientific discovery across a broad spectrum of science and technology, by enhancing and reinforcing European capabilities in the area of high performance computing. This becomes possible through a deep integration of existing national high-end platforms, tightly coupled by a dedicated network and supported by innovative system and grid software. Grid: A service for sharing computer power and data storage capacity over the Internet, unlike the Web which is a service just for sharing information over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource. Today, the Grid is a “work in progress”, with the underlying technology still in a prototype phase, and being developed by hundreds of researchers and software engineers around the world. Open Grid Forum: The Open Grid Forum is a community of users, developers, and vendors leading the global standardisation effort for grid computing. OGF accelerates grid adoption to enable business value and scientific discovery by providing an open forum for grid innovation and developing open standards for grid software interoperability. The work of OGF is carried out through community-initiated working groups, which develop standards and specifications in cooperation with other leading standards organisations, software vendors, and users. The OGF community consists of thousands of individuals in industry and research, representing over 400 organisations in more than 50 countries. Globus Toolkit: A software toolkit designed by the Globus Alliance to provide a set of tools for Grid Computing middleware based on standard grid APIs. Its latest development version, GT4, is based on standards currently being drafted by the Open Grid Forum. Grid Engine: An open source batch-queuing and workload management system. Grid Engine is typically used on a compute farm or compute cluster and is responsible for accepting, scheduling, dispatching, and managing the remote execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. Grid Portal: A Grid Portal provides a single secure web interface for end-users and administrators to computational resources (computing, storage, network, data, applications) and other services, while hiding the complexity of the underlying hardware and software of the distributed computing environment. An example is the EnginFrame cluster, grid, and cloud portal which for example in DEISA serves as the portal for the Life Science community. OGSA: The Open Grid Services Architecture, describes an architecture for a service-oriented grid computing environment for business and scientific use, developed within the Open Grid Forum. OGSA is based on several Web service technologies, notably WSDL and SOAP. Briefly, OGSA is a distributed interaction and computing architecture based around services, assuring interoperability on heterogeneous systems so that different types of resources can communicate and share information. OGSA has been described as a refinement of the emerging Web Services architecture, specifically designed to support Grid requirements. Web Service: A software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-process able format (specifically WSDL).
87
Porting Applications to Grids
Other systems interact with the Web service in a manner prescribed by its description using SOAPmessages, typically conveyed using HTTP with an XML serialisation in conjunction with other Webrelated standards. UNICORE: The Uniform Interface to Computing Resources offers a ready-to-run Grid system including client and server software. UNICORE makes distributed computing and data resources available in a seamless and secure way in intranets and the internet. The UNICORE project created software that allows users to submit jobs to remote high performance computing resources without having to learn details of the target operating system, data storage conventions and techniques, or administrative policies and procedures at the target site. Virtual Organization: A group of people with similar interest that primarily interact via communication media such as newsletters, telephone, email, online social networks etc. rather than face to face, for social, professional, educational or other purposes. In Grid Computing, a VO is a group who shares the same computing resources.
ENDNOTE 1
88
Another version of this chapter was published in the International Journal of Grid and High Performance Computing, Volume 1, Issue 1, edited by Emmanuel Udoh, pp. 55-76, copyright 2009 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
89
Chapter 5
Benchmarking Grid Applications for Performance and Scalability Predictions Radu Prodan University of Innsbruck, Austria Farrukh Nadeem University of Innsbruck, Austria Thomas Fahringer University of Innsbruck, Austria
ABSTRACT Application benchmarks can play a key role in analyzing and predicting the performance and scalability of Grid applications, serve as an evaluation of the fitness of a collection of Grid resources for running a specific application or class of applications (Tsouloupas & Dikaiakos, 2007), and help in implementing performance-aware resource allocation policies of real time job schedulers. However, application benchmarks have been largely ignored due to diversified types of applications, multi-constrained executions, dynamic Grid behavior, and heavy computational costs. To remedy these, the authors present an approach taken by the ASKALON Grid environment that computes application benchmarks considering variations in the problem size of the application and machine size of the Grid site. Their system dynamically controls the number of benchmarking experiments for individual applications and manages the execution of these experiments on different Grid sites. They present experimental results of our method for three real-world applications in the Austrian Grid environment.
INTRODUCTION Grid infrastructures provide an opportunity to the scientific and business communities to exploit the powers of heterogeneous resources in multiple administrative domains under a single umbrella (Foster & Kesselman, The Grid: Blueprint for a Future Computing Infrastructure, 2004). Proper characterization DOI: 10.4018/978-1-60566-661-7.ch005
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Benchmarking Grid Applications for Performance and Scalability Predictions
of Grid resources is of key importance in effective mapping and scheduling of the jobs in order to minimize execution time of complex workflows and utilize maximum power of these resources. Benchmarking has been used for many years to characterize a large variety of resources ranging from CPU architectures to file systems, databases, parallel systems, internet infrastructures, or middleware (Dikaiakos, 2007). There have always been issues regarding optimized mapping of jobs to the Grid resources on the basis of available benchmarks (Tirado-Ramos, Tsouloupas, Dikaiakos, & Sloot, 2005). Existing Grid benchmarks (or their combinations) do not suffice to measure/predict application performance and scalability, and give a quantitative comparison of different Grid sites for individual applications while taking into effect variations in the problem size. In addition, there are no integration mechanisms and common units available for existing benchmarks to make meaningful inferences about the performance and scalability of individual Grid applications on different Grid sites. Application benchmarking on the Grid can provide a basis for users and Grid middleware services (like meta-schedulers (Berman, et al., 2005) and resource brokers (Raman, Livny, & Solomon, 1999)) for optimized mapping of jobs to the Grid resources by serving as an evaluation of fitness to compare different computing resources in the Grid. The performance results obtained from real application benchmarking are much more useful for scheduling these applications on a highly distributed Grid infrastructure than the regular resource information provided by the standard Grid information services (Tirado-Ramos, Tsouloupas, Dikaiakos, & Sloot, 2005) (Czajkowski, Fitzgerald, Foster, & Kesselman, 2001). Application benchmarks are also helpful in predicting the performance and scalability of Grid applications, studying the effects of variations in application performance for different problem sizes, and gaining insights into the properties of computing nodes architectures. However, the complexity, heterogeneity, and the dynamic nature of Grids raise serious questions about the overall realization and applicability of application benchmarking. Moreover, diversified types of applications, multi-constrained executions, and heavy computational costs make the problem even harder. Above all, mechanizing the whole process of controlling and managing benchmarking experiments and making benchmarks available to users and Grid services in an easy and flexible fashion makes the problem more challenging. To overcome this situation, we present a three layered Grid application benchmarking system that produces benchmarks for Grid applications taking into effect the variations in problem size and machine size of the Grid sites. Our system provides the necessary support for conducting controlled and reproducible experiments, for computing performance benchmarks accurately, and for comparing and interpreting benchmarking results in the context of application performance and scalability predictions. It takes the specifications of executables, set of problem sizes, pre-execution requirements and the set of available Grid sites in an input in XML format. These XML specifications, along with the available resources are parsed to generate jobs to be submitted to different Grid sites. At first, the system completes pre-experiment requirements like the topological order of activities in a workflow, and then runs the experiments according to the experimental strategy. The benchmarks are computed from experimental results and archived in a repository for later use. Performance and scalability prediction and analysis from the benchmarks are available through a graphical user interface and Web Service Resource Framework (WSRF) (Banks, 2006) service interfaces. We do not require complex integration/analysis of measurements, or new metrics for interpretation of benchmarking results. Among our considerations for the design of Grid application benchmarks were conciseness, portability, easy computation and adaptability for different Grid users/services. We have implemented a
90
Benchmarking Grid Applications for Performance and Scalability Predictions
prototype of the proposed system as a WSRF service in the context of the ASKALON Grid application development and computing environment (Fahringer, et al., 2006). The rest of the chapter is organized as follows. The next section presents the Grid resource, application, and execution models that serve as foundation for our work. Then, we summarize the requirements of a prediction system, followed by a detailed architecture design. Afterwards we present our experimental design method for benchmarking and prediction of Grid applications. Experimental results that validate our work on real-world applications in a real Grid environment are presented in the second half of this chapter, followed by a related work summary and an outlook into the future work. The last section concludes the chapter.
BACKGROUND In this section we first review the relevant related work in the area of Grid application benchmarking, and then define the general Grid resource, application, and execution models that represent the foundation for our benchmarking and prediction work.
Related Work There have been several significant efforts that targeted benchmarking of individual Grid resources such as (Hockney & Berry, 1994) (Bailey, et al., 1991) (Dixit, 1991) (Dongarra, Luszczek, & Petitet, 2003). The discussion presented in (Van der Wijngaart & Frumkin, 2004) shows that the configuration, administration, and analysis of NAS Grid Benchmarks requires an extensive manual effort like other benchmarks. Moreover, these benchmarks lack some integration mechanism needed to make meaningful inferences about the performance of different Grid applications. A couple of comprehensive tools like (Tsouloupas & Dikaiakos, 2007) are also available for benchmarking a wide range of Grid resources. These provide easy means of archiving and publishing of results. Likewise, GrenchMark (Iosup & Epema, GRENCHMARKIosup & Epema, GRENCHMARK: A Framework for Analyzing, Testing, and Comparing Grids, 2006) is a framework for analyzing, testing, and comparing Grid settings. Its main focus is the generation and submission of synthetic Grid workloads. In contrast, our work focuses on single application benchmarks which are extensively supported. Individual benchmarks have been successfully used for resource allocation (Afgan, Velusamy, & Bangalore, 2005) (Jarvis & Nudd, 2005) and application scheduling (Heymann, Fernandez, Senar, & Salt, 2003). A good work for resource selection is presented in (Jarvis & Nudd, 2005) by building models from resource performance benchmarks and application performance details. Authors in (Afgan, Velusamy, & Bangalore, 2005) present resource filter, resource ranker and resource MakeMatch on the basis of benchmarks, and user provided information. Though this work provides good accuracy, it requires much user intervention during the whole process. Moreover, these benchmarks do not support cross-platform performance translations of different Grid applications while considering variations in problem sizes. A similar work has been presented in (Tirado-Ramos, Tsouloupas, Dikaiakos, & Sloot, 2005). The authors present a tool for resource selection for different applications while considering variations in performance due to different machine sizes. Importance of application-specific benchmarks is
91
Benchmarking Grid Applications for Performance and Scalability Predictions
also described by (Seltzer, Krinsky, & Smith, 1999). In this work, the authors present three different methodologies to benchmark Grid applications by modeling application and Grid site information and require much manual intervention. The distinctive part of our work is that we focus on controlling and specifying the total number of experiments needed for benchmarking process. Our proposed benchmarks are flexible regarding variations in machine size as well as problem sizes required for real-time scheduling and application performance prediction. Moreover, we support a semi-automatic benchmarking process. The cross-platform interoperability of our benchmarks allows trade-off analysis and translation of performance information between different platforms.
Grid Resource Model We consider the Grid as an aggregation of heterogeneous Grid sites. A Grid site consists of a number of compute and storage systems that share same local security, network, and resource management policies. Our experimental Grid environment comprises homogeneous parallel computers within a Grid site, including cache coherent Non-Uniform Memory Architectures (ccNUMA), Clusters of Workstations (COW), and Networks of desktop Workstations (NOW). Each parallel computer is utilized as a single computing resource using a local resource management system such as Sun Grid Engine (SGE), Portable Batch System (PBS) or its Maui and Torque derivatives. To simplify the presentation and without losing any generality, we assume in the remainder of the paper that a Grid site is a homogeneous parallel computer. A heterogeneous Grid consists of an aggregation of homogeneous sites.
Grid Workflow Model The workflow model based on loosely-coupled coordination of atomic activities has emerged as one of the most attractive paradigms in the Grid community for programming Grid applications. Despite this, most existing Grid application development environments provide the application developer with a nontransparent Grid. Commonly, application developers are explicitly involved in tedious tasks such as selecting software components deployed on specific sites, mapping applications onto the Grid, or selecting appropriate computers for their applications. In this section we propose an abstract Grid workflow model that is completely decoupled from the underlying Grid technologies such as Globus toolkit (Foster & Kesselman, Globus: A Metacomputing Infrastructure Toolkit, 1997) or Web services ((W3C), World Wide Web Consortium). We define a workflow as a Directed Acyclic Graph (DAG): W = (Nodes, C-Edges, D-edges, IN-ports, OUT-ports), where Nodes is the set of activities, C-edges = ∪(A1,A2)∈Nodes (A1, A2) is the set of control flow dependencies, D-edges = ∪(A1,A2,D-port)∈Nodes (A1, A2, D-port) is the set of data flow dependencies, IN-ports is the set of workflow input data ports, and OUT-ports is the set of output data ports. An activity A∈Nodes is a mapping from a set of input data ports IN-portsA to a set of output data ports OUT-portsA: A: IN-portsA → OUT-portsA.
92
Benchmarking Grid Applications for Performance and Scalability Predictions
A data port D-port ∈IN-portsA × OUT-portsA is an association between a unique identifier (within the workflow representation) and a well-defined activity type: D-port = (identifier, type). The type of a data port is instantiated by the type system supported by the underlying implementation language, e.g. the XML schema. The most important data type in our experience that shall be supported for Grid workflows is file alongside other basic types such as integer, float, or string. An activity N∈Nodes can be of two kinds: 1. 2.
Computational activity or atomic activity represents an atomic unit of computation such as a legacy sequential or parallel application; Composite activity is a generic term for an activity that aggregates multiple (atomic and composite) activities according to one of the following four patterns: a. parallel loop activity allows the user to express large-scale workflows consisting of hundreds or thousands of atomic activities in a compact manner; b. sequential loop activity defines repetitive computations with possibly unknown number of iterations (e.g. dynamic convergence criteria that depend on the runtime output data port values computed within one iteration); c. conditional activity models if and switch-like statements that activate one from its multiple successor activities based on the evaluation of a boolean condition; d. workflow activity is introduced for modularity and reuse purposes, and is recursively defined according to this definition.
In the remainder of this paper we will use the terms activity and application interchangeably. In this paper we only deal with the benchmarking and prediction of computational activities, while data transfer prediction has been addressed in related work such as (Wolski, 2003). We designed and implemented our approach within the ASKALON Grid application development and computing environment that allows the specification of workflows according to this model at two levels of abstraction (see Figure 1): • •
graphical, based on the standard Unified Modeling Language (UML); XML-based using the Abstract Grid Workflow Language (AGWL) which can be automatically generated from the graphical UML representation.
Grid Execution Model The XML-based AGWL representation of a workflow represents the input to the ASKALON WSRFbased (Banks, 2006) middleware services for execution on the Grid (see Figure 1). To support this requirement transparently, a set of sophisticated services whose functionality is not directly exposed to the end-user is essential: •
•
Resource Manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005) is responsible for negotiation, reservation, allocation of resources, and automatic deployment of services required executing Grid applications. In combination with AGWL, the Resource Manager shields the user from the low-level Grid infrastructure; Scheduler (Wieczorek, Prodan, & Fahringer, 2005) determines effective mappings of single or
93
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 1. The ASKALON Grid application development and computing environment architecture
•
• • •
multiple workflows onto the Grid using graph-based heuristics and single or bi-criteria optimization algorithms such as dynamic programming, game theory, or genetic algorithms; Performance prediction supports the scheduler with information about the expected execution time of activities on individual Grid resources. The design and implementation of this service is the scope of the this paper; Enactment Engine (Duan, Prodan, & Fahringer, 2006) targets scalable, reliable and fault-tolerant execution of workflows; Data repository is a relational database used by the Enactment Engine to log detailed workflow execution events required for post-mortem analysis and visualization; Performance analysis (Prodan & Fahringer, 2008) supports automatic instrumentation and bottleneck detection through online monitoring of a broad set of high-level workflow overheads (over 50), systematically organized in a hierarchy comprising job management, control of parallelism, communication, load imbalance, external load, or other middleware overheads.
PREDICTION REqUIREMENTS The performance of an application is dependent upon a number of inter-related parameters at different levels of Grid infrastructure (e.g. Grid site, computing nodes, processor architecture, memory hierarchy, I/O, storage node, network (LAN or WAN), network topology), as shown in Figure 2 adapted from (Dikaiakos, 2007). Practically it is almost impossible to characterize the effects of all these individual components to shape the overall performance behavior of different Grid applications. Even benchmarks of different resource components cannot be put together to describe application performance because application properties must also be taken into effect (Hockney & Berry, 1994). In such a case, application performance benchmarks with some mechanism of application performance translation across the
94
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 2. Factors affecting Grid application performance
heterogeneous Grid sites can help to describe its performance in the Grid. Application benchmarks include effects of different resource components, in particular their combinational varying effects specific to individual applications. Our solution to performance prediction is therefore to benchmark scientific applications according to a well-thought experimental design strategy across heterogeneous Grid sites. These benchmarks are flexible to application problem size and number of processors (machine size) and are thus called soft benchmarks. More specifically there is a need for benchmarks, which: • • • • • • •
Represent the performance of Grid application on different Grid sites; Incorporate the individual effects of different Grid resources specific to different applications (like memory, caching, etc.); Can be used for performance and scalability predictions of the application; Are portable to different platforms (De Roure & Surridge, 2003); Are flexible regarding variations in problem and machine sizes; Support fast and simplified computation and management; Are comprehensively understandable and usable by different users and services.
On the other hand, it is also necessary to address the high cost of Grid benchmarking administration, benchmarking computation, and analysis which requires a comprehensive system with a visualization and analysis component.
95
Benchmarking Grid Applications for Performance and Scalability Predictions
ARCHITECTURE DESIGN The design of our prediction framework illustrated in Figure 3 consists of a set of tools organized in three layers that perform and facilitate the benchmarking process (the benchmarking experiments, computation, and storage of results) in a flexible way, and later publish the results and perform analysis. In the first layer, the application details for benchmarking experiments are specified in an XMLbased language, which is parsed by a small compiler that produces the job descriptions in the Globus Resource Specification Language (RSL) (Foster & Kesselman, Globus: A Metacomputing Infrastructure Toolkit, 1997). Later, to these job descriptions are added resource specifications on which these jobs are to be launched, to produce final jobs used for executing the benchmarks experiments. In this layer, the total number of benchmarking experiments for individual applications is controlled with respect to different parameters. In layer 2, the Experiment Execution Engine executes the benchmark experiments on available Grid sites provided by the Resource Manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005). A Grid site is considered at both micro-level (the individual Grid nodes), as well as macro-level (the entire parallel computer) by taking machine size as a variable in the benchmark measurements. Such application benchmarks therefore incorporate the variations in application performance associated to different problem and machine sizes.
Figure 3. The prediction framework architecture
96
Benchmarking Grid Applications for Performance and Scalability Predictions
The monitoring component watches the execution of the benchmarking experiments and alerts the Orchestrator component in layer 3 to collect the data and coordinate the start-up of the Benchmarks Computation component to compute the benchmarks. The Archive component stores this information in the benchmarks repository for future use. The Benchmarks Visualization Browser publishes the benchmarks in a graphical user interface for user analysis, and Information Service component is an interface to other services.
ExPERIMENTAL DESIGN To support the automatic application execution time prediction, benchmarking experiments need to be made against some experimental design and the generated data be archived automatically for later use. Specifically in our work, the general purpose of the experimental design phase is to set a strategy for generation and execution of a minimum number of benchmarking experiments for an application to support its performance prediction later on. Among others, our key objectives for this phase are to: • • • •
Reduce/minimize training phase time; MMinimize/eliminate the heavy modeling requirements after the training phase; Develop and maintain the efficient scalability of experimental design with respect to the Grid size; Make it generalizable to a wide range of applications on heterogeneous Grid-sites.
To address these objectives, we design our experimental design in the light of guidelines given by Montgomery (Montgomery, 2004): 1.
2. 3. 4.
5.
Recognition of statement of problem: We describe our problem statement as to obtain maximum execution time information of the application at different problem sizes on all heterogeneous Grid sites with different possible machine sizes in minimum number of the experiments; Selection of response variables: In our work the response variable is the execution time of the application; Choice of factors, levels, and ranges: The factors affecting the response variable are the problem size of the application, the Grid size, and the machine size of one parallel computer; Choice/formulation of experimental design: In our experimental design strategy, we minimize first the combinations of Grid size with problem size, and then the combinations of Grid size with machine size. By minimizing the combinations of Grid size with problem size, we minimize the number of experiments against different problem sizes across the heterogeneous Grid sites. Similarly, by minimizing the Grid size combinations with the machine size factor, we minimize number of experiments against different problem sizes across different number of processors. We designed this to eliminate the need of next two steps presented by Montgomery et. al. called statistical analysis and modeling, and conclusions, to minimize the serving costs on the fly; Performing of experiments: We address performing of experiments under automatic training phase as described later in Section 0.
97
Benchmarking Grid Applications for Performance and Scalability Predictions
Experiment Specification To describe application specifications we created a small language called Grid Application Description Language (GADL). A GADL definition specifies the application to be executed, its exact paths retrieved from the Resource Manager, the problem size ranges, and pre-requisites of execution (e.g. predecessor activities within a workflow, environment variables), if any. More precisely, every GADL instance is described by: •
Application name with a set of problems sizes given either as enumerations or as value ranges using a start: end: step pattern:
<parameter> •
Resource manager (Siddiqui, Villazon, Hoffer, & Fahringer, 2005) URI used to retrieve the available Grid sites and location of the application executables implementing the activity types:
•
A set of pre-requisites, comprising the activities which must be finished before the execution (of some components of) the application:
<prerequisites> •
A set of input files required for executing the application:
•
98
An executable needed to change the problem size in some peculiar input files characteristic to scientific applications:
Benchmarking Grid Applications for Performance and Scalability Predictions
<probsizechange> <probsizechange/>
Training Experiments The training or benchmarking phase for an application consists of the experiment executions for different selections of problem and machine sizes on different Grid sites to obtain the corresponding execution times referred as training set or historical data. Automatic performance prediction based on such historical data needs enough amounts of data present in data repository in order to deliverable accurate results. In general, there is a tradeoff between the number of experiment conducted and the accuracy of the prediction. The historical data needs to be generated for every new application ported onto the Grid and/or for every new machine (different from existing machines) added to the Grid. Conducting automatic benchmarking for application execution time predictions on the Grid is a complex problem due to variety of the factors involved. More formally, the automatic training phase benchmarking comprises: •
A set of A activities of different activity types belonging to a workflow W = (Nodes, C-edges, D-edges, IN-ports, OUT-ports); A set of distributed heterogeneous Grid sites; A set of Grid sites or homogeneous parallel computers; A set PA of different workflow problem sizes every workflow activity type A.
• • •
The total number of experiments N produced by this parameter set is: N =
å
ANodes
|site| ö÷ æ PA × ççç å å m ÷÷, çè site Grid m =1 ÷ø
where |PA| denotes the number of problem sizes in the set PA (or the set cardinality) and |sites| the number of processors in a Grid site. The cardinality of Nodes, PA, site, as well as the number of Grid sites and CPU types have a significant effect on the number of experiments and, therefore on the overall duration of the automatic training phase. The goal is to compute a set of experiments such that N is minimized and the prediction accuracy is maximized.
Performance Sharing and Translation Computing the full cross product of the parameters involved in the benchmarking process may lead to a huge number of experiments that cannot be executed exhaustively (or is not necessary). Therefore, controlling the number of experiments is of key importance in the efficiency of the whole benchmarking process. Our focus is to reduce the total number of benchmarking experiments and to maximize the utility of benchmarking results. To reduce the experimental space, we introduce a Performance Sharing and Translation (PST) mechanism based on several multi-parameter performance relativity properties, experimentally observed for
99
Benchmarking Grid Applications for Performance and Scalability Predictions
our case study applications. We normalize the execution times against that of a reference problem size selected by default as the largest problem size to take in effect of inter process communication in the set of problem sizes specified by the user. The normalization mechanism not only makes the performance of different machines comparable, but also provides a basis for translating different performance values across different Grid sites. The normalization of values is based on the observation that for many computeintensive applications, and in particular the embarrassingly pilot parallel applications that scale linearly with the machine and problem sizes and that drive our experimental work, the normalized execution times for different problem and machine sizes are the same on all the Grid sites with 90% accuracy. This allows cross-platform interoperability. For example, the normalized execution on a Grid site g for a certain problem size and machine size will be equal to that of another Grid site h. We define in the following sections the inter- and intra-platform performance relativity properties.
Inter-Platform PST Inter-platform PST specifies that the performance behavior Tg(A,p) of an application A for a problem size p relative to another problem size q on a Grid site g is the same as that of the same problem sizes on another Grid site h: Tg (A, p) Tg (A, q )
»
Th (A, p) Th (A, q )
.
This phenomenon is based on the fact that rate of change in execution time of an application across different problem sizes is preserved on different Grid sites, i.e. the rate of change in execution time of an application A for the problem size p (the target problem size) with respect to another problem size q (the reference problem size) on Grid site g is equal to the rate of change in execution time for the problem size p with respect to the problem size q on Grid site h: DTg (A, p) DTg (A, q )
»
DTh (A, p) DTh (A, q )
.
Intra-Platform PST Similarly, intra-platform PST specifies that the performance behavior of an embarrassingly parallel application A on a Grid site g for a machine size m relative to another machine size n for a problem size p is similar to that for another problem size q: Tg (A, p, m ) Tg (A, p, n )
»
Tg (A, q, m ) Tg (A, q, n )
.
This phenomenon is based on the fact that rate of change in execution time of an application across different problem sizes is preserved for different machine sizes, i.e. the rate of change in execution time of an application for the problem size p and machine size m on Grid site g with respect to that for
100
Benchmarking Grid Applications for Performance and Scalability Predictions
machine size n will be equal to the rate of change in execution time for the problem size q and machine size m with respect to that for a machine size n on the same Grid site: DTg (A, p, m ) DTg (A, p, n )
»
DTg (A, q, m ) DTg (A, q, n )
.
Similarly, the rate of change in executions time of the application across different machine sizes is also preserved for different problem sizes: DTg (A, p, m ) DTg (A, q, m )
»
DTg (A, p, n ) DTg (A, q, n )
.
We use this phenomenon to share execution times for scalability within one Grid site. The accuracy of inter- and intra-platform similarity of normalized behaviors does not depend upon the selection of reference point for embarrassingly parallel applications. However, for parallel applications exploiting inter-process communications during their executions, this accuracy increases as the reference point gets closer to the target point. Usually, the closer the reference point, the greater the similarity (of interprocess communication) it encompasses. Thus, the accuracy increases as the reference problem size gets closer to the target problem size in case of inter-platform PST, respectively the reference problem and machine sizes get closer to the target problem and machine sizes in case of intra-platform PST. More formally, for inter-platform PST: éT (A, p) T (A, p) ù ú = 0, lim êê g - h ú q ® p T (A, q ) T A q ( , ) êë g úû h and similarly for intra-platform PST: éT (A, p, m ) T (A, q, m ) ù ú = 0. lim êê g - g ú n ®m T (A, p, n ) T A q n ( , , ) êë g úû g For normalization from the minimum training set only, we select the maximum problem size (in normal practice of user of the application) and maximum machine size as reference point, to incorporate the maximum effects of inter-process communications in the normalization. The distance between the target point and the reference point for inter- and intra-platform PST on one Grid site is calculated respectively as: 2 ïìï 2 ïï (T (p) - T (q )) + (p - q ) , d =í 2 ïï 2 ïïî (T (m ) - T (n )) + (m - n ) ,
for nearest problem size; for nearest machine size.
101
Benchmarking Grid Applications for Performance and Scalability Predictions
Reduced Experiment Set Our methodology of reducing the number of benchmarking experiments is to enable sharing of the benchmarking information across the Grid sites, as we explained in Section 0. The performance ratios of different Grid sites are different for different applications and they also vary for different problem and machine sizes. First of all, for each application we make one experiment for the reference problem size on each of the non-identical Grid sites. Afterwards, we make a full factorial design of benchmarking experiments on the “fastest” (in terms of processor speed) available Grid site considering the problem and machine size as parameters for that application. We select the largest problem size as the reference problem size whose benchmarks are used to share information across the Grid sites. In the next prediction phase, the process of normalization helps in completing the benchmarks computation for all the Grid sites. For scalability analysis and prediction, one benchmark experiment for each of different machine sizes is also made for the reference problem size. By means of inter-platform PST, the total number of experiments reduces for an activity A with PA problem sizes on a Grid with G sites and an average of M different machine sizes per site from PA ∙ G ∙ M to PA ∙ M + G − 1 and for single parallel computers from PA ∙ M to PA + M − 1. By introducing intraplatform PST, we reduce total number of experiments for parallel machines as Grid sites further to a linear complexity of PA + (M − 1) + (G − 1 Later, we employ prediction mechanism to derive performance values for the problem and machine size combinations that were not effectively benchmarked. We argue that this reduction in the number of performed benchmarks is a reasonable trade-off between duration of the benchmarking process and accuracy. In Section 5, we show experimentally that predictions based on our approach are within 90% accuracy. A similar or better accuracy can be achieved with either more benchmarks, or by using analytical modeling techniques. However, both these alternatives are time-consuming. In addition, analytical modeling requires a separate model and expert knowledge for each new type of application. With current Grid environments hosting hundreds to thousands of different applications (Iosup & Epema, Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications, 2007), analytical modeling for individual application performance and scalability (which requires manual efforts) is impractical, whereas, benchmarking requires only one generic setup.
Experiment Execution Once the set of experiments has been computed, the next phase towards prediction is to execute them according to the experimental design strategy. We employ the opportunistic load balancing algorithm (Schwiegelshohn & Yahyapour, 2000) for scheduling benchmarking experiments in the Grid. The algorithm for automatically conducting benchmarking experiments is shown in Figure 4. We schedule the benchmarking experiments for each workflow activity type in topological order. For every activity type, we make one experiment on every Grid site for one reference problem size and one sequential machine size, which we use later on in the normalization process. Afterwards, we perform the full factorial design of experiments for one processor as machine size. Finally, for scalability predictions we make one experiment for each of the different machine sizes for the reference problem size on the fastest available Grid site. Jobs within one Grid site are submitted in parallel to the local queuing system that executes them according to the local system administration policies.
102
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 4. The automatic application benchmarking algorithm algorithm benchmark_scheduling; input: W = (Nodes, C-edges, D-edges, IN-ports, OUT-ports); Set of problem sizes: PA, A Nodes; Set of Grid sites; output: TS = execution time set; TS = ; for A Nodes pred(A) Nodes p = reference problem size of A (p PA); for site Grid do in parallel Tsite(A, p) = time(execution of A on site for reference problem size p); TS = TS Tsite(A, p); end for; site = the idle Grid site with the fastest processor CPU; for r PA do in parallel Tsite(A, r, n) = time(execution of A on site for problem size r on n reference processors); TS = TS Tsite(A, r, n); end for; for m = 1 to |site| do in parallel Tsite(A, p, m) = time(execution of A on site for problem size p with machine size m); TS = TS Tsite(A, p, m); end for; Nodes = Nodes – A; end for; return TS; end algorithm.
The algorithm returns the execution times of these experiments which are then archived and later used by Benchmarks Computation component to calculate the benchmarks (see Figure 3).
Background Load Sometimes the background load, that is, the applications run by external users, severely affects the performance of some (or even all) the applications in the system, especially on ccNUMA SMP parallel computers. This happens mostly when several applications contend for the same network or processor shares, or when resource utilization is very high and the resource manager is ineffective (Arpaci-Dusseau, Arpaci-Dusseau, Vahdat, Liu, Anderson, & Patterson, 1995). However, our benchmarking procedure does not take into account the background load, at least for the moment. The reason is threefold. First, our goal is to quantify the best achievable performance of an application on a Grid platform without the contention generated by additional users. Work in (Arpaci-Dusseau, Arpaci-Dusseau, Vahdat, Liu, Anderson, & Patterson, 1995) helps quantifying the ratio between the maximum achievable performance and the performance achieved in practice. Second, work in hotspot or symbiotic scheduling (Snavely & Weinberg, 2006) helps scheduling applications with overlapping resource requirements such that the overlap is minimized. Third, while mechanisms for ensuring the background load on the resources have been proposed (Mohamed & Epema, 2005), a better understanding of the structure and of the impact of the background load is needed. We plan to investigate aspects of this problem in future work.
103
Benchmarking Grid Applications for Performance and Scalability Predictions
Performance and Scalability Predictions The benchmarks are computed from the results of experiments and archived in a data repository for future references. This is done in a manner that facilitates the comparisons between the benchmarks for different Grid sites, problem sizes, and machine sizes, along with the performance and scalability predictions. Benchmarks can be browsed through a graphical user interface (see Figure 5) for application performance and scalability predictions for different problem sizes on different Grid sites. In this section we explain how the benchmarks are used for performance and scalability predictions and Grid site comparisons. The performance of an application A can be predicted for any problem size p on any Grid site g from another Grid site h (for which execution time for problem size p exists) from the benchmarks using the normalization method, as follows: Tg (A, p) =
Th (A, p) Th (A, q )
× Tg (A, q ).
Figure 5. Graphical user interface for application benchmarks and predictions
104
Benchmarking Grid Applications for Performance and Scalability Predictions
where Tg(A, p) represents the execution time of an activity A, for a problem size p, on a Grid site g. Similarly, for scalability analysis and prediction taking machine size as a parameter, the performance of the parallel applications for different number of CPUs can be predicted from the benchmarks as follows: Tg (A, p, m ) =
Tg (A, q, m ) Tg (A, q, n )
× Tg (A, p, n ).
where Tg(A, p, m) represents execution time of an application for problem size p on a Grid site g for a machine size m. For execution time and scalability predictions, normalization is done based on execution time for the closest set of parameters (problem size and machine size). At the start, this is made based on the only common set of parameters in the benchmark repository and later, if some other performance values are available (after adding some experimental values from real runs), calculated based on the closer performance value, as it increases accuracy in the cross platform performance and scalability predictions. For our prediction results we obtained a minimum accuracy of 90% from our proposed number of experiments as we will demonstrate in Section 0.
Grid Site Comparisons Our training benchmarks help facilitating the comparisons of applications’ performance for different values of problem and machine sizes on different Grid sites, as the second key use. This can guide the Grid site selection policies for real time schedulers, resource brokers and different Grid users. Furthermore, these comparisons provide application developers with information about the systems capabilities in terms of application performance, so that they can develop and tune their applications for high-quality implementations.
ExPERIMENTS We have conducted experiments to validate our experimental design method in a heterogeneous subset of the Austrian Grid environment summarized in Table 1.
Workflow Applications We used three real-world workflow applications to validate our method: WIEN2k, MeteoAG, and Invmod, which we describe in the next subsections.
WIEN2k WIEN2k (Schwarz, Blaha, & Madsen, 2002) is a program package for performing electronic structure calculations of solids using density functional theory based on the full potential (linearized) augmented
105
Benchmarking Grid Applications for Performance and Scalability Predictions
Table 1. The Austrian Grid testbed Site Name
Architecture
No. CPUs
Processor Architecture
Gigahertz
RAM [megabytes]
Location
altix1.jku
ccNUMA, SGI Altix 3000
64
Itanium 2
1.6
14000
Linz
altix1.uibk
ccNUMA, SGI Altix 350
16
Itanium 2
1.6
16000
Innsbruck
schafberg
ccNUMA, SGI Altix 350
16
Itanium 2
1.6
14000
Salzburg
agrid1
NOW, Fast Ethernet
20
Pentium 4
1.8
1800
Innsbruck
hydra
COW, Fast Ethernet
16
AMD Athlon
2.0
1600
Linz
hc-ma
NOW, Fast Ethernet
16
AMD Opteron 2.2
2.2
4000
Innsbruck
zid-cc
NOW, Fast Ethernet
22
Intel Xeon
2.2
2000
Innsbruck
karwendel
COW, Infiniband
80
AMD Opteron
2.4
16000
Innsbruck
Figure 6. Simplified WIEN2k workflow representation
plane wave ((L)APW) and local orbital (lo) method. We first ported the application onto the Grid by splitting the monolithic code into several course grain activity types coordinated in a workflow as illustrated in Figure 6. The LAPW1 and LAPW2 activities can be solved in parallel by a fixed number of so called k-points. A final activity called Converged applied on several output files tests whether the problem convergence criterion is fulfilled. The number of sequential loop iterations is statically unknown.
106
Benchmarking Grid Applications for Performance and Scalability Predictions
MeteoAG We designed MeteoAG (Schüller, Qin, Nadeem, Prodan, Fahringer, & Mayr, 2006) as a Grid workflow application for meteorological simulations based on the RAMS (Cotton, et al., 2003) numerical atmospheric model. The simulations produce spatial and temporal fields of heavy precipitation cases over the western part of Austria to resolve most alpine watersheds and thunderstorms. A database of reanalyzed heavy precipitation cases is generated in order to study various aspects of objective analysis algorithms for rain gauge networks and the impact of weather radar on the analysis. Figure 7 illustrates the workflow structure in which a large set of simulation cases modeled as a parallel loop. For each simulation, another nested parallel loop is executed with a different so called akmin parameter value. For each individual akmin value, the activities rams makevfile, rams init, revu compare and raver are processed sequentially. Based on the results of the raver activity, a conditional
Figure 7. The simplified MeteoAG workflow
107
Benchmarking Grid Applications for Performance and Scalability Predictions
activity decides whether the activity rams_hist or the parallel loop pforTimeStep (in which the activity revu_dump is enclosed as iteration) are executed.
Invmod Invmod is a hydrological application designed at the University of Innsbruck for calibration of parameters of the WaSiM tool developed at the Swiss Federal Institute of Technology Zurich. Invmod uses the Levenberg-Marquardt algorithm to minimize the least squares of the differences between the measured and the simulated runoff for a determined time period. We re-engineered the monolithic Invmod application into a Grid-enabled scientific workflow consisting of two levels of parallelism as depicted Figure 8: • •
Each iteration of the outermost parallel loop called random run performs a local search optimization starting from an arbitrarily chosen initial solution; Alternative local changes are examined separately for each calibrated parameter, which is done in parallel in the inner nested parallel loop
Figure 8. The simplified Invmod workflow
108
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 9. Experiment reduction with problem, machine, and Grid sizes
The number of sequential loop iterations is variable and depends on the actual convergence of the optimization process. However, it is usually equal to the input maximum iteration number.
Experiment Set Reduction We analyzed the scalability of our experimental design strategy by varying the problem size of our applications from 10 to 200 for fixed values of the remaining factors: ten Grid sites with machine size 20 and 50 single processor machines. We observed a reduction in the total number of experiments from 96% to 99%, as shown in Figure 9. A reduction from 77% to 97% in the total number of experiments was observed when we varied the machine size from one to 80, for fixed factors of 10 parallel machines, 50 single processor Grid sites and problem size of five. From another perspective, we observed that the total number of experiments increased from 7% to 9% when the Grid size was increased from 15 to 155, for the fixed factors of five parallel machines with machine size of 10 and problem size 10. We observed an overall reduction of 78% to 99% when we varied all factors simultaneously: five parallel machines with machine size from 1 to 80, single processor Grid sites from 10 to 95, and problem size from 10 to 95.
Normalized Benchmarks Due to space limitations, we report results on benchmarking one activity from each of the three previously introduced workflows: LAPW1 from WIEN2k, rams_hist from MeteoAG, and wasim_b2c from Invmod. The training benchmarks for the LAPW1 activity type of the WIEN2k workflow on different Grid sites of the Austrian Grid are shown in Figure 10. We made a total of 45 benchmark experiments for
109
Benchmarking Grid Applications for Performance and Scalability Predictions
41 different problem sizes on 5 different Grid sites. The total execution time of our reduced LAPW1 benchmarking phase was 4203.24 seconds, while the total full factorial set would need 5.6 times longer (23614.8 seconds). We repeated every experiment for five times and took their average to reduce the anomalies in the computations due to external factors. For LAPW1, we took the execution time of the problem size 9.0 as the base performance value for normalization. The similar benchmarks curves (for different values of problem size) on different machines show the realization of normalized performance behavior of the Grid benchmarks across heterogeneous platforms. Performance and scalability benchmarks for different number of CPUs for the rams_hist activity type of the MeteoAG workflow on zid-cc and hc-ma Grid sites are shown in Figure 11 and Figure 12, respectively. A total of 30 benchmarking experiments were made for 19 problem sizes and 12 machine sizes on zid-cc, and 32 experiments for 19 problem sizes and 14 machine sizes on hc-ma. In these experiments, we have used a machine size of one for normalization. The identical scalability curves demonstrate the realization of normalized performance behavior of application benchmarks with respect to problem and machine size on one platform. We observed similar results for the wasim_b2c activity of Invmod for different problem sizes on different Grid sites, as showed in Figure 13. Here, the reduced training phase took 2190.62 seconds, while the full factorial set would need about 10711.5 seconds (4.8 times longer).
Grid Site Comparison A comparison of different Grid sites for LAPW1 and wasim_b2c is shown in Figure 14 and Figure 15, respectively. The scalability comparison for MeteoAG for different problem sizes on two different platforms, 32 bit zid-cc and 64 bit hc-ma, is shown in Figure 16 and Figure 17, respectively. A comparison Figure 10. Normalized LAPW1 benchmarks for 41 problem sizes and 5 Grid sites
110
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 11. Normalized rams_hist benchmarks on zid-cc with 19 problem sizes and 12 machine sizes
Figure 12. Normalized rams_hist benchmarks on hc-ma with 19 problem sizes and 14 machine sizes
111
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 13. Normalized wasim_b2c benchmarks on hc-ma with 9 problem sizes and 5 Grid sites
of two different versions of LAPW1 (32-bit versus 64-bit version) on karwendel is presented in Figure 18. These graphs were generated from the application benchmarks when only one benchmark measurement for the 64-bit version was made. To give a glimpse of the variability in the quantitative comparisons of different Grid sites for different applications, we present our experimental results in Figure 19. As shown in this figure, the agrid1 and altix1.uibk Grid sites yielded different execution time ratios for the three different applications. For WIEN2k this ratio is 2.37, for Invmod 10.37, and for MeteoAG 1.71. It is noteworthy that these ratios are irrespective of the total execution times on these Grid sites. This is the reason that why benchmarks for individual resources (e.g. CPU, memory) do not suffice for application performance and scalability predictions. Furthermore, considering one application, the comparison of execution times on Grid sites yields different ratios for different problem sizes. This performance behavior of Grid applications urged us to make a full factorial design of experiments on the Grid, rather than modeling individual applications analytically which is complex and inefficient. The execution time ratios of the two Grid sites altix1.uibk and agrid1 for 41 different problem sizes are shown in Figure 20.
Prediction Accuracy Figure 21 and Figure 22 show the graphs of comparison between the measured and predicted values for LAPW1 and wasim_b2c, respectively. The lowest curves in both figures represent the execution values on agrid1 taken as reference Grid site. The values of execution times of these activities on four other Grid sites were calculated through the normalization process. We have used the reference point available from the training set for normalization. Every two curves of measured and predicted values are very
112
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 14. Performance benchmark and Grid site comparison for 41 problem sizes of LAPW1 and five Grid sites
Figure 15. Performance benchmark and Grid site comparison for 9 problem sizes of wasim_b2c and five Grid sites
113
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 16. Performance benchmark and Grid site comparison for 18 problem sizes of rams_hist and 12 machine sizes on zid-cc
Figure 17. Performance benchmark and Grid site comparison for 18 problem sizes of rams_hist and 14 machine sizes on hc-ma
114
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 18. 32 bit versus 64 bit performance benchmark and comparison for different problem sizes on karwendel
Figure 19. Quantitative performance comparison of altix1 and agrid1 for three workflow applications
115
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 20. Execution times ratios for altix1.uibk and agrid1 for 41 problem sizes
much similar. However, we can see that they are closest near the reference problem size and have little differences as the distance from the reference problem size increases. Due to this reason, we always take the reference problem size as close as possible to the target problem size. We observed a maximum average variation of 10% from the actual values (obtained from real runs) in our performance and scalability prediction, which means 90% accuracy in our predictions with a maximum standard deviation of 2%. As we get more data during the actual runs, it will increase the probability of finding closer problem sizes other than the one obtained during the benchmarking phase, to be used in the normalization and thus increase the accuracy even beyond 90%.
FUTURE TRENDS In the future we plan to enhance our present work in multiple dimensions. First, we aim to refine the experimental design phase to further reduce the number of experiments on one Grid site (with full factorial experiments) by applying intelligent space search methods. In the beginning this will be done with the help of end-users, and later we plan to automate it for different applications. Second, we intend to make another set of benchmarks by keeping track of memory used for different problem sizes of an application. This will help in translating application performance across different machines with different memory capacities, including performance variations due to paging in case of data-intensive applications. Third, we are also enhancing our present work towards application benchmarking at the level of Grid constellations comprising of multiple sites spreading across multiple Virtual Organizations. Fourth, we plan to incorporate application throughput information for performance transformation across the
116
Benchmarking Grid Applications for Performance and Scalability Predictions
Figure 21. Comparison of real and predicted values for LAPW1
Figure 22. Comparison of real and predicted values for wasim_b2c
platforms and learn from previous errors in predictions. Last but not least, we want to find the effect of prediction inaccuracies in scheduling workflows.
117
Benchmarking Grid Applications for Performance and Scalability Predictions
CONCLUSION Application benchmarks provide a concrete basis for performance analysis and predictions incorporating variations in the problem and machine sizes on different platforms, and for real quantitative comparison of different Grid sites. Efficient and reliable design of experiments to support automatic benchmarking for the training set of application performance prediction on the Grid is of crucial importance. We proposed in this paper an effective experimental design through a step-by-step controlling mechanism that reduces the combinations of the factors affecting the application performance prediction. Our scalable approach is based on two intra- and inter-platform performance sharing and translation mechanisms that reduce the number of benchmarking experiments in the training phase to a complexity linear with the number of problem sizes, the size of one Grid site, and the number of Grid sites. Benchmarking an application with our method requires executing a full factorial set of experiments on one Grid site, and a scalability analysis for different machine sizes for a reference problem size. Using this information, predicting the performance of the application for an arbitrary problem size and machine size on another Grid site requires performing one single additional benchmarking experiment and then applying the inter- and intra-platform translation mechanisms. We demonstrated experimental results for three real-world applications in the Austrian Grid environment. In our experiments we achieved a 77% − 99% reduction in the number of experiments while maintaining 90% accuracy in the prediction results.
REFERENCES Afgan, E., Velusamy, V., & Bangalore, P. (2005). Grid resource broker using application benchmarking. European Grid Conference, (LNCS 3470, pp. 691-701). Amsterdam: Springer Verlag. Arpaci-Dusseau, R. H., Arpaci-Dusseau, A. C., Vahdat, A., Liu, L. T., Anderson, T. E., & Patterson, D. A. (1995). The interaction of parallel and sequential workloads on a network of workstations. SIGMETRICS, (pp. 267-278). Bailey, D., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., & Dagum, L. (1991). The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3), 63–73. Banks, T. (2006). Web services resource framework (WSRF). Organization for the Advancement of Structured Information Standards (OASIS). Berman, F., Casanova, H., Chien, A. A., Cooper, K. D., Dail, H., & Dasgupta, A. (2005). New Grid scheduling and rescheduling methods in the GrADS project. International Journal of Parallel Programming, 33(2-3), 209–229. doi:10.1007/s10766-005-3584-4 Cotton, W., Pielke, R., Walko, R., Liston, G., Tremback, C., & Jiang, H. (2003). RAMS 2001: Current status and future directions. Meteorology and Atmospheric Physics, 82(1-4), 5–29. doi:10.1007/s00703001-0584-9
118
Benchmarking Grid Applications for Performance and Scalability Predictions
Czajkowski, K., Fitzgerald, S., Foster, I., & Kesselman, C. (2001). Grid information services for distributed resource sharing. 10th International Symposium on High Performance Distributed Computing (pp. 181-194). San Francisco: IEEE Computer Society Press. De Roure, M., & Surridge, D. (2003). Interoperability challenges in Grid for industrial applications. GGF9 Semantic Grid Workshop, Chicago. Dikaiakos, M. D. (2007). Grid benchmarking: vision, challenges, and current status. [New York: Wiley InterScience.]. Concurrency and Computation, 19, 89–105. doi:10.1002/cpe.1086 Dixit, K. M. (1991). The SPEC benchmarks. Parallel Computing, 17(10-11), 1195–1209. doi:10.1016/ S0167-8191(05)80033-X Dongarra, J., Luszczek, P., & Petitet, A. (2003, August). The LINPACK Benchmark: past, present and future. Concurrency and Computation, 15(9), 803–820. doi:10.1002/cpe.728 Duan, R., Prodan, R., & Fahringer, T. (2006). Run-time optimization for Grid workflow applications. International Conference on Grid Computing. Barcelona, Spain: IEEE Computer Society Press. Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem, F., Nerieri, F., et al. (2006). ASKALON: A development and grid computing environment for scientific workflows. In I. J. Taylor, E. Deelman, D. G. Ganon, & M. Shields (Eds.), Workflows for e-Science (p. 530). Berlin: Springer Verlag. Foster, I., & Kesselman, C. (1997). Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications and High Performance Computing, 11(2), 115–128. doi:10.1177/109434209701100205 Foster, I., & Kesselman, C. (2004). The Grid: Blueprint for a future computing infrastructure (2 Ed.). San Francisco: Morgan Kaufmann. Heymann, E., Fernandez, A., Senar, M. A., & Salt, J. (2003). The EU-Crossgrid approach for grid application scheduling. European Grid Conference, (LNCS 2970, pp. 17-24). Amsterdam: Springer Verlag. Hockney, R., & Berry, M. (1994). PARKBENCH report: public international benchmarks for parallel computers. Science Progress, 3(2), 101–146. Iosup, A., & Epema, D. H. (2006). GRENCHMARK: A framework for analyzing, testing, and comparing grids. International Conference on Cluster Computing and the Grid (pp. 313-320). Singapore: IEEE Computer Society Press. Iosup, A., & Epema, D. H. (2007). Build-and-test workloads for Grid middleware: Problem, analysis, and applications. International Conference on Cluster Computing and the Grid (pp. 205-213). Rio de Janeiro, Brazil: IEEE Computer Society Press. Jarvis, S. A., & Nudd, G. R. (2005, February). Performance-based middleware for Grid computing. Concurrency and Computation: Practactice and Experience, 17(2-4), 215–234. doi:10.1002/ cpe.925
119
Benchmarking Grid Applications for Performance and Scalability Predictions
Mohamed, H. H., & Epema, D. H. (2005). Experiences with the KOALA co-allocating scheduler in multiclusters. International Conference of Cluster Computing and the Grid (pp. 784-791). Cardiff, UK: IEEE Computer Society Press. Montgomery, D. C. (2004). Design and analysis of experiments (6 ed.). New York: Wiley. Prodan, R., & Fahringer, T. (2008, March). overhead analysis of scientific workflows in grid environments. Transactions on Parallel and Distributed Systems, 19(3), 378–393. doi:10.1109/ TPDS.2007.70734 Raman, R., Livny, M., & Solomon, M. H. (1999). Matchmaking: An extensible framework for distributed resource management. Cluster Computing, 2(2), 129–138. doi:10.1023/A:1019022624119 Schüller, F., Qin, J., Nadeem, F., Prodan, R., Fahringer, T., & Mayr, G. (2006). Performance, scalability and quality of the meteorological grid workflow MeteoAG. In Austrian Grid Symposium. Innsbruck, Austria: OCG Verlag. Schwarz, K., Blaha, P., & Madsen, G. K. (2002). Electronic structure calculations of solids using the WIEN2k package for material sciences. Computer Physics Communications, 147(71). Schwiegelshohn, U., & Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal of Scheduling, 3(5), 297–320. doi:10.1002/1099-1425(200009/10)3:5<297::AID-JOS50>3.0.CO;2-D Seltzer, M. I., Krinsky, D., & Smith, K. A. (1999). The case for application-specific benchmarking. Workshop on Hot Topics in Operating Systems (pp. 102-109). Rio Rico, AZ: IEEE Computer Society Press. Siddiqui, M., Villazon, A., Hoffer, J., & Fahringer, T. (2005). GLARE: A Grid activity registration, deployment, and provisioning framework. Supercomputing Conference. Seattle, WA: IEEE Computer Society Press. Snavely, A., & Weinberg, J. (2006). Symbiotic space-sharing on SDSC’s datastar system. Job Scheduling Strategies for Parallel Processing. (LNCS 4376, pp.192-209). St. Malo, France: Springer Verlag. Theiner, D., & Rutschmann, P. (2005). An inverse modelling approach for the estimation of hydrological model parameters. (I. Publishing, Ed.) Journal of Hydroinformatics. Tirado-Ramos, A., Tsouloupas, G., Dikaiakos, M. D., & Sloot, P. M. (2005). Grid resource selection by application benchmarking: A computational haemodynamics case study. International Conference on Computational Science. (LNCS 3514, pp. 534-543). Atlanta, GA: Springer Verlag. Tsouloupas, G., & Dikaiakos, M. D. (2007). GridBench: A tool for the interactive performance exploration of Grid infrastructures. Journal of Parallel and Distributed Computing, 67(9), 1029–1045. doi:10.1016/j.jpdc.2007.04.009 Van der Wijngaart, R. F., & Frumkin, M. A. (2004). Evaluating the information power Grid using the NAS Grid benchmarks. International Parallel and Distributed Processing Symposium. Santa Fe, NM: IEEE Computer Society Press.
120
Benchmarking Grid Applications for Performance and Scalability Predictions
Wieczorek, M., Prodan, R., & Fahringer, T. (2005). Scheduling of scientific workflows in the ASKALON Grid environment. SIGMOD Record, 09. Wolski, R. (2003). Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4), 41–49. doi:10.1145/773056.773064 World Wide Web Consortium (W3C). (n.d.). Web services activity. Retrieved from http://www. w3.org/2002/ws/
KEY TERMS AND DEFINITIONS Benchmark: A measurement to be used as a reference value for future calculations such as performance predictions. Experimental Design: Design of all information gathering exercises where variation is present, whether under the full control of the experimenter or not. Grid: A geographically distributed hardware and software infrastructure that integrates high-end computers, networks, databases, and scientific instruments from multiple sources to form a virtual supercomputer on which users can work collaboratively within virtual organizations. Performance Prediction: Estimation of the execution time of an application for a certain problem size in a certain configuration (e.g. machine size) on the target computer architecture. Scalability: The ability of a system to either handle growing amounts of work without losing processing speed. Scheduling: The process of finding an appropriate execution resource to each atomic activity of a large application; scheduling is usually employed for parallel applications, bags of tasks, and workflows and is an NP-complete problem for certain objective functions such as execution time. Scientific Workflow: A large-scale loosely coupled application consisting of a set of commodity off-the-shelf software components (also called tasks or activities) interconnected in a directed graph through control flow and data flow dependencies.
121
Section 2
P2P Computing
123
Chapter 6
Scalable Index and Data Management for Unstructured Peer-To-Peer Networks Shang-Feng Chiang National Taiwan University, Taiwan Kuo Chiang National Taiwan University, Taiwan Ruo-Jian Yu National Taiwan University, Taiwan Sheng-De Wang National Taiwan University, Taiwan
ABSTRACT In order to improve the scalability and reduce the traffic of Gnutella-like unstructured peer-to-peer networks, index caching and controlled flooding mechanisms had been an important research topic in recent years. In this chapter the authors will describe and present the current state of the art about index management schemes, interest groups and data clustering for unstructured peer-to-peer networks. Index caching mechanisms are an approach to reducing the traffic of keyword querying. However, the cached indices may incur redundant replications in the whole network, leading to the less efficient use of storage and the increase of traffic. They propose a multiplayer index management scheme that actively diffuses the indices in the network and groups indices according to their request rate. The peers of the group that have indices with higher request rate will be placed in layers that receive queries earlier. Their simulation shows that the proposed approach can keep a high success query rate as well as reduce the flooding size.
DOI: 10.4018/978-1-60566-661-7.ch006
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scalable Index and Data Management for Unstructured P2P Networks
INTRODUCTION With the growth of the Internet, peer-to-peer (P2P) systems have become an important paradigm in designing large scale distribution systems. Peer-to-peer systems (Androutsellis-Theotokis, & Spinellis, 2004) provide effective ways of sharing data and can be based on overlay networks, which are classified by the degree of decentralization. The three categories are as follows: purely decentralized, partially decentralized, and hybrid decentralized architecture. Supporting efficient search of desired documents has been the most important issue in a decentralized peer-to-peer network. The overlay for decentralized peer-to-peer networks can be either unstructured or structured based on some distributed hash functions. Gnutella and Napster are pioneers in peer-to-peer file sharing systems and belong to the unstructured ones. A class of structured peer-to-peer networks uses DHT (Distributed Hash Table) to maintain the shared documents. Distributed hash tables (DHTs) make use of hashing functions to provide distribution and lookup services. In this way, any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from names to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. DHTs are typically designed to scale a large number of nodes and handle nodes’ arrival and departure. With a routing table, all the participating nodes only need to communicate with a small fraction of all the nodes in a structured overlay network. On the other hand, the unstructured peer-to-peer networks often rely on flooding mechanisms to search the desired objects. As a result, it needs to use techniques like index caching, active replication, or controlled flooding, to reduce the query traffic. The search algorithm of Gnutella use a kind of flooding method to discover objects, which sends queries to all nodes within a given TTL value. However, the mechanism is not scalable, since the query messages will grow exponentially due to its blind search method. In this chapter, we will discuss some scalable techniques for index and data management for unstructured peer-to-peer networks. The concepts of interest group and data clustering will also be addressed. We also propose an index diffusion scheme to maintain a high success query rate and reduce the traffic load, for unstructured peer-to-peer systems.
BACKGROUND AND RELATED WORK BitTorrent BitTorrent is a peer-to-peer communication protocol (Cohen 2002) that can distribute large amounts of data widely without the original distributor incurring the entire costs of hardware, hosting, and bandwidth resources. Instead, when data is distributed using the BitTorrent protocol, each recipient supplies pieces of the data to newer recipients, reducing the cost and burden on any given individual source, providing redundancy against system problems, and reducing dependence on the original distributor.
124
Scalable Index and Data Management for Unstructured P2P Networks
Blind Search Methods Most search methods in Gnutella-like peer-to-peer networks can be categorized into blind search methods and informed search methods (Tsoumakos, & Rousseopoulos, 2006). In blind search methods, there is no mechanism to keep the query results, to judge the best query path, or to use some other information to reduce traffic. Blind search confines the nodes to transmit messages to some adjacent nodes instead of all adjacent nodes without using the information of messages or choosing the best path for transmitting. The typical systems include Gnutella, Modified-BFS (Kalogeraki, Gunopulos, & Zeinalipour-Yazti 2002), Random Walks (Lv, Cao, Cohen, Li, & Shenker 2002), and Dynamic Query (Fisk, 2003) (Ripeanu, Foster, & Iamnitchi 2002).
Index Caching Mechanisms Some informed search methods are efficient in reducing flooding messages by using a class of caching mechanisms. In a normal Uniform Index Caching (UIC) mechanism, all peers in the path of a query record the query hit results in their caches. DiCAS records the query hit results in a multilayer peer-topeer network (Wang, Xiao, Liu, & Zheng, 2004). In DiCAS with m layers, each peer randomly takes an initial value in a certain range, for example, 0 to m-1, as a group id when it joins a peer-to-peer network. A query qr matches a peer if and only if the peer’s group id matches the following equation: id = hash(qr) mod m
(1)
Unstructured peer-to-peer networks often use partially centralized architecture to reduce the query traffic. Some nodes in the networks play specific roles, for example, managing network configuration, monitoring network status, and forwarding messages to other peers. Distributed caching (Ambastha, Beak, Gokhale, & Mohr, 2003) and adaptive search (DiCAS) algorithm has been proposed to reduce network search traffic with the help of small cache space contributed by each individual peer. With indices passively cached in a group of peers based on a predefined hash function, the DiCAS protocol can significantly reduce network search traffic. Based on the DiCAS algorithm, we will propose an index management scheme to enhance the search performance.
Interest Group and Data Clustering The concept of interest groups (Chao 2006) can play an important role in designing peer-to-peer systems. Everyone has his own interest and the communities in the Internet also reflect the fact. If we can place files with the same attribute or files interested by the same group of users in the same cluster, the users can easily find the files by searching for them within the same cluster. It is worthy of noticing that the resource types themselves present the properties of fractals both in coarse and fine classifications of resources. To further improve the efficiency, locality should be considered by multicast or routing algorithms of the application layer when constructing overlay networks (Zhang et al 2004). It is noted that even better performance can be achieved if we combine the locality and the concept of interest group.
125
Scalable Index and Data Management for Unstructured P2P Networks
Popularity of Content: The shared contents in a peer-to-peer network in general follow a kind of probability distribution that reflects some forms of popularity. We would like to model this kind of distribution, which is close to the realistic cases, in peer-to-peer networks. In many cases, keywords of queries may follow a Zipf distribution. The distribution model of content popularity described in (Saleh, Hefeeda, 2006) follows a Mandelbrot-Zipf distribution (Silagadze, 1997). The Mandelbrot-Zipf distribution, a general form of the Zipf-like distribution with an extra parameter, defines the probability of accessing an object at rank i out of N available objects as: p(i ) =
HN ,
s, q
(i +q )s
,
N
H N , s, q = å i =1
1 (i +q )s
,
q ³0
(2)
Here, s is the skewness factor, and q is the plateau factor. In (Saleh, Hefeeda, 2006), it is observed that the typical value for s is between 0.4 and 0.7, and typical value for q is between 5 and 60. MandelbrotZipf distribution degenerates to a Zipflike distribution with a skewness factor s if q = 0. In our simulation, we will use these parameters to setup the simulation environment in order to make the simulation model close to real cases.
AN INDEx DIFFUSION MECHANISM Request Popularity and Disposition In order to group indices, we cluster the indices of files with similar properties into the same groups. In structured peer-to-peer networks, each keyword is exactly matched by a single node. The architecture of structured peer-to-peer network may be the best choice for grouping the indices by keywords. Search in structured peer-to-peer networks is also more efficient than in unstructured peer-to-peer networks. For an unstructured peer-to-peer network, we group indices by the request popularity instead of keywords. The request popularity is a special property of any shared files, and the request popularity cannot be calculated only by the information of sharing files. To measure the request popularity of the shared files, the peer that owns the file must update the hit counts by listening to the queries or messages to update indices. This is similar to the webpage ranking. A popular webpage will be marked with a higher rank value by the search engine. And the search engine can sort the search results by the rank of each webpage. So, we can assign a higher rank value to a more popular file. The indices of the more popular files can be distributed to peers that lie in more ahead locations so that a query can reach it earlier. We design a hierarchical multilayer peer-to-peer network, and the priorities of layers are different. The top layer has the highest priority, and queries always reach the peers that are in the top layer first. Indices of the most popular files are placed in the top layer. We will move those file indices whose request rates are growing up to an upper layer so that they can be reached earlier in the subsequence queries. The indices of the files that become unpopular will be replaced by more popular ones. There is a situation needed to be considered. When having new files inserted into the peer-to-peer network, we suppose they are popular so we insert their indices into the top layer to make them be reached as
126
Scalable Index and Data Management for Unstructured P2P Networks
Figure 1. The multilayer structure
earlier as possible.
Network Architecture In order to maintain our multilayer architecture easily and balance the load in each layer, we divide the network into three layers (0 to 2) and 7 groups (0 to 6). The architecture is shown in Figure 1. Each peer randomly selects a value (0 to 6) as its group id when it joins the network. The top layer, Layer-0, contains only one group (Group-0), and the second layer contains 2 groups (Group-1 and Group-2). The other 4 groups are assigned to the third layer. Because only a few files are popular, Layer-0 has only one group. The index diffusion will be carried out by the owner of the file according the request rate, which is characterized by the query hit counts occurred in some period, T. The owner of a file will diffuse the index in Layer-0 if the hit count is larger than 4 in T. And if the hit counts are 2 or 3 in T, the indices will be randomly diffused to one group in Layer-1. In order to control the traffic, indices in Layer-2 will never expire. The indices in Layer-2 will be removed if and only if the procedure of index update fails. So, when a new file is inserted to the network, the file owner will not only place file indices to Layer-0, but also will randomly select one group in Layer-2 to keep the file index.
Control of Indices Most index caching mechanisms are based on passively overhearing the traffic or actively querying and collecting the information about shared files from their neighbors. Overhearing does not cost any additional overhead about traffic by caching the results of query hits because it does not affect the original flooding algorithms. However, the active querying and collecting the locations of shared files will produce overheads due to the update mechanism. The overhead is significant when peers frequently leave or join the overlay network. Obviously, the hit counts of shared files that contain more popular keywords will be higher than others. As a result, passive caching methods will cache more copies of indices of popular items than unpopular ones. The advantage is that the popular files will be found easily and quickly because these files are cached widely over the network. In this case, a flooding algorithm with a low TTL value is sufficient to find such an item. However, there is no method to control the amount of indices of an item in the index caching mechanisms. Also, it produces at least two disadvantages: (1)
127
Scalable Index and Data Management for Unstructured P2P Networks
it is hard to find new shared files that contains popular keywords; (2) the information of cache may be invalid in peer-to-peer networks with high churn rate. In order to limit the traffic incurred by flooding, some search methods assign a smaller TTL value in the process of search. Flooding with a small TTL will only cover a small part of networks, so the peers that have newer files may not be included in the range of flooding. On the other hand, in a peer-to-peer network, the destination peer that was pointed to by an index in the index cache may have left the network. In the DiCAS algorithm, peers use a hash method to redirect query messages in the multilayer peer-to-peer network. By this method, search is more efficient than the original uniform index caching mechanism in terms of traffic. But there is one issue about the number of layers in the multilayer peer-to-peer network. If we insert more layers in the peer-to-peer network, the probability of finding a desired file in a matched layer will be lower. This may limit the performance of query success rate in the peer-to-peer network, and the amount of results may not be enough to meet the preset expected number of matched results. The proposed approach of the active index diffusing (AID) scheme can overcome these problems in index caching mechanisms. In our scheme, we group the peers that share the same files in a cluster, and they could diffuse indices actively. There are four main strategies in the AID (Active Index Diffusion) scheme: 1.
128
Clustering peers that share the same file names: The main concept of our AID scheme is similar to some existing peer-to-peer applications, such as BitTorrent. There are several trackers that record owners of the files and peers that are downloading the files in BitTorrent. These trackers will provide the information when peers request the files. Peer-to-peer networks like BitTorrent are efficient in getting peer lists from trackers, so we group peers that share the same file names in a cluster. In the AID scheme, each peer knows other peers that share the file with the same file names. In other words, every peer could play the role just like trackers in BitTorrent. If a peer requests the file, the owner of the file will provide the information of the file. Another reason of grouping files with the same file names is that there are more and more fake files appearing in peer-to-peer networks. Fake files are detected in KaZaA, while files may be infected by virus in Winny and Share. Grouping these fake files and real files together, we can apply some method to differentiate between fake and real files with a tag. The tag may help peers tell if the file is real. 2. Index updated when an index hit occurs: As we have discussed, the request popularity characterizes our index diffusion scheme. We define the request popularity by the number of index hit counts over a timeout period, T. In each cluster, owners will elect one peer as the master of the file. The master may be the one that has the longest online time in the same cluster. When the master leaves the network, peers will reelect one peer as the master. When an index hit occurs in a peer, the peer will send an index update message to the file owner. The owner of the file then selects a peer in its cluster randomly to carry out the index update. The file owner will also send an index update message to the master. The master in each cluster keeps the hit count obtained from the index update messages. 3. Actively diffusing indices: According to the hit rate, the master will actively diffuse its indices to the peers in an appropriate layer when the indices expire. We denote the value of timeout as T. The value of T is should not be too small such that masters may update indices too often. In order to control the traffic of actively diffusing indices, T must be assigned a value larger than a certain number that may be related to the network size and the amount of shared files. Assume that a network has N peers, the index replication rate is r, totally I
Scalable Index and Data Management for Unstructured P2P Networks
Figure 2. Index copy when peer joins
different files are shared, and the maximal amount of messages as S. If you want to limit the number of messages S to satisfy the following inequality: S £ Total pupular indices that need updates in T = N × r × I × k /T T
(3)
So we can determine the value of T by: T ³ N ×r ×I ×k /S
(4)
where k is the probability if a file is popular. In (Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S., 2003), it is estimated that 32% to 42% queries are repeated. According to Zipf distribution, these queries are only requesting files with small sizes. Thus, we assign a small value to k. 4.
Copy indices when peer joins: The basic procedure of joining our network is the same as Gnutella/ LimeWire (Fisk 2003). Because we wish to control the flooding size and keep the hit rate steady, the index replication rate should be kept in a constant value. In a peer-to-peer network, we cannot control the departure of each peer, but we could do something when peers join in. When peers receive copy request messages, they will replicate a part of their indices to the newly-joining peer. The method can make the replication ratio in a stable value. After finishing the joining procedure, the newcomer will request indices replication by sending n copy request messages to n connected peers. If a peer receives a copy request message, it will reply with 1/n of indices in its index cache to the newcomer. The example is shown in Figure 2. As a result, the index replication ratio will
129
Scalable Index and Data Management for Unstructured P2P Networks
Table 1. p’(F) and F’ in different network sizes Peers
10000
100000
p’(F)
F’
p’(F)
F’
6 groups
95.57%
30.26303
95.48%
30.31680
7 groups
97.43%
29.01158
97.35%
29.06710
8 groups
98.52%
28.04401
98.46%
28.10261
9 groups
99.15%
27.30189
99.11%
27.35940
10 groups
99.52%
26.73488
99.49%
26.78876
keep a constant value because the new peer will keep 1 = n × 1/n table of indices.
Flooding of Query In peer-to-peer networks, if the file indices are diffused in the whole network and there is no intelligent method (just flooding with blind) applied to search, the query success rate can be described by a hypergeometric distribution. By definition, if a random variable X follows the hypergeometric distribution with parameters N, D, and n, the probability of getting exactly k successes is given by, f (X = k ; N , D, n ) =
(kD )(nN--kD ) (nN )
(5)
where N is the whole sample space, D is the amount of total desired items, and n is the amount of samples. Now, let a simple peer-to-peer network be consisted of N peers and the index replication rate is r. Then, the expression of the probability model is: f (X = k ; N , rN , m ) =
(rN )((m1--rk)N ) k (mN )
(6)
If the query messages had been sent to m peers, the success rate, hm, would be equal to: hm = 1-f (X = 0; N , rN , m )
(7)
Generally speaking, a higher flooding size incurs more traffic overheads, but the success rate can also be higher. We wish that our flooding method in each group is efficient in the aspect of the flooding size and success rate, so we modify the search method of dynamic query to suit our architecture. The concept is to satisfy the demand of files in dynamic query. In our search method, we expect the success rate higher than a certain value if the indices of desired items are in the group. Trying to make our network scalable, we set the basic flooding size F of each group to the same value. Searching in each group contains two iterations and begins with sending query messages to F peers in
130
Scalable Index and Data Management for Unstructured P2P Networks
Figure 3. Example of routing path 1 with the modified dynamic query
the first iteration. The message will be flooded to another F peers if and only if there is no hit message returned in the first iteration. When the network is divided into g groups, the success rate of a search in a group having the indices of desired files is: hm = 1-f (X = 0; N , rN , m )
(8)
The final success rate with two iterations can be written as: p ¢(F ) = p(F ) + (1 - p(F )) × p(F ) = 2p(F ) - p(F )2
(9)
Figure 4. Example of routing path 1 with the modified random walk
131
Scalable Index and Data Management for Unstructured P2P Networks
Figure 5. Example of routing path 2 with the modified random walk
The value of p’(F) is very close to p(2F), and the average flooding size F’ is equal to: F ¢ = F + (1 - p(F )) × F = 2F - F × p(F )
(10)
Because p(F) is close to 1, the average flooding size, F’ will approach F. If we flood a query to F’ peers directly, the value of p(F’) is lower than p’(F). We set the value of F as 25 and then calculate p’(F) and F’ with different network sizes. The results are shown in the Table 1. Our architecture with 10000 peers reaches a success rate of 97.43%, with F’ = 29.01158 for the 7-group peer-to-peer system. When the network size is ten times larger, the success rate is 97.35%, with F’ = 29.06710. It shows that the average flooding sizes and success rates in different network sizes are quite similar, so there is no
Figure 6. Flooding size
132
Scalable Index and Data Management for Unstructured P2P Networks
Figure 7. Average success rate
problem with the scalability of our network.
Modified Search Methods Because our peer-to-peer network is based on a multilayer architecture, the existing search method does not be suitable for our architecture. We should adapt some existing blind search methods for our peer-to-peer network. 1.
2.
Dynamic query: A requesting peer randomly selects a peer from each group as the forwarder; then, the query will start with sending a query message to the forwarder of Group-0. This forwarder will send the query in Layer-0. If hit results are not enough, the query messages will be sent to Group-1 and Group-2 peers in Layer-1. If the results returned from Layer-0 and Layer-1 are not satisfied with desired amount, the query will reach the final layer, Layer-2. Figure 3 illustrates an example of modified dynamic query. Random walks:We adopt an F-walker random walk algorithm and each walker will send queries in each group at most 2 times, where F will be set to 25. There are 25 walkers sent by a requesting peer, and the walkers start walking in the Group-0. Walkers will send queries in each group at most 2 times and end in Group-3 or Group-6 of the Layer-2. The examples of routing path are shown in Figure 4 and Figure 5. There are two cases where the walker will stop: one is the walker had finished the query, and the other one is when a hit occurs.
133
Scalable Index and Data Management for Unstructured P2P Networks
Figure 8. Average hit files
SIMULATION RESULTS In our simulations, there are 10000 peers in the unstructured peer-to-peer network, and these peers are divided into 3 layers and 7 groups. The basic flooding size F is set to be 25 peers for each group. The timeout value T is 2500. File names in our simulations are titles of 1268 distinguishing books, and the popularity of file names is modeled by the Mandelbrot-Zipf distribution. Each book has an average of 7 keywords. Finally, we totally assign 20000 query iterations in each simulation, and the parameters of query keywords in the Mandelbrot-Zipf distribution are the same as the parameters that the file names used. As suggested by Share, we tried 2 different experiment setups of the Mandelbrot-Zipf distribution, where one is s = 0.4 and q = 60, and the other one is s = 0.7 and q = 5. In the following experiments, results are slightly different. We used m-zipf(0.7, 5) and m-zipf(0.4, 60) to distinguish these two setups. In the configurations with m-zipf(0.7, 5), we totally inserted 3415 files to the whole network. And we assigned 1866 files to the networks that is configured with m-zipf(0.4, 60). In (Yinglian and O’Hallaron 2002), the authors only experimented on the DiCAS peer-to-peer network with 2-layer configuration. We will make a comparison in query success rates of the DiCAS in different layer configurations. The network degree is modified from 5 to 25, and the cache size is set to 50. The dynamic query search method was implemented and modified for the DiCAS. First, we want to know how the flooding size will be reduced when we insert more layers in the DiCAS. The simulations run in 2- to 4-layer configurations. Due to the use of the index caching mechanism and the search method of dynamic query, the flooding sizes are decreased when more query iterations are submitted. The amount
134
Scalable Index and Data Management for Unstructured P2P Networks
Figure 9. Average duplicated rate per hit
of desired files of dynamic query is set to 4. After the 10000th query iteration, the flooding sizes become more stable than before, so we calculate the average flooding size from the 10000th query iteration to the 20000th one. Figure 6 shows that the average flooding sizes of different multilayer configurations in the DiCAS. As the result of average flooding sizes in the DiCAS, the sizes are larger than 850 peers in the original 2-layer configuration with m-zipf(0.7, 5) and m-zipf(0.4, 60). When the amount of layers grows, the flooding sizes are decreased evidently. In the 4-layer configuration, the flooding size is only half of the 2-layer configuration. We can also notice that the traffic in m-zipf(0.4, 60) is larger than that in
Figure 10. Average messages of DiCAS
135
Scalable Index and Data Management for Unstructured P2P Networks
Figure 11. Average messages of the AID
m-zipf(0.7, 5) in this figure. Compared with the results in the DiCAS, the proposed method has smaller flooding size. Furthermore, more layers in the DiCAS cause the reduction of the query success rate, as the simulation results shown in Figure 7. In order to make a fair comparison, the success rates are calculated by the success hit count from the 10000th query iteration to the 20000th one too. During that period, the success rates are more stable than previous section. In the 2-layer configuration, hit rates are close to 95%. But the query success rate in the 4-layer configuration with m-zipf(0.4, 60) is lower than 85%. According to this figure we can see that the query success rate decreases when the number of layers increases. In our AID scheme with two different search methods, dynamic query (DQ) and random walks (RW), the query success rates of each search methods could be seen in Figure 7. The simulation results about the success rates of each method are all higher than 98%, and some of them can reach 99%. Obviously, the success rate is closer to 1 in the proposed AID scheme than the DiCAS. Figure 6 also shows that the average flooding sizes of each search method in the AID scheme are smaller than the DiCAS. And Figure 8 is about the number of hit files per query of the DiCAS and the AID scheme. The other problem for index caching mechanisms like DiCAS is that we can get a lot of redundant indices from the hit messages, but these indices only cover a few different files. In other words, there are too many indices replicated in the system. Although more layers in the network could reduce the replicated indices per query, the success rate became lower. We also calculated the average replicated rate per query in Figure 9 to compare the DiCAS and our AID scheme. The ratios in the AID scheme are much smaller than DiCAS, and the ratios are between 1.8 and 2.6. It shows that our AID scheme reduces more hit messages about these indices, which are replicated in each search. We can also compare the total messages in the DiCAS and our AID scheme. The results are shown in Figure 10 and Figure 11. When networks become stable, the average number of messages per query in the DiCAS approach is larger than 1000, but in our AID scheme, it only costs 270 to 340 messages. Finally, the MandelbrotZipf distribution with s = 0.7 and q = 5 in the DiCAS could get more files and a higher success rate in a smaller flooding size. The main difference between our AID scheme and the DiCAS scheme is that our modified search methods try to maximize the search success rate, so the values of success rates in the AID scheme are quite similar in different Mandelbrot-Zipf configurations.
136
Scalable Index and Data Management for Unstructured P2P Networks
DISCUSSIONS AND FUTURE TRENDS We understand that the popularity of files cannot be determined only by a period of time T. In other words, judging which layer to diffuse a file will be better if we also take the hit count in the past into consideration. Otherwise, if the popularity of files varies rapidly, they will bounce between different layers frequently, even though it is unnecessary, after each period time T. Hence we introduce the concept of moving average which can smooth a short-term oscillation and make the value which represents the popularity of files more authentic in case of short-term abnormal burst or tranquility. Assume HN is the hit counts observed in time interval [(N-1)*T, N*T], and W is the weighting we assign to HP(N-1), which is the value representing the popularity of files in time interval [(N-2)*T, (N-1)*T]; HN *(1-W) means that we assign the weighting (1-W) to HN with respect to the weight W to HP(N-1). We can use an exponential moving average to estimate the popularity of files HP(N) in time interval [(N-1)*T, N*T] as follows. HP(N) = HP(N-1) * W + HN *(1-W)
(11)
Furthermore, we can also make a clear definition for “popular files” on our own by using some learning algorithm. By training a large amount of data, the method can be more efficient in deciding which layer to diffuse files.
CONCLUSION The experimental results show that our architecture is efficient in keeping a high success rate of queries and reducing the flooding size. The proposed approach can also result in an index table with less redundant indices than the existing system DiCAS, making an efficient use of storage and does not incur too much traffic. Flooding and caching according to hashed keywords like DiCAS may cause flooding in layers that do not contain desired files, and result in a lower query hit rate. This problem is even worse in configuring the system with more layers. Compared with DiCAS, the query success rate has improvements from 95% to 98% and the traffic of messages has reduced 73% to 80%. The replicated indices of query also had been decreased to 13%. This shows that the proposed unstructured peer-to-peer system is scalable and efficient in reducing traffic and increasing hit rate.
REFERENCES Ambastha, N., Beak, I., Gokhale, S., & Mohr, A. (2003). A cache-based resource location approach for unstructured P2P network architectures. Graduate Research Conference, Department of Computer Science, Stony Brook University, NY. Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335–371. doi:10.1145/1041680.1041681 Chao, C.-H. (2006, April). An Interest-based architecture for peer-to-peer network systems. In Proceedings of the International Conference AINA.
137
Scalable Index and Data Management for Unstructured P2P Networks
Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S. (2003). Making gnutella-like p2p systems scalable. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (pp. 407-418). Cheng, A. H., & Joung, Y. J. (2006). Probabilistic file indexing and searching in unstructured peer-topeer networks. Computer Networks, 50(1), 106–127. doi:10.1016/j.comnet.2005.12.008 Cohen, B. (2002). BitTorrent Protocol 1.0. Retrieved from BitTorrent.org. Fisk, A. (2003). Gnutella dynamic query protocol v. 0.1. Retrieved from http://www9.limewire.com/ developer/dynamic query.html. Kalogeraki, V., Gunopulos, D., & Zeinalipour-Yazti, D. (2002). A local search mechanism for peer-topeer networks. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (pp. 300-307). Lv, C., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002). Search and replication in unstructured peer-topeer networks. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems (pp.258-259). Ripeanu, M., Foster, I., & Iamnitchi, A. (2002). Mapping the gnutella network: properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing, 6(1), 50-57. Saleh, O., & Hefeeda, M. (2006). Modeling and caching of peer-to-peer traffic. In Proc. of 14th IEEE International Conference on Network Protocols (ICNP’06), (pp. 249-258). Silagadze, Z. (1997). Citations and the Mandelbrot-Zipf’s law. Complex Systems, 11, 487–499. Stoica, R. Morris, Karger, D., Kaashoek, M. F., & Balakrishnan, H., (2001). Chord: A scalable peer-topeer lookup service for Internet applications. In ACM SIGCOMM, August, (pp. 149-160). Tsoumakos, D., & Rousseopoulos, N. (2006). Analysis and comparison of p2p search methods. In Proceedings of the 1st International Conference on Scalable Information Systems (INFOSCALE 2006), No. 25. Wang, C., Xiao, L., Liu, Y., & Zheng, P. (2004). Distributed caching and adaptive search in multilayer P2P networks. In International Conference on Distributed Computing Systems (ICDCS’04) (pp. 219-226). Yang, B., & Garcia-Molina, H. (2002). Improving search in peer-to-peer networks. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS’02) (pp. 5). Yinglian, X. O’Hallaron D. (2002). Locality in search engine queries and its implications for caching. In Proceedings of the IEEE Infocom (pp. 1238-1247). Zhang, X. Y., Zhang, Q., Zhang, Z., Song, G., & Zhu, W. (2004). A construction of locality-aware overlay network: mOverlay and its performance. IEEE Journal on Selected Areas in Communications, 22(1), 18–28. doi:10.1109/JSAC.2003.818780
138
Scalable Index and Data Management for Unstructured P2P Networks
KEY TERMS AND DEFINITIONS Dynamic Query: The search technique the Gnutella adopted to reduce traffic overhead and make searches more efficient, where a search reaches only those clients which are likely to have the files and stops as soon as the program has acquired enough search results. Peer-to-Peer Networks: A network with equal peer nodes that simultaneously function as both clients and servers to the other nodes on the network. Random Walk: A search algorithm used in a peer-to-peer network where the query will be forwarded up and down the given list until a match is found, the query is aborted, or it reaches the limits of the list. Structured Peer-to-Peer Networks: A peer-to-peer network that employ a consistent protocol such as a distributed hashing function to ensure that any node can efficiently route a search to some peer that has the desired file. The consistent hashing is also used to distribute the file to a peer. Uniform Index Caching: A cache scheme where query results are cached in all peers along the query path. Unstructured Peer-to-Peer Networks: A peer-to-peer network that is formed when the overlay links are established arbitrarily.
139
140
Chapter 7
Hierarchical Structured Peer-to-Peer Networks Yong Meng Teo National University of Singapore, Singapore Verdi March National University of Singapore, Singapore Marian Mihailescu National University of Singapore, Singapore
ABSTRACT Structured peer-to-peer networks are scalable overlay network infrastructures that support Internet-scale network applications. A globally consistent peer-to-peer protocol maintains the structural properties of the network with peers dynamically joining, leaving and failing in the network. In this chapter, the authors discuss hierarchical distributed hash tables (DHT) as an approach to reduce the overhead of maintaining the overlay network. In a two-level hierarchical DHT, the top-level overlay consists of groups of nodes where each group is distinguished by a unique group identifier. In each group, one or more nodes are designated as supernodes and act as gateways to nodes at the second level. Collisions of groups occur when concurrent node joins result in the creation of multiple groups with the same group identifier. This has the adverse effects of increasing the lookup path length due to a larger top-level overlay, and the overhead of overlay network maintenance. We discuss two main approaches to address the group collision problem: collision detection-and-resolution, and collision avoidance. As an example, they describe an implementation of hierarchical DHT by extending Chord as the underlying overlay graph.
INTRODUCTION Structured peer-to-peer systems or distributed hash tables (DHT) are self-organizing distributed systems designed to support efficient and scalable lookups with dynamic network topology changes. Nodes are organized as structured overlay networks, and data is mapped to nodes in the overlap network based on their identifier. There are two main types of structured peer-to-peer architectures: flat and hierarchiDOI: 10.4018/978-1-60566-661-7.ch007
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Hierarchical Structured Peer-to-Peer Networks
cal. A flat DHT (Alima, 2003; Ratnasamy, 2001; Stoica, 2001; Rowstron, 2001; Maymounkov, 2002; Zhao, 2001) organizes nodes into one overlay network, in which each node has the same responsibility and uses the same rules for routing messages. On the other hand, a hierarchical DHT organizes nodes into a multi-level overlay network with the primary aim of reducing the maintenance overhead of its overlay network. In a peer-to-peer system, peers join and leave the system dynamically. A process called stabilization updates the routing information maintained in each peer so as to keep the overlay network up-to-date (Ghinita, 2006). A hierarchical DHT employs a multi-level overlay network where the top-level overlay consists of logical groups (Garcés-Erice, 2003; Harvey, 2003; Karger, 2004; Mislove, 2004; Tian, 2005; Xu, 2003; Zhao, 2003). Each group, which consists of a number of nodes, is assigned a group identifier with a specific objective such as improving administrative autonomy (Harvey, 2003; Mislove, 2004; Zhao, 2003), reducing network latency (Tian, 2005; Xu, 2003), and integrating various services into one system (Karger, 2004). Within a group, one or more nodes are selected as supernodes to act as gateways to nodes in the groups. Within each group, nodes can further form a second-level overlay network. In this chapter, we discuss the organization of a hierarchical DHT with the aim of reducing its overlay maintenance overhead. Using a two-level hierarchical Chord as an example, the top-level overlay network consists of groups with distinct group identifiers. However, collision of groups occurs when two or more groups are created with the same group identifier. Collisions increase stabilization overhead and degrade lookup performance. To address the collision problem, we discuss two main approaches: collision detection-and-resolution, and collision avoidance. The rest of this chapter is organized as follows. Section 2 presents an overview of flat DHT using Chord as the example (Stoica, 2001). Three main approaches to reduce routing maintenance overhead are introduced: hierarchical DHT, varying frequency of stabilizations, and varying number of routing states. Extending Chord into a hierarchical Chord DHT, Section 3 discusses two differing approaches in addressing the collision problem, namely, collision detection-and-resolution, and collision avoidance. Section 4 summarizes this chapter and discusses open issues.
DISTRIBUTED HASH TABLES Distributed hash table (Gummadi, 2003; Hsiao, 2003; Ratnasamy, 2002; Stribling, 2004) is a decentralized lookup scheme designed to provide scalable lookups, i.e., shorter lookup path length with high result guarantee and reduced number of false negative answers. The DHT protocol provides an interface to retrieve a key-value pair. A key is an identifier assigned to a resource; traditionally this key is a hash value associated with the resource. A value is an object to be stored into DHT; this could be the shared resource itself such as a file, an index or a pointer to a resource, or a resource metadata. An example of a key-value pair is <SHA1(file name), http://peer-id/file>, where the key is the SHA1 hash of the file’s name and the value is the address (location) of the file. To support scalable lookups with high result guarantee, DHT exploits the following: 1.
Key-to-node mapping: Assuming that keys and nodes share the same identifier space, DHT maps key k to node n where n is the node closest to k in the identifier space; we refer to n as the responsible node of k. The key-to-node mapping improves result guarantee because searching for a key-value pair equals to locating the node responsible for the key (Loo, 2004).
141
Hierarchical Structured Peer-to-Peer Networks
Figure 1. Chord lookup
2.
3.
Data-item distribution: Key-value pairs, also called data items, with key equals to k are stored at node n independent of the owners of these key-value pairs. This is implemented in DHT by a store operation (Dabel, 2003; Rhea, 2005). The concept of data-item distribution has been further exploited for various optimizations, including load balancing (Godfrey, 2004; Godfrey, 2005; Karger, 2004) and high availability (Dabek, 2001; Ghodsi, 2005a; Kubiatowicz, 2000; Landers, 2004; Leslie, 2006). Structured overlay network: Searching and storing a key-value pair requires routing of the request to a responsible node. To achieve scalable routing, nodes are organized as structured overlay network. A structured overlay network exhibits two main properties: (i) it resembles a graph and is organized into a network topology such as a ring (Rowstron, 2001; Stoica, 2001), a torus (Ratnasamy, 2001), or a tree (Aberer, 2003; Maymounkov, 2002), and (ii) each node uses its identifier to position itself in the structured overlay network. The tradeoff in different overlay topologies are routing performance and overhead of maintaining routing states.
As an example of DHT implementation, we discuss Chord which supports O(log N)-hop lookup path length and maintains O(log N) routing states per node, where N denotes the total number of nodes (Stoica, 2001). Chord organizes nodes as a ring that represents an m-bit one-dimensional circular identifier space, and as a consequence, all arithmetic is modulo 2m. To form a ring overlay, each node n maintains two pointers to its immediate neighbors as shown in Figure 1(a). The successor pointer points to successor(n), the immediate clockwise neighbor of n. Similarly, the predecessor pointer points to predecessor(n), the
142
Hierarchical Structured Peer-to-Peer Networks
Figure 2. Join operation in chord
immediate counter-clockwise neighbor of n. In Chord, every piece of data is assigned an m-bit identifier called a key. Key k is then map onto successor(k), the first node whose identifier is equal to or greater than k in the identifier space (Figure 1(b)). Thus, node n is responsible for keys in the range of (predecessor(n), n], i.e. keys that are greater than predecessor(n) but smaller than or equal to n. For example, node 32 is responsible for all keys in (21, 32]. All key-value pairs whose key equals to k are then stored on successor(k) regardless of who owns the key-value pairs. This distribution of keys is called data-item distribution. Finding key k implies that we route a request to successor(k). To achieve scalable routing, each node n maintains a finger table of m entries as shown in Figure 1(c). Each entry in this table is also called a finger. The ith finger of n is denoted as n.finger[i] and points to successor(n + 2i−1), where 1 ≤ i ≤ m. Note that the first finger is also the successor pointer while the largest finger divides the circular identifier space into two halves. When N < 2m, the finger table consists of only O(log N) unique entries (Stoica, 2001). By utilizing finger tables, Chord locates successor(k) in O(log N) hops with high probability (Stoica, 2001). Intuitively, the process resembles a binary search where each step halves the distance to successor(k). Thus, each node n forwards a request to the nearest known preceding node of k. This is repeated until the request arrives at predecessor(k), the node whose identifier precedes k, which will forward the request to successor(k). Figure 1(d) shows an example of finding successor(54) initiated by node 8. Node 8 forwards the request to its sixth finger which points to node 48. Node 48 is the predecessor of key 54 because its first finger points to node 56 and 48 < 54 ≤ 56. Finally, node 48 will forward the request to node 56. Figure 2 illustrates the construction of a Chord ring. A new node n joins a Chord ring by locating its own successor. Then, n inserts itself between successor(n) and the predecessor of successor(n), illustrated in Figure 2(a). The key-value pairs stored on successor(n), whose key is less than or equal to n, is migrated to node n (Figure 2(b)). Because the join operation invalidates the ring overlay, every node periodically invokes a maintenance process called stabilization to correct its successor and predecessor pointers (Figure 2(c)), and its remaining fingers. A number of approaches have been proposed to reduce the maintenance overhead of DHT. We classify these approaches into three main categories: hierarchical DHT, varying frequency of stabilizations, and varying number of routing states. The last two approaches are applicable directly to both flat and hierarchical DHTs.
143
Hierarchical Structured Peer-to-Peer Networks
HIERARCHICAL DHT In hierarchical DHT, nodes are organized as a two-level overlay network. The top-level overlay consists of logical groups of nodes, where each group is identified by a group identifier (gid). In each group, one or more nodes are designated as supernodes and act as gateways to the nodes at the second level. Each node is assigned an identifier consisting of two subfields: a unique node identifier as is common in DHT to distinguish different peers, and a group identifier to reflect the node’s group. For example, in compute-cycle sharing, a group identifier denotes the type of shared resource or processor type (March, 2007). Grouping of shared resources by processor types facilitates resource discovery and allocation. Figure 3 shows a hierarchical Chord system (Garcés-Erice, 2003), where nodes with the same gid form a group and the groups are organized in the top-level overlay network. Routing in the top-level and the second-level overlay are based on the group identifier and the node identifier, respectively. A hierarchical DHT groups nodes based on various properties to achieve specific objectives. Examples include: 1. 2. 3.
Grouping by administrative domains improves the administrative autonomy and reduces latency (Harvey, 2003; Mislove, 2004; Zhao, 2002); Grouping by physical proximity reduces network latency (Tian, 2005; Xu, 2003); Grouping by services promotes the integration of services into one system (Karger, 2004).
In terms of topology maintenance, the hierarchical structure has the following advantages compared to the flat structure: 1.
2.
Lower overhead of overlay maintenance: Maintenance of structured overlay network involves the correction of nodes’ routing states to adapt to dynamic events of node joining, leaving, or failing. Since the hierarchical structure partitions nodes into multiple overlays, each of which is smaller than a flat overlay, maintenance messages are routed only in one of these smaller overlays. This speeds up the correction of routing states while reducing the number of stabilization messages processed by each node. Isolation of churn: Topology changes within a group due to churn, i.e., continuous changes due to node joins, leaves, or failures, do not affect the top-level overlay or other groups. Stable overlay topologies improve the result guarantee of DHT lookups.
However, when new nodes join such a hierarchical DHT system, collisions of groups may occur. Collisions result in the top-level overlay containing two or more groups with the same group identifier, and increase the size of the overlay. For example, in a join operation, a new node firstly requests a bootstrap node to locate an existing group identified with gid. However, when the bootstrap node belongs to another group gid’ and some routing states in the top-level overlay are incorrect, the bootstrap node may fail to locate group gid. Thus, instead of joining group gid, the new node creates a new group with the same gid. Collisions increase the size of the top-level overlay, which in turn increases the lookup path length and the total number of stabilization messages. In the worst case, collisions lead to the degeneration of the hierarchical structure into the flat structure, where every node occupies the top-level overlay. If the number of groups is c times larger than the number of ideal groups1, the lookup path length is increased
144
Hierarchical Structured Peer-to-Peer Networks
Figure 3. Hierarchical structured chord
by O(log c) hops, but the total number of stabilization messages is increased by Ω(c) times. There are two main approaches to address the problem of collisions in hierarchical DHT systems: 1.
2.
Collision detection and resolution: With this approach, collisions are allowed to occur but it is the responsibility of the hierarchical DHT systems to detect collisions and merge these groups into a single group (March, 2005). In systems such as hierarchical Chord-based DHT (Garcés-Erice, 2003), Diminished Chord (Karger, 2004), Hieras (Xu, 2003) and HONet (Tian, 2005), collisions can occur but the problem is not directly addressed. They assume that collisions can be resolved by mechanisms inherent in the system structure, and the extent of collisions is not studied. Collision avoidance: In hierarchical DHT systems, schemes can be devised to ensure that collisions do not occur. This can be achieved through collision-free join protocols (Teo, 2008) or collision-free grouping policies (Harvey, 2003; Karger, 2004; Mislove, 2004; Xu, 2003; Zhao, 2003). Collision-free join protocol such as in (Teo, 2008) uses the predecessor node to serialize the join lookup operation. All nodes in the overlay network maintain accurate fingers and new groups are reflected instantaneously by the predecessor supernode. The leave protocol is also modified to ensure the correctness of the finger table, the successor pointers, and the predecessor pointers. Thus, a departing supernode notifies its successor and predecessor to update their pointers accordingly. As long as the fingers are maintained in an accurate state, collisions do not occur. ◦ In hierarchical DHT such as Brocade (Zhao, 2003), SkipNet (Harvey, 2003), and hierarchical Scribe (Mislove, 2004), collisions do not occur because a new node always chooses a bootstrap node from the same group. In such systems, nodes are grouped by their administrative domain. Therefore, it is natural for the new node to choose a bootstrap node from the same administrative domain. This grouping policy guarantees that multiple groups with the same group identifier are not created. However, such systems do not address other grouping policies that can introduce collisions, i.e., when a new node is bootstrapped from a node in
145
Hierarchical Structured Peer-to-Peer Networks
◦
a different group. In (Karger, 2004; Xu, 2003), all nodes in a group are assumed to be supernodes. Hence, collisions do not occur. However, the size of the top-level overlay, with or without collisions, is the same. In addition, the top-level overlay is larger than systems where only a subset of nodes becomes supernodes. Thus, the total number of stabilization messages is increased because more supernodes have to perform stabilization.
VARYING FREqUENCY OF STABILIzATION Frequency-based approaches such as adaptive stabilization (Castro, 2004; Ghinita, 2006), piggybacking stabilization with lookups (Alima, 2003; Li, 2005), and reactive stabilization (Alima, 2003) reduce the maintenance overhead by reducing the frequency in invoking routing-state correction procedures. Adaptive stabilization adjusts the frequency based on churn rate and the importance of each routing state to lookup performance2. Systems such as DKS (Alima, 2003) and Accordion (Li, 2005) piggyback stabilization with lookups to reduce the necessity of performing dedicated periodic stabilization; DKS refers to this as correction-on-use. Reactive stabilization such as DKS’s correction-on-change (Ghodsi, 2005) does away altogether with periodic stabilization. Instead, changes to overlay networks due to membership changes are propagated immediately when membership-change events are detected. However, Rhea et. al. reported that reactive stabilization can increase maintenance overhead under high churn rate and constrained bandwidth availability (Rhea, 2004). As an example, we discuss the stabilization mechanism in DKS (Distributed k-ary Search). DKS is proposed as a framework that generalizes different DHT implementations as a k-ary search, e.g., Chord is an instance of DKS when k = 2 (Alima, 2003). Rather than periodic stabilization, DKS maintains its overlay network based on three main principles: local atomic operations, correction-on-use, and correction-on-change. With the local atomic operations, DKS serializes concurrent node insertions/ leaves between two existing adjacent nodes. This reduces the number of incorrect successor and predecessor pointers during churn. However, the local atomic join does not correct other routing states such as fingers affected by the churn. These routing states will be corrected by correction-on-use and correction-on-change. The correction-on-use technique piggybacks stabilization during lookup processes. If the number of lookup messages is high, then the overlay network can be maintained without a need for dedicated stabilizations. Essentially, a routing table entry is not corrected until it is used during lookups. To realize correction-on-use, every lookup message contains information about the position of the receiver from the sender’s perspective3. If the receiver determines that the information (i.e. the sender’s perspective regarding the position of the receiver) is wrong, then the receiver advises the sender about the correct information (to the best of the receiver’s knowledge). The disadvantage of correction-on-use is that the speed at which the overlay network is corrected depends on the amount of lookup traffic. To address this disadvantage, DKS also employs correction-on-change: after a new node joins, it notifies all nodes that need to be updated.
146
Hierarchical Structured Peer-to-Peer Networks
VARYING SIzE OF ROUTING TABLES This approach reduces the size of routing tables so that the number of routing states to correct becomes smaller. Examples of DHT that implement this approach include CAN (Ratnasamy, 2001), Koorde (Kaashoek, 2003), and Accordion (Li, 2005). However, reducing the size of routing tables potentially increases lookup path length (Xu, 2003). In Accordion (Li, 2005), the size of routing tables is controlled through the process of acquisition and eviction of routing states. The rate of state acquisition is determined by a specified bandwidth budget, while the rate of state eviction is influenced by the churn rate. During acquisition, new states are added into a routing table. Accordion couples DKS’s correction-on-use approach with explicit stabilization. The frequency of explicit stabilization is constrained by the bandwidth budget. During eviction, node removes routing entries that point to nodes perceived to be non-existent. In addition, Accordion favors routing states that points to nodes with a longer live time; pointers to relative newer nodes have a higher probability to be evicted. Thus, a higher bandwidth budget increases routing-table size, whereas a higher churn rate reduces it. Besides reducing the size of routing tables, DHT can also partition each routing table into two parts: one part consisting of entries that are corrected through stabilization, and the other part consisting of cached entries. This reduces the maintenance overhead while achieving a shorter lookup path length. For example, in the latest implementation of Chord, a finger table consists of O(log N) fingers, and a number of location caches maintained by a LRU replacement policy (Stoica, 2001).
Hierarchical Chord A hierarchical Chord partitions its nodes into a multi-level overlay network. Because nodes join a smaller overlay network than in a flat structure, each node maintains and corrects a smaller number of routing states than in a flat structure. Figure 4 shows an example of hierarchical Chord. In hierarchical Chord, each node is assigned a group identifier (gid) and a unique node identifier (nid). We use the notation gid|nid to denote the group identifier and node identifier of each node. Nodes with the same gid form a group and groups are organized in the top-level as Chord overlay network. Within each group, nodes are organized as a second-level overlay using the node identifier. The topology and stabilization mechanism can differ from the top-level. In each group, one or more nodes designated as supernodes act as gateways to other nodes in the group. In Figure 4, node 0|5, node 2|7, node 4|2, and node 6|4 are respectively the supernodes of groups’ g0, g2, g4, and g6. In hierarchical Chord, a lookup request for key k implies locating the group responsible for k. Figure 5 illustrates the process. Firstly, a lookup request for key k is routed to the supernode of the initiating group. Secondly, using Chord lookup algorithm (Chord, 2001), the lookup request is further routed to the supernode of group whose group identifier is gid = successor(k). Thirdly, the lookup request can be further forwarded to one of the second-level nodes in group k based on additional criteria. As shown in Figure 5, a lookup request for key 2, initiated by second-level node 6|6, is forwarded to its supernode 6|4 (step 1). In the top-level overlay, the lookup request is routed to supernode 2|7 of group 2 (step 2). Finally, supernode 2|7 can further forward the request to its second-level nodes (step 3), e.g., lookup for compute resources of type 2 in multiple administrative domains (Teo, 2005). If new nodes join a hierarchical Chord when some routing states in the top-level overlay are incorrect, i.e., yet to be updated, then the top-level overlay may end up with two or more groups with the same
147
Hierarchical Structured Peer-to-Peer Networks
Figure 4. A two-level overlay network consisting of four groups
group identifier. This is called collisions of groups. In the following subsections, we discuss how collisions occur and present a collision detection and resolution scheme, and a collision avoidance scheme. To avoid sending additional overhead messages, collision detection is performed together with successor stabilization, i.e., the process of correcting successor pointers. This is because successful collision detections require the successor pointers in the top-level Chord overlay to be correct, and the correctness of the successor pointers is maintained by stabilization. In presenting our algorithm, we assume that each node maintains a list of variables shown in Table 1. The algorithm adopts the same convention as in (Stoica, 2001), where remote procedure calls or variables are preceded by the remote node identifier, while the local procedure calls and variables omit the local node identifier. Figure 5. Example of lookup in hierarchical chord
148
Hierarchical Structured Peer-to-Peer Networks
Table 1. Variables maintained by node n in hierarchical chord Variable
Description
gid
m-bit group identifier
nid
m-bit node identifier
successor
pointer to successor(gid) if n is a supernode, nil otherwise
predecessor
pointer to predecessor(gid) if n is a supernode, nil otherwise
is_super
true if n is a supernode, false otherwise
supernode
pointer to supernode of group gid if node is a supernode, nil othewise
COLLISIONS OF GROUP IDENTIFIERS Collisions of group identifiers arise because of join operations invoked by nodes. Figure 6 shows the nodejoin algorithm for hierarchical Chord. Node n, whose group identifier is denoted as n.gid, makes a request to join group g through bootstrap node n’. In a hierarchical Chord, this means finding successor(g|0) in the top-level overlay. If n’ successfully finds an existing group g then n joins this group using a groupspecific protocol (line 5–9). However, if n’ returns g’ > g, then n creates a new group with identifier g (line 11–15). A collision occurs if the new group is created even though a group with identifier g already exists. This happens when n and bootstrap node n’ are in two different groups, and the top-level overlay has not fully stabilized, i.e., some supernodes successor pointers are yet to be updated. Figure 7 illustrates a collision scenario when node 1|2 and node 1|3 belonging to the same group g1, join concurrently. Due to concurrent joins, find_successor() invoked by both nodes returns node 2|7. As a result, both the new node joins create two groups with the same group identifier g1. Collisions increase the maintenance overhead in the top-level Chord ring by Ω(c) times. Let K denotes the number of groups and N denotes the number of nodes. Assuming that each group assigns one supernode, the ideal size of the top-level overlay is K supernodes. Without collisions, the total number of stabilization messages is denoted as S. With collisions, the size of top-level overlay is increased by c times, i.e., cK groups. As each group performs periodic stabilization, the cost of stabilization with collisions (SC) is Ω(cS). The stabilization cost ratio, with and without collisions, is shown in Equation 1. Sc S
=
cK log2 cK c log2 cK = = W(c) K log2 K log2 cK
(1)
Collisions also increase the lookup path length in the top-level Chord by O(log c) hops. Without collisions, the top-level Chord ring consists of K supernodes, and hence, the lookup path length is O(log K). With collisions, the size of the top-level overlay becomes cK and the lookup path length is O(log cK) = O(log c + log K) hops.
149
Hierarchical Structured Peer-to-Peer Networks
Figure 6. Join operation
COLLISION DETECTION AND RESOLUTION SCHEME Collisions can be detected during successor stabilization. This is achieved by extending Chord’s stabilization so that it not only checks and corrects the successor pointer of supernode n, but also detects if n and its new successor should be in the same group. Figure 8 presents a collision detection algorithm. It first ensures that the successor pointer of a node is valid (line 4–5). It then checks for a potential collision
Figure 7. Collision at the top-level overlay
150
Hierarchical Structured Peer-to-Peer Networks
Figure 8. Collision detection algorithm
(line 8–10), before updating the successor pointer to point to the correct node (line 11–13). Figure 9 illustrates the collision detection process. In Figure 9(a), a collision occurs when nodes 1|2 and 1|3 belonging to the same group, group 1, join concurrently. In Figure 9(b), node 1|3 stabilizes and causes node 2|7 to set its predecessor pointer to node 1|3 (step 1). Then, the stabilization by node 0|5 causes 0|5 to set its successor pointer to node 1|3 (step 2), and node 1|3 to set its predecessor pointer
151
Hierarchical Structured Peer-to-Peer Networks
Figure 9. Collision detection piggybacks successor stabilization
to node 0|5 (step 3). In Figure 9(c), the stabilization by node 1|2 causes 1|2 to set its successor pointer to node 1|3. At this time, a collision is detected by node 1|2 and is resolved by merging 1|2 to 1|3. If each group contains more than one supernode, then the is_collision routine shown in Figure 8 may incorrectly detect collisions. Consider the example in Figure 10(a). When node n stabilizes, it incorrectly detects a collision with node n’ because n.successor.predecessor = n’ and n.gid = n’.gid. An approach to avoid this problem is for each group to maintain a set of its supernodes (Garcés-Erice, 2003; Gupta, 2003) so that each supernode can accurately decide whether a collision has occurred. The modified collision detection algorithm is shown in Figure 10(b). To resolve collisions, groups associated with the same gid are merged. After the merging, some supernodes, depending on the group policy, become ordinary nodes. Before a supernode changes its state into a second-level node, the supernode notifies its successors and predecessors to update their pointers (see Figure 11). Nodes in the second level also need to be merged to the new group. We discuss two methods to merge groups, namely supernode initiated and node initiated.
152
Hierarchical Structured Peer-to-Peer Networks
Figure 10. Collision detection for groups with several supernodes
Supernode Initiated To merge two groups n.gid and n’.gid, supernode n notifies its second-level nodes to join group n’.gid (Figure 12). The advantage of this approach is that second-level nodes join a new group as soon as a collision is detected. However, n needs to keep track of its group membership. If n has only partial knowledge of group membership, some nodes in the second-level can become orphans.
Figure 11. Announce leave to preceding and succeeding supernodes
153
Hierarchical Structured Peer-to-Peer Networks
Figure 12. Collision resolution-supernode-initiated approach
Node Initiated In node-initiated merging, each second-level node periodically checks that its known supernode n’ is still a valid supernode (Figure 13). If n’ is no longer a supernode, then the second-level node will ask n’ to find the correct supernode. These second-level nodes then join a new group through the new supernode. This approach does not require supernodes to track group membership. However, it introduces an additional overhead to the second-level nodes as they periodically check the status of their supernode.
COLLISION AVOIDANCE SCHEME Avoiding collision has the following advantages: 1. 2. 3. 4.
154
lower overhead: Runaway collisions are very costly, and detecting and resolving collisions is highly difficult in a decentralized and dynamic peer-to-peer system with high churn rate (Teo, 2008); reduced bootstrap time: New peers can join the network at a faster rate because the time between the join event and the update of the underlying overlay network states is reduced; improved lookup performance: Without collision, the top-level overlay is maintained at the ideal size; faster resource availability: As costly collision resolution is not necessary, resources are available once the nodes join the network.
Hierarchical Structured Peer-to-Peer Networks
Figure 13. Collision resolution-node-initiated approach
In the join operation in Figure 6, a node performs a lookup for the group identifier, which is handled by the supernode of the successor group. If the joining node and the supernode that respond to the lookup have the same group identifier, the node joins the second-level overlay. Collisions occur when concurrent joins create multiple new groups with the same group identifier in the first-level overlay. This scenario arises because before the routing states are updated, each joining node is unaware of the existence of other joining nodes. To avoid collisions due to join requests, the join protocol is modified such that the predecessor node handles the join lookup request instead of the successor node. The rationale behind this change is that all join requests are serialized at the predecessor. If the group identifier of the successor’s supernode is different from the group identifier of the joining node, then the predecessor immediately changes its successor pointer to reflect the new group created by the joining node. Thus, this modification allows the overlay network to reveal new groups to subsequent joining nodes and make them available to incoming lookups.
155
Hierarchical Structured Peer-to-Peer Networks
Figure 14. Collision-free join operation
Join Protocol The detailed join algorithm shown in Figure 14 is divided into the following steps: 1. 2. 3.
4.
156
A joining node performs a lookup for group gid, which is routed at the top overlay to the supernode whose identifier is successor(gid|0) (line 3). If a group for the resource type exists, a supernode is already created for the resource type and the joining node becomes a member of the second-level overlay (lines 5–7). If a group for the resource type does not exist, the joining node becomes the supernode of a newly created group. The joining node then sets its predecessor and successor pointers accordingly (lines 9–11). In addition, the supernode in step 1 updates its successor pointer to the joining node. Stabilization is used by the new supernode to build a finger table (line 12).
Hierarchical Structured Peer-to-Peer Networks
Figure 15. Leave operation
Leave Protocol When a supernode leaves its group becomes an orphan group if the supernode is the only one in the group. If a new node attempts to join the orphan group, then a collision occurs because the new node cannot locate the orphan group in the top-level overlay. Hence, a new group is created in the top-level overlay where its group identifier is the same as the orphan group. To prevent this type of collisions, the departing supernode notifies its first-level overlay successor and predecessor to update their finger tables. Furthermore, a new supernode needs to be elected for the orphan group to prevent collisions during subsequent node joins. Figure 15 presents a simple-but-costly leave protocol that reuses our collision-free join operation (Figure 14) to elect new supernodes. In this protocol, the orphan group is disbanded where all its members are forced to rejoin the system. Thus, the node which completes its join operation first becomes the new supernode.
Failures A more complex case which leads to collisions is when supernodes fail. A supernode failure invalidates other nodes’ successor pointers and finger table. While inaccurate finger table only degrade lookup performance, inaccurate successor pointers leads to collisions. However, avoiding collisions due to supernode failures is a challenging problem. Unlike departures (Section 3.3.2) where supernodes leave the overlay network gracefully, failures can be viewed as supernodes leave the overlay network silently. This means that there is no notification to the overlay network to indicate that any collision avoidance procedures should be triggered. Hence, it is necessary for the system to detect the presence of supernode failures so that any corrective measures can be initiated, e.g. the collision detection-and-resolution
157
Hierarchical Structured Peer-to-Peer Networks
scheme presented in Section 3.2.
SUMMARY AND OPEN ISSUES Efficient lookup is an essential service in peer-to-peer applications. In structured peer-to-peer systems, dynamic joining and leaving of peers and failing of peer nodes change the structural properties of the overlay network. Stabilization, the process of overlay network maintenance is a necessary overhead and impact on the lookup performance. In this chapter, we discuss three main approaches in reducing overlay maintenance overhead, namely, hierarchical DHT, varying frequency of stabilizations and varying number of routing states. We discuss in more detail hierarchical DHT where nodes are organized as multi-level overlay networks. In hierarchical DHT, collisions of groups occur when concurrent node joins result in multiple groups with the same group identifier being created at the top-level overlay. Collisions increase the size of the top-level overlay by a factor c, which increases the lookup path length by only O(log c) hops, but increases the total number of stabilization messages by Ω(c) times. To address the collision problem, we present firstly a collision detection-and-resolution scheme and two approaches to merge collision groups, namely, supernode-initiated and node-initiated. Though the effect of collisions can be reduced by collision detection and correction, the message overhead cost is high. A collision avoidance scheme where join and leave operations are collision free is discussed. The open issues of group collisions in hierarchical DHT include: 1.
2.
3.
158
Current experimental results on both collision detection-and-resolution and avoidance schemes assume that node joins, leaves, and fails occur exclusively (March, 2005; Teo, 2008). However, in practice, these three events are interleaved and are important when network churn rate is high. Thus, in addition to the frequency of top-level overlay’s stabilizations during collision detections (March, 2005), churn also impacts how often second-level nodes should check the status of their supernode during the node-initiated collision resolution approach. An adaptive method similar to (Ghinita, 2006) is a possible direction; however, this has not been studied in detail. When a supernode leaves, the current collision-free leave protocol uses a simple but naïve approach to deal with orphan groups where all the second-level nodes are forced to rejoin a hierarchical DHT. A more efficient approach is required. For example, an efficient distributed election scheme can be used to select a supernode among the second-level nodes, and only the elected supernode joins the top-level overlay. Node failures are unplanned and collisions that arise due to node failure are therefore harder to address. Avoiding collisions due to supernode failures is a challenge. We envisage two possible solutions; both using multiple supernodes. Firstly, each group employs a number of backup supernodes so that the collision-free join protocol is able to resolve the problem of orphan group before redirecting new nodes to the group. Alternatively, each group can have multiple supernodes in the top-level overlay; but this is at the expense of a larger top-level overlay.
Hierarchical Structured Peer-to-Peer Networks
REFERENCES Aberer, K., Cudr-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., & Punceva, M. (2003). P-Grid: A self-organizing structured p2p system. SIGMOD Record, 32(3), 29–33. doi:10.1145/945721.945729 Alima, L. O., El-Ansary, S., Brand, P., & Haridi, S. (2003). DKS (N, k, f): A Family of Low Communication, Scalable and Fault-tolerant Infrastructures for P2P Applications. In Proceedings of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid (pp. 344-350). New York: IEEE Computer Society Press. Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335–371. doi:10.1145/1041680.1041681 Castro, M., Costa, M., & Rowstron, A. (2004). Performance and Dependability of Structured Peer-toPeer Overlays. In Proceedings of the 2004 Intl. Conf. on Dependable Systems and Networks (pp. 9-18). New York: IEEE Computer Society Press. Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., & Stoica, I. (2001). Wide-Area Cooperative Storage with CFS. In Proceedings of the 11th ACM Symp. on Operating Systems Principles (pp. 202-215). New York: ACM Press. Dabek, F., Zhao, B. Y., Druschel, P., Kubiatowicz, J., & Stoica, I. (2003). Towards a Common API for Structured Peer-to-Peer Overlays. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 33-44). Berlin: Springer-Verlag. Garcés-Erice, L., Biersack, E. W., Felber, P. A., Ross, K. W., & Urvoy-Keller, G. (2003). Hierarchical Peer-to-Peer Systems. In Proceedings of the 9th Intl. Euro-Par Conf. (pp. 1230-1239). Berlin: SpringerVerlag. Ghinita, G., & Teo, Y. M. (2006). An adaptive stabilization framework for distributed hash tables. In Proceedings of the 20th IEEE Intl. Parallel and Distributed Processing Symp. New York: IEEE Computer Society Press. Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Low-bandwidth topology maintenance for robustness in structured overlay networks. In Proceedings of 38th Hawaii Intl. Conf. on System Sciences (p. 302). New York: IEEE Computer Society Press. Ghodsi, A., Alima, L. O., & Haridi, S. (2005a). Symmetric replication for structured peer-to-peer systems. In Proceedings of the 3rd Intl. Workshop on Databases, Information Systems and Peer-to-Peer Computing (p. 12). Berlin: Spinger-Verlag. Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2004). Load balancing in dynamic structured p2p systems. In Proceedings of INFOCOM (pp. 2253- 2262). New York: IEEE Press. Godfrey, P. B., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In Proceedings of INFOCOM (pp. 596-606). New York: IEEE Press. Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., & Stoica, I. (2003). The impact of dht routing geometry on resilience and proximity. In Proceedings of ACM SIGCOMM (pp. 381-394). New York: ACM Press.
159
Hierarchical Structured Peer-to-Peer Networks
Gupta, I., Birman, K., Linga, P., Demers, A., & Renesse, R. V. (2003). Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 160-169). Berlin: Springer-Verlag. Harvey, N. J., Jones, M. B., Saroiu, S., Theimer, M., & Wolman, A. (2003). SkipNet: A scalable overlay network with practical locality properties. In Proceedings of the 4th USENIX Symp. on Internet Technologies and Systems (pp. 113-126). USENIX Association. Hsiao, H.-C., & King, C.-T. (2003). A tree model for structured peer-to-peer protocols. In Proceedings of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid (pp. 336-343). New York: IEEE Computer Society Press. Kaashoek, M. F., & Karger, D. R. (2003). Koorde: A simple degree-optimal distributed hash table. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 98-107). Berlin: Springer-Verlag. Karger, D. R., & Ruhl, M. (2004). Diminished chord: A protocol for heterogeneous subgroup. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 288-297). Berlin: Springer-Verlag. Karger, D. R., & Ruhl, M. (2004). Simple, efficient load balancing algorithms for peer-to-peer systems. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 131-140). Berlin: SpringerVerlag. Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., et al. (2000). OceanStore: An Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (pp. 190-201). New York: ACM Press. Landers, M., Zhang, H., & Tan, K.-L. (2004). PeerStore: Better performance by relaxing in peer-to-peer backup. In Proceedings of the 4th Intl. Conf. on Peer-to-Peer Computing (pp. 72-79). New York: IEEE Computer Society Press. Leslie, M., Davies, J., & Huffman, T. (2006). replication strategies for reliable decentralised storage. In Proceedings of the 1st Workshop on Dependable and Sustainable Peer-to-Peer Systems (pp. 740-747). New York: IEEE Computer Society Press. Li, J., Stribling, J., Gil, T. M., Morris, R., & Kaashoek, M. F. (2004). Comparing the performance of distributed hash tables under churn. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 87-99). Berlin: Springer-Verlag. Li, J., Stribling, J., Morris, R., & Kaashoek, M. F. (2005). Bandwidth-efficient management of dht routing tables. In Proceedings of 2nd Symp. on Networked Systems Design and Implementation (pp. 99-114). USENIX Association. Loo, B. T., Huebsch, R., Stoica, I., & Hellerstein, J. M. (2004). The case for a hybrid p2p search infrastructure. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 141-150). Berlin: Springer-Verlag. March, V., Teo, Y. M., Lim, H. B., Eriksson, P., & Ayani, R. (2005). Collision detection and resolution in hierarchical peer-to-peer systems. In Proceedings of the 30th IEEE Conf. on Local Computer Networks (pp. 2-9). New York: IEEE Computer Society Press.
160
Hierarchical Structured Peer-to-Peer Networks
March, V., Teo, Y. M., & Wang, X. (2007). DGRID: A DHT-based resource indexing and discovery scheme for computational grids. In Proceedings of the 5th Australasian Symp. on Grid Computing and e-Research (pp. 41-48). Australian Computer Society, Inc. Maymounkov, P., & Mazieres, D. (2002). Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 53-65). Berlin: Springer-Verlag. Mislove, A., & Druschel, P. (2004). Providing administrative control and autonomy in structured peerto-peer overlays. Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 162-172). Berlin: Springer-Verlag. Oram, A. (2001). Peer-to-Peer: Harnessing the power of disruptive technologies. O’Reilly. Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2003). Load Balancing in structured P2P systems. Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 68-79). Berlin: Springer-Verlag. Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable content-addressable network. Proceedings of ACM SIGCOMM (pp. 161-172). New York: ACM Press. Ratnasamy, S., Stoica, I., & Shenker, S. (2002). Routing algorithms for DHTs: Some open questions. Proceedings the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 45-52). Berlin: Springer-Verlag. Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling Churn in a DHT. Proceedings of the USENIX (pp. 127-140). USENIX Association. Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., et al. (2005). OpenDHT: A public DHT service and its uses. In Proceedings of ACM SIGCOMM (pp. 73-84). New York: ACM Press. Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proceedings of IFIP/ACM Intl. Conf. on Distributed Systems Platforms (pp. 329-350). Berlin: Springer-Verlag. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peerto-peer lookup service for Internet applications. In Proceedings of ACM SIGCOMM (pp. 149-160). New York: ACM Press. Teo, Y. M., & Mihailescu, M. (2008). Collision avoidance in hierarchical peer-to-peer systems. In Proceedings of 7th Intl. Conf. on Networking (pp. 336-341). New York: IEEE Computer Society Press. Tian, R., Xiong, Y., Zhang, Q., Li, B., Zhao, B. Y., & Li, X. (2005). Hybrid Overlay Structure Based on Random Walks. In Proceedings of the 4th Intl. Workshop on Peer-to-Peer Systems (pp. 152-162). Berlin: Springer-Verlag. Xu, J. (2003). On the fundamental tradeoffs between routing table size and network diameter in peerto-peer networks. In Proceedings of INFOCOM (pp. 2177-2187). New York: IEEE Press.
161
Hierarchical Structured Peer-to-Peer Networks
Xu, Z., Min, R., & Hu, Y. (2003). HIERAS: A DHT based hierarchical p2p routing algorithm. In Proceedings of the 2003 Intl. Conf. on Parallel Processing (pp. 187-194). New York: IEEE Computer Society Press. Zhao, B. Y., Duan, Y., Huang, L., Joseph, A., & Kubiatowicz, J. (2003). Brocade: landmark routing on overlay networks. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 34-44). Berlin: Springer-Verlag. Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (2001). Tapestry: An infrastructure for fault-tolerant widearea location and routing, (Tech. Rep.). UC Berkeley, Computer Science Department, Berkeley, CA.
KEY TERMS AND DEFINITIONS Chord: A structured overlay network with nodes organized as a logical ring. Churn: Changes in overlay networks due to dynamic node joins, leaves, or failures. Collision of Groups: An occurrence when two or more groups with the same group identifier occupy the top-level overlay network. Distributed Hash Table: A class of distributed systems where keys are map onto nodes and nodes are organized as a structured overlay network to support scalable lookup service. Finger: An entry in each node’s routing table (finger table) in Chord Key-Value Pair: A tuple consisting of a unique identifier (key) and an object (value) to be stored into DHT. Predecessor: The immediate counter-clockwise neighbor of a node in Chord. Successor: The immediate clockwise neighbor of a node in Chord. Supernode: A gateway node to a second-level hierarchical overlay network. Stabilization: A procedure to keep the routing information in each peer nodes updated.
ENDNOTES 1 2
3
162
Size of the top-level overlay without collision. Routing states with higher importance such as successor pointers in Chord (Stoica, 2001) and leaf sets in Pastry (Rowstron, 2001), are refreshed/corrected more frequently. This is possible due to the k-ary model.
163
Chapter 8
Load Balancing in Peerto-Peer Systems Haiying Shen University of Arkansas, USA
ABSTRACT Structured peer-to-peer (P2P) overlay networks like Distributed Hash Tables (DHTs) map data items to the network based on a consistent hashing function. Such mapping for data distribution has an inherent load balance problem. Thus, a load balancing mechanism is an indispensable part of a structured P2P overlay network for high performance. The rapid development of P2P systems has posed challenges in load balancing due to their features characterized by large scale, heterogeneity, dynamism, and proximity. An efficient load balancing method should flexible and resilient enough to deal with these characteristics. This chapter will first introduce the P2P systems and the load balancing in P2P systems. It then introduces the current technologies for load balancing in P2P systems, and provides a case study of a dynamism-resilient and proximity-aware load balancing mechanism. Finally, it indicates the future and emerging trends of load balancing, and concludes the chapter.
1. INTRODUCTION Peer-to-peer (P2P) overlay network is a logical network on top of a physical network in which peers are organized without any centralized coordination. Each peer has equivalent responsibilities, and offers both client and server functionalities to the network for resource sharing. Over the past years, the immense popularity of P2P resource sharing services has produced a significant stimulus to contentdelivery overlay network research (Xu, 2005). An important class of the overlay networks is structured P2P overlays, i.e. distributed hash tables (DHTs), that map keys to the nodes of a network based on a consistent hashing function (Karger, 1997). Representatives of the DHTs include CAN (Ratnasamy, DOI: 10.4018/978-1-60566-661-7.ch008
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Load Balancing in Peer-to-Peer Systems
2001), Chord (Stoica, 2003), Pastry (Rowstron, 2001), Tapestry (Zhao, 2001), Kademlia (Maymounkov, 2002), and Cycloid (Shen, 2006); see (Shen, 2007) and references therein for the details of the representatives of the DHTs. In a DHT overlay, each node and key has a unique ID, and each key is mapped to a node according to the DHT definition. The ID space of each DHT is partitioned among the nodes and each node is responsible for those keys the IDs of which are located in its space range. For example, in Chord, a key is stored in a node whose ID is equal to or succeeding the key’s ID. However, a downside of consistent hashing is uneven load distribution. In theory, consistent hashing produces a bound of O(log n) imbalance of keys between nodes, where n is the number of nodes in the system (Karger, 1997). Load balancing is an indispensable part of DHTs. The objective of load balancing is to prevent nodes from being overloaded by distributing application load among the nodes in proportion to their capacities. Although the load balancing problem has been studied extensively in a general context of parallel and distributed systems, the rapid development of P2P systems has posed challenges in load balancing due to their features characterized by large scale, heterogeneity, dynamism/churn, and proximity. An efficient load balancing method should flexible and resilient enough to deal with these characteristics. Network churn represents a situation where a large percentage of nodes and items join, leave and fail continuously and rapidly, leading to unpredicted P2P network size. Effective load balancing algorithms should work for DHTs with and without churn and meanwhile be capable of exploiting the physical proximity of the network nodes to minimize operation cost. By proximity, we mean that the logical proximity abstraction derived from DHTs don’t necessarily match the physical proximity information in reality. In the past, numerous load balancing algorithms were proposed with different characteristics (Stoica, 2003; Rao, 2003; Godfrey, 2006; Zhu, 2005; Karger, 2006). This chapter is dedicated to providing the reader with a complete understanding of load balancing in P2P overlays. The rest of this chapter is organized as follows. In Section 2, we will give an in depth background of load balancing algorithms in P2P overlays. We move on to present the load balancing algorithms discussing their goals, properties, initialization, and classification in Section 3. Also, we will present a case study of a dynamism-resilient and locality-aware load balancing algorithm. In Section 4, we will discuss the future and emerging trends in the domain of load balancing, and present the current open problems in load balancing from the P2P overlay network perspective. Finally, in Section 5 we conclude this chapter.
2. BACKGROUND Over the past years, the immerse popularity of the Internet has produced a significant stimulus to P2P file sharing systems. A recent study of large scale characterization of traffic (Saroiu, 2002) shows that more than 75% of Internet traffic is generated by P2P applications. Load balancing is an inherent problem in DHTs based on consistent hashing functions. Karger et al. proved that the consistent hashing function in Chord (Karger, 1997) leads to a bound of O(log n) imbalance of keys between the nodes. Load imbalance adversely affects system performance by overloading some nodes, while prevents a P2P overlay from taking full advantage of all resources. One main goal of P2P overlays is to harness all available resources such as CPU, storage, and bandwidth in the P2P network so that users can efficiently and effectively access files. Therefore, load balancing is crucial to achieving high performance of a P2P overlay. It helps to avoid overloading nodes and make full use of all available resources in the P2P overlay.
164
Load Balancing in Peer-to-Peer Systems
Load balancing in DHT networks remains challenging because of their two unique features: 1.
2.
Dynamism. A defining characteristic of DHT networks is dynamism/churn. A great number of nodes join, leave and fail continually and rapidly, leading to unpredicted network size. A load balancing solution should be able to deal with the effect of churn. Popularity of the items may also change over time. A load balancing solution that works for static situations does not necessarily guarantee a good performance in dynamic scenarios. Skewed query patterns may also result in considerable number of visits at hot spots, hindering efficient item access. Proximity. A load balancing solution tends to utilize proximity information to reduce the load balancing overhead. However, logical proximity abstraction derived from DHTs doesn’t necessarily match the physical proximity information in reality. This mismatch becomes a big obstacle for the deployment and performance optimization of P2P applications.
In addition, DHT networks are often highly heterogeneous. With the increasing emergence of diversified end devices on the Internet equipped with various computing, networking, and storage capabilities, the heterogeneity of participating peers of a practical P2P system is pervasive. This requires a load balancing solution to distribute not only the application load (e.g. file size, access volume), but also the load balancing overhead among the nodes in proportion to their capacities. Recently, numerous load balancing methods have been proposed. The methods can be classified into three categories: virtual server, load transfer and ID assignment or reassignment. Virtual server methods (Stoica, 2003; Godfrey, 2005) map keys to virtual servers the number of which is much more than real servers. Each real node runs a number of virtual servers, so that each real node is responsible for O(1/n) of the key ID space with high probability. Load transfer methods (Rao, 2003; Karger, 2004; Zhu, 2005) move load from heavily loaded nodes to lightly loaded nodes to achieve load balance. ID assignment or reassignment methods (Bienkowski, 2005; Byers, 2003) assign a key to a lightly loaded node among a number of options, or reassign a key from a heavily loaded node to a lightly loaded node.
3. LOAD BALANCING METHODS 3.1 Examples of Load Balancing Methods In this section we will review various load balancing methods that have been proposed for structured P2P overlays over the last few years. For each method, we will review its goals, algorithms, properties, and pros and cons.
3.1.1 Virtual Server Basic virtual server method. Consistent hashing leads to a bound of O(log n) imbalance of keys between nodes. Karger et al. (Karger, 1997) pointed out that the O(log n) can be reduced to an arbitrarily small constant by having each node run (log n) virtual nodes, each with its own identifier. If each real node runs v virtual nodes, all bounds should be multiplied by v. Based on this principle, Stoica et al. (2003) proposed an abstraction of virtual servers for load balancing in Chord. With the virtual server method, Chord makes the number of keys per node more uniform by associating keys with virtual nodes, and map-
165
Load Balancing in Peer-to-Peer Systems
ping multiple virtual nodes (with unrelated identifiers) to each real node. This provides a more uniform coverage of the identifier space. For example, if (log n) randomly chosen virtual nodes are allocated to each real node, with high probability each of the n bins will contain O(log n) virtual nodes (Motwani, 1995). The virtual server-based approach for load balancing is simple in concept. There is no need for the change of underlying DHTs. However, the abstraction incurs large space overhead and compromises lookup efficiency. The storage for each real server increases from O(log n) to O(log2 n) and the network traffic increase considerably by a factor of Ω(log n). In addition, node joins and departures generate high overhead for nodes to update their routing tables. This abstraction of virtual server simplifies the treatment of load balancing problem at the cost of higher space overhead and lookup efficiency compromise. Moreover, the original concept of virtual server ignores the node heterogeneity. Y0 DHT protocol. Brighten et al. (Godfrey, 2005) addressed the problem of virtual server method by arranging a real server for virtual ID space of consecutive virtual IDs. This reduces the load imbalance from O(log n) to a constant factor. The authors developed a DHT protocol based on Chord, called Y0, that achieves load balancing with minimal overhead under the assumption that the load is uniformly distributed in the ID space. The authors proved that Y0 can achieve near-optimal load balancing with low overhead, and it increases the size of the routing tables by at most a constant factor. Y0 is based on the concept of virtual servers, but with a twist: instead of picking k virtual servers with random IDs, a node clusters those IDs in a random fraction Θ(k/n) of the ID space. This allows the node to share a single set of overlay links among all k virtual servers. As a result, the number of links per physical node is still Θ(log n), even with Θ(log n) virtual servers per physical node. To deal with node heterogeneity, Y0 arranges higher-capacity nodes to have a denser set of overlay links, and allows lower-capacity nodes to be less involved in routing. It results in reduced lookup path length compared to the homogeneous case in which all nodes have the same number of overlay links. Y0 leads to more significant than Chord with the original concept of virtual server, because its placement of virtual servers provides more control over the topology. Real-world simulation results show that Y0 reduces the load imbalance of Chord from O(log n) to a less than 3.6 without increasing the number of links per node. In addition, the average path length is significantly reduced as node capacities become increasingly heterogeneous. For a real-word distribution of node capacities, the path length in Y0 is asymptotically less than half the path length in the case of a homogeneous system. Y0 operates under the uniform load assumption that the load of each node is proportional to the size of the ID space it owns. This is reasonable when all objects generate similar load (e.g., have the same size), the object IDs are randomly chosen (e.g., are computed as a hash of the object’s content), and the number of objects is large compared to the number of nodes (e.g., Ω(n log n)). However, some of the cases may not hold true in reality. Virtual node activation. In virtual server methods, to maintain connectivity of the network, every virtual node needs to periodically check its neighbors to ensure their updated status. More virtual nodes will lead to higher overhead for neighbor maintenance. Karger and Ruhl (Karger, 2004) coped with the virtual server problem by arranging for each real node to activate only one of its O(log n) virtual servers at any given time. The real node occasionally checks its inactive virtual servers and may migrate to one of them if the distribution of load in the system has changed. Since only one virtual node is active, the overhead for neighbor information storage and neighbor maintenance will not be increased in a real node. As in the Chord with the original virtual server method, this scheme gives each real node a small number of addresses on the Chord ring, preserving Chord’s protection against address spoofing by malicious
166
Load Balancing in Peer-to-Peer Systems
nodes trying to disrupt the routing layer. Combining the virtual node activation load-balancing scheme with the Koorde routing protocol (Kaashoek, 2003), the authors got a protocol that simultaneously offers (i) O(log n) degree per real node, (ii) O(log n/log log n) lookup hops, and (iii) constant factor load balance. The authors claimed that previous protocols could achieve any two of these but not all three. Generally speaking, achieving (iii) required operating O(log n) virtual nodes, which pushed the degree to O(log2 n) and failed to achieve (i).
3.1.2 ID Assignment or Reassignment In this category of load balancing methods, most proposals are similar in that they consider a number of (typically, Θ(log n)) locations for a node and select the one which gives the best load balance. The proposals differ in which locations should be considered, and when the selection should be conducted (Godfrey, 2005). Some proposals arrange a newly-jointed node to select a location, while others let nodes re-select a location when a node is overloaded. Naor and Weider (2003) proposed a method in which a node checks Θ(log n) random IDs when joining, and chooses the ID which leads to the best load balance. They show that this method produces a maximum share of 2 if there are no node deletions. Share is an important metric for evaluating the performance of a load balancing method (Godfrey, 2005). Node v’s share is defined as: share(v ) =
fv cv n
,
where fv is the ID space assigned to node v, and cv is the normalized capacity of node v such that the average capacity is 1 and å v cv = n . To handle load imbalance incurred by node departures, nodes are divided into groups of Θ(log n) nodes and periodically reposition themselves in each group. Adler et al. (Adler, 2003) proposed to let a joining node randomly contacts an existing node already in the DHT. The joining node then chooses an ID in the longest interval owned by one of the proposed node’s O(log n) neighbors to divide the interval into half. As a result, the intervals owned by nodes have almost the same length, leading to an O(1) maximum share. Manku (Manku, 2004) proposed a load balancing algorithm. In the algorithm, a newly-joined node randomly chooses a node and splits in half the largest interval owned by one of the Θ(log n) nodes adjacent to the chosen node in the ID space. This achieves a maximum share of 2 while moving at most one node ID for each node arrival or departure. It extends to balancing within a factor 1 + ε but moves Θ (1/ε) IDs for any ε>0. As mentioned that Karger and Ruhl (Karger, 2004) proposed an algorithm in which each node has O(log n) virtual nodes, i.e. IDs, and periodically selects an ID among them as an active ID. This has maximum share 2 + ε, but requires reassignment of O(log log n) IDs per arrival or departure. Bienkowski et al. (2005) proposed a node departure and re-join strategy to balance the key ID intervals across the nodes. In the algorithm, lightly loaded nodes leave the system and rejoin the system with a new ID to share the load of heavily loaded ones. The strategy reduces the number of reassignments to a constant, but shows only O(1) maximum share. Byers et al. (Byers, 2003) proposed the use of the “power of two choices” algorithm. In this algorithm, each object is hashed to d ≥ 2 different IDs, and is placed in the least loaded node of the nodes
167
Load Balancing in Peer-to-Peer Systems
responsible for those IDs. The other nodes are given a redirection pointer to the destination node so that searching is not slowed significantly. For homogeneous nodes and objects and a static system, picking d = 2 achieves a load balance within a (log log n) factor of optimal, and when d = Θ(log n), the load balance is within a constant factor of optimal. The ID assignment or reassignment methods reassign IDs to nodes in order to maintain the load balance when nodes arrive and depart the system. The object transfer and neighbor update involved in ID rearrangement would incur a high overhead. Moreover, few methods directly take into account the heterogeneity of file load.
3.1.3 Load Transfer The virtual server methods and key assignment and reassignment methods ignore the heterogeneity of file load. Further load imbalance may result from non-uniform distribution of files in the identifier space and a high degree of heterogeneity in file loads and node capacities. In addition, few of the methods are able to deal with both the network churn and proximity. In general, the DHT churn should be dealt with by randomized matching between heavily loaded nodes with lightly loaded nodes. Load transfer methods to move load from heavily loaded nodes to lightly loaded nodes can deal with these problems. Rao et al. (2003) proposed three algorithms to rearrange load based on nodes’ different capacities: one-to-one, many-to-many, and one-to-many. Their basic idea is to move virtual servers, i.e. load, from heavily loaded nodes to lightly loaded nodes so that each node’s load does not exceed its capacity. Specifically, the method periodically collects the information of servers’ load status, which helps load rearrangement between heavily loaded nodes and lightly loaded nodes. The algorithms are different primarily in the amount of information used to decide load rearrangement. In the one-to-one algorithm, each lightly loaded server randomly probes nodes for a match with a heavily loaded one. In the manyto-many algorithm, each heavily loaded server sends its excess virtual nodes to a global pool, which executes load rearrangement periodically. The one-to-one scheme produces too many probes, while the many-to-many scheme increases overhead in load rearrangement. As a trade-off, the one-to-many algorithm works in a way that each heavily loaded server randomly chooses a directory which contains information about a number of lightly loaded servers, and moves its virtual servers to lightly loaded servers until it is not overloaded anymore. In a DHT overlay, a node’s load may vary greatly over time since the system can be expected to experience continuous insertions and deletions of objects, skewed object arrival patterns, and continuous arrival and departure of nodes. To cope with this problem, Godfrey et al. (2006) extended Rao’s work (Rao, 2003) for dynamic DHT networks with rapid arrivals and departures of items and nodes. In their approach, if a node’s capacity utilization exceeds a predetermined threshold, its excess virtual servers will be moved to a lightly loaded node immediately without waiting for the next periodic balancing. This work studied this algorithm by using extensive simulations over a wide set of system scenarios and algorithm parameters. Most recently, Karger and Ruhl (2004) proved that the virtual server method could not be guaranteed to handle item distributions where a key ID interval has more than a certain fraction of the load. As a remedy, they proposed two schemes with provable features: moving items and moving nodes to achieve equal load between a pair of nodes, and then achieves a system-wide load balance state. In the moving items scheme, every node occasionally contacts a random other node. If one of the two nodes has much larger load than the other, then items are moved from the heavily loaded node to the lightly loaded node
168
Load Balancing in Peer-to-Peer Systems
until their loads become equal. In the moving nodes scheme, if a pair of nodes has very uneven loads, the load of the heavier node gets split between the two nodes by changing their addresses. However, this scheme breaks DHT mapping and cannot support key locations as usual. Karger and Ruhl (2004) provided a theoretic treatment for load balancing problem and proved that good load balance can be achieved by moving items if the fraction of address space covered by every node is O(1/n) (Karger, 2004). Almost all of these algorithms assume the objective of minimizing the amount of moved load. The algorithms treat all nodes equally in random probing, and neglect the factor of physical proximity on the effectiveness of load balancing. With proximity consideration, load transferring and communication should be within physically close heavy and light nodes. One of the first works to utilize the proximity information to guide load balancing is due to Zhu and Hu (2005). They presented a proximity-aware algorithm to take into account the node proximity information in load balancing. The authors suggested to build a K-nary tree (KT) structure on top of a DHT overlay. Each KT node is planted in a virtual server. A K-nary tree node reports the load information of its real server to its parent, until the tree root is reached. The root then disseminates final information to all the virtual nodes. Using this information, each real server can determine whether it is heavily loaded or not. Lightly loaded and heavily loaded nodes report their free capacity, excess virtual nodes information to their KT leaf nodes respectively. The leaf nodes will propagate the information upwards along the tree. When the total length of information reaches a certain threshold, the KT node would execute load rearrangement between heavily loaded nodes and lightly loaded nodes. The KT structure helps to use proximity information to move load between physically close heavily and lightly loaded nodes. However, the construction and maintenance of KT are costly, especially in churn. In churn, a KT will be destroyed without timely fixes, degrading load balancing efficiency. For example, when a parent fails or leaves, the load imbalance of its children in the subtree cannot be resolved before its recovery. Therefore, although the network is self-organized, the algorithm is hardly applicable to DHTs with churn. Besides, the tree needs to be reconstructed every time after virtual server transferring, which is imperative in load balancing. Second, a real server cannot start determining its load condition until the tree root gets the accumulated information from all nodes. This centralized process is inefficient and hinder the scalability improvement of P2P systems.
3.2 Case Study: Locality-Aware Randomized Load Balancing Algorithms This section presents Locality-Aware Randomized load balancing algorithms (LAR) (Shen, 2007) that take into account proximity information in load balancing and deal with network dynamism meanwhile. The algorithms take advantage of the proximity information of the DHTs in node probing and distribute application load among the nodes according to their capacities. The LAR algorithms introduce a factor of randomness in the probing of lightly loaded nodes in a range of proximity so as to make the probing process robust in DHTs with churn. The LAR algorithms further improve the efficiency by allowing the probing of multiple candidates at a time. Such a probing process is referred as d-way probing, d ≥ 1. The algorithms are implemented in Cycloid (Shen, 2006), based on a concept of “moving item” (Karger, 2004) for retaining DHT network efficiency and scalability. The algorithms are also suitable for virtual server methods. The performance of the LAR load balancing algorithms is evaluated via comprehensive simulations. Simulation results demonstrate the superiority of a locality-aware 2-way randomized load balancing algorithm, in comparison with other pure random approaches and locality-aware sequential algorithms. In DHTs with churn, it performs no worse than the best churn resilient algorithm. In the
169
Load Balancing in Peer-to-Peer Systems
Table 1. Routing table of a cycloid node (4,101-1-1010) NodeID (4,101-1-1010) Routing Table Cubical neighbor: (3,101-0-xxxx) Cyclic neighbor: (3,101-1-1100) Cyclic neighbor: (3,101-1-0011) Leaf Sets (half smaller, half larger) Inside Leaf Set (3,101-1-1010)
(6,101-1-1010) Outside Leaf Set
(7,101-1-1001)
(6,101-1-1011)
following, Cycloid DHT is first introduced before the LAR algorithms are presented.
3.2.1 Cycloid: A Constant-Degree DHT Cycloid (Shen, 2006) is a lookup efficient constant-degree DHT that we recently proposed. In a Cycloid system with n = d ∙ 2d nodes, each lookup takes O(d) hops with O(1) neighbors per node. In this section, we give a brief overview of the Cycloid architecture and its self-organization mechanism, focusing on the structural features related to load balancing. ID and structure. In Cycloid, each node is represented by a pair of indices (k, ad−1 ad−2 . . . a0), where k is a cyclic index and ad−1 ad−2 . . . a0 is a cubical index. The cyclic index is an integer, ranging from 0 to d − 1 and the cubical index is a binary number between 0 and 2d − 1. Each node keeps a routing table and two leaf sets, inside leaf set and outside leaf set, with a total of 7 entries to maintain its connectivity to the rest of the system. Table 1 shows a routing state table for node (4,10111010) in an 8-dimensional Cycloid, where x indicates an arbitrary binary value. Its corresponding links in both cubical and cyclic Figure 1. Cycloid node routing links state
170
Load Balancing in Peer-to-Peer Systems
aspects are shown in Figure 1. In general, a node (k, ad−1 ad−2 . . . a0), k 6= 0, has one cubical neighbor (k − 1, ad−1a d−2 . . . ak xx...x) where x denotes an arbitrary bit value, and two cyclic neighbors (k−1, bd−1 bd−2 . . . b0) and (k−1, cd−1 cd−2 . . . c0). The cyclic neighbors are the first larger and smaller nodes with cyclic index k−1 mod d and their most significant different bit with the current node in cubical indices is no larger than (k − 1). That is, (k-1, bd−1 . . . b1b0) = min{∀(k-1, yd−1 . . . y1y0)|yd−1 . . . y0≥ad−1 . . . a1a0}, (k-1, cd−1 . . . c1c0) = max{∀(k-1, yd−1 . . . y1y0)|yd−1 . . . y0≤ad−1 . . . a1a0}. The node with a cyclic index k = 0 has no cubical neighbor or cyclic neighbors. The node with cubical index 0 has no small cyclic neighbor, and the node with cubical index 2d − 1 has no large cyclic neighbor. The nodes with the same cubical index are ordered by their cyclic index (mod d) on a local circle. The inside leaf set of a node points to the node’s predecessor and successor in the local circle. The largest cyclic index node in a local circle is called the primary node of the circle. All local circles together form a global circle, ordered by their cubical index (mod 2d). The outside leaf set of a node points to the primary nodes in its preceding and succeeding small circles in the global circle. The Cycloid connection pattern is resilient in the sense that even if many nodes are absent, the remaining nodes are still capable of being connected. The Cycloid DHT assigns keys onto its ID space by the use of a consistent hashing function. For a given key, the cyclic index of its mapped node is set to its hash value modulated by d and the cubical index is set to the hash value divided by d. If the target node of an item key (k, ad−1 . . . a1a0) is not present in the system, the key is assigned to the node whose ID is first numerically closest to ad−1 ad−2 . . . a0 and then numerically closest to k. Self-organization. P2P systems are dynamic in the sense that nodes are frequently joining and departing from the network. Cycloid deals with the dynamism in a distributed manner. When a new node joins, it initializes its routing table and leaf sets, and notifies the nodes in its inside leaf set of its participation. It also needs to notify the nodes in its outside leaf set if it becomes the primary node of its local circle. Before a node leaves, it notifies its inside leaf set nodes, as well. Because a Cycloid node has no incoming connections for cubical and cyclic neighbors, a leaving node cannot notify those who take it as their cubical neighbor or cyclic neighbor. The need to notify the nodes in its outside leaf set depends on whether the leaving node is a primary node or not. Updating cubical and cyclic neighbors are the responsibility of system stabilization, as in Chord.
3.2.2 Load Balancing Framework This section presents a framework for load balancing based on item movement on Cycloid. It takes advantage of the Cycloid’s topological properties and conducts a load balancing operation in two steps: local load balancing within a local circle and global load balancing between circles. A general approach with consideration of node heterogeneity is to partition the nodes into a super node with high capacity and a class of regular nodes with low capacity (Fasttrack, 2001; Yang, 2003). Each super node, together with a group of regular nodes, forms a cluster in which the super node operates as a server to the others. All the super nodes operate as equals in a network of super-peers. Super-peer networks strike a balance between the inherent efficiency of centralization and distribution, and take advantage of capacity heterogeneity, as well. Recall that each local circle in Cycloid has a primary node. We regard Cycloid as a quasi-super-peer network by assigning each primary node as a leading super node in its circle. A node
171
Load Balancing in Peer-to-Peer Systems
Table 2. Donating and starving sorted lists Load information in a primary node Donating sorted list
Starving sorted list
< δ Lj, Aj >
< Li,1, Di,1, Ai >
…
…
< δ Lm, Am >
< Li,k, Di,k, Ai >
is designated as supernode if its capacity is higher than a pre-defined threshold. The Cycloid rules are modified for node join and leave slightly to ensure that every primary node meets the capacity requirement of supernodes. If the cyclic ID selected by a regular node is the largest in its local circle, it needs to have another choose unless it is the bootstrap node of the circle. In the case of primary node departure or failure, a supernode needs to be searched in the primary node’s place if the node with the second largest cyclic ID in the circle is not a super node. This operation can be regarded as the new supernode leaves and re-joins the system with the ID of the leaving or failing primary node. Let Li,k denote the load of item k in node i. It is determined by the item size Si,k and the number of visits of the item Vi,k during a certain time period. That is, Li,k = Si,k ×Vi,k. The actual load of a real server i, denoted by Li, is the total load of all of its items: mi
Li = å Li,k , k =1
assuming the node has mi items. Let Ci denote the capacity of node i; it is defined as a pre-set target load which the node is willing to hold. We refer to the node whose actual load is no larger than its target load (i.e. Li ≤ Ci) as a light node; otherwise a heavy one. We define utilization of a node i, denoted by NUi, as the fraction of its target capacity that is occupied. That is, NUi= Li/Ci. System utilization, denoted by SU, is the ratio of the total actual load to the total node capacity. Each node contains a list of data items, labeled as Dk, k = 1, 2, .... To make full use of node capacity, the excess items chosen to transfer should be with minimum load. We define excess items of a heavy node as a subset of the resident items, satisfying the following condition. Without loss of generality, we assume the excess items are {D1,D2, . . ., Dm’}, 1≤ m’ ≤m. Their corresponding loads are {Li,1, ...,Li, m’}. The set of excess items is determined in such a way that m'
minimizes å Li,k
(1)
k =1
m'
subject to(L i - å Li,k ) £ C i
(2)
k =1
Each primary node has a pair of sorted donating and starving lists which store the load information of all nodes in its local cycle. A donating sorted list (DSL) is used to store load information of light nodes and a starving sorted list (SSL) is used to store load information of heavy nodes as shown in Table 2. The free capacity of light node i is defined as δLi = Ci − Li. Load information of heavy node I includes
172
Load Balancing in Peer-to-Peer Systems
the information of its excess items in a set of 3-tuple representation: < Li,1, Di,1, Ai >,< Li,k, Di,k, Ai >, . . ., , in which Ai denotes the IP address of node i. Load information of light node j is represented in the form of < δLj, Aj >. An SSL is sorted in a descending order of Li,k; minLi,k represents the item with the minimum load in the primary node’s starving list. A DSL is sorted in an ascending order of δLj ; max δLj represents the maximum δLj in the primary node’s donating list. Load rearrangement is executed between a pair of DSL and SSL, as shown in Algorithm 1. This scheme guarantees that heavier items have a higher priority to be reassigned to a light node, which means faster convergence to a system-wide load balance state. A heavy item Li,k is assigned to the mostfit light node with δLj which has minimum free capacity left after the heavy item Li,k is transferred to it. It makes full use of the available capacity. Our load balancing framework is based on item movement, which transfers items directly instead of virtual servers to save cost. Cycloid maintains two pointers for each transferred item. When an item D is transferred from heavy node i to light node j, node i will have a forward pointer in D location pointing to the item D in j’s place; item D will have a backward pointer to node i indicating its original host. When queries for item D reach node i, they will be redirected to node j with the help of forward pointer. If item D needs to be transferred from node j to another node, say g, for load balancing, node j will notify node i via its backward pointer of the item’s new location. Algorithm 1: Primary node periodically performs load rearrangement between a pair of DSL and SSL for each item k in SSL dofor each item j in DSL doifLi,k ≤ δ Ljthen Item k is arranged to be transferred from i to j ifδ Lj - Li,k > 0 then Put <(δ Li – Li,k),Ai> back to DSL We use a centralized method in local load balancing, and a decentralized method in global load balancing. Each node (k, ad−1a d−2 . . . a0) periodically reports its load information to the primary node in its local circle. Unlike a real super-peer network, Cycloid has no direct link between a node and the primary node. The load information needs to be forwarded using Cycloid routing algorithm, which ensures the information reaches the up-to-the-minute primary node. Specifically, the information is targeted to the node (d − 1, ad−1a d−2 . . . a0). By the routing algorithm, the destination it reaches, say node i, may be the primary node or its successor depending on which one is closer to the ID. If the cyclic index of the successor(i) is larger than the cyclic index of i, then the load information is forwarded to the predecessor(i), which is the primary node. Otherwise, node i is the primary node. According to the Cycloid routing algorithm, each report needs to take d/2 steps in the worst case. Cycloid cycle contains a primary node all the time. Since the load information is guaranteed to reach the up-to-the-minute primary node, there is no serious advert effect of primary node updates on load balancing. After receiving the load information, the primary node puts it to its own DSL and SSL accordingly. A primary node with nonempty starving list (PNS) first performs local load rearrangement between its DSL and SSL. Afterwards, if its SSL is still not empty, it probes other primary nodes’ DSLs for global load rearrangement one by one until its SSL becomes empty. When a primary node does’t have enough capacity for load balancing, it can search for a high capacity node to replace itself. We arrange the PNS to initiate probing because the probing process will stop once it is not overloaded. If a node of nonempty donating list initiates probing, the probing process could proceed infinitely, incurring much more communication messages and bandwidth cost. Because primary nodes are super peers with high capacities, they are less likely to be
173
Load Balancing in Peer-to-Peer Systems
overloaded in the load balancing. This avoids the situation that heavy nodes will be overloaded if they perform probing, such as in the schemes in (Rao, 2003). This scheme can be extended to perform load rearrangement between one SSL and multiple DSLs for improvement.
3.2.3 Locality-Aware Randomized Load Balancing Algorithms The load balancing framework in the preceding section facilitates the development of load balancing algorithms with different characteristics. A key difference between the algorithms is, for a PNS, how to choose another primary node for a global load rearrangement between their SSL and DSL. It affects the efficiency and overhead to reach a system-wide load balance state. D-way randomized probing. A general approach to dealing with the churn of DHTs is randomized probing. In the policy, each PNS probes other primary nodes randomly for load rearrangement. A simple form is one-way probing, in which a PNS, say node i, probes other primary nodes one by one to execute load rearrangement SSLi and DSLj, where j is a probed node. We generalize the one-way randomized probing policy to a d-way probing, in which d primary nodes are probed at a time, and the primary node with the most total free capacity in its DSL is chosen for load rearrangement. A critical performance issue is the choice of an appropriate value d. The randomized probing in our load balancing framework is similar to load balancing problem in other contexts: competitive online load balancing and supermarket model. Competitive online load balancing is to assign each task to a server on-line with the objective of minimizing the maximum load on any server, given a set of servers and a sequence of task arrivals and departures. Azar et al. (1994) proved that in competitive online load balancing, allowing each task to have two server choices to choose a less loaded server instead of just one choice can exponentially minimize the maximum server load and result in a more balanced load distribution. Supermarket model is to allocate each randomly incoming task modeled as a customer with service requirements, to a processor (or server) with the objective of reducing the time each customer spends in the system. Mitzenmacher et al. (1997) proved that allowing a task two server choices and to be served at the server with less workload instead of just one choice leads to exponential improvements in the expected execution time of each task. But a poll size larger than two gains much less substantial extra improvement. The randomized probing between the lists of SSLs and DSLs is similar to the above competitive load balancing and supermarket models if we regard SSLs as tasks, and DSLs as servers. But the random probing in P2P systems has a general workload and server models. Servers are dynamically composed with new ones joining and existent ones leaving. Servers are heterogeneous with respect to their capacities. Tasks are of different sizes and arrive in different rates. In (Fu, 2008), we proved the random probing is equivalent to a generalized supermarket model and showed the following results. Theorem 5.1: Assume servers join in a Poisson distribution. For any fixed time interval [0,T], the length of the longest queue in the supermarket model with d = 1 is ln n/ ln ln n(1+O(1)) with high probability; the length of the longest queue in the model with d ≥ 2 is ln ln n/ ln d + O(1), where n is the number of servers. The theorem implies that 2-way probing could achieve a more balanced load distribution with faster speed even in churn, because 2-way probing has higher possibility to reach an active node than 1-way probing, but d-way probing, d > 2, may not result in much additional improvement. Locality-aware probing. One goal of load balancing is to effectively keep each node lightly loaded
174
Load Balancing in Peer-to-Peer Systems
Table 3. Simulation settings and algorithm parameters Environmental Parameter
Default Value
Object arrival location
Uniform over ID space
Number of nodes
4906
Node capacity
Bounded Pareto: shape 2 lower bound: 2500, upper bound: 2500 * 10
Number of items
20480
Existing item load
Bounded Pareto: shape 2 lower bound: mean item actual load / 2 upper bound: mean item actual load / 2 * 10
with minimum load balancing overhead. Proximity is one of the most important performance factors. Mismatch between logical proximity abstraction and physical proximity information in reality is a big obstacle for the deployment and performance optimization issues for P2P applications. Techniques to exploit topology information in overlay routing include geographic layout, proximity routing and proximity-neighbor selection (Castro, 2002). The proximity-neighbor selection and topologically-aware overlay construction techniques in (Xu, 2003; Castro, 2002; Waldvogel, 2002) are integrated into Cycloid to build a topology-aware Cycloid. As a result, the topology-aware connectivity of Cycloid ensures that a message reaches its destination with minimal overhead. Details of topology-aware Cycloid construction will be presented in Section 3.2.4. In a topology-aware Cycloid network, the cost for communication and load movement can be reduced if a primary node contacts other primary nodes in its routing table or primary nodes of its neighbors. In general, the primary nodes of a node’s neighbors are closer to the node than randomly chosen primary nodes in the entire network, such that load is moved between closer nodes. This method should be the first work that handles the load balancing issue with the information used for achieving efficient routing. There are two methods for locality-aware probing: randomized and sequential method. 1.
2.
Locality-aware randomized probing (LAR): In LAR, each PNS contacts primary nodes in a random order in its routing table or primary nodes of its neighbors except the nodes in its inside leaf set. After all these primary nodes have been tried, if the PNS’s SSL is still nonempty, global random probing is started in the entire ID space. Locality-aware sequential probing (Lseq): In Lseq, each PNS contacts its larger outside leaf set Successor(PNS). After load rearrangement, if its SSL is still nonempty, the larger outside leaf set of Successor(PNS), Successor(Successor(PNS)) is tried. This process is repeated, until that SSL becomes empty. The distances between a node and its sequential nodes are usually smaller than distances between the node and randomly chosen nodes in the entire ID space.
3.2.4 Performance Evaluation We designed and implemented a simulator in Java for evaluation of the load balancing algorithms on topology-aware Cycloid. Table 3 lists the parameters of the simulation and their default values. The simulation model and parameter settings are not necessarily representative of real DHT applications. They are set in a similar way to related studies in literature for fair comparison. We will compare the
175
Load Balancing in Peer-to-Peer Systems
different load balancing algorithms in Cycloid without churn in terms of the following performance metrics; the algorithms in Cycloid with churn will also be evaluated. 1. 2.
3. 4.
5.
Load movement factor: Defined as the total load transferred due to load balancing divided by the system actual load, which is system target capacity times SU. It represents load movement cost. Total time of probing: Defined as the time spent for primary node probing assuming that probing one node takes 1 time unit, and probing a number of nodes simultaneously also takes 1 time unit. It represents the speed of probing phrase in load balancing to achieve a system-wide load balance state. Total number of load rearrangements: Defined as the total number of load rearrangement between a pair of SSL and DSL. It represents the efficiency of probing for light nodes. Total probing bandwidth: Defined as the sum of the bandwidth consumed by all probing operations. The bandwidth of a probing operation is the sum of bandwidth of all involved communications, each of which is the product of the message size and physical path length of the message traveled. It is assumed that the size of a message asking and replying for information is 1 unit. It represents the traffic burden caused by probings. Moved load distribution: Defined as the cumulative distribution function (CDF) of the percentage of moved load versus moving distance. It represents the load movement cost for load balance. The more load moved along the shorter distances, the less load balancing costs.
Topology-aware cycloid construction. GT-ITM (transit-stub and tiers) (Zegura, 1996) is a network topology generator, widely used for the construction of topology-aware overlay networks (Ratnasamy, 2002; Xu, 2003; Xu, 2003; Gummadi, 2003). We used GT-ITM to generate transit-stub topologies for Cycloid, and get physical hop distance for each pair of Cycloid nodes. Recall that we use proximityneighbor selection method to build topology-aware Cycloid; that is, it selects the routing table entries pointing to the physically nearest among all nodes with nodeID in the desired portion of the ID space. We use landmark clustering and Hilbert number (Xu, 2003) to cluster Cycloid nodes. Landmark clustering is based on the intuition that close nodes are likely to have similar distances to a few landmark nodes. Hilbert number can convert d dimensional landmark vector of each node to one dimensional index while still preserve the closeness of nodes. We selected 15 nodes as landmark nodes to generate the landmark vector and a Hilbert number for each node cubic ID. Because the nodes in a stub domain have close (or even same) Hilbert numbers, their cubic IDs are also close to each other. As a result, physically close nodes are close to each other in the DHT’s ID space, and nodes in one cycle are physically close to each other. For example, assume nodes i and j are very close to each other in physical locations but far away from node m. Nodes i and j will get approximately equivalent landmark vectors, which are different from m’s. As a result, nodes i and j would get the same cubical IDs and be assigned to the circle different from m’s. In the landmark approach, for each topology, we choose landmarks at random with the only condition that the landmarks are separated from each other by four hops. More sophisticated placement schemes, as described in (Jamin, 2000) would only serve to improve our results. Our experiments are built on two transit-stub topologies: “ts5k-large” and “ts5k-small” with approximately 5,000 nodes each. In the topologies, nodes are organized into logical domains. We classify the domains into two types: transit domains and stub domains. Nodes in a stub domain are typically an endpoint in a network flow; nodes in transit domains are typically intermediate in a network flow. “ts5klarge” has 5 transit domains, 3 transit nodes per transit domain, 5 stub domains attached to each transit
176
Load Balancing in Peer-to-Peer Systems
Figure 2. Effect of load balancing
node, and 60 nodes in each stub domain on average. “ts5k-small” has 120 transit domains, 5 transit nodes per transit domain, 4 stub domains attached to each transit node, and 2 nodes in each stub domain on average. “ts5k-large” has a larger backbone and sparser edge network (stub) than “ts5k-small”. “ts5klarge” is used to represent a situation in which Cycloid overlay consists of nodes from several big stub domains, while “ts5k-small” represents a situation in which Cycloid overlay consists of nodes scattered in the entire Internet and only few nodes from the same edge network join the overlay. To account for the fact that interdomain routes have higher latency, each interdomain hop counts as 3 hops of units of latency while each intradomain hop counts as 1 hop of unit of latency. Effectiveness of LAR algorithms. In this section, we will show the effectiveness of LAR load balancing algorithm. First, we present the impact of LAR algorithm on the alignment of the skews in load distribuFigure 3. Effect of load balancing due to different probing algorithms
177
Load Balancing in Peer-to-Peer Systems
tion and node capacity when the system is fully loaded. Figure 2(a) shows the initial node utilization of each node. Recall that node utilization is a ratio of the actual load to its target (desired) load. Many of the nodes were overloaded before load balancing. Load balancing operations drove all node utilizations down below 1 by transferring excess items between the nodes, as shown in Figure 2(b). Figure 2(c) shows the scatterplot of loads according to node capacity. It confirms that the capacity-aware load balancing feature of the LAR algorithm. Recall that LAR algorithm was based on item movement, using forward pointers to keep DHT lookup protocol. We calculated the fraction of items that are pointed to by forward pointers in systems of different utilization levels. We found that the fraction increased linearly with the system load, but it would be no higher than 45% even when the system becomes fully loaded. The cost is reasonably low compared to the extra space, maintenance cost and efficiency degradation in virtual server load balancing approach. We measured the load movement factors due to different load balancing algorithms: one-way random (R1), two-way random (R2), LAR1, LAR2, and Lseq, on systems of different loads and found that the algorithms led to almost the same amount of load movement in total at any given utilization level. This is consistent with the observations by Rao et al. (2003) that the load moved depends only on distribution of loads, the target to be achieved, but not on load balancing algorithms. This result suggests that an effective load balancing algorithm should explore to move the same amount of load along shorter distance and in shorter time to reduce load balancing overhead. In the following, we will examine the performance of various load balancing algorithms in terms of other performance metrics. Because metrics (2) and (3) are not affected by topology, the results of them in “ts5k-small” will not be presented sometimes. Comparison with other algorithms.Figure 3(a) shows the probing process in Lseq takes much more time than R1 and LAR1. This implies that random algorithm is better than sequential algorithm in probing efficiency. Figure 3(b) shows that the numbers of rearrangements of the three algorithms are almost the same. This implies that they need almost the same number of load rearrangement to achieve load balance. However, long probing time of Lseq suggests that it is not as efficient as random probing. It is consistent with the observation of Mitzenmacher in (Mitzenmacher, 1997) that simple randomized load balancing schemes can balance load effectively. Figure 3(c) and (d) show the performance of the algorithms in “ts5k-large”. From Figure 3(c), we can observe that unlike in lightly loaded systems, in heavily loaded systems, R1 takes more bandwidth than LAR1 and Lseq, and the performance gap increases as the system load increases. This is because that much less probings are needed in a lightly loaded system, causing less effect of probing distance on bandwidth consumption. The bandwidth results of LAR and Lseq are almost the same when the SU is under 90%; when the SU goes beyond 0.9, LAR consumes more bandwidth than Lseq. This is due to the fact that in a more heavily loaded system, more nodes need to be probed in the entire ID space, leading to longer load transfer distances. Figure 3(d) shows the moved load distribution in load balancing as the SU approaches 1. We can see that LAR1 and Lseq are able to transfer about 60% of global moved load within 10 hops, while R1 transfers only about 15% because R1 is locality-oblivious. Figure 3(e) and (f) show the performance of the algorithms in “ts5k-small”. These results also confirm that LAR1 achieve better locality-aware performance than R1, although the improvement is not so significant as in “ts5k-large”. It is because that in “ts5k-small” topology,nodes are scattered in the entire network, and the neighbors of a primary node may not be physically closer than other nodes.
178
Load Balancing in Peer-to-Peer Systems
Figure 4. Breakdown of probed nodes
Figures 3(d) and (f) also include the results due to two other popular load balancing approaches: proximity-aware Knary Tree (KTree) algorithm (Zhu, 2005) and churn resilient algorithm (CRA) (Godfrey, 2006) for comparison. From the figures, we can see that LAR performs as well as KTree, and outperforms proximity-oblivious CRA, especially in “ts5k-large”. The performance gap between proximity-aware and proximity-oblivious algorithms is not as large as in “ts5k-small”. It is because the nodes in “ts5k-small” are scattered in the entire Internet with less locality.
Figure 5. Effect of load balancing due to different LAR algorithms
179
Load Balancing in Peer-to-Peer Systems
In summary, the results in Figure 3 suggest that the randomized algorithm is more efficient than the sequential algorithm in the probing process. The locality-aware approaches can effectively assign and transfer loads between neighboring nodes first, thereby reduce network traffic and improve load balancing efficiency. The LAR algorithm performs no worse than the proximity-aware KTree algorithm. In Section 3.2.5, we will show LAR works much better for DHTs with churn. Effect of D-way random probing (Figure 4). We tested the performance of the LARd algorithms with different probing concurrency degree d. Figure 5(a) shows that LAR2 takes much less probing time than LAR1. It implies that LAR2 reduces the probing time of LAR1 at the cost of more number of probings. Unlike LAR1, in LAR2, a probing node only sends its SSL to a node with more total free capacity in its DSL between two probed nodes. The more item transfers in one load rearrangement, the less probing time. It leads to less number of SSL sending operation of LAR2 than LAR1, resulting in less number of load rearrangements as shown in Figure 5(b). Therefore, simultaneous probings to get a node with more total free capacity in its DSL can save load balancing time and reduce network traffic load. Figures 4(a) and (b) show the breakdown of total number of probed nodes in percentage that are from neighbors or randomly chosen in entire ID space in LAR1 and LAR2 respectively. Label “one neighbor and one random” represents the condition when there’s only one neighbor in routing table, then another probed node is chosen randomly from ID space. We can see that the percentage of neighbor primary node constitutes the most part, which means that neighbors can support most of system excess items in load balancing. With SU increases, the percentage of neighbor primary node decreases because the neighbors’ DSLs don’t have enough free capacity for a larger number of excess items, then randomly chosen primary nodes must be resorted to. Figures 5(a) and (b) show that the probing efficiency of LARd (d>2) is almost the same as LAR2, though they need to probe more nodes than LAR2. The results are consistent with the expectations in Section 3.2.1 that a two-way probing method leads to an exponential improvement over one-way probing, but a d-way (d>2) probing leads to much less substantial additional improvement. In the following, we will analyze whether the improvement of LARd (d ≥ 2) over LAR1 is at the cost of more bandwidth consumption or locality-aware performance degradation. We can observe from Figure 5(c) that the probing bandwidth of LAR2 is almost the same as LAR1. Figure 5(d) shows the moved load distribution in global load balancing due to different algorithms. We can see that LAR2 leads to an approximately identical distribution as LAR1 and they cause slightly less global load movement cost than LAR4 and LAR6. This is because the more simultaneous probed nodes, the less possibility that the best primary node is a close neighbor node. These observations demonstrate that LAR2 improves on LAR1 at no cost of bandwidth consumption. It retains the advantage of locality-aware probing. Figures 5(e) and (f) show the performance of different algorithms in “ts5k-small”. Although the performance gap is not as wide as in ‘ts5k-large”, the relative performance between the algorithms retains. In practice, nodes and items continuously join and leave P2P systems. It is hard to achieve the objective of load balance in networks with churn. We conducted a comprehensive evaluation of the LAR algorithm in dynamic situations and compare the algorithm with CRA, which was designed for DHTs with churn. The performance factors we considered include load balancing frequency, item arrival/departure rate, non-uniform item arrival pattern, and network scale and node capacity heterogeneity. We adopted the same metrics as in (Godfrey, 2006):
180
Load Balancing in Peer-to-Peer Systems
Figure 6. Effect of load balancing with churn
1.
2.
The 99.9th percentile node utilization (99.9th NU). We measure the maximum 99.9th percentile of the node utilizations after each load balancing period T in simulation and take the average of these results over a period as the 99.9th NU. The 99.9th NU represents the efficiency of LAR to minimize load imbalance. Load moved/DHT load moved (L/DHT-L), defined as the total load moved incurred due to load balancing divided by the total load of items moved due to node joins and departures in the system. This metric represents the efficiency of LAR to minimize the amount of load moved.
Unless otherwise indicated, we run each trial of the simulation for 20T simulated seconds, where T is a parameterized load balancing period, and its default value was set to 60 seconds in our test. The item and node join/departure rates were modeled by Poisson processes. The default rate of item join/ departure rate was 0.4; that is, there were one item join and one item departure every 2.5 seconds. We ranged node interarrival time from 10 to 90 seconds, with 10 second increment in each step. A node life time is computed by arrival rate times number of nodes in the system. The default system utilization SU was set to 0.8. Performance comparison with CRA in Churn.Figure 6 plots the performance due to LAR1 and CRA versus node interarrival time during T period. By comparing results of LAR1 and CRA, we can have a number of observations. First, the 99.9th NUs of LAR1 and CRA are kept no more than 1 and 1.25 reFigure 7. Impact of system utilization under continual node joins and departures
181
Load Balancing in Peer-to-Peer Systems
Figure 8. Impact of load balancing frequency
spectively. This implies that on average, LAR1 is comparable with CRA in achieving the load balancing goal in churn. Second, LAR1 moves up to 20% and CRA moves up to 45% of the system load to achieve load balance for SU as high as 80%. Third, the load moved due to load balancing is very small compared with the load moved due to node joins and departures and it is up to 40% for LAR1 and 53% for CRA. When the node interarrival time is 10, the L/DHT-L is the highest. It is because faster node joins and departures generate much higher load imbalance, such that more load transferred is needed to achieve load balance. The fact that the results of LAR1 are comparable to CRA implies that LAR algorithm is as efficient as CRA to handle churn by moving a small amount load. The results in Figure 6 are due to a default node join/leave rate of 0.4. Figure 7 plots the 99.9th NU, load movement factor and the L/DHT-L as a function of SU with different node interarrival time respectively. We can observe that the results of the three metrics increase as SU increases. That’s because nodes are prone to being overloaded in a heavily loaded system, resulting in more load transferred to achieve load balance. We also can observe that the results of the metrics increase as interarrival time decreases, though they are not obvious. It is due to the fact that with faster node joins and departures, nodes are more easily to become overloaded, leading to the increase of the 99.9th NU and load moved in load balancing. Low NUs in different SU and node interarrival time means that the LAR is effective in maintaining each node light in a dynamic DHT with different node join/departure rate and different SUs, and confirms the churn-resilient feature of the LAR algorithm. Impact of load balancing frequency in Churn. It is known that high frequent load balancing ensures the system load balance at a high cost, and low frequent load balancing can hardly guarantee load balance at all time. In this simulation, we varied load balancing interval T from 60 to 600 seconds, at a step size of 60, and we conducted the test in a system with SU varies from 0.5 to 0.9 at a step size of 0.1. Figure 8(a) and (b) show the 99.9th NU and load movement factor in different system utilization and time interval. We can see that the 99.9th NU and load movement factor increase as SU increases. This is because that nodes are most likely to be overloaded in highly loaded system, leading to high maximum NU and a large amount of load needed to transfer for load balance. Figure 8(a) shows that all the 99.9th NUs are less than 1, and when the actual load of a system consists
182
Load Balancing in Peer-to-Peer Systems
Figure 9. Impact of item arrival/departure rate
more than 60% of its target load, the 99.9 NU quickly converges to 1. It implies that the LAR algorithm is effective in keeping every node light, and it can quickly transfer excess load of heavy nodes to light nodes even in a highly loaded system. Observing Figure 8(a) and (b), we find that in a certain SU, the more load moved, the lower 99.9th NU. It is consistent with our expectation that more load moved leads to move balanced load distribution. Intuitively, a higher load balancing frequency should lead to less the 99.9th NU and more load moved. Our observation from Figure 8 is counter-intuitive. That is, the 99.9th NU increases and load movement factor decreases as load balancing is performed more frequently. Recall that the primary objective of load balancing is to keep each node not overloaded, instead of keeping the application load evenly distributed between the nodes. Whenever a node’s utilization is below 1, it does not need to transfer its load to others. With a high load balancing frequency, few nodes are likely to be overloaded. They may have high utilizations less than 1, and end up with less load movement and high node utilization. Figure 8(b) reveals a linear relationship between the load movement factor and system utilization and that the slope of low frequency is larger than high frequency because of the impact of load balancing frequency on highly loaded systems. Impact of item arrival/departure rate in Churn. Continuous and fast item arrivals increase the probability of overloaded nodes generation. Item departures generate nodes with available capacity for excess items. An efficient load balancing algorithm will find nodes with sufficient free capacity for excess items quickly in order to keep load balance state in churn. In this section, we evaluate the efficiency of LAR algorithm in the face of rapid item arrivals and departures. In this test, we varied item arrival/departure rate from 0.05 to 0.45 at a step size of 0.1, varied SU from 0.5 to 0.9 at a step size of 0.05, and measured the 99.9th NU and load movement factor in each condition. Figure 9(a) and (b), respectively, plot the 99.9th NU and load movement factor as functions of item arrival/departure rate. As expected, the 99.9th NU and load movement factor increase with system utilization. It is consistent with the results in the load balancing frequency test. Figure 9(a) shows that all the 99.9th NUs are less than 1, which means that the LAR is effective to assign excess items to light nodes in load balancing in rapid item arrivals and departures. From the figures, we can also see that when item arrival/departure
183
Load Balancing in Peer-to-Peer Systems
Figure 10. Impact of non-uniform item arrival patterns
rate increases, unlike in lightly loaded system, the 99.9th NU decreases in heavily loaded system. It is due to efficient LAR load balancing, in which more load rearrangements initiated timely by overloaded nodes with high item arrival rate. On the other hand, in the lightly loaded system, though the loads of nodes accumulate quickly with high item arrival rate, most nodes are still light with no need to move out load, leading to the increase of 99.9th NU. This is confirmed by the observation in Figure 9(b) that the load moved is higher in heavily loaded system than that in lightly loaded system, and movement factor drops faster in highly loaded system, which means that faster item departures lead to less load moved for load balance. Figure 9(b) demonstrates that the load movement factor drops as item arrival/ departure rate increases. It is because that the total system load (denominator of load movement factor) grows quickly with a high item arrival/departure rate. In summary, item arrival/departure rate has direct effect on NU and load movement factor in load balancing, and LAR is effective to achieve load balance with rapid item arrivals and departures. Impact of Non-uniform Item Arrivals in Churn. Furthermore, we tested LAR algorithm to see if it is churn-resilient enough to handle skewed load distribution. We define an “impulse” of items as a group of items that suddenly join in the system and their IDs are distributed over a contiguous interval of an ID space interval. We set their total load as 10% of the total system load, and varied the spread of interval from10% to 90% of the ID space. Figure 10(a) shows that in different impulses and SUs, LAR algorithm kept the 99.9th NU less than 1.055, which implies that LAR algorithm can almost solve the impulses successfully. The 99.9th NU is high in high SU and low impulse spread. Except when SU equals to 0.8, the impulse with spread larger than 0.3 can be successfully resolved by LAR algorithm. When the impulse is assigned to a small ID space interval less than 0.3, the load of the nodes in that ID space interval accumulates quickly, leading to higher NUs. The situation becomes worse with higher SU, because there’s already less available capacity left in the system for the impulse. The curve of SU=0.8 is largely above others is mainly due to the item load and node capacity distributions, and the impulse load relative to the SU. In that case, it is hard to find nodes with large enough capacity to support excess items because of the fragmentation of the 20% capacity left in the system. The results are consistent with the results in paper (Godfrey, 2006). Figure 10(b) shows that the load movement factor decreases with the increase of impulse spread, and the decrease of SU. In low impulse spread, a large amount of load assigning to a small region generates
184
Load Balancing in Peer-to-Peer Systems
Figure 11. Impact of the number of nodes in the system
a large number of overloaded nodes, so the LAR load balancing algorithm cannot handle them quickly. This situation becomes worse when SU increases to 0.8, due to little available capacity left. Therefore, the 99.9th NU and the load movement factor are high in highly loaded system and low impulse interval. In summary, the LAR algorithm can solve non-uniform item arrival generally. It can deal with sudden increase of 10% load in 10% ID space in a highly loaded system with SU equals to 0.8, achieving the 99.9th NU close to 1. Impact of Node Number and Capacity Heterogeneity in Churn. Consistent hashing function adopted in DHT leads to a bound of O(log n) imbalance of keys between the nodes, where n is the number of nodes in the system. Node heterogeneity in capacity makes the load balancing problem even more severe. In this section, we study the effects of the number of nodes and heterogeneous capacity distribution in the system on load balancing. We varied the number of nodes from 1000 to 8000 at a step size of 1000, and tested NU and load movement factor when node capacities were heterogeneous and homogeneous. Homogeneous node capacities are equal capacities set as 50000, and heterogeneous node capacities are determined by the default Pareto node capacity distribution. Figure 11(a) shows that in the heterogeneous case, the 99.9th NUs are all around 1. It means that the LAR can maintain nodes to be light in different network scales when node capacities are heterogeneous. In the homogeneous case, the 99.9th NU maintains around 1 when node number is no more than 5000, but it grows linearly as node number increases when nodes are more than 5000. It is somewhat surprisingly that LAR can achieve better load balance in large scale network when node capacities are heterogeneous than when they are homogeneous. Intuitively, this is because that in the heterogeneous case, very high load items can be accommodated by large capacity nodes, but there’s no node with capacity large enough to handle them in the homogeneous case. The results are consistent with those in (Godfrey, 2006). Figure 11(b) shows that in both cases, the load movement factors increase as the number of nodes grows. Larger system scale generates higher key imbalance, such that more load needs to be transferred for load balance. The figure also shows that the factor of the homogeneous case is pronounced less than that in the heterogeneous case. This is due to the heterogeneous capacity distribution, in which some nodes have very small capacities but are assigned much higher load, which is needed to move out for load balance. The results show that node heterogeneity helps, not hurts, the scalability of LAR algorithm. LAR algorithm can achieve good load balance even in large scale network by arranging load transfer
185
Load Balancing in Peer-to-Peer Systems
timely.
3.2.6. Summary This section presents LAR load balancing algorithms to deal with both of the proximity and dynamism of DHTs simultaneously. The algorithms distribute application load among the nodes by “moving items” according to their capacities and proximity information in topology-aware DHTs. The LAR algorithms introduce a factor of randomness in the probing process in a range of proximity to deal with DHT churn. The efficiency of the randomized load balancing is further improved by d-way probing. Simulation results show the superiority of a locality-aware 2-way randomized load balancing in DHTs with and without churn. The algorithm saves bandwidth in comparison with randomized load balancing because of its locality-aware feature. Due to the randomness factor in node probing, it can achieve load balance for SU as high as 90% in dynamic situations by moving load up to 20% of the system load, and up to 40% of the underlying DHT load moved caused by node joins and departures. The LAR algorithm is further evaluated with respect to a number of performance factors including load balancing frequency, arrival/departure rate of items and nodes, skewed item ID distribution, and node number and capacity heterogeneity. Simulation results show that LAR algorithm can effectively achieve load balance by moving a small amount of load even in skewed distribution of items.
4. FUTURE TRENDS Though a lot of research has been conducted in the field of load balancing in parallel and distributed systems, load balancing methods are still in their incubation phase when it comes to P2P overlay networks. In this section, we discuss the future and emerging trends, and present a number of open issues in the domain of load balancing in P2P overlay networks. P2P overlay networks are characterized by heterogeneity, dynamism and proximity. With heterogeneity consideration, a load balancing method should allocate load among nodes based on the actual file load rather than the number of files. A dynamism-resilient load balancing method should not generate high overhead in load balancing when nodes join, leave or fail continuously and rapidly. A proximity-aware load balancing method moves load between physically close nodes so as to reduce the overhead of load balancing. However, few of the current load balancing methods take into account these three factors to improve the efficiency and effectiveness of load balancing. Virtual server methods and ID assignment and reassignment methods only aim to distribute the number of files among nodes in balance, therefore they are unable to consider file heterogeneity. In addition, these methods lead to high overhead due to neighbor maintenance and varied ID intervals owned by nodes in churn. These two categories of methods can be complementary to the load transfer methods that have potential to deal with the three features of P2P overlay networks. Thus, combing the three types of load balancing strategies to overcome each other’s drawbacks and take advantage of the benefits of each method will be a promising future direction. The LAR algorithms were built on Cycloid structured DHT. It is important that the LAR algorithms are applicable to other DHT networks as well. It must be complemented by node clustering to cluster DHT nodes together according to their physical locations to facilitate LAR’s probing in a range of proximity. The work in (Shen, 2006) presents a way of clustering physically close nodes in a general DHT network, which can be applied to LAR’s generalization to other DHT networks.
186
Load Balancing in Peer-to-Peer Systems
Currently, most heterogeneity-unaware load balancing methods measure load by the number of files stored in a node, and heterogeneity-aware load balancing methods only consider file size when determining a node’s load. In addition to the storage required, the load incurred by a file also includes bandwidth consumption caused by file queries. Frequently-queried files generate high load, while infrequentlyqueried files lead to low load. Since files stored in the system often have different popularities, and the access patterns to the same file may vary with time, a file’s load is changing dynamically. However, most load balancing methods are not able to cope with load variance caused by non-uniform and time-varying file popularity. Thus, an accurate method to measure a file’s load that considers all factors affecting load is required. On the other hand, node capacity heterogeneity should also be identified. As far as the author knows, all current load balancing methods assume that there is one bottleneck resource, though there are various resources including CPU, memory, storage and bandwidth. For highly effective load balancing, the various loads such as bandwidth and storage should be differentiated, and various node resources should be differentiated as well. Rather than mapping generalized node capacity and load, corresponding load and node resource should be mapped in load balancing. These improvements will significantly enhance the accuracy and effectiveness of a load balancing method. Most load balancing algorithms only balances key distribution among nodes. In file sharing P2P systems, a main function of nodes is to handle key location query. Query load balancing is a critical part of P2P load balancing. That is, the number of queries that nodes receive, handle and forward corresponds to their different capacities. A highly effective load balancing method will distribute both key load and query load in balance.
5. CONCLUSION A load balancing method is indispensable to a high performance P2P overlay network. It helps to avoid overloading nodes and take full advantage of node resources in the system. This chapter has provided a detailed introduction of load balancing in P2P overlay networks, and has examined all aspects of load balancing methods including their goals, properties, strategies and classification. A comprehensive review of research works focusing on load balancing in DHT networks has been presented, along with an in depth discussion of their pros and cons. Furthermore, a load balancing algorithm that overcomes the drawbacks of previous methods has been presented in detail. Finally, the future and emerging trends and open issues in load balancing in P2P overlay networks have been discussed.
REFERENCES Adler, M., Halperin, E., Karp, R. M., & Vazirani, V. (2003, June). A stochastic process on the hypercube with applications to peer-to-peer networks. In Proc. of STOC. Azar, Y. Broder, A., et al. (1994). Balanced allocations. In Proc. of STOC (pp. 593–602). Bienkowski, M., Korzeniowski, M., & auf der Heide, F. M. (2005). Dynamic load balancing in distributed hash tables. In Proc. of IPTPS.
187
Load Balancing in Peer-to-Peer Systems
Brighten Godfrey, P., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In Proc. of IEEE INFOCOM. Byers, J. Considine, J., & Mitzenmacher, M. (2003, Feb.). Simple load balancing for distributed hash tables. In Proc. of IPTPS. Castro, M., Druschel, P., Hu, Y. C., & Rowstron, A. (2002). Topology-aware routing in structured peerto-peer overlay networks. In Future Directions in Distributed Computing. Fasttrack product description. (2001). http://www.fasttrack.nu/index.html. Fu, S., Xu, C. Z., & Shen, H. (April 2008). Random choices for Churn resilient load balancing in peerto-peer networks. Proc. of IEEE International Parallel and Distributed Processing Symposium. Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2006). Load balancing in dynamic structured P2P systems. Performance Evaluation, 63(3). Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., & Stoica, I. (2003). The impact of DHT routing geometry on resilience and proximity. In Proc. of ACM SIGCOMM. Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt, Y., & Zhang, L. (2000). On the placement of Internet instrumentation. In Proc. of INFOCOM. Kaashoek, F., & Karger, D. R. Koorde. (2003). A simple degree-optimal hash table. In Proceedings IPTPS. Karger, D., Lehman, E., Leighton, T., Levine, M., et al. (1997). Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. of STOC (pp 654–663). Karger, D. R., & Ruhl, M. (2004). Simple efficient load balancing algorithms for Peer-to-Peer systems. In Proc. of IPTPS. Manku, G. (2004). Balanced binary trees for ID management and load balance in distributed hash tables. In Proc. of PODC. Maymounkov, P., & Mazires, D. Kademlia. (2002). A peer-to-peer information system based on the xor metric. The 1st Interational Workshop on Peer-to-Peer Systems (IPTPS). Mitzenmacher, M. (1997). On the analysis of randomized load balancing schemes. In Proc. of SPAA. Mondal, A., Goda, K., & Kitsuregawa, M. (2003). Effective load-balancing of peer-to-peer systems. In Proc. of IEICE DEWS DBSJ Annual Conference. Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. New York: Cambridge University Press. Naor, M., & Wieder, U. (June 2003). Novel Architectures for P2P applications: The continuous-discrete approach. In Proc. SPAA. Rao, A., Lakshminarayanan, K., et al. (2003). Load balancing in structured P2P systems. In Proc. of IPTPS.
188
Load Balancing in Peer-to-Peer Systems
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable content-addressable network. In Proceedings of ACM SIGCOMM. (pp 329–350). Ratnasamy, S., Handley, M., Karp, R., & Shenker, S. (2002). Topologically aware overlay construction and server selection. In Proc. of INFOCOM. Rowstron, A., & Druschel, P. Pastry. (2001). Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Proc. of the 18th IFIP/ACM Int’l Conf. on Distributed Systems Platforms (Middleware). Saroiu, S., et al. (2002). A Measurement Study of Peer-to- Peer File Sharing Systems. In Proc. of MMCN. Shen, H., & Xu, C. (2006,April). Hash-based proximity clustering for load balancing in heterogeneous DHT networks. In Proc. of IPDPS. Shen, H., Xu, C., & Chen, G. (2006). Cycloid: A scalable constant-degree P2P overlay network. Performance Evaluation, 63(3), 195–216. doi:10.1016/j.peva.2005.01.004 Shen, H., & Xu, C.-Z. (2007). Locality-aware and Churn-resilient load balancing algorithms in structured peer-to-peer networks. [TPDS]. IEEE Transactions on Parallel and Distributed Systems, 18(6), 849–862. doi:10.1109/TPDS.2007.1040 Stoica, I., Morris, R., et al. (2003). Chord: A scalable peer-to-peer lookup protocol for Internet applications. IEEE/ACM Transactions on Networking. Waldvogel, M., & Rinaldi, R. (2002). Efficient topology-aware overlay network. In Proc. of HotNets-I. Xu, C. (2005). Scalable and Secure Internet Services and Architecture. Boca Raton, FL: Chapman & Hall/CRC Press. Xu, Z., Mahalingam, M., & Karlsson, M. (2003). Turning heterogeneity into an advantage in overlay routing. In Proc. of INFOCOM. Xu, Z., Tang, C., & Zhang, Z. (2003). Building topology-aware overlays using global soft-state. In Proc. of ICDCS. Yang, B., & Garcia-Molina, H. (2003). Designing a super-peer network. In Proc. of ICDE. Zegura, E. Calvert, K. et al. (1996). How to model an Internetwork. In Proc. of INFOCOM. Zhao, B. Y., Kubiatowicz, J., & Oseph, A. D. (2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141). University of California at Berkeley, Berkeley, CA. Zhu, Y., & Hu, Y. (2005). Efficient, proximity-aware load balancing for DHT-based P2P systems. Proc. of IEEE TPDS, 16(4).
189
Load Balancing in Peer-to-Peer Systems
KEY TERMS AND DEFINITIONS Dynamism/Churn: A great number of nodes join, leave and fail continually and rapidly, leading to unpredicted network size. Heterogeneity: The instinct properties of participating peers, including computing ability, differ a lot and deserve serious consideration for the construction of a real efficient wide-deployed application. Load Balancing Method: A method that controls the load in each node no more than the node’s capacity. Peer: A peer (or node) is an abstract notion of participating entities. It can be a computer process, a computer, an electronic device, or a group of them. Peer-to-Peer Network: A peer-to-peer network is a logical network on top of physical networks in which peers are organized without any centralized coordination. Proximity: Mismatch between logical proximity abstraction derived from DHTs and physical proximity information in reality, which is a big obstacle for the deployment and performance optimization issues for P2P applications. Structured Peer-to-Peer Network/Distributed Hash Table: A peer-to-peer network that maps keys to the nodes based on a consistent hashing function.
190
191
Chapter 9
Decentralized Overlay for Federation of Enterprise Clouds Rajiv Ranjan The University of Melbourne, Australia Rajkumar Buyya The University of Melbourne, Australia
ABSTRACT This chapter describes Aneka-Federation, a decentralized and distributed system that combines enterprise Clouds, overlay networking, and structured peer-to-peer techniques to create scalable wide-area networking of compute nodes for high-throughput computing. The Aneka-Federation integrates numerous small scale Aneka Enterprise Cloud services and nodes that are distributed over multiple control and enterprise domains as parts of a single coordinated resource leasing abstraction. The system is designed with the aim of making distributed enterprise Cloud resource integration and application programming flexible, efficient, and scalable. The system is engineered such that it: enables seamless integration of existing Aneka Enterprise Clouds as part of single wide-area resource leasing federation; self-organizes the system components based on a structured peer-to-peer routing methodology; and presents end-users with a distributed application composition environment that can support variety of programming and execution models. This chapter describes the design and implementation of a novel, extensible and decentralized peer-to-peer technique that helps to discover, connect and provision the services of Aneka Enterprise Clouds among the users who can use different programming models to compose their applications. Evaluations of the system with applications that are programmed using the Task and Thread execution models on top of an overlay of Aneka Enterprise Clouds have been described here.
INTRODUCTION Wide-area overlays of enterprise Grids (Luther, Buyya, Ranjan, & Venugopal, 2005; Andrade, Cirne, Brasileiro, & Roisenberg, 2003; Butt, Zhang, & Hu, 2003; Mason & Kelly, 2005) and Clouds (Amazon DOI: 10.4018/978-1-60566-661-7.ch009
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Decentralized Overlay for Federation of Enterprise Clouds
Elastic Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008; Buyya, Yeo, Venugopal, 2008) are an appealing platform for the creation of high-throughput computing resource pools and cross-domain virtual organizations. An enterprise Cloud1 is a type of computing infrastructure that consists of a collection of inter-connected computing nodes, virtualized computers, and software services that are dynamically provisioned among the competing end-user’s applications based on their availability, performance, capability, and Quality of Service (QoS) requirements. Various enterprise Clouds can be pooled together to form a federated infrastructure of resource pools (nodes, services, virtual computers). In a federated organisation: (i) every participant gets access to much larger pools of resources; (ii) the peak-load handling capacity of every enterprise Cloud increases without having the need to maintain or administer any additional computing nodes, services, and storage devices; and (iii) the reliability of a enterprise Cloud is enhanced as a result of multiple redundant clouds that can efficiently tackle disaster condition and ensure business continuity. Emerging enterprise Cloud applications and the underlying federated hardware infrastructure (Data Centers) are inherently large, with heterogeneous resource types that may exhibit temporal resource conditions. The unique challenges in efficiently managing a federated Cloud computing environment include: •
•
•
Large scale: composed of distributed components (services, nodes, applications, users, virtualized computers) that combine together to form a massive environment. These days enterprise Clouds consisting of hundreds of thousands of computing nodes are common (Amazon Elastic Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008) and hence federating them together leads to a massive scale environment; Resource contention: driven by the resource demand pattern and a lack of cooperation among end-user’s applications, particular set of resources can get swamped with excessive workload, which significantly undermines the overall utility delivered by the system; and Dynamic: the components can leave and join the system at will.
The aforementioned characteristics of the infrastructure accounts to significant development, system integration, configuration, and resource management challenges. Further, the end-users follow a variety of programming models to compose their applications. In other words, in order to efficiently harness the computing power of enterprise Cloud infrastructures (Chu, Nandiminti, Jin, Venugopal, & Buyya, 2007; Amazon Elastic Compute Cloud, 2008; Google App Engine, 2008; Microsoft Live Mesh, 2008), software services that can support high level of scalability, robustness, self-organization, and application composition flexibility are required. This chapter has two objectives. The first is to investigate the challenges as regards to design and development of decentralized, scalable, self-organizing, and federated Cloud computing system. The second is to introduce the Aneka-Federation software system that includes various software services, peer-to-peer resource discovery protocols and resource provisioning methods (Ranjan, 2007; Ranjan, Harwood, & Buyya, 2008) to deal with the challenges in designing decentralized resource management system in a complex, dynamic and heterogeneous enterprise Cloud computing environment. The components of the Aneka-Federation including computing nodes, services, providers and end-users self-organize themselves based on a structured peer-to-peer routing methodology to create a scalable wide-area overlay of enterprise Clouds. In rest of this chapter, the terms Aneka Cloud(s) and Aneka Enterprise Cloud(s) are used interchangeably.
192
Decentralized Overlay for Federation of Enterprise Clouds
The unique features of Aneka-Federation are: (i) wide-area scalable overlay of distributed Aneka Enterprise Clouds (Chu et al., 2007); (ii) realization of a peer-to-peer based decentralized resource discovery technique as a software service, which has the capability to handle complex resource queries; and (iii) the ability to enforce coordinated interaction among end-users through the implementation of a novel decentralized resource provisioning method. This provisioning method is engineered over a peer-to-peer routing and indexing system that has the ability to route, search and manage complex coordination objects in the system. The rest of this chapter is organized as follows: First, the challenges and requirements related to design of decentralized enterprise Cloud overlays are presented. Next, follows a brief introduction of the Aneka Enterprise Cloud system including the basic architecture, key services and programming models. Then, a finer detail related to the Aneka-Federation software system that builds upon the decentralized Content-based services is presented. Comprehensive details on the design and implementation of decentralized Content-based services for message routing, search, and coordinated interaction follows. Next, the experimental case study and analysis based on the test run of two enterprise Cloud applications on the Aneka-Federation system is presented. Finally, this work is put in context with the related works. The chapter ends with a brief conclusion.
DESIGNING DECENTRALIzED ENTERPRISE CLOUD OVERLAY In decentralized organization of Cloud computing systems both control and decisions making are decentralized by nature and where different system components interact together to adaptively maintain and achieve a desired system wide behavior. A distributed Cloud system configuration is considered to be decentralized “if none of the components in the system are more important than the others, in case that one of the component fails, then it is neither more nor less harmful to the system than caused by the failure of any other component in the system”. A fundamental challenge in managing the decentralized Cloud computing system is to maintain a consistent connectivity between the components (self-organization) (Parashar & Hariri, 2007). This challenge cannot be overtaken by introducing a central network model to connect the components, since the information needed for managing the connectivity and making the decisions is completely decentralized and distributed. Further, centralized network model (Zhang, Freschl, & Schopf, 2003) does not scale well, lacks fault-tolerance, and requires expensive server hardware infrastructure. System components can leave, join, and fail in a dynamic fashion; hence it is an impossible task to manage such a network centrally. Therefore, an efficient decentralized solution is mandatory that can gracefully adapt, and scale to the changing conditions. A possible way to efficiently interconnect the distributed system components can be based on structured peer-to-peer overlays. In literature, structured peer-to-peer overlays are more commonly referred to as the Distributed Hash Tables (DHTs). DHTs provide hash table like functionality at the Internet scale. DHTs such as Chord (Stoica, Morris, Karger, Kaashoek, & Balakrishnan, 2001), CAN (Ratnasamy, Francis, Handley, Karp, & Schenker, 2001), Pastry (Rowstron & Druschel, 2001), and Tapestry (Zhao, Kubiatowicz, & Joseph, 2001) are inherently self-organizing, fault-tolerant, and scalable. DHTs provide services that are light-weight and hence, do not require an expensive hardware platform for hosting, which is an important requirement as regards to building and managing enterprise Cloud system that consists of commodity machines. A DHT is a distributed data structure that associates a key with a data.
193
Decentralized Overlay for Federation of Enterprise Clouds
Entries in a DHT are stored as a (key, data) pair. A data can be looked up within a logarithmic overlay routing hops if the corresponding key is known. The effectiveness of the decentralized Cloud computing system depends on the level of coordination and cooperation among the components (users, providers, services) as regards to scheduling and resource allocation. Realizing cooperation among distributed Cloud components requires design and development of the self-organizing, robust, and scalable coordination protocols. The Aneka-Federation system implements one such coordination protocol using the DHT-based routing, lookup and discovery services. The finer details about the coordination protocol are discussed later in the text.
ANEKA ENTERPRISE CLOUD: AN OVERVIEW Aneka (Chu et al., 2007) is a .NET-based service-oriented platform for constructing enterprise Clouds. It is designed to support multiple application models, persistence and security solutions, and communication protocols such that the preferred selection can be changed at anytime without affecting an existing Aneka ecosystem. To create an enterprise Cloud, the resource provider only needs to start an instance of the configurable Aneka container hosting required services on each selected Cloud node. The purpose of the Aneka container is to initialize services and acts as a single point for interaction with the rest of the enterprise Cloud. Figure 1 shows the design of the Aneka container on a single Cloud node. To support scalability, the Aneka container is designed to be lightweight by providing the bare minimum functionality needed for an enterprise Cloud node. It provides the base infrastructure that consists of services for persistence, security (authorization, authentication and auditing), and communication (message handling and dispatching). Every communication within the Aneka services is treated as a message, handled and dispatched through the message handler/dispatcher that acts as a front controller. The Aneka container hosts a compulsory MembershipCatalogue service, which maintains the resource discovery indices (such as a .Net remoting address) of those services currently active in the system. The Aneka container can host any number of optional services that can be added to augment the capabilities of an enterprise Cloud node. Examples of optional services are indexing, scheduling, execution, and storage services. This provides a single, flexible and extensible framework for orchestrating different kinds of enterprise Cloud application models. To support reliability and flexibility, services are designed to be independent of each other in a container. A service can only interact with other services on the local node or other Cloud node through known interfaces. This means that a malfunctioning service will not affect other working services and/ or the container. Therefore, the resource provider can seamlessly configure and manage existing services or introduce new ones into a container. Aneka thus provides the flexibility for the resource provider to implement any network architecture for an enterprise Cloud. The implemented network architecture depends on the interaction of services among enterprise Cloud nodes since each Aneka container on a node can directly interact with other Aneka containers reachable on the network.
194
Decentralized Overlay for Federation of Enterprise Clouds
Figure 1. Design of Aneka container
ANEKA-FEDERATION The Aneka-Federation system self-organizes the components (nodes, services, clouds) based on a DHT overlay. Each enterprise Cloud site in the Aneka-Federation (see Figure 2) instantiates a new software service, called Aneka Coordinator. Based on the scalability requirements and system size, an enterprise Cloud can instantiate multiple Aneka Coordinator services. The Aneka Coordinator basically implements the resource management functionalities and resource discovery protocol specifications. The software design of the Aneka-Federation system decouples the fundamental decentralized interaction of participants from the resource allocation policies and the details of managing a specific Aneka Cloud Service. Aneka-Federation software system utilizes the decentralized Cloud services as regards to efficient distributed resource discovery and coordinated scheduling.
DESIGN AND IMPLEMENTATION Aneka Coordinator software service is composed of the following components:
195
Decentralized Overlay for Federation of Enterprise Clouds
Figure 2. Aneka-Federation network with the coordinator services and Aneka enterprise Clouds
•
•
Aneka services: These include the core services for peer-to-peer scheduling (Thread Scheduler, Task Scheduler, Dataflow Scheduler) and peer-to-peer execution (Thread Executor, Task Executor) provided by the Aneka framework. These services work independently in the container and have the ability to interact with other services such as the P2PMembershipCatalogue through the MessageDispatcher service deployed within each container. Aneka peer: This component of the Aneka Coordinator service loosely glues together the core Aneka services with the decentralized Cloud services. Aneka peer seamlessly encapsulates together the following: Apache Tomcat container (hosting environment and web service front end to the Content-based services), Internet Information Server (IIS) (hosting environment for ASP. Net service), P2PMembershipCatalogue, and Content-based services (After Figure 3 see Figure 4). The basic functionalities of the Aneka peer (refer to Figure 3) include providing services for: (i) Content-based routing of lookup and update messages; and (ii) facilitating decentralized coordination for efficient resource sharing and load-balancing among the Internet-wide distributed Aneka Enterprise Clouds. The Aneka peer service operates at the Core services layer in the layered architecture shown after Figures 5, 6, 7, and 8, in Figure 9.
Figure 4 shows a block diagram of interaction between various components of Aneka Coordinator software stack. The Aneka Coordinator software stack encapsulates the P2PMembershipCatalogue
196
Decentralized Overlay for Federation of Enterprise Clouds
Figure 3. Aneka-Federation over decentralized Cloud services
and Content-based decentralized lookup services. The design components for peer-to-peer scheduling, execution, and membership are derived from the basic Aneka framework components through object oriented software inheritance (see Figure 5, Figure 6, and Figure 7). A UML (Unified Modeling Language) class diagram that displays the core entities within the Aneka Coordinator’s Scheduling service is shown in Figure 5. The main class (refer to Figure 5) that undertakes activities related to application scheduling within the Aneka Coordinator is the P2PScheduling service, which is programmatically inherited from the Aneka’s IndependentScheduling service class. The P2PScheduling service implements the methods for: (i) accepting application submission from client nodes (see [REMOVED SHAPE FIELD] Figure 8); (ii) sending search query to the P2PMembershipCatalogue service; (iii) dispatching application to Aneka nodes (P2PExecution service); and (iv) collecting the application output data. The core programming models in Aneka including Task, Thread, and Dataflow instantiate P2PScheduling service as their main scheduler class. This runtime binding of P2PScheduling service class to different programming models is taken care of by Microsoft .NET platform and Inverse of Control (IoC) (Fowler, 2008) implementation in the Spring .NET framework (Spring.Net, 2008). Similar to P2PScheduling service, the binding of P2PExecution service to specific programming models (such as P2PTaskExecution, P2PThreadExecution) is done by Microsoft .NET platform and IoC implementation in the Spring .NET framework. The interaction between the services (such as P2PTaskExecution and P2PTaskScheduling service) is facilitated by the MessageDispatcher service. The P2PExecution services update their node usage status with the P2PMembershipCatalogue through the P2PExecutorStatusUpdate component (see Figure 6). The core Aneka Framework defines distinct message types to enable seamless interaction between services. The functionality of handling, compiling, and delivering the messages within the Aneka framework is implemented in the MessageDispactcher service. Recall that the MessageDispatcher service is automatically deployed in the Aneka container.
197
Decentralized Overlay for Federation of Enterprise Clouds
Figure 4. A block diagram showing interaction between various components in the Aneka Coordinator software stack
P2PMembershipCatalogue service is the core component that interacts with the Content-based decentralized Cloud services and aids in the organization and management of Aneka-Federation overlay. The UML class design for this service within the Aneka Coordinator is shown in Figure 7. This service accepts resource claim and ticket objects from P2PScheduling and P2PExecution services respectively (refer to Figure 8), which are then posted with the Content-based services hosted in the Apache Tomcat container. The P2PMembershipCatalogue interacts with the components hosted within the Apache Tomcat container (Java implementation) using the SOAP-based web services Application Programming Interfaces (APIs) exposed by the DFPastryManager component (see Figure 7). The Content-based service communicates with the P2PMembershipCatalogue service through an ASP.NET web service hosted within in the IIS container (see Figure 4 or 8). The mandatory services within a Aneka Coordinator that are required to instantiate a fully functional Aneka Enterprise Cloud site includes P2PMembershipCatalogue, P2PExecution, P2PScheduling, .Net web service, and Content-based services (see Figure 8). These services exports a enterprise Cloud site to the federation, and give it capability to accept remote jobs based on its load condition (using their P2PExecution services), and submit local jobs to the federation (through their P2PScheduling services). Figure 8 demonstrates a sample application execution flow in the Aneka-Federation system. Clients
198
Decentralized Overlay for Federation of Enterprise Clouds
Figure 5. Class design diagram of P2PScheduling service
directly connect and submit their application to a programming model specific scheduling service. For instance, a client having an application programmed using Aneka’s Thread model would submit his application to Thread P2PScheduling service (refer to step 1 in Figure 8). Clients discover the point of contact for local scheduling services by querying their domain specific Aneka Coordinator service. On receipt of an application submission message, a P2PScheduling service encapsulates the resource requirement for that application in a resource claim object and sends a query message to the P2PMembershipCatalogue (see step 2 in Figure 8). Execution services (such as the P2PThreadExecution, P2PTaskExecution), which are distributed over different enterprise Clouds and administered by enterprise specific Aneka Coordinator services, update their status by sending a resource ticket object to the P2PMembership Catalogue (see step 3 in Figure 8). A resource ticket object in the Aneka-Federation system abstracts the type of service being offered, the underlying hardware platform, and level of QoS that can be supported. The finer details about the composition and the mapping of resource ticket and claim objects are discussed later in this chapter. The P2PMembershipCatalogue then posts the resource ticket and claim objects with the decentralized Content-based services (see step 4 and 5 in Figure 8). When a resource ticket, issued by a P2PTExecution service, matches with a resource claim object, posted by a P2PScheduling service, the Content-based service sends a match notification to the P2PScheduling service through the P2PMembershipCatalogue (see step 6, 7, 8 in Figure 8). After receiving the notification, the P2PScheduling service deploys its application on the P2PExecution service (see step 9 in Figure 8). On completion of a submitted applica-
199
Decentralized Overlay for Federation of Enterprise Clouds
Figure 6. Class design diagram of P2PExecution service
tion, the P2PExecution service directly returns the output to the P2PScheduling service (see step 10 in Figure 8). (Figure 9 and Figure 10) The Aneka Coordinator service supports the following two inter-connection models as regards to an Aneka Enterprise Cloud site creation (See Figure 9 and Figure 10). First, a resource sharing domain or enterprise Cloud can instantiate a single Aneka-Coordinator service, and let other nodes in the Cloud connect to the Coordinator service. In such a scenario, other nodes need to instantiate only the P2PExecution and P2PScheduling services. These services are dependent on the domain specific Aneka Coordinator service as regards to load update, resource lookup, and membership to the federation (see Figure 11). In second configuration, each node in a resource domain can be installed with all the services within the Aneka Coordinator (see Figure 4). This kind of inter-connection will lead to a true peer-topeer Aneka-Federation Cloud network, where each node is an autonomous computing node and has the ability to implement its own resource management and scheduling decisions. Hence, in this case the Aneka Coordinator service can support completely decentralized Cloud computing environment both within and between enterprise Clouds.
200
Decentralized Overlay for Federation of Enterprise Clouds
Figure 7. Class design diagram of P2PMembershipCatalogue service
CONTENT-BASED DECENTRALIzED CLOUD SERVICES It is aforementioned that the DHT based overlay presents a compelling solution for creating a decentralized network of Internet-wide distributed Aneka Enterprise Clouds. However, DHTs are efficient at handling single-dimensional search queries such as “find all services that match a given attribute value”. Since Cloud computing resources such as enterprise computers, supercomputers, clusters, storage devices, and databases are identified by more than one attribute, therefore a resource search query for these resources is always multi-dimensional. These resource dimensions or attributes include service type, processor speed, architecture, installed operating system, available memory, and network bandwidth. Recent advances in the domain of decentralized resource discovery have been based on extending the existing DHTs with the capability of multi-dimensional data organization and query routing (Ranjan, Harwood, & Buyya, 2008). Our decentralized Cloud management middleware supports peer-to-peer Content-based resource discovery and coordination services for efficient management of distributed enterprise Clouds. The middleware is designed based on a 3-tier layered architecture: the Application layer, Core Services layer, and Connectivity layer (see Figure 9). Cloud services such as the Aneka Coordinator, resource brokers, and schedulers work at the Application layer and insert objects via the Core services layer. The core functionality including the support for decentralized coordinated interaction, and scalable resource dis-
201
Decentralized Overlay for Federation of Enterprise Clouds
Figure 8. Application execution sequence in Aneka-Federation
covery is delivered by the Core Services Layer. The Core services layer, which is managed by the Aneka peer software service, is composed of two sub-layers (see Figure 9): (i) Coordination Service (Ranjan et al., 2007); and (ii) Resource discovery service. The Coordination service component of Aneka peer accepts the coordination objects such as a resource claim and resource ticket. A resource claim object is a multi-dimensional range look-up query (Samet, 2008) (spatial range object), which is initiated by Aneka Coordinators in the system in order to locate the available Aneka Enterprise Cloud nodes or services that can host their client ‘s applications. A resource claim object has the following semantics:
Aneka Service = “P2PThreadExecution” && CPU Type = “Intel” && OSType = “WinXP” && Processor Cores > “1” && Processors Speed > “1.5 GHz” On the other hand, a resource ticket is a multi-dimensional point update query (spatial point object), which is sent by an Aneka Enterprise Cloud to report the local Cloud nodes and the deployed services’ availability status. A resource ticket object has the following semantics:
202
Decentralized Overlay for Federation of Enterprise Clouds
Figure 9. Layered view of the content-based decentralized Cloud services
Aneka Service = “P2PThreadExecution” && CPU Type = “Intel” && OSType = “WinXP” && Processor Cores = “2” && Processors Speed = “3 GHz” Further, both of these queries can specify different kinds of constraints on the attribute values. If a query specifies a fixed value for each attribute then it is referred to as a multi-dimensional point query. However, in case the query specifies a range of values for attributes, then it is referred to as a multidimensional range query. The claim and ticket objects encapsulate coordination logic, which in this case is the resource provisioning logic. The calls between the Coordination service and the Resource Discovery service are made through the standard publish/subscribe technique. Resource Discovery service is responsible for efficiently mapping these complex objects to the DHT overlay. The Resource Discovery service organizes the resource attributes by embedding a logical publish/ subscribe index over a network of distributed Aneka peers. Specifically, the Aneka peers in the system create a DHT overlay that collectively maintains the logical index to facilitate a decentralized resource discovery process. The spatial publish/subscribe index builds a multi-dimensional attribute space based on the Aneka Enterprise Cloud node’s resource attributes, where each attribute represents a single dimension. The multi-dimensional spatial index assigns regions of space to the Aneka peer. The calls between Core services layer and Connectivity layer are made through standard DHT primitives such
203
Decentralized Overlay for Federation of Enterprise Clouds
Figure 10. Resource claim and ticket object mapping and coordinated scheduling across Aneka Enterprise Cloud sites. Spatial resource claims {T1, T2, T3, T4}, index cell control points {A, B, C, D}, spatial point tickets {l, s} and some of the spatial hashings to the Pastry ring, i.e. the d-dimensional (spatial) coordinate values of a cell’s control point is used as the Pastry key. For this Figure fmin =2, dim = 2.
as put (key, value), get (key) that are defined by the peer-to-peer Common Application Programming Interface (API) specification 0. There are different kinds of spatial indices 0 such as the Space Filling Curves (SFCs) (including the Hilbert curves, Z-curves), k-d tree, MX-CIF Quad tree and R*-tree that can be utilized for managing, routing, and indexing of objects by resource discovery service at Core services layer. Spatial indices are well suited for handling the complexity of Cloud resource queries. Although some spatial indices can have issues as regards to routing load-balance in case of a skewed attribute set, all the spatial indices are generally scalable in terms of the number of hops traversed and messages generated while searching and routing multi-dimensional/spatial claim and ticket objects. Resource claim and ticket object mapping: At the Core services layer, a spatial index that assigns regions of multi-dimensional attribute space to Aneka peers has been implemented. The MX-CIF Quadtree spatial hashing technique (Egemen, Harwood, & Samet, 2007) is used to map the logical
204
Decentralized Overlay for Federation of Enterprise Clouds
Figure 11. Aneka-Federation test bed distributed over 3 departmental laboratories
multi-dimensional control point (point C in Figure 10 represents a 2-dimensional control point) onto a Pastry DHT overlay. If an Aneka peer is assigned a region in the multi-dimensional attribute space, then it is responsible for handling all the activities related to the lookups and updates that intersect with the region of space. Figure 10 depicts a 2-dimensional Aneka resource attribute space for mapping resource claim and ticket objects. The attribute space resembles a mesh-like structure due to its recursive division process. The index cells, resulted from this process, remain constant throughout the life of a d-dimensional attribute space and serve as the entry points for subsequent mapping of claim and ticket objects. The number of index cells produced at the minimum division level, fmin is always equal to (fmin) dim , where dim is the dimensionality of the attribute space. These index cells are called base index cells and they are initialized when the Aneka Peers bootstrap to the federation network. Finer details on the recursive subdivision technique can be found in (Egemen et al., 2007). Every Aneka Peer in the federation has the basic information about the attribute space coordinate values, dimensions and minimum division levels. Every cell at the fmin level is uniquely identified by its centroid, termed as the control point. Figure 10 shows four control points A, B, C, and D. A DHT hashing (cryptographic functions such as SHA1/2) method is utilized to map the responsibility of managing control points to the Aneka Peers. In a 2-dimensional setting, an index cell i = (x1, y1, x2, y2), and its control point are computed as ((x2-x1)/2, (y2-y1)/2). The spatial hashing technique takes two input parameters, SpatialHash (control point coor-
205
Decentralized Overlay for Federation of Enterprise Clouds
dinates, object’s coordinates), in terms of DHT common API primitive that can be written as put (Key, Value), where the cryptographic hash of the control point acts as the Key for DHT overlay, while Value is the coordinate values of the resource claim or ticket object to be mapped. In Figure 10, the Aneka peer at Cloud s is assigned index cell i through the spatial hashing technique, which makes it responsible for managing all objects that map to the cell i (Claim T2, T3, T4 and Ticket s). For mapping claim objects, the process of mapping index cells to the Aneka Peers depends on whether it is spatial point object or spatial range object. The mapping of point object is simple since every point is mapped to only one cell in the attribute space. For spatial range object (such as Claims T2, T3 or T4), the mapping is not always singular because a range object can cross more than once index cell (see Claim T5 in Figure 10). To avoid mapping a spatial range object to all the cells that it intersects, which can create many duplicates, a mapping strategy based on diagonal hyperplane 0 in the attribute space is implemented. This mapping involves feeding spatial range object coordinate values and candidate index as inputs to a mapping function, Fmap (spatial object, candidate index cells). An Aneka Peer service uses the index cell(s) currently assigned to it and a set of known base index cells as candidate cells, which are obtained at the time of bootstrapping into the federation. The Fmap returns the index cells and their control points to which the given spatial range object should be stored with. Next, these control points and the spatial object is given as inputs to function SpatialHash(control point, object), which in connection with the Connectivity layer generates DHT Ids (Keys) and performs routing of claim/ticket objects to the Aneka Peers. Similarly, the mapping process of a ticket object also involves the identification of the intersection index cells in the attribute space. A ticket is always associated with a region (Gupta, Sahin, Agarwal, & Abbadi, 2004) and all cells that fall fully or partially within the region are selected to receive the corresponding ticket. The calculation of the region is based upon the diagonal hyperplane of the attribute space. Coordinated load balancing: Both resource claim and ticket objects are spatially hashed to an index cell i in the multi-dimensional Aneka services’ attribute space. In Figure 10, resource claim object for task T1 is mapped to index cell A, while for T2, T3, and T4, the responsible cell is i with control point value C. Note that, these resource claim objects are posted by P2PScheduling services (Task or Thread) of Aneka Cloud nodes. In Figure 10, scheduling service at Cloud p posts a resource claim object which is mapped to index cell i. The index cell i is spatially hashed to an Aneka peer at Cloud s. In this case, Cloud s is responsible for coordinating the resource sharing among all the resource claims that are currently mapped to the cell i. Subsequently, Cloud u issues a resource ticket (see Figure 10) that falls under a region of the attribute space currently required by the tasks T3 and T4. Next, the coordination service of Aneka peer at Cloud s has to decide which of the tasks (either T3 or T4 or both) is allowed to claim the ticket issues by Cloud u. The load-balancing decision is based on the principle that it should not lead to over-provisioning of resources at Cloud u. This mechanism leads to coordinated load-balancing across Aneka Enterprise Clouds and aids in achieving system-wide objective function, while at the same time preserving the autonomy of the participating Aneka Enterprise Clouds. The examples in Table 1 are list of resource claim objects that are stored with an Aneka peer’s coordination service at time T = 700 secs. Essentially, the claims in the list arrived at a time <= 700 and wait for a suitable ticket object that can meet their application’s requirements (software, hardware, service type). Table 2 depicts a ticket object that has arrived at T = 700. Following the ticket arrival, the coordination service undertakes a procedure that allocates the ticket object among the list of matching claims. Based on the Cloud node’s attribute specification, both Claim 1 and Claim 2 match the ticket issuing Cloud
206
Decentralized Overlay for Federation of Enterprise Clouds
Table 1. Claims stored with an Aneka Peer service at time T Time 300
Claim ID Claim 1
Service Type
Speed (GHz)
P2PThreadExecution
400
Claim 2
P2PTaskExecution
500
Claim 3
P2PThreadExecution
>2
Processors 1
Type Intel
>2
1
Intel
> 2.4
1
Intel
node’s configuration. As specified in the ticket object, there is currently one processor available within the Cloud 2, which means that at this time only Claim 1 can be served. Following this, the coordination service notifies the Aneka-Coordinator, which has posted the Claim 1. Note that Claims 2 and 3 have to wait for the arrival of tickets that can match their requirements. The Connectivity layer is responsible for undertaking a key-Based routing in the DHT overlay, where it can implement the routing methods based on DHTs, such as Chord, CAN, and Pastry. The actual implementation protocol at this layer does not directly affect the operations of the Core services layer. In principle, any DHT implementation at this layer could perform the desired task. DHTs are inherently self-organizing, fault-tolerant, and scalable. At the Connectivity layer, our middleware utilizes the open source implementation of Pastry DHT known as the FreePastry (2008). FreePastry offers a generic, scalable and efficient peer-to-peer routing framework for the development of decentralized Cloud services. FreePastry is an open source implementation of well-known Pastry routing substrate. It exposes a Key-based Routing (KBR) API and given the Key K, Pastry routing algorithm can find the peer responsible for this key in logb n messages, where b is the base and n is the number of Aneka Peers in the network. Nodes in a Pastry overlay form a decentralized, self-organising and fault-tolerant circular network within the Internet. Both data and peers in the Pastry overlay are assigned Ids from 160-bit unique identifier space. These identifiers are generated by hashing the object’s names, a peer’s IP address or public key using the cryptographic hash functions such as SHA-1/2. FreePastry is currently available under BSD-like license. FreePastry framework supports the P2P Common API specification proposed in the paper (Dabek, Zhao, Druschel, Kubiatowicz, & Stoica, 2003).
ExPERIMENTAL EVALUATION AND DISCUSSION In this section, we evaluate the performance of the Aneka-Federation software system by creating a resource sharing network that consists of 5 Aneka Enterprise Clouds (refer to Figure 11). These Aneka Enterprise Clouds are installed and configured in three different Laboratories (Labs) within the Computer Science and Software Engineering Department, The University of Melbourne. The nodes in these Labs
Table 2: Ticket Published with an Aneka Peer service at time T Time 700
Cloud ID Cloud 2
Service Type P2PThreadExecution
Speed (GHz) 2.7
Processors 1 (available)
Type Intel
207
Decentralized Overlay for Federation of Enterprise Clouds
are connected through a Local Area Network (LAN). The LAN connection has a data transfer bandwidth of 100 Mb/Sec (megabits per seconds). Next, the various parameters and application characteristics related to this study are briefly described. Aneka enterprise cloud configuration: Each Aneka Cloud in the experiments is configured to have 4 nodes out of which, one of the nodes instantiates the Aneka-Coordinator service. In addition to the Aneka Coordinator service, this node also hosts the other optional services including the P2PScheduling (for Thread and Task models) and P2PExecution services (for Thread and Task models). The remaining 3 nodes are configured to run the P2PExecution services for Task and Thread programming models. These nodes connect and communicate with the Aneka-Coordinator service through .Net remoting messaging APIs. The P2PExecution services periodically update their usage status with the Aneka-Coordinator service. The update delay is configurable parameter with values in milliseconds or seconds. The nodes across different Aneka Enterprise Clouds update their status dynamically with the decentralized Content-based services. The node status update delays across the Aneka Enterprise Clouds are uniformly distributed over interval [5, 40] seconds. FreePastry network configuration: Both Aneka Peers’ nodeIds and claim/ticket objectIds are randomly assigned from and uniformly distributed in the 160-bit Pastry identifier space. Every Contentbased service is configured to buffer maximum of 1000 messages at a given instance of time. The buffer size is chosen to be sufficiently large such that the FreePastry does not drop any messages. Other network parameters are configured to the default values as given in the file freepastry.params. This file is provided with the FreePastry distribution. Spatial index configuration: The minimum division fmin of logical d-dimensional spatial index that forms the basis for mapping, routing, and searching the claim and ticket objects is set to 3, while the maximum height of the spatial index tree, fmax is constrained to 3. In other words, the division of the d-dimensional attribute is not allowed beyond fmin. This is done for simplicity, understanding the load balancing issues of spatial indices (Egemen et al., 2007) with increasing fmax is a different research problem and is beyond scope of this chapter. The index space has provision for defining claim and ticket objects that specify the Aneka nodes/service’s characteristics in 4 dimensions including number of Aneka service type, processors, processor architecture, and processing speed. The aforementioned spatial index configuration results into 81(34) index cells at fmin level. On an average, 16 index cells are hashed to an Aneka Peer in a network of 5 Aneka Coordinators. Claim and ticket object’s spatial extent: Ticket objects in the Aneka-Federation express equality constraints on an Aneka node’s hardware/software attribute value (e.g. =). In other words, ticket objects are always d-dimensional (spatial) point query for this study. On the other hand, the claim objects posted by P2PScheduling services have their spatial extent in d dimensions with both, range and fixed constraint (e.g. >=, <=) for the attributes. The spatial extent of a claim object in different attribute dimension is controlled by the characteristic of the node, which is hosting the P2PScheduling service. Attributes including Aneka service type, processor architecture, and number of processors are fixed, i.e. they are expressed as equality constraints. The value for processing speed is expressed using >= constraints, i.e. search for the Aneka services, which can process application atleast as fast as what is available on the submission node. However, the P2PScheduling services can create claim objects with different kind of constraints, which can result in different routing, searching, and matching complexity. Studying this behavior of the system is beyond the scope of this chapter. Application models: Aneka supports composition and execution of application programmers using different models (Vecchiola & Chu, 2008) to be executed on the same enterprise Cloud infrastructure. The
208
Decentralized Overlay for Federation of Enterprise Clouds
experimental evaluation in this chapter considers simultaneous execution of applications programmed using Task and Thread models. The Task model defines an application as a collection of one or more tasks, where each task represents an independent unit of execution. Similarly, the Thread model defines an application as a collection of one or more independent threads. Both models can be successfully utilized to compose and program embarrassingly parallel programs (parameter sweep applications). The Task model is more suitable for cloud enabling the legacy applications, while the Thread model fits better for implementing and architecting new applications, algorithms on clouds since it gives finer degree of control and flexibility as regards to runtime control. To demonstrate the effectiveness of the Aneka-Federation platform with regards to: (i) ease of heterogeneous application composition flexibility; (ii) different programming model supportability; and (iii) concurrent scheduling feasibility of heterogeneous applications on shared Cloud computing infrastructure, the experiments are run based on the following applications: •
•
Persistence of Vision Raytracer (2008): This application is cloud enabled using the Aneka Task programming model. POV-Ray is an image rendering application that can create very complex and realistic three dimensional models. Aneka POV-Ray application interface allows the selection of a model, the dimension of the rendered image, and the number of independent tasks into which rendering activities have to be partitioned. The task partition is based on the values that a user specifies for parameter rows and columns on the interface. In the experiments, the values for the rows and the columns are varied over the interval [5 x 5, 13 x 13] in steps of 2. Mandelbrot Set (2008): Mathematically, the Mandelbrot set is an ordered collection of points in the complex plane, the boundary of which forms a fractal. Aneka implements and cloud enables the Mandelbrot fractal calculation using the Thread programming model. The application submission interface allows the user to configure number of horizontal and vertical partitions into which the fractal computation can be divided. The number of independent thread units created is equal to the horizontal x vertical partitions. For evaluations, we vary the values for horizontal and vertical parameters over the interval [5 x 5, 13 x 13 ] in steps of 2. This configuration results in 5 observation points.
Results and Discussion To measure the performance of Aneka-Federation system as regards to scheduling, we quantify the response time metric for the POV-Ray and Mandelbrot applications. The response time for an application is computed by subtracting the output arrival time of the last task/thread in the execution list from the time at which the application is submitted. The observations are made for different application granularity (sizes) as discussed in the last Section. Figure 12 depicts the results for response time in seconds with increasing granularity for the POVRay application. The users at Aneka Cloud 1, 3, 4 submit the applications to their respective Aneka Coordinator services (refer to the Figure 11). The experiment results show that the POV-Ray application submitted at Aneka Cloud 1 experienced comparatively lesser response times for its POV-Ray tasks as compared to the ones submitted at Aneka Cloud 3 and 4. The fundamental reasons behind this behavior of system is that the spatial extent and attribute constraints of the resource claim objects posted by the P2PTaskScheduling service at Aneka Cloud 1. As shown in Figure 11, every Aneka Cloud offers processors of type “Intel” with varying speed. Based on the in the previous Section, the processing speed
209
Decentralized Overlay for Federation of Enterprise Clouds
Figure 12. POV-Ray application: Response time (secs) vs. problem size
is expressed using >= constraints, which means that the application submitted in the Aneka Enterprise Clouds, 1 and 2 (processing speed = 2.4 GHz), can be executed on any of the nodes in the enterprise Clouds 1, 2, 3, 4, and 5. However, the application submitted at Aneka Clouds 3 and 4 can be executed only on Clouds 3, 4, and 5. Accordingly, the application submitted in Aneka Cloud 3 can only be processed locally as the spatial dimension and processing speed for the resource claim objects specifies constraints as >= 3.5 GHz. Due to these spatial constraints on the processing speed attribute value, the application in different Clouds gets access to varying Aneka node pools that result in different levels of response times.
Figure 13. Mandelbrot application: Response time (Secs) vs. problem size
210
Decentralized Overlay for Federation of Enterprise Clouds
Figure 14. P2PTaskExecution service: Time (secs) vs. number of jobs completed
For the aforementioned arguments, it can be seen in Figure 12 and Figure 13 (Mandelbrot applications) that applications at Aneka Clouds 1 and 2 have relatively better response times as compared to the ones submitted at Aneka Cloud 3, 4, and 5. Figure 14 and Figure 15 present the results for the total number of jobs processed in different Aneka Clouds by their P2PTaskExecution and P2PThreadExectuion services. The results show that the P2PTaskExecution and P2PThreadExecution services hosted within the Aneka Clouds 3, 4, and 5 processes relatively more jobs as compared to those hosted within Aneka Clouds 1 and 2. This happens due to the spatial constraint on the processing speed attribute value in the resource claim object posted by different P2PScheduling (Task/Thread) services across the Aneka Clouds. As Aneka Cloud 5 offers the fastest processing speed (within the spatial extent of all resource claim objects in the system), it Figure 15. P2PThreadExecution service: Time (secs) vs. number of jobs completed
211
Decentralized Overlay for Federation of Enterprise Clouds
Figure 16. Enterprise Cloud Id vs. job %
processes more jobs as compared to other Aneka Clouds in the federation (see Figure 14 and Figure 15). Thus, in the proposed Aneka-Federation system, the spatial extent for resource attribute values specified by the P2PScheduling services directly controls the job distribution and application response times in the system. Figure 16 shows the aggregate percentage of task and thread jobs processed by the nodes of the different Aneka Clouds in the federation. As mentioned in our previous discussions, Aneka Clouds 3, 4, and 5 ends up processing larger percentage for both Task and Thread application composition models. Together they process approximately 140% of total 200% jobs (100% task + 100% thread) in the federation.
RELATED WORK Volunteer computing systems including SETI@home (Anderson, Cobb, Korpela, Lebofsky, & Werthimer, 2002) and BOINC (Anderson, 2004) are the first generation implementation of public resource computing systems. These systems are engineered on the traditional master/worker model, wherein a centralized scheduler/coordinator is responsible for scheduling, dispatching tasks and collecting data from the participant nodes in the Internet. These systems do not provide any support for multi-application and programming models, a capability which is inherited from the Aneka to the Aneka-Federation platform. Unlike SETI@home and BOINC, Aneka-Federation creates a decentralized overlay of Aneka Enterprise Clouds. Further, Aneka-Federation allows submission, scheduling, and dispatching of application from any Aneka-Coordinator service in the system, thus giving every enterprise Cloud site autonomy and flexibility as regards to decision making. OurGrid (Andrade et al., 2003) is a peer-to-peer middleware infrastructure for creating an Internetwide enterprise Grid computing platform. The message routing and communication between the OurGrid sites is done via broadcast messaging primitive based on the JXTA (Gong, 2001) substrate. ShareGrid
212
Decentralized Overlay for Federation of Enterprise Clouds
Project (2008) extends the OurGrid infrastructure with fault-tolerance scheduling capability by replication tasks across a set of available nodes. In contrast to the OurGrid and the ShareGrid, Aneka-Federation implements a coordinated scheduling protocol by embedding a d-dimensional index over a DHT overlay, which makes the system highly scalable and guarantees deterministic search behavior (unlike JXTA). Further, the OurGrid system supports only the parameter sweep application programming model, while the Aneka-Federation supports more general programming abstractions including Thread, Task, and Dataflow. Peer-to-Peer Condor flock system (Butt et al., 2003) aggregates Internet-wide distributed condor work pools based on the Pastry overlay (Rowstron et al., 2001). The site managers in the Pastry overlay accomplish the load-management by announcing their available resources to all sites, who’s Identifiers (IDs) appear in the routing table. An optimized version of this protocol proposes recursively propagating the load-information to the sites who’s IDs are indexed by the contacted site’s routing table. The scheduling coordination in an overlay is based on probing each site in routing table for resource availability. The probe message propagates recursively in the network until a suitable node is located. In the worst case, the number of messages generated due to recursive propagation can result into broadcast communication. In contrast, Aneka-Federation implements more scalable, deterministic and flexible coordination protocol by embedding a logical d-dimensional index over DHT overlay. The d-dimensional index gives the Aneka-Federation the ability to perform deterministic search for Aneka services, which are defined based on the complex node attributes (CPU type, speed, service type, utilization). XtermWeb-CH (Abdennadher & Boesch, 2005) extends the XtermWeb project (Fedak, Germain, Neri, & Cappello, 2002) with the functionalities such as peer-to-peer communication among the worker nodes. However, the core scheduling and management component in XtermWeb-CH, which is called the coordinator, is a centralized service that has a limited scalability. G2-P2P (Mason & Kelly, 2005) uses the Pastry framework to create a scalable cycle-stealing framework. The mappings of objects to nodes are done via Pastry routing method. However, the G2-P2P system does not implement any specific scheduling or load-balancing algorithm that can take into account the current application load on the nodes and based on that perform runt-time load-balancing. In contrast, the Aneka-Federation realizes a truly decentralized, cooperative and coordinated application scheduling service that can dynamically allocate applications to the Aneka services/nodes without over-provisioning them.
CONCLUSION AND FUTURE DIRECTIONS The functionality exposed by the Aneka-Federation system is very powerful, and our experimental results on real test-bed prove that it is a viable technology for federating high throughput Aneka Enterprise Cloud systems. One of our immediate goals is to support substantially larger Aneka-Federation setups than the ones used in the performance evaluations. We intend to provide support for composing more complex application models such as e-Research workflows that have both compute and data node requirement. The resulting Aneka-Federation infrastructure will enable new generation of application composition environment where the application components, Enterprise Clouds, services, and data would interact as peers. There are several important aspects of this system that require further implementation and future research efforts. One such aspect being developing fault-tolerant (self-healing) application scheduling algorithms that can ensure robust executions in the event of concurrent failures and rapid join/leave
213
Decentralized Overlay for Federation of Enterprise Clouds
operations of enterprise Clouds/Cloud nodes in decentralized Aneka-Federation overlay. Other important design aspect that we would like to improve is ensuring a truly secure (self-protected) Aneka-Federation infrastructure based on peer-to-peer reputation and accountability models.
ACKNOWLEDGMENT The authors would like to thank Australian Research Council (ARC) and the Department of Innovation, Industry, Science, and Research (DIISR) for supporting this research through the Discovery Project and International Science Linkage grants respectively. We would also like to thank Dr. Tejal Shah, Dr. Sungjin Choi, Dr. Christian Vecchiola, and Dr. Alexandre di Costanzo for proofreading the initial draft of this chapter. The chapter is partially derived from our previous publications (Ranjan, 2007).
REFERENCES Abdennadher, N., & Boesch, R. (2005). Towards a peer-to-peer platform for high performance computing. In HPCASIA’05 Proceedings of the Eighth International Conference in High-Performance Computing in Asia-Pacific Region, (pp. 354-361). Los Alamitos, CA: IEEE Computer Society. Retrieved from http:// doi.ieeecomputersociety.org/10.1109/HPCASIA.2005.98 Amazon Elastic Compute Cloud. (2008, November). Retrieved from http://www.amazon.com/ec2 Anderson, D. P. (2004). BOINC: A system for public-resource computing and storage. In Grid’04 Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, (pp. 4-10). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/GRID.2004.14 Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002). SETI@home: An experiment in public-resource computing. Communications of the ACM, 45(11), 56-61. New York: ACM Press. Retrieved from http://doi.acm.org/10.1145/581571.581573 Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, R. (2003, October). OurGrid: An approach to easily assemble grids with equitable resource sharing. In JSSPP’03 Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing (LNCS). Berlin/Heidelberg, Germany: Springer. doi: 10.1007/10968987 Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A self-organizing flock of condors. In SC ’03 Proceedings of the ACM/IEEE Conference on Supercomputing, (p. 42). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/SC.2003.10031 Buyya, R., Yeo, C. S., & Venugopal, S. (2008, September). Market-oriented cloud computing: vision, hype, and reality for delivering it services as computing utilities. In HPCC’08 Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications. Los Alamitos, CA: IEEE CS Press.
214
Decentralized Overlay for Federation of Enterprise Clouds
Chu, X., Nadiminti, K., Jin, C., Venugopal, S., & Buyya, R. (2007, December). Aneka: Next-generation enterprise grid platform for e-science and e-business applications, e-Science’07: In Proceedings of the 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India (pp. 151-159). Los Alamitos, CA: IEEE Computer Society Press. For more information, see http://doi.ieeecomputersociety.org/10.1109/E-SCIENCE.2007.12 Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., & Stoica, I. (2003). Towards a common API for structured peer-to-peer overlays. In IPTPS03 Proceedings of the 2nd International Workshop on Peerto-Peer Systems, (pp. 33-44). Heidelberg, Germany: SpringerLink. doi: 10.1007/b11823 Fedak, G., Germain, C., Neri, V., & Cappello, F. (2002, May). XtremWeb: A generic global computing system. In CCGRID’01: Proceeding of the First IEEE Conference on Cluster and Grid Computing, workshop on Global Computing on Personal Devices, Brisbane, (pp. 582-587). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/CCGRID.2001.923246 Fowler, M. (2008, November). Inversion of control containers and the dependency injection pattern. Retrieved from http://www.martinfowler.com/articles/injection.html FreePastry. (2008, November). Retrieved from http://freepastry.rice.edu/FreePastry Gong, L. (2001, June). JXTA: A network programming environment. IEEE Internet Computing, 5(3), 88-95. Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety. org/10.1109/4236.93518 Google App Engine. (2008, November). Retrieved from http://appengine.google.com Gupta, A., Sahin, O. D., Agarwal, D., & El Abbadi, A. (2004). Meghdoot: Content-based publish/ subscribe over peer-to-peer networks. In Middleware’04 Proceedings of the 5th ACM/IFIP/USENIX International Conference on Middleware, (pp. 254-273). Heidelberg, Germany: SpringerLink. doi: 10.1007/b101561. Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005, June). Alchemi: A. NET-based enterprise grid computing system, In ICOMP’05 Proceedings of the 6th International Conference on Internet Computing, Las Vegas, USA. Mandelbrot Set. (2008, November). Retrieved from http://mathworld.wolfram.com/MandelbrotSet. html. Mason, R., & Kelly, W. (2005). G2-p2p: A fully decentralized fault-tolerant cycle-stealing framework. In R. Buyya, P. Coddington, and A. Wendelborn, (Eds.), In AusGrid’05 Australasian Workshop on Grid Computing and e-Research, Newcastle, Australia, (Vol. 44 of CRPIT, pp. 33-39). Microsoft Live Mesh. (2008, November). Retrieved from http://www.mesh.com. Parashar, M., & Hariri, S. (Eds.). (2007). Autonomic computing: Concepts, infrastructures, and applications. Boca Raton, FL: CRC Press, Taylor and Francis Group. Persistence of Vision Raytracer. (2008, November). Retrieved from http://www.povray.org
215
Decentralized Overlay for Federation of Enterprise Clouds
Ranjan, R. (2007, July). Coordinated resource provisioning in federated grids. Doctoral thesis, The University of Melbourne, Australia. Ranjan, R., Harwood, A., & Buyya, R. (2008, July). Peer-to-peer resource discovery in global grids: A tutorial. IEEE Communication Surveys and Tutorials (COMST), 10(2), 6-33. New York: IEEE Communications Society Press. doi:doi:10.1109/COMST.2008.4564477 Ranjan, R., Harwood, A., & Buyya, R. (2008). Coordinated load management in peer-to-peer coupled federated grid systems. (Technical Report GRIDS-TR-2008-2). Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia. doi: http://www.gridbus.org/reports/CoordinatedGrid2007.pdf Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A scalable content-addressable network. In SIGCOMM’01 Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, (pp. 161-172). New York: ACM Press. Retrieved from http://doi.acm.org/10.1145/ 383059.383072 Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware’01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, (pp. 329-350). Heidelberg, Germany: SpringerLink. doi: 10.1007/3-540-45518-3 Samet, H. (2008, November). The design and analysis of spatial data structures. New York: AddisonWesley Publishing Company. ShareGrid Project. (2008, November). Retrieved from http://dcs.di.unipmn.it/sharegrid. Spring.NET. (2008, November). Retrieved from http://www.springframework.net. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peerto-peer lookup service for internet applications. In SIGCOMM’01 Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, (pp. 149 – 160). New York: ACM Press. Retrieved from http://doi.acm.org/10.1145/383059.383071 Tanin, E., Harwood, A., & Samet, H. (2007). Using a distributed quadtree index in peer-to-peer networks. [Heidelberg, Germany: SpringerLink.]. The VLDB Journal, 16(2), 165–178. doi:. doi:10.1007/ s00778-005-0001-y Vecchiola, C., & Chu, X. (2008). Aneka tutorial series on developing task model applications. (Technical Report). Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia. Zhang, X., Freschl, J. L., & Schopf, J. M. (2003, June). A performance study of monitoring and information services for distributed systems. In HPDC’03: Proceedings of the Twelfth International Symposium on High Performance Distributed Computing, (pp. 270-281). Los Alamitos, CA: IEEE Computer Society Press. Zhao, B. Y., Kubiatowicz, J. D., & Joseph, A. D. (2001, April). Tapestry: An infrastructure for Faulttolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, USA.
216
Decentralized Overlay for Federation of Enterprise Clouds
KEY TERMS AND DEFINITIONS Enterprise Cloud: An enterprise Cloud is a type of computing infrastructure that consists of a collection of inter-connected computing nodes, virtualized computers, and software services that are dynamically provisioned among the competing end-user’s applications based on their availability, performance, capability, and Quality of Service (QoS) requirements. Aneka-Federation: The Aneka-Federation integrates numerous small scale Aneka Enterprise Cloud services and nodes that are distributed over multiple control and enterprise domains as parts of a single coordinated resource leasing abstraction. Overlay Networking: A logical inter-connection of services, nodes, devices, sensors, instruments, and data hosts at application layer (under TCP/IP model) over an infrastructure of physical network routing systems such as the Internet or Local Area Network (LAN). In overlays, the routing and forwarding of messages between services is done on the basis of their relationship in the logical space, while the messages are actually transported through the physical links. Decentralized Systems: A distributed Cloud system configuration is considered to be decentralized if none of the components in the system are more important than the others, in case that one of the component fails, then it is neither more nor less harmful to the system than caused by the failure of any other component in the system. Distributed Hash Table (DHT): A DHT is a data structure that associates unique index with a data. Entries in the DHTs are stored as (index, data) pair. A data can be looked up within a logarithmic overlay routing hops and messages bound if the corresponding index is known. DHTs are self-managing in their behavior as they can dynamically adapt to leave, join and failure of nodes or services in the system. Recently, DHTs have been applied to build Internet scale systems that involve hundreds of thousands of components (node, service, data, and file). Resource Discovery: Resource discovery activity involve searching for the appropriate service, node, or data type that match the requirements of applications such as file sharing, Grid applications, Cloud applications etc. The resource discovery methods can be engineered based on various network models including centralized, decentralized, and hierarchical with varying degree of scalability, fault-tolerance, and network performance. Multi-Dimensional Queries: Complex web services, Grid resource characteristics, and Cloud services are commonly represented by a number of attributes such as service type, hardware (processor type, speed), installed software (libraries, operating system), service name, security (authentication and authorization control); efficiently discovering the aforementioned services with deterministic guarantees in decentralized and scalable manner requires lookup queries to encapsulate search values for each attribute (search dimension) . The search is resolved by satisfying the constraints for the values expressed in each dimension, hence resulting in multi-dimensional queries that search for values in a virtual space that has multiple dimensions (x, y, z…).
ENDNOTE 1
3rd generation enterprise Grids are exhibiting properties that are commonly envisaged in Cloud computing systems.
217
Section 3
Programming Models and Tools
219
Chapter 10
Reliability and Performance Models for Grid Computing Yuan-Shun Dai University of Electronics Science Technology of China, China & University of Tennessee, Knoxville, USA Jack Dongarra University of Tennessee, Knoxville, USA; Oak Ridge National Laboratory, USA; & University of Manchester, UK
ABSTRACT Grid computing is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. It is hard to analyze and model the Grid reliability because of its largeness, complexity and stiffness. Therefore, this chapter introduces the Grid computing technology, presents different types of failures in grid system, models the grid reliability with star structure and tree structure, and finally studies optimization problems for grid task partitioning and allocation. The chapter then presents models for star-topology considering data dependence and treestructure considering failure correlation. Evaluation tools and algorithms are developed, evolved from Universal generating function and Graph Theory. Then, the failure correlation and data dependence are considered in the model. Numerical examples are illustrated to show the modeling and analysis.
INTRODUCTION Grid computing (Foster & Kesselman, 2003) is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration etc, see e.g. Kumar (2000), Das et al. (2001), Foster et al. (2001, 2002) and Berman et al. (2003). Many experts believe that the grid technologies will offer a second chance to fulfill the promises of the Internet. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (Foster et al., 2001). The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources. This is required by a range of collaborative problem-solving and resourceDOI: 10.4018/978-1-60566-661-7.ch010
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Reliability and Performance Models for Grid Computing
brokering strategies emerging in industry, science, and engineering. This sharing is highly controlled by the resource management system (Livny & Raman, 1998), with resource providers and consumers defining what is shared, who is allowed to share, and the conditions under which the sharing occurs. Recently, the Open Grid Service Architecture (Foster et al., 2002) enables the integration of services and resources across distributed, heterogeneous, dynamic, virtual organizations. A grid service is desired to complete a set of programs under the circumstances of grid computing. The programs may require using remote resources that are distributed. However, the programs initially do not know the site information of those remote resources in such a large-scale computing environment, so the resource management system (the brain of the grid) plays an important role in managing the pool of shared resources, in matching the programs to their requested resources, and in controlling them to reach and use the resources through wide-area network. The structure and functions of the resource management system (RMS) in the grid have been introduced in details by Livny & Raman (1998), Cao et al. (2002), Krauter et al. (2002) and Nabrzyski et al. (2003). Briefly stated, the programs in a grid service send their requests for resources to the RMS. The RMS adds these requests into the request queue (Livny & Raman, 1998). Then, the requests are waiting in the queue for the matching service of the RMS for a period of time (called waiting time), see e.g. Abramson et al. (2002). In the matching service, the RMS matches the requests to the shared resources in the grid (Ding et al., 2002) and then builds the connection between the programs and their required resources. Thereafter, the programs can obtain access to the remote resources and exchange information with them through the channels. The grid security mechanism then operates to control the resource access through the Certification, Authorization and Authentication, which constitute various logical connections that causes dynamicity in the network topology. Although the developmental tools and infrastructures for the grid have been widely studied (Foster & Kesselman, 2003), grid reliability analysis and evaluation are not easy because of its complexity, largeness and stiffness. The gird computing contains different types of failures that can make a service unreliable, such as blocking failures, time-out failures, matching failures, network failures, program failures and resource failures. This chapter thoroughly analyzes these failures. Usually the grid performance measure is defined as the task execution time (service time). This index can be significantly improved by using the RMS that divides a task into a set of subtasks which can be executed in parallel by multiple online resources. Many complicated and time-consuming tasks that could not be implemented before are working well under the grid environment now. It is observed in many grid projects that the service time experienced by the users is a random variable. Finding the distribution of this variable is important for evaluating the grid performance and improving the RMS functioning. The service time is affected by many factors. First, various available resources usually have different task processing speeds online. Thus, the task execution time can vary depending on which resource is assigned to execute the task/subtasks. Second, some resources can fail when running the subtasks, so the execution time is also affected by the resource reliability. Similarly, the communication links in grid service can be disconnected during the data transmission. Thus, the communication reliability influences the service time as well as data transmission speed through the communication channels. Moreover, the service requested by a user may be delayed due to the queue of earlier requests submitted from others. Finally, the data dependence imposes constraints on the sequence of the subtasks’ execution, which has significant influence on the service time.
220
Reliability and Performance Models for Grid Computing
Figure 1. Grid computing system
This chapter first introduces the grid computing system and service, and analyzes various failures in grid system. Both reliability and performance are analyzed in accordance with the performability concept. Then the chapter presents models for star- and tree-topology grids respectively. The reliability and performance evaluation tools and algorithms are developed based on the universal generating function, graph theory, and Bayesian approach. Both failure correlation and data dependence are considered in the models.
GRID SERVICE RELIABILITY AND PERFORMANCE Description of the Grid Computing Today, the Grid computing systems are large and complex, such as the IP-Grid (Indiana-Purdue Grid) that is a statewide grid (http://www.ip-grid.org/). IP-Grid is also a part of the TeraGrid that is a nationwide grid in the USA (http://www.teragrid.org/). The largeness and complexity of the grid challenge the existing models and tools to analyze, evaluate, predict and optimize the reliability and performance of grid systems. The global grid system is generally depicted by the Figure 1. Various organizations (Foster et al., 2001), integrate/share their resources on the global grid. Any program running on the grid can use those resources if it can be successfully connected to them and is authorized to access them. The sites that contain the resources or run the programs are linked by the global network as shown in the left part of Figure 1.
221
Reliability and Performance Models for Grid Computing
The distribution of the service tasks/subtasks among the remote resources are controlled by the Resource Management System (RMS) that is the “brain” of the grid computing, see e.g. Livny & Raman (1998). The RMS has five layers in general, as shown in Figure 1: program layer, request layer, management layer, network layer and resource layer. 1.
2.
3.
4.
5.
Program layer: The program layer represents the programs of the customer’s applications. The programs describe their required resources and constraint requirements (such as deadline, budget, function etc). These resource descriptions are translated to the resource requests and sent to the next request layer. Request layer: The request layer provides the abstraction of “program requirements” as a queue of resource requests. The primary goals of this layer are to maintain this queue in a persistent and fault-tolerant manner and to interact with the next management layer by injecting resource requests for matching, claiming matched resources of the requests. Management layer: The management layer may be thought of as the global resource allocation layer. It has the function of automatically detecting new resources, monitoring the resource pool, removing failed/unavailable resources, and most importantly matching the resource requests of a service to the registered/detected resources. If resource requests are matched with the registered resources in the grid, this layer sends the matched tags to the next network layer. Network layer: The network layer dynamically builds connection between the programs and resources when receiving the matched tags and controls them to exchange information through communication channels in a secure way. Resource layer: The resource layer represents the shared resources from different resource providers including the usage policies (such as service charge, reliability, serving time etc.)
Failure Analysis of Grid Service Even though all online nodes or resources are linked through the Internet with one another, not all resources or communication channels are actually used for a specific service. Therefore, according to this observation, we can make tractable models and analyses of grid computing via a virtual structure for a certain service. The grid service is defined as follows: Grid service is a service offered under the grid computing environment, which can be requested by different users through the RMS, which includes a set of subtasks that are allocated to specific resources via the RMS for execution, and which returns the result to the user after the RMS integrates the outputs from different subtasks. The above five layers coordinate together to achieve a grid service. At the “Program layer”, the subtasks (programs) composing the entire grid service task initially send their requests for remote resources to the RMS. The “Request layer” adds these requests in the request queue. Then, the “Management layer” tries to find the sites of the resources that match the requests. After all the requests of those programs in the grid service are matched, the “Network layer” builds the connections among those programs and the matched resources. It is possible to identify various types of failures on respective layers:
222
Reliability and Performance Models for Grid Computing
• •
•
•
•
Program layer: Software failures can occur during the subtask (program) execution; see e.g. Xie (1991) and Pham (2000). Request layer: When the programs’ requests reach the request layer, two types of failures may occur: “blocking failure” and “time-out failure”. Usually, the request queue has a limitation on the maximal number of waiting requests (Livny & Raman, 1998). If the queue is full when a new request arrives, the request blocking failure occurs. The grid service usually has its due time set by customers or service monitors. If the waiting time for the requests in the queue exceeds the due time, the time-out failure occurs, see e.g. Abramson et al. (2002). Management layer: At this layer, “matching failure” may occur if the requests fail to match with the correct resources, see e.g. Xie et al. (2004, pp. 185-186). Errors, such as incorrectly translating the requests, registering a wrong resource, ignoring resource disconnection, misunderstanding the users’ requirements, can cause these matching failures. Network layer: When the subtasks (programs) are executed on remote resources, the communication channels may be disconnected either physically or logically, which causes the “network failure”, especially for those long time transmissions of large dataset, see e.g. Dai et al. (2002). Resource layer: The resources shared on the grid can be of software, hardware or firmware type. The corresponding software, hardware or combined faults can cause resource unavailability.
Grid Service Reliability and Performance Most previous research on distributed computing studied performance and reliability separately. However, performance and reliability are closely related and affect each other, in particular under the grid computing environment. For example, while a task is fully parallelized into m subtasks executed by m resources, the performance is high but the reliability might be low because the failure of any resource prevents the entire task from completion. This causes the RMS to restart the task, which reversely increases its execution time (i.e. reduces performance). Therefore, it is worth to assign some subtasks to several resources to provide execution redundancy. However, excessive redundancy, even though improving the reliability, can decrease the performance by not fully parallelizing the task. Thus, the performance and reliability affect each other and should be considered together in the grid service modeling and analysis. In order to study performance and reliability interactions, one also has to take into account the effect of service performance (execution time) upon the reliability of the grid elements. The conventional models, e.g. Kumar et al. (1986), Chen & Huang (1992), Chen et al. (1997), and Lin et al., (2001), are based on the assumption that the operational probabilities of nodes or links are constant, which ignores the links’ bandwidth, communication time and resource processing time. Such models are not suitable for precisely modeling the grid service performance and reliability. Another important issue that has much influence the performance and reliability is data dependence, that exists when some subtasks use the results from some other subtasks. The service performance and reliability is affected by data dependence because the subtasks cannot be executed totally in parallel. For instance, the resources that are idle in waiting for the input to run the assigned subtasks are usually hotstandby because cold-start is time consuming. As a result, these resources can fail in waiting mode. The considerations presented above lead the following assumptions that lay in the base of grid service reliability and performance model.
223
Reliability and Performance Models for Grid Computing
Assumptions: 1.
2.
3. 4.
5.
6. 7.
8.
9.
The service request reaches the RMS and is being served immediately. The RMS divides the entire service task into a set of subtasks. The data dependence may exist among the subtasks. The order is determined by precedence constraints and is controlled by the RMS. Different grid resources are registered or automatically detected by the RMS. In a grid service, the structure of virtual network (consisting of the RMS and resources involved in performing the service) can form star topology with the RMS in the center or, tree topology with the RMS in the root node. The resources are specialized. Each resource can process one or multiple subtask(s) when it is available. Each resource has a given constant processing speed when it is available and has a given constant failure rate. Each communication channel has constant failure rate and a constant bandwidth (data transmission speed). The failure rates of the communication channels or resources are the same when they are idle or loaded (hot standby model). The failures of different resources and communication links are independent. If the failure of a resource or a communication channel occurs before the end of output data transmission from the resource to the RMS, the subtask fails. Different resources start performing their tasks immediately after they get the input data from the RMS through communication channels. If same subtask is processed by several resources (providing execution redundancy), it is completed when the first result is returned to the RMS. The entire task is completed when all of the subtasks are completed and their results are returned to the RMS from the resources. The data transmission speed in any multi-channel link does not depend on the number of different packages (corresponding to different subtasks) sent in parallel. The data transmission time of each package depends on the amount of data in the package. If the data package is transmitted through several communication links, the link with the lowest bandwidth limits the data transmission speed. The RMS is fully reliable, which can be justified to consider a relatively short interval of running a specific service. The imperfect RMS can also be easily included as a module connected in series to the whole grid service system.
Grid Service Time Distribution and Reliability/Performance Measures The data dependence on task execution can be represented by m×m matrix H such that hki = 1 if subtask i needs for its execution output data from subtask k and hki = 0 otherwise (the subtasks can always be numbered such that k
224
Reliability and Performance Models for Grid Computing
The task execution time is defined as time from the beginning of input data transmission from the RMS to a resource to the end of output data transmission from the resource to the RMS. The amount of data that should be transmitted between the RMS and resource j that executes subtask i is denoted by ai. If data transmission between the RMS and the resource j is accomplished through links belonging to a set γj, the data transmission speed is s j = min(bx )
(1)
Lx g j
Where bx is the bandwidth of the link Lx. Therefore, the random time tij of subtask i execution by resource j can take two possible values tij = tˆij = t j +
ai
(2)
sj
if the resource j and the communication path γj do not fail until the subtask completion and tij = ∞ otherwise. Here, τj is the processing time of the j-th resource. Subtask i can be successfully completed by resource j if this resource and communication path γj do not fail before the end of subtask execution. Given constant failure rates of resource j and links, one can obtain the conditional probability of subtask success as -(l +p )tˆ p j (tˆij ) = e j j ij
(3)
Where πj is the failure rate of the communication path between the RMS and the resource j, which can be calculated as p j = å lx , λx is the failure rate of the link Lx. The exponential distribution (3) is comx g j
mon in software or hardware components’ reliability that had been justified in both theory and practice, see e.g. Xie et al. (2004). These give the conditional distribution of the random subtask execution time tij: Pr(tij = tˆij ) = p j (tij ) and Pr(tij = ¥) = 1 - p j (tij ) . Assume that each subtask i is assigned by the RMS to resources composing set ωi. The RMS can initiate execution of any subtask j (send the data to all the resources from ωi) only after the completion of every subtask k ji . Therefore the random time of the start of subtask i execution Ti can be determined as Ti = max(Tk ) k ji
(4)
Where Tk is random completion time for subtask k. If ji = Æ , i.e. subtask i does not need data produced by any other subtask, the subtask execution starts without delay: Ti = 0. If ji ¹ Æ , Ti can have different realizations Tˆil (1 ≤ l ≤ Ni).Having the time Ti when the execution of subtask i starts and the time tij of
225
Reliability and Performance Models for Grid Computing
subtask i executed by resource j, one obtains the completion time for subtask i on resource j as tij = Ti + tij
(5)
In order to obtain the distribution of random time tij one has to take into account that probability of any realization of tij = Tˆil + tˆij is equal to the product of probabilities of three events: Execution of subtask i starts at time Tˆil : qil=Pr(Ti =Tˆil ); Resource j does not fail before start of execution of subtask i: pj(Tˆil ); Resource j does not fail during the execution of subtask i: pj( tˆij ). Therefore, the conditional distribution of the random time tij given execution of subtask i starts at time Tˆil (Ti=Tˆil ) takes the form Pr( tij = Tˆil + tˆij ) =pj(Tˆil )pj( tˆij ) = pj(Tˆil + tˆij ) = tij = Tˆil + tˆij Tˆil tˆij Tˆil tˆij e -(lj +pj )(Tˆil +tˆij ) ) =pj(Tˆ )pj( tˆij ) = il ˆ ˆ (6) pj(Tˆ + tˆij ) = e -(lj +pj )(Til +tij ) , il
-(l +p )(Tˆ +tˆ ) Pr( tij = ¥ )=1- pj(Tˆil + tˆij )=1- tij = ¥ Tˆil tˆij e -(lj +pj )(Tˆil +tˆij ) )=1- pj(Tˆil + tˆij )=1-e j j il ij .
The random time of subtask i completion Ti is equal to the shortest time when one of the resources from ωi completes the subtask execution: Ti = min(tij ) j wi
(7)
According to the definition of the last subtask m, the time of its beginning corresponds to the service completion time, because the time of the task proceeds with RMS is neglected. Thus, the random service time Θ is equal to Tm. Having the distribution (pmf) of the random value Θ ≡ Tm in the form qml = Pr(Tm = Tˆml ) for 1 ≤ l ≤ Nm, one can evaluate the reliability and performance indices of the grid service. In order to estimate both the service reliability and its performance, different measures can be used depending on the application. In applications where the execution time of each task (service time) is of critical importance, the system reliability R(Θ*) is defined (according to performability concept in Meyer (1980), Grassi et al. (1988) and Tai et al. (1993)) as a probability that the correct output is produced in time less than Θ*. This index can be obtained as Nm
R(Q*) = å qml × 1(Tˆml < Q*) l =1
(8)
When no limitations are imposed on the service time, the service reliability is defined as the probability that it produces correct outputs without respect to the service time, which can be referred to as
226
Reliability and Performance Models for Grid Computing
Figure 2. Grid system with star architecture
R(∞). The conditional expected service time W is considered to be a measure of its performance, which determines the expected service time given that the service does not fail, i.e. Nm
W = å Tˆmlqml / R(¥).
(9)
l =1
STAR TOPOLOGY GRID ARCHITECTURE A grid service is desired to execute a certain task under the control of the RMS. When the RMS receives a service request from a user, the task can be divided into a set of subtasks that are executed in parallel. The RMS assigns those subtasks to available resources for execution. After the resources complete the assigned subtasks, they return the results back to the RMS and then the RMS integrates the received results into entire task output which is requested by the user. The above grid service process can be approximated by a structure with star topology, as depicted by Figure 2, where the RMS is directly connected with any resource through respective communication channels. The star topology is feasible when the resources are totally separated so that their communication channels are independent. Under this assumption the grid service reliability and performance can be derived by using the universal generating function technique.
227
Reliability and Performance Models for Grid Computing
Universal Generating Function The universal generating function (u-function) technique was introduced in (Ushakov, 1987) and proved to be very effective for the reliability evaluation of different types of multi-state systems. The u-function representing the pmf of a discrete random variable Y is defined as a polynomial K
u (z ) = å ak z k , y
(10)
k =1
where the variable Y has K possible values and αk is the probability that Y is equal to yk. To obtain the u-function representing the pmf of a function of two independent random variables φ(Yi, Yj), composition operators are introduced. These operators determine the u-function for φ(Yi, Yj) using simple algebraic operations on the individual u-functions of the variables. All of the composition operators take the form Ki
u j (z ) = å aik z U(z) = ui (z ) Ä f k =1
yik
Kj
Ä å ajh z f
h =1
y jh
Ki
Kj
= å å aik ajh z
f(yik ,y jh )
(11)
k =1 h =1
The u-function U(z) represents all of the possible mutually exclusive combinations of realizations of the variables by relating the probabilities of each combination to the value of function φ(Yi, Yj) for this combination. In the case of grid system, the u-function uij(z) can define pmf of execution time for subtask i assigned to resource j. This u-function takes the form tˆ
uij (z ) = p j (tˆij )z ij + (1 - p j (tˆij ))z ¥
(12)
Where tˆij and p j (tˆij ) are determined according to Eqs. (2) and (3) respectively. The pmf of the random start time Ti for subtask i can be represented by u-function Ui(z) taking the form Li
U i (z ) = å qil z il , Tˆ
(13)
l =1
where qil = Pr(Ti = Tˆil ) . For any realization Tˆil of Ti the conditional distribution of completion time tij for subtask i executed by resource j given Ti = Tˆil according to (6) can be represented by the u-function Tˆ +tˆ uij (z ,Tˆil ) = p j (Tˆil + tˆij )z il ij + (1 - p j (Tˆil + tˆij ))z ¥
228
(14)
Reliability and Performance Models for Grid Computing
Table 1. Parameters of grid system for analytical example No of subtask i
No of resource j
λj+πj (sec-1)
1
1
0.0025
100
0.779
2
0.00018
180
0.968
2
3
tˆij (sec)
p j (tˆij )
3
0.0003
250
-
4
0.0008
300
-
5
0.0005
300
0.861
6
0.0002
430
0.918
The total completion time of subtask i assigned to a pair of resources j and d is equal to the minimum of completion times for these resources according to Eq. (7). To obtain the u-function representing the pmf of this time, given Ti = Tˆil , composition operator with φ(Yj, Yd) = min(Yj, Yd) should be used: ui (z ,Tˆil ) = uij (z ,Tˆil ) Ä uid (z ,Tˆil ) = [ p j (Tˆil + tˆij )z
Tˆil +tˆij
min
Ä[ pd (Tˆil + tˆid )z
Tˆil +tˆid
min
+ (1 - p j (Tˆil + tˆij ))z ¥ ]
+ (1 - pd (Tˆil + tˆid ))z ¥ ]
Tˆ +min(tˆij ,tˆid ) Tˆ +tˆ = p j (Tˆil + tˆij )pd (Tˆil + tˆid )z il + pd (Tˆil + tˆid )(1 - p j (Tˆil + tˆij ))z il id Tˆ +tˆ +p (Tˆ + tˆ )(1 - p (Tˆ + tˆ ))z il ij + (1 - p (Tˆ + tˆ ))(1 - p (Tˆ + tˆ ))z ¥ . j
il
ij
d
il
id
j
il
ij
d
il
id
(15)
The u-function ui (z ,Tˆil ) representing the conditional pmf of completion time Ti for subtask i assigned to all of the resources from set ωi ={j1, …, ji} can be obtained as ui (z ,Tˆil ) = uij (z ,Tˆil ) Ä uij (z ,Tˆil ) Ä ... Ä uij (z ,Tˆil ) . 1
min
2
min
min
i
(16)
ui (z ,Tˆil ) can be obtained recursively: ui (z ,Tˆil ) = uij (z ,Tˆil ), 1
ui (z ,Tˆil ) = ui (z ,Tˆil ) Ä uie (z ,Tˆil ) for e = j2, …, ji. min
(17)
Having the probabilities of the mutually exclusive realizations of start time Ti, qil = Pr(Ti = Tˆil ) and u-functions ui (z ,Tˆil ) representing corresponding conditional distributions of task i completion time, we can now obtain the u-function representing the unconditional pmf of completion time Ti as Ni
Ui (z ) = å qil ui (z ,Tˆil ) .
(18)
l =1
229
Reliability and Performance Models for Grid Computing
Figure 3. Subtask execution precedence constraints for analytical example
Having u-functions Uk (z ) representing pmf of the completion time Tk for any subtask k ji = {k1,..., ki } , one can obtain the u-functions Ui(z) representing pmf of subtask i start time Ti according to (4) as Ni
Tˆ U i (z ) = Uk (z ) Ä Uk (z ) Ä ... Ä Uk (z ) = å qil z il . 1
max
2
max
max
i
l =1
(19)
Ui(z) can be obtained recursively: Ui(z) = z0 U i (z ) = U i (z ) Ä Ue (z ) for e = k1, …, ki. max
(20)
It can be seen that if ji = Æ then Ui(z) = z0. The final u-function Um(z) represents the pmf of random task completion time Tm in the form Nm
U m (z ) = å qml z
Tˆml
.
(21)
l =1
Using the operators defined above one can obtain the service reliability and performance indices by implementing the following algorithm: 1. 2.
230
Determine tˆij for each subtask i and resource j wi using Eq. (2); Define for each subtask i (1 ≤ i ≤ m) Ui (z ) = U i (z ) = z0. For all i: If ji = 0 or if for any k ji Uk (z ) ≠ z0 (u-functions representing the completion times of all of the predecessors of subtask i are obtained)
Reliability and Performance Models for Grid Computing
Ni
2.1. Obtain U i (z ) = å qil z
Tˆil
using recursive procedure (20);
l =1
3. 4.
2.2. For l = 1, …, Ni: 2.2.1. For each j wi obtain uij (z ,Tˆil ) using Eq. (14); 2.2.2. Obtain ui (z ) using recursive procedure (17); 2.3. Obtain Ui (z ) using Eq. (18). If Um(z) = z0 return to step 2. Obtain reliability and performance indices R(Θ*) and W using equations (8) and (9).
Illustrative Example This example presents analytical derivation of the indices R(Θ*) and W for simple grid service that uses six resources. Assume that the RMS divides the service task into three subtasks. The first subtask is assigned to resources 1 and 2, the second subtask is assigned to resources 3 and 4, the third subtask is assigned to resources 5 and 6: ω1 = {1,2}, ω2 = {3,4}, ω3 = {5,6}. The failure rates of the resources and communication channels and subtask execution times are presented in Table 1. Subtasks 1 and 3 get the input data directly from the RMS, subtask 2 needs the output of subtask 1, the service task is completed when the RMS gets the outputs of both subtasks 2 and 3: j1 = j3 = Æ , j2 = {1} , j4 = {2, 3} . These subtask precedence constraints can be represented by the directed graph in Figure 3. Since j1 = j3 = Æ , the only realization of start times T1 and T3 is 0 and therefore, U1(z)=U2(z)=z0 . According to step 2 of the algorithm we can obtain the u-functions representing pmf of completion times t11 , t12 , t35 and t36 . In order to determine the subtask execution time distributions for the individual resources, define the u-functions uij(z) according to Table 1 and Eq. (9): u11 (z , 0) = exp(-0.0025 ´ 100)z 100 + [1 - exp(-0.0025 ´ 100)]z ¥ = 0.779z100 + 0.221z∞. In the similar way we obtain u12 (z , 0) = 0.968z180 + 0.032z∞; u35 (z , 0) = 0.861z300 + 0.139z∞; u35 (z , 0) u36 (z , 0) = 0.861z300 + 0.139z∞; u (z , 0) = 0.918z430 + 36 0.082z∞.
231
Reliability and Performance Models for Grid Computing
Figure 4. A virtual tree structure of a grid service
The u-function representing the pmf of the completion time for subtask 1 executed by both resources 1 and 2 is u12 (z , 0) Ä u11 (z , 0) u12 (z , 0) = (0.779z100 + 0.221z∞)U (z ) = u (z , 0) Ä U1 (z ) = u11 (z , 0) Ä min min min 1 11 Ä u12 (z , 0) = (0.779z100 + 0.221z∞) Ä (0.968z180 + 0.032z∞) min
min
=0.779z100 +0.214z180 + 0.007z∞. The u-function representing the pmf of the completion time for subtask 3 executed by both resources 5 and 6 is u36 (z ) = (0.861z300 + 0.139z∞)U (z ) = u (z , 0) = u (z ) Ä u (z ) Ä U3 (z ) = u3 (z , 0) = u35 (z ) = Ä min 36 3 3 35 min min Ä 300 ∞ 430 ∞ u (z , 0) = u (z ) = Ä u (z ) = (0.861z + 0.139z ) (0.918z + 0.082z ) 3
35
min
36
min
=0.861z300 +0.128z430 + 0.011z∞. Execution of subtask 2 begins immediately after completion of subtask 1. Therefore, U2(z) = U1 (z ) =0.779z100 +0.214z180 + 0.007z∞ (T2 has three realizations 100, 180 and ∞).
232
Reliability and Performance Models for Grid Computing
The u-functions representing the conditional pmf of the completion times for the subtask 2 executed by individual resources are obtained as follows. u23 (z , 100) = e -0.0003´(100+250)z 100+250 + [1 - e -0.0003´(100+250) ]z ¥ =0.9z350+0.1z∞; u23 (z , 180) = e -0.0003´(180+250)z 180+250 + [1 - e -0.0003´(180+250) ]z ¥ =0.879z430+0.121z∞; u23 (z , ¥) = z ¥ ; u24 (z , 100) = e -0.0008´(100+300)z 100+300 + [1 - e -0.0008´(100+300) ]z ¥ =0.726z400+0.274z∞; u24 (z , 180) = e -0.0008´(180+300)z 180+300 + [1 - e -0.0008´(180+300) ]z ¥ =0.681z480+0.319z∞; u24 (z , ¥) = z ¥ . The u-functions representing the conditional pmf of subtask 2 completion time are: u2 (z , 100) = u23 (z , 100) Ä u24 (z, 100) = (0.9z350+0.1z∞) u2 (z , 100) = u23 (z , 100) Ä u24 (z, 100) = Ä min min min (0.9z350+0.1z∞) Ä (0.726z400+0.274z∞) min
=0.9z350+0.073z400+0.027z∞; u2 (z , 180) = u23 (z , 180) Ä u24 (z, 180) = (0.879z430+0.121z∞) u2 (z , 180) = u23 (z , 180) Ä u24 (z, 180) = Ä min min min (0.879z430+0.121z∞) Ä (0.681z480+0.319z∞) min
=0.879z430+0.082z480+0.039z∞; u2 (z , ¥) = u23 (z , ¥) Ä u24 (z , ¥) = z ¥ . min
According to Eq. (18) the unconditional pmf of subtask 2 completion time is represented by the following u-function U2 (z ) = 0.779u2 (z , 100) + 0.214u2 (z , 180) + 0.007z ¥ =0.779(0.9z350+0.073z400+0.027z∞)+0.214(0.879z430+0.082z480+0.039z∞)+0.007z∞ =0.701z350+0.056z400+0.188z430+0.018z480+0.037z∞ The service task is completed when subtasks 2 and 3 return their outputs to the RMS (which corresponds to the beginning of subtask 4). Therefore, the u-function representing the pmf of the entire service time is obtained as U 4 (z ) = U2 (z ) Ä U3 (z ) max
233
Reliability and Performance Models for Grid Computing
=(0.701z350+0.056z400+0.188z430+0.018z480+0.037z∞) Ä (0.861z300 +0.128z430 + 0.011z∞)=0.603z350 max +0.049z400 +0.283z430 +0.017z480 +0.048z∞. The pmf of the service time is: Pr(T4 = 350) = 0.603; Pr(T4 = 400) = 0.049; Pr(T4 = 430) = 0.283; Pr(T4 = 480) = 0.017; Pr(T4 = ∞) = 0.048. From the obtained pmf we can calculate the service reliability using Eq. (8): R(Θ*) = 0.603 for 350< Θ* ≤ 400; R(Θ*) = 0.652 for 400< Θ* ≤430; R(Θ*) = 0.935 for 430< Θ* ≤ 480; R(∞) = 0.952 and the conditional expected service time according to Eq. (9): W = (0.603×350 + 0.049×400 + 0.283×430 + 0.017×480) / 0.952 = 378.69 sec.
TREE TOPOLOGY GRID ARCHITECTURE In the star grid, the RMS is connected with each resource by one direct communication channel (link). However, such approximation is not accurate enough even though it simplifies the analysis and computation. For example, several resources located in a same local area network (LAN) can use the same gateway to communicate outside the network. Therefore, all these resources are not connected with the RMS through independent links. The resources are connected to the gateway, which communicates with the RMS through one common communication channel. Another example is a server that contains several resources (has several processors that can run different applications simultaneously, or contains different databases). Such a server communicates with the RMS through the same links. These situations cannot be modeled using only the star topology grid architecture. In this section, we present a more reasonable virtual structure which has a tree topology. The root of the tree virtual structure is the RMS, and the leaves are resources, while the branches of the tree represent the communication channels linking the leaves and the root. Some channels are commonly used by multiple resources. An example of the tree topology is given in Figure 3 in which four resources (R1, R2, R3, R4) are available for a service. The tree structure models the common cause failures in shared communication channels. For example, in Figure 4, the failure in channel L6 makes resources R1, R2, and R3 unavailable. This type of common cause failure was ignored by the conventional parallel computing models, and the above star-topology models. For small-area communication, such as a LAN or a cluster, such assumption that ignores the common cause failures on communications is acceptable because the communication time is negligible compared to the processing time. However, for wide-area communication, such as the grid system, it is more likely to have failure on communication channels. Therefore, the communication time cannot be neglected. In many cases, the communication time may dominate the processing time due to the large amount of data transmitted. Therefore, the virtual tree structure is an adequate model representing the functioning of grid services.
234
Reliability and Performance Models for Grid Computing
Table 2. Parameters of the MTSTs’ paths Elements, subtasks
R1, J1
R2, J2
R3, J2
R4, J1
Data transmission speed (Kbps)
5
6
4
10
Data transmission time (s)
30
15
22.5
15
Processing time (s)
48
25
35.5
38
Time to subtask completion (s)
78
40
58
53
Algorithms for Determining the pmf of the Task Execution Time With the tree-structure, the simple u-function technique is not applicable because it does not consider the failure correlations. Thus, new algorithms are required. This section presents a novel algorithm to evaluate the performance and reliability for the tree-structured grid service based on the graph theory and the Bayesian approach.
Minimal Task Spanning Tree (MTST) The set of all nodes and links involved in performing a given task form a task spanning tree. This task spanning tree can be considered to be a combination of minimal task spanning trees (MTST), where each MTST represents a minimal possible combination of available elements (resources and links) that guarantees the successful completion of the entire task. The failure of any element in a MTST leads to the entire task failure. For solving the graph traversal problem, several classical algorithms have been suggested, such as Depth-First search, Breadth-First search, etc. These algorithms can find all MTST in an arbitrary graph (Dai et al., 2002). However, MTST in graphs with a tree topology can be found in a much simpler way because each resource has a single path to the RMS, and the tree structure is acyclic. After the subtasks have been assigned to corresponding resources, it is easy to find all combinations of resources such that each combination contains exactly m resources executing m different subtasks that compose the entire task. Each combination determines exactly one MTST consisting of links that belong to paths from the m resources to the RMS. The total number of MTST is equal to the total number of such combinations N, where m
N = Õ| wj |
(22)
j =1
(see Example 4.2.1). Along with the procedures of searching all the MTST, one has to determine the corresponding running time and communication time for all the resources and links. For any subtask j, and any resource k assigned to execute this subtask, one has the amount of input and output data, the bandwidths of links, belonging to the corresponding paths γk, and the resource processing time. With these data, one can obtain the time of subtask completion (see Example 4.2.2).
235
Reliability and Performance Models for Grid Computing
Some elements of the same MTST can belong to several paths if they are involved in data transmission to several resources. To track the element involvement in performing different subtasks and to record the corresponding times in which the element failure causes the failure of a subtask, we create the lists of two-field records for each subtask in each MTST. For any MTST Si (1 ≤ i ≤ N), and any subtask j (1 ≤ j ≤ m), this list contains the names of the elements involved in performing the subtask j, and the corresponding time of subtask completion yij (see Example 4.2.3). Note that yij is the conditional time of subtask j completion given only MTST i is available. Note that a MTST completes the entire task if all of its elements do not fail by the maximal time needed to complete subtasks in performing which they are involved. Therefore, when calculating the element reliability in a given MTST, one has to use the corresponding record with maximal time.
pmf of The Task Execution Time Having the MTST, and the times of their elements involvement in performing different subtasks, one can determine the pmf of the entire service time. First, we can obtain the conditional time of the entire task completion given only MTST Si is available as Y{i } = max(yij ) for any 1 ≤ i ≤ N: 1£ j £m
(23)
For a set ψ of available MTST, the task completion time is equal to the minimal task completion times among the MTST. ù é Yy = min(Y{i } ) = min ê max(yij )ú . i y i y ë1£ j £m û
(24)
Now, we can sort the MTST in an increasing order of their conditional task completion times Y{i}, and divide them into different groups containing MTST with identical conditional completion time. Suppose there are K such groups denoted by G1, G2,…,GK where 1 ≤ K ≤ N, and any group Gi contains MTST with identical conditional task completion times Θi (0 ≤ Θ1 < Θ2<…< ΘK). Then, it can be seen that the probability Qi = Pr(Θ = Θi) can be obtained as Qi = Pr (Ei ,Ei -1 ,Ei -2 ,...,E1 )
(25)
Where Ei is the event when at least one of MTST from the group Gi is available, and Ei is the event when none of MTST from the group Gi is available. Suppose the MTST in a group Gi are arbitrarily ordered, and Fij (j=1,2,…, Ni) represents an event when the j-th MTST in the group is available. Then, the event Ei can be expressed by Ni
Ei = Fij , j =1
236
(26)
Reliability and Performance Models for Grid Computing
and (25) takes the form Ni
Pr(Ei , Ei -1, Ei -2 ,..., E1 ) = Pr(Ei , Ei -1, Ei -2 ,..., E1 ) Pr( Fij , Ei -1, Ei -2 ,..., E1 ) = Ni
j =1
Pr( Fij , Ei -1, Ei -2 ,..., E1 ) .
(27)
j =1
Using the Bayesian theorem on conditional probability, we obtain from (27) that Ni
(
)
Qi = å Pr (Fij ) × Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij . j =1
(28)
The probability Pr(Fij) can be calculated as a product of the reliabilities of all the elements belonging to the j-th MTST from group Gi.
(
)
The probability Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij can be computed by the following twostep algorithm (see Example 4.2.4). Step 1: Identify failures of all the critical elements’ in a period of time (defined by the start and end time), during which they lead to the failures of any MTST from groups Gm for m=1,2,…i-1 (events E m ), and any MTST Sk from group Gi for k = 1,2,…, j−1 (events F ik ), but do not affect the MTST Sj from group Gi. Step 2: Generate all the possible combinations of the identified critical elements that lead to the event Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij using a binary search, and compute the probabilities of those combinations. The sum of the probabilities obtained is equal to
(
)
Pr Fi ( j -1), Fi ( j -2),..., Fi 1, E1, E 2 , , Ei -1 Fij . When calculating the failure probabilities of MTSTs’ elements, the maximal time from the corresponding records in a list for the given MTST should ___ ___
be used. The algorithm for obtaining the probabilities Pr{ E1 , E 2 , Ei -1 Ei } can be found in Dai et al. (2002). Having the conditional task completion times Y{i}for different MTST, and the corresponding probabilities Qi, one obtains the task completion time distribution (Θi, Qi), 1 ≤ i ≤ K, and can easily calculate the indices (8) & (9) (see Example 4.2.5).
Illustrative Example Consider the virtual grid presented in Figure 3, and assume that the service task is divided into two subtasks J1 assigned to resources R1 & R4, and J2 assigned to resources R2 & R3. J1, and J2 require 50Kbits, and 30Kbits of input data, respectively, to be sent from the RMS to the corresponding resource; and 100Kbits, and 60Kbits of output data respectively to be sent from the resource back to the RMS. The subtask processing times for resources, bandwidth of links, and failure rates are presented in Fig. 3 next to the corresponding elements.
237
Reliability and Performance Models for Grid Computing
Table 3. pmf of service time Θi
Qi
Θi Qi
53
0.3738
19.8114
58
0.1480
8.584
78
0.0945
7.371
∞
0.3837
∞
The Service MTST The entire graph constitutes the task spanning tree. There exist four possible combinations of two resources executing both subtasks: {R1, R2}, {R1, R3}, {R4, R2}, {R4, R3}. The four MTST corresponding to these combinations are: S1: {R1, R2, L1, L2, L5, L6}; S2: {R1, R3, L1, L3, L5, L6}; S3: {R2, R4, L2, L5, L4, L6}; S4: {R3, R4, L3, L4, L6}.
Parameters of MTSTs’ Paths Having the MTST, one can obtain the data transmission speed for each path between the resource, and the RMS (as minimal bandwidth of links belonging to the path); and calculate the data transmission times, and the times of subtasks’ completion. These parameters are presented in Table 2. For example, resource R1 (belonging to two MTST S1 & S2) processes subtask J1 in 48 seconds. To complete the subtask, it should receive 50Kbits, and return to the RMS 100Kbits of data. The speed of data transmission between the RMS and R1 is limited by the bandwidth of link L1, and is equal to 5 Kbps. Therefore, the data transmission time is 150/5=30 seconds, and the total time of task completion by R1 is 30+48=78 seconds.
List of MTST Elements Now one can obtain the lists of two-field records for components of the MTST. S1: path for J1:(R1,78); (L1,78); (L5,78); (L6,78); path for J2: (R2,40); (L2,40); (L5,40); (L6,40). S2: path for J1: (R1,78), (L1,78), (L5,78), (L6,78); path for J2: (R3,58), (L3,58), (L6,58). S3: path for J1: (R4,53), (L4,53); path for J2: (R2,40), (L2,40), (L5,40), (L6,40). S4: path for J1: (R4,53), (L4,53); path for J2: (R3,58), (L3,58), (L6,58).
pmf of Task Completion Time The conditional times of the entire task completion by different MTST are Y1=78; Y2=78; Y3=53; Y4=58.
238
Reliability and Performance Models for Grid Computing
Therefore, the MTST compose three groups: G1 = {S3} with Θ1 = 53; G2 = {S4} with Θ2= 58; and G3 = {S1, S2} with Θ3 = 78. According to (25), we have for group G1: Q1=Pr(E1)=Pr(S3). The probability that the MTST S3 completes the entire task is equal to the product of the probabilities that R4, and L4 do not fail by 53 seconds; and R2, L2, L5, and L6 do not fail by 40 seconds. Pr(Θ=53)=Q1=exp(−0.004×53)exp(−0.004×53)exp(−0.008×40) ×exp(−0.003×40)exp(−0.001×40)exp(−0.002×40) = 0.3738. Now we can calculate Q2 as
(
)
(
)
(
Q2 = Pr(E 2 , E1 ) = Pr (F21 ) Pr E1 F21 = Pr (F21 ) Pr F11 F21 = Pr(E 2 , E1 ) Pr (F21 ) Pr E1 F21
( (
)
(
)
(
)
(
)
)
Pr (F21 ) Pr F11 F21 Pr (S 4 ) Pr S 3 S 4 = Pr (F21 ) Pr E1 F21 = Pr (F21 ) Pr F11 F21 = Pr (S 4 ) Pr S 3 S 4
)
because G2, and G1 have only one MTST each. The probability that the MTST S4 completes the entire task Pr(S4) is equal to the product of probabilities that R3, L3, and L6 do not fail by 58 seconds; and R4, and L4 do not fail by 53 seconds. Pr(S4) = exp(-0.004 ´ 53) exp(-0.003 ´ 58) exp(-0.004 ´ 53) exp(-0.004 ´ 58) exp(-0.002 ´ 58) = 0.3883
(
)
To obtain Pr S 3 S 4 , one first should identify the critical elements according to the algorithm presented in the Dai et al. (2002). These elements are R2, L2, and L5. Any failure occurring in one of these elements by 40 seconds causes failure of S3, but does not affect S4. The probability that at least one failure occurs in the set of critical elements is
(
)
(
)
Pr S 3 S 4 = Pr S 3 S 4 1 - exp(-0.008 ´ 40) exp(-0.003 ´ 40) exp(-0.001 ´ 40) = 1 - exp(-0.008 ´ 40) exp(-0.003 ´ 40) exp(-0.001 ´ 40) = 0.3812.
Then,
(
)
(
)
Pr(Θ =58) = Pr(E 2 , E1 ) =Pr(S4) Pr S 3 S 4 = Pr(E 2 , E1 ) Pr S 3 S 4 0.3883 ´ 0.3812 =Pr(S4) Pr S 3 S 4 = 0.3883 ´ 0.3812 =0.1480.
(
)
239
Reliability and Performance Models for Grid Computing
Now one can calculate Q3 for the last group G3 = {S1, S2} corresponding to Θ3 = 78 as
(
)
(
)
Q3 = Pr(E 3 , E 2 , E1 ) = Pr(E 3 , E 2 , E1 ) Pr (F31 ) Pr E1, E 2 F31 + Pr (F32 ) Pr F31, E1, E 2 F32 = Pr (F31 ) Pr E1, E 2 F31 + Pr (F32 ) Pr F31, E1, E 2 F32
(
)
(
)
(
(
= Pr (S1 ) Pr S 3 , S 4 S1 + Pr (S 2 ) Pr S1, S 3 , S 4 S 2
)
)
The probability that the MTST S1 completes the entire task is equal to the product of the probabilities that R1, L1, L5, and L6 do not fail by 78 seconds; and R2, and L2 do not fail by 40 seconds. Pr (S1 ) = exp(-0.007 ´ 78) exp(-0.008 ´ 40) exp(-0.005 ´ 78) exp(-0.003 ´ 40) ´ exp(-0.001 ´ 78) exp(-0.002 ´ 78) = 0.1999. The probability that the MTST S2 completes the entire task is equal to the product of the probabilities that R1, L1, L5, and L6 do not fail by 78 seconds; and R3, and L3 do not fail by 58 seconds. Pr (S 2 ) = exp(-0.007 ´ 78) exp(-0.003 ´ 58) exp(-0.005 ´ 78) exp(-0.004 ´ 58) ´ exp(-0.001 ´ 78) exp(-0.002 ´ 78) = 0.2068.
(
)
To obtain Pr S , S S , one first should identify the critical elements. Any failure of either R4 or 3 4 1 L4 in the time interval from 0 to 53 seconds causes failures of both S3, and S4; but does not affect S1. Therefore,
(
)
(
)
Pr S 3 , S 4 S1 = Pr S 3 , S 4 S1 1 - exp(-0.004 ´ 53) exp(-0.004 ´ 53) = 1 - exp(-0.004 ´ 53) exp(-0.004 ´ 53) =0.3456.
(
)
The critical elements for calculating Pr S1, S 3 , S 4 S 2 are R2, and L2 in the interval from 0 to 40 seconds; and R4, and L4 in the interval from 0 to 53 seconds. The failure of both elements in any one of the following four combinations causes failures of S3, S4, and S1, but does not affect S2: 1. 2. 3. 4.
R2 during the first 40 seconds, and R4 during the first 53 seconds; R2 during the first 40 seconds, and L4 during the first 53 seconds; L2 during the first 40 seconds, and R4 during the first 53 seconds; and L2 during the first 40 seconds, and L4 during the first 53 seconds. Therefore,
4 é 2 ù Pr S1, S 3 , S 4 S 2 = Pr S1, S 3 , S 4 S 2 1 - Õ êê1 - Õ [1 - exp(lij × tij )]úú = i =1 ë j =1 û 4 é 2 ù 1 - Õ êê1 - Õ [1 - exp(lij × tij )]úú =0.1230, i =1 ë j =1 û
(
240
)
(
)
Reliability and Performance Models for Grid Computing
where λij is the failure rate of the j-th critical element in the i-th combination (j=1,2), (i=1,2,3,4); and tij is the duration of the time interval for the corresponding critical element. Pr (S1 ), Pr (S 2 ), Pr S 3 , S 4 S1 Pr S1, S 3 , S 4 S 2 , and , one can calculate Having the values of
(
)
(
)
Pr(Θ =78)= Q3 = 0.1999 × 0.3456 + 0.2068 × 0.1230 = 0.0945. After obtaining Q1, Q2, and Q3, one can evaluate the total task failure probability as Pr(Θ = ∞) = 1− Q1 − Q2 − Q3 = 1 − 0.3738 − 0.1480 − 0.0945 = 0.3837,
and obtain the pmf of service time presented in Table 3.
4.2.5. Calculating the Reliability Indices From Table 3, weone obtains the probability that the service does not fail as R(∞) = Q1 + Q2 + Q3 = 0.6164, the probability that the service time is not greater than a pre-specified value of θ* = 60 seconds as 3
R(q*) = å Qi × 1( i < q*) = 0.3738 + 0.1480 = 0.5218 , i =1
and the expected service execution time given that the system does not fail as 3
W = å iQi / R(¥) = 35.7664 / 0.6164 = 58.025 seconds. i =1
Parameterization and Monitoring In order to obtain the reliability and performance indices of the grid service one has to know such model parameters as the failure rates of the virtual links and the virtual nodes, and bandwidth of the links. It is easy to estimate those parameters by implementing the monitoring technology. A monitoring system (called Alertmon Network Monitor,http://www.abilene.iu.edu/noc.html) is being applied in the IP-grid (Indiana Purdue Grid) project (www.ip-grid.org), to detect the component failures, to record service behavior, to monitor the network traffics and to control the system configurations. With this monitoring system, one can easily obtain the parameters required by the grid service reliability model by adding the following functions in the monitoring system:
241
Reliability and Performance Models for Grid Computing
1.
2.
Monitoring the failures of the components (virtual links and nodes) in the grid service, and recording the total execution time of those components. The failure rates of the components can be simply estimated by the number of failures over the total execution time. Monitoring the real time network traffic of the involved channels (virtual links) in order to obtain the bandwidth of the links.
To realize the above monitoring functions, network sensors are required. We presented a type of sensors attaching to the components, acting as neurons attaching to the skins. It means the components themselves or adjacent components play the roles of sensors at the same time when they are working. Only a little computational resource in the components is used for accumulating failures/time and for dividing operations, and only a little memory is required for saving the data (accumulated number of failures, accumulated time and current bandwidth). The virtual nodes that have memory and computational function can play the sensing role themselves; if some links have no CPU or memory then the adjacent processors or routers can perform this data collecting operations. Using such self-sensing technique avoids overloading of the monitoring center even in the grid system containing numerous components. Again, it does not affects the service performance considerably since only small part of computation and storage resources is used for the monitoring. In addition, such self-sensing technique can also be applied in monitoring other measures. When evaluating the grid service reliability, the RMS automatically loads the required parameters from corresponding sensors and calculates the service reliability and performance according to the approaches presented in the previous sections. This strategy can also be used for implementing the Autonomic Computing concept.
CONCLUSION Grid computing is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multi-institutional collaboration. Although the developmental tools and techniques for the grid have been widely studied, grid reliability analysis and modeling are not easy because of their complexity of combining various failures. This chapter introduced the grid computing technology and analyzed the grid service reliability and performance under the context of performability. The chapter then presented models for star-topology grid with data dependence and tree-structure grid with failure correlation. Evaluation tools and algorithms were presented based on the universal generating function, graph theory, and Bayesian approach. Numerical examples are presented to illustrate the grid modeling and reliability/performance evaluation procedures and approaches. Future research can extend the models for grid computing to other large-scale distributed computing systems. After analyzing the details and specificity of corresponding systems, the approaches and models can be adapted to real conditions. The models are also applicable to wireless network that is more failure prone. Hierarchical models can also be analyzed in which output of lower level models can be considered as the input of the higher level models. Each level can make use of the proposed models and evaluation tools.
242
Reliability and Performance Models for Grid Computing
ACKNOWLEDGMENT This work was supported in part by National Science Foundation (NSF) under grant number 0831609.
REFERENCES Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems, 18(8), 1061–1074. doi:10.1016/S0167-739X(02)00085-7 Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., & Faerman, M. (2003). Adaptive computing on the Grid using AppLeS. IEEE Transactions on Parallel and Distributed Systems, 14(4), 369–382. doi:10.1109/TPDS.2003.1195409 Cao, J., Jarvis, S. A., Saini, S., Kerbyson, D. J., & Nudd, G. R. (2002). ARMS: An agent-based resource management system for grid computing. Science Progress, 10(2), 135–148. Chen, D. J., Chen, R. S., & Huang, T. H. (1997). A heuristic approach to generating file spanning trees for reliability analysis of distributed computing systems. Computers and Mathematics with Applications, 34(10), 115–131. doi:10.1016/S0898-1221(97)00210-1 Chen, D. J., & Huang, T. H. (1992). Reliability analysis of distributed systems based on a fast reliability algorithm. IEEE Transactions on Parallel and Distributed Systems, 3(2), 139–154. doi:10.1109/71.127256 Dai, Y. S., & Levitin, G. (2006). Reliability and performance of tree-structured grid services . IEEE Transactions on Reliability, 55(2), 337–349. doi:10.1109/TR.2006.874940 Dai, Y. S., Pan, Y., & Zou, X. K. (2006). A hierarchical modelling and analysis for grid service reliability. IEEE Transactions on Computers. Dai, Y. S., Xie, M., & Poh, K. L. (2002), Reliability analysis of grid computing systems. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC2002), (pp. 97-104). New York: IEEE Computer Press. Dai, Y. S., Xie, M., & Poh, K. L. (2005). Markov renewal models for correlated software failures of multiple types. IEEE Transactions on Reliability, 54(1), 100–106. doi:10.1109/TR.2004.841709 Dai, Y. S., Xie, M., & Poh, K. L. (2006).Availability modeling and cost optimization for the grid resource management system. IEEE Transactions on Systems, Man, and Cybernetics. Part A . Systems and Humans: a Publication of the IEEE Systems, Man, and Cybernetics Society., 38(1), 170. Dai, Y. S., Xie, M., Poh, K. L., & Liu, G. Q. (2003). A study of service reliability and availability for distributed systems. Reliability Engineering & System Safety, 79(1), 103–112. doi:10.1016/S09518320(02)00200-4
243
Reliability and Performance Models for Grid Computing
Dai, Y. S., Xie, M., Poh, K. L., & Ng, S. H. (2004). A model for correlated failures in N-version programming. IIE Transactions, 36(12), 1183–1192. doi:10.1080/07408170490507729 Das, S. K., Harvey, D. J., & Biswas, R. (2001). Parallel processing of adaptive meshes with load balancing. IEEE Transactions on Parallel and Distributed Systems, 12(12), 1269–1280. doi:10.1109/71.970562 Ding, Q., Chen, G. L., & Gu, J. (2002). A unified resource mapping strategy in computational grid environments. Journal of Software, 13(7), 1303–1308. Foster, I., & Kesselman, C. (2003). The Grid 2: Blueprint for a new computing infrastructure. San Francisco: Morgan-Kaufmann. Foster, I., Kesselman, C., Nick, J. M., & Tuecke, S. (2002). Grid services for distributed system integration. Computer, 35(6), 37–46. doi:10.1109/MC.2002.1009167 Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. International Journal of High Performance Computing Applications, 15, 200–222. doi:10.1177/109434200101500302 Grassi, V., Donatiello, L., & Iazeolla, G. (1988). Performability evaluation of multicomponent fault tolerant systems. IEEE Transactions on Reliability, 37(2), 216–222. doi:10.1109/24.3744 Krauter, K., Buyya, R., & Maheswaran, M. (2002). A taxonomy and survey of grid resource management systems for distributed computing. Software, Practice & Experience, 32(2), 135–164. doi:10.1002/ spe.432 Kumar, A. (2000). An efficient SuperGrid protocol for high availability and load balancing. IEEE Transactions on Computers, 49(10), 1126–1133. doi:10.1109/12.888048 Kumar, V. K. P., Hariri, S., & Raghavendra, C. S. (1986). Distributed program reliability analysis. IEEE Transactions on Software Engineering, SE-12, 42–50. Levitin, G., Dai, Y. S., & Ben-Haim, H. (2006). Reliability and performance of star topology grid service with precedence constraints on subtask execution. IEEE Transactions on Reliability, 55(3), 507–515. doi:10.1109/TR.2006.879651 Levitin, G., Dai, Y. S., Xie, M., & Poh, K. L. (2003). Optimizing survivability of multi-state systems with multi-level protection by multi-processor genetic algorithm. Reliability Engineering & System Safety, 82, 93–104. doi:10.1016/S0951-8320(03)00136-4 Lin, M. S., Chang, M. S., Chen, D. J., & Ku, K. L. (2001). The distributed program reliability analysis on ring-type topologies. Computers & Operations Research, 28, 625–635. doi:10.1016/S03050548(99)00151-3 Liu, G. Q., Xie, M., Dai, Y. S., & Poh, K. L. (2004). On program and file assignment for distributed systems. Computer Systems Science and Engineering, 19(1), 39–48. Livny, M., & Raman, R. (1998). High-throughput resource management. In The Grid: Blueprint for a new computing infrastructure (pp. 311-338). San Francisco: Morgan-Kaufmann
244
Reliability and Performance Models for Grid Computing
Meyer, J. (1980). On evaluating the performability of degradable computing systems. IEEE Transactions on Computers, 29, 720–731. doi:10.1109/TC.1980.1675654 Nabrzyski, J., Schopf, J. M., & Weglarz, J. (2003). Grid Resource Management. Amsterdam: Kluwer Publishing. Pham, H. (2000). Software reliability. Singapore: Springer-Verlag. Tai, A., Meyer, J., & Avizienis, A. (1993). Performability enhancement of fault-tolerant software. IEEE Transactions on Reliability, 42(2), 227–237. doi:10.1109/24.229492 Xie, M. (1991). Software reliability modeling. Hackensack, NJ: World Scientific Publishing Company. Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing systems reliability: Models and analysis. New York: Kluwer Academic Publishers. Yang, B., & Xie, M. (2000). A study of operational and testing reliability in software reliability analysis. Reliability Engineering & System Safety, 70, 323–329. doi:10.1016/S0951-8320(00)00069-7
KEY TERMS AND DEFINITIONS Bayesian Analysis: Use Bayes method to get the posterior distribution from a prior distribution. Graph Theory: Use graph algorithms to analyze given a network graph. Grid Computing: Grid computing is a newly developed technology for complex systems with largescale resource sharing, wide-area communication, and multi-institutional collaboration. Modeling: A representation, generally in mathematical presentations, to show the construction or appearance of a computing system. Performance: The inverse of the execution time. Reliability: The probability for the service to be successfully completed given a execution time. Universal Generating Function: Also called as u-function that is a technique to express and evaluate models in a polynomial format.
245
246
Chapter 11
Mixed Programming Models Using Parallel Tasks Jörg Dümmler Chemnitz University of Technology, Germany Thomas Rauber University Bayreuth, Germany Gudula Rünger Chemnitz University of Technology, Germany
ABSTRACT Parallel programming models using parallel tasks have shown to be successful for increasing scalability on medium-size homogeneous parallel systems. Several investigations have shown that these programming models can be extended to hierarchical and heterogeneous systems which will dominate in the future. In this chapter, the authors discuss parallel programming models with parallel tasks and describe these programming models in the context of other approaches for mixed task and data parallelism. They discuss compiler-based as well as library-based approaches for task programming and present extensions to the model which allow a flexible combination of parallel tasks and an optimization of the resulting communication structure.
INTRODUCTION Large modular parallel applications can be decomposed into a set of cooperating parallel tasks. This set of parallel tasks and their cooperation or coordination structure are a flexible representation of a parallel program for the specific application. The flexibility in scheduling and mapping the parallel tasks can be exploited to achieve efficiency and scalability on a specific distributed memory platform by choosing a suitable mapping and scheduling of the tasks. Each parallel task is responsible for the computation of a specific part or module of the application, and can be executed on an arbitrary number of processors. The terms multiprocessor tasks, malleable tasks and moldable tasks have been used to denote such parallel tasks. In the following, we use the term multiprocessor task (M-task). An M-task can be implemented DOI: 10.4018/978-1-60566-661-7.ch011
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Mixed Programming Models Using Parallel Tasks
using an SPMD programming model (basic M-task) or can be hierarchically composed of other M-tasks and thereby support nested parallelism (composed M-task). The advantage of the M-task programming model is to exploit coarse-grained parallelism between M-tasks and fine-grained parallelism within basic M-tasks in the same program and thus the potential parallelism and scalability can be increased. Each M-task provides an interface consisting of a set of input and output parameters. These parameters are parallel data structures that are distributed among the processors executing the M-task according to a predefined distribution scheme, e.g. a block-wise distribution of an array. A data dependence between M-tasks M1 and M2 arises if M1 produces output data required as an input for M2. Such data dependencies might lead to data re-distribution operations if M1 and M2 are executed on different sets of processors or if M1 produces its output in a different data distribution than expected by M2. Control dependencies are introduced by coordination operators, e.g. loop constructs for the repeated execution of an M-task or constructs for the conditional execution of an M-task. The data and control dependencies between M-tasks can be captured by a graph representation. Examples are macro dataflow graphs(Ramaswamy, Sapatnekar, & Banerjee, 1997) or series-parallel (SP) graphs(Rauber & Rünger, 2000). The actual execution of an M-task program is based on a schedule of the M-tasks that has to take the data and control dependencies into account. M-tasks that are connected by a data or control dependence have to be executed subsequently. For independent M-tasks both, a concurrent execution on disjoint processor groups or an execution one after another are possible. The optimal schedule depends on the structure of the application and on the communication and computing performance of the parallel target platform. For the same application a pure data parallel schedule that executes all M-tasks consecutively on all available processors might lead to the best results on one platform but a mixed task and data parallel schedule may result in lower execution times on another platform. Thus, the parallel programming with M-tasks offers a very flexible programming style exploiting different levels of granularity and making parallel programs easily adoptable to a specific parallel platform. Examples for M-task applications come from multiple areas. Large multi-disciplinary simulation programs consist of a collection of algorithms from different fields, e.g. aircraft design (Chapman, Haines, Mehrota, Zima, & van Rosendale, 1997; Bal & Haines, 1998) that uses models from aerodynamics, propulsion, and structural analysis or environmental simulations (Chapman et al., 1997) that combine atmospheric, surface water, and ground water models. Examples from numerical analysis include solution methods for ordinary differential equations (ODEs) like extrapolation methods (Rauber & Rünger, 2000), iterated Runge-Kutta methods (Rauber & Rünger, 1999a), implicitly iterated Runge-Kutta methods (Rauber & Rünger, 2000), or Parallel Adams methods (Rauber & Rünger, 2007). These time-stepping methods compute a fixed number of independent stage vectors within each time step and combine these vectors into the new approximation vector for the next time step. Partial differential equations (PDEs) can be defined over geometrically complex domains that are decomposed into sets of partially overlapping discretization meshes. Solution methods for PDEs can exploit coarse-grained parallelism between these meshes and fine-grained parallelism within the meshes (Merlin, Baden, Fink, & Chapman, 1999; Diaz, Rubio, Soler, & Troya, 2003). Hierarchical algorithms and divide-and-conquer algorithms compute partial solutions for independent subsets of the input and derive the final solution from these partial results. Examples are multi-level matrix multiplication algorithms (Hunold, Rauber, & Rünger, 2008). Stream-based applications process input streams by several pipeline stages and can exploit task and data parallelism by replicating non-scaling stages and executing the replicas concurrently. Examples come from image processing (Subhlok & Yang, 1997) and sensor-based programs that periodically process data produced by sensors (Subhlok & Yang, 1997; Bal & Haines, 1998; Orlando, Palmerini, & Perego, 2000).
247
Mixed Programming Models Using Parallel Tasks
There is a large variety of specific parallel programming models which support the programming with parallel tasks, multiprocessor tasks or related concepts. In the next section we start with an overview of programming approaches for mixed parallelism with different ways of programming support. A more detailed description is given for the TwoL(two level) approach with its compiler support and the TLib approach with a library interface. Moreover, we present extensions to the parallel programming with M-tasks that have been proposed recently. The scheduling and mapping of M-tasks with dependencies is an important method to get efficient versions of an M-task program. Thus, we present mapping techniques for M-task programs and finish the chapter with measurements of numerical codes on up-to-date multi-core clusters.
TASK-BASED PROGRAMMING APPROACHES Several approaches have been proposed for the use of M-tasks for programming large parallel systems including language extensions as well as skeleton-based, library-based and coordination-based approaches. We give an overview of these approaches in the following subsections.
Language Extensions Language extensions enrich existing programming languages with additional annotations or language constructs to support a mixed task and data parallel execution. The host languages are often data parallel languages that are extended to support task parallelism or task parallel languages with support for data parallelism. A special compiler is required to translate the language extensions. Most approaches use a source-to-source compiler which creates a program in the host language that utilizes a runtime library to realize the extensions. Fortran M (Foster & Chandy, 1995) is a task parallel language based on Fortran 77. Language constructs are provided for creating processes and communication channels that enable one-to-one communication between processes based on a message passing paradigm. The process model is dynamic, i.e., new processes and communication channels can be created at runtime. Fortran D(Fox, Hiranandani, Kennedy, Koelbel, Kremer, & Tseng et al., 1990) and High Performance Fortran (HPF)(High Performance Fortran Forum, 1993) are data parallel languages based on Fortran 90 that include primitives to distribute arrays among processors and data parallel operations such as array expressions and parallel loops. The integration of Fortran M with either Fortran D or HPF is described in (Chandy, Foster, Kennedy, Koelbel, & Tseng, 1994). Fortran M is responsible for resource management, e.g. starting the data parallel tasks that are executed on processor groups specified by the user, and Fortran D or HPF takes care of the distribution of computations and data structures on these groups. Two concurrently executed data parallel tasks can communicate with each other using send and receive operations on a channel that has to be provided by the parent task. Opus (Chapman, Mehrotra, van Rosendale, & Zima, 1994; Chapman et al., 1997) defines a set of extensions to the data parallel HPF language to support the coordination of multiple independent data parallel modules. Target applications of Opus are coarse-grained multi-disciplinary simulations consisting of independent program parts that periodically exchange information, e.g. for the simultaneous optimization of the aerodynamic and structural design of an aircraft configuration. Task parallelism in Opus is realized by special subroutines that may be invoked onto a specific set of processor resources
248
Mixed Programming Models Using Parallel Tasks
that has to be provided by the user. The heart of the Opus extensions are ShareD Abstractions (SDAs) that are objects encapsulating data and methods. An SDA may be shared by multiple tasks thus supporting communication and coordination between these tasks. The framework OpusJava(Laure, Mehrotra, & Zima, 1999; Laure, 2001) has been proposed to integrate Opus components into larger distributed Java based environments and thus providing support for loosely coupled heterogeneous platforms. Braid (West & Grimshaw, 1995) adds data parallelism to the object-oriented Mentat task parallel language. Mentat is based on C++ and provides high-level abstractions to define task parallel objects. The Mentat system handles the dynamic creation, communication, synchronization, and scheduling of these objects. Braid additionally supports data parallel objects that include overlay methods to initialize data, aggregate methods to apply operations on all or a subset of the data elements, and reduction methods to distill information from the values of the data set. The user can provide annotations to inform the compiler and runtime system about the communication behavior of the objects. This includes local communication within data parallel methods, e.g. nearest neighbor communication pattern and the interaction between objects and which operations are dominant. The runtime system realizes the distribution of the data based on the user’s annotations and platform specific characteristics. Fx (Subhlok & Yang, 1997) is a Fortran-based language that integrates directives to partition and layout data similar to HPF and directives to control task parallelism. Task parallelism can be exploited within specific areas of the program called task regions. Within a task region subroutine calls can be executed in a data parallel way by a subset of the available processors. The size and layout of the processor subsets can be computed at runtime. Each subroutine may contain additional task regions, thus providing support for nested parallelism. The Fx framework includes a mapping tool to compute an optimized task placement based on a dynamic programming approach (Subhlok & Vondran, 1995). The required cost information are obtained by executing the program with different mappings. High Performance Fortran 2.0 (HPF 2.0) (High Performance Fortran Forum, 1997) is a language extension based on Fortran 95 including approved extensions for a mixed task and data parallel execution. The utilized task model is similar in spirit to the Fx approach. The task region construct provides support for creating independent coarse-grained tasks, each of which can itself execute data parallel or nested task parallel computations. The on directive allows the programmer to control the distribution of computations among the processors of a parallel machine. The distribution of the data on processor groups and subgroups is supported by the distribute and align directives. The shape of the utilized processor groups can be computed at runtime. Orca (Ben Hassen, Bal, & Jacobs, 1998) defines a specification language that is translated into C code utilizing a special runtime library. Data parallelism is available in form of partitioned objects that may be distributed over multiple processors. Data parallel computations are performed using the owner-computes rule and communication operations to access remote data are inserted by the compiler. Task parallelism is expressed by using processes that can be started dynamically. The data distribution for partitioned objects and the processors for the execution of a task have to be explicitly coded by the programmer. The communication between processes is supported by shared objects that are implemented as instances of abstract data types. Each process can read and modify data within the shared objects by using atomic operations, thus enabling data exchanges between concurrently executing processes. Spar/Java (van Reeuwijk, Kuijlman, & Sips, 2003; Sips & van Reeuwijk, 2004) defines language extensions for Java that are translated to C++ code by the Timber compiler using the Vnus language as an intermediate step. The compiler includes special optimizations for multi-dimensional arrays. The language extensions provide annotations to explicitly distribute data and computations. The syntax of
249
Mixed Programming Models Using Parallel Tasks
these annotations is similar to functional languages. The foreach construct defines data parallel computations, e.g. operations on array elements. The each construct defines data independence for a set of statements and therefore enables a task parallel execution. The executing processors can be specified for each statement using the on annotation. Fortress (Allen, Chase, Hallett, Luchangco, Maessen, & Ryo et al., 2008), Chapel (Chamberlain, Callahan, & Zima, 2007) and X10 (Charles, Grothoff, Saraswat, Donawa, Kielstra, & Ebcioglu et al., 2005) are new parallel programming languages that are currently under development. The underlying programming models provide a higher level of abstraction than the previously mentioned approaches and are targeted to increase the productivity of the programmers. The memory is assumed to be globally shared by all program parts; necessary communication operations have to be automatically determined by the compiler. Fortress is an object-oriented language that expresses parallel computations with implicit and explicit threads. Explicit threads are created by the programmer; implicit threads are created by parallel language constructs, e.g. also-do blocks to define independent computations for task parallelism or for loops which are parallel by default and can be executed in a data parallel way. The parallel target platform is modeled by regions that can be hierarchically nested; threads can be assigned to specific regions by the user to increase the performance. Currently, Fortress is only available for shared memory platforms but an extension for distributed memory systems is planned. Parallel platforms in the Chapel language can be described by a set of locales, e.g. a locale per cluster node. Data and computations can be mapped on locales using the on clause. Data parallelism in Chapel is expressed by domains that define the size and shape of arrays. Domains can be distributed among locales. Data parallel operations on array elements can be expressed using the forall loop, or the reduce and scan functions. Task parallelism is supported by the cobegin directive that expresses independent computations. X10 is based on a partitioned global address space (PGAS) memory model that is represented by a set of places. Multiple activities may be executed concurrently by different places. The async statement supports the creation of asynchronous activities on specific places. These activities can be synchronized using by the finish statement. X10 supports multi-dimensional arrays that may be distributed among a set of places using pre-defined or user-defined distribution types. Data parallel operations on arrays can be performed using the ateach construct.
Skeleton-Based Approaches Skeleton-based approaches include a predefined set of coordination patterns to combine sequential code or small parallel program fragments into complex parallel applications. Parallel skeletons can provide support for data parallel computations, e.g. mapping the same code onto different parts of the input data, or for task parallel computations, e.g. arrange different tasks in form of a pipeline. Multi-level parallelism is supported by nesting different skeletons within each other. P3L (Pelagatti, 2003) is a skeleton coordination language using C as a host language that is used to express the sequential portions of the application. The supported skeletons include data parallel, task parallel and control skeletons that can be nested within each other. Data parallel skeletons are map to distribute data and to apply a specific skeleton to each data element, reduce to combine distributed data into a single value, scan to compute the parallel prefix of an array, and comp for functional composition. The task parallel skeletons operate on streams of input data; pipe applies a sequence of skeletons to the
250
Mixed Programming Models Using Parallel Tasks
input data one after another forming a pipeline and farm applies the same skeleton to different items of the input data stream. Control skeletons are seq for wrapping sequential code and loop for the repeated execution of another skeleton. P3L includes a compiler for the generation of C+MPI programs that utilize a library which provides optimized implementation templates for each skeleton. A cost expression for each skeleton is available and, thus, the costs for the entire application can be determined by combining these cost expressions according to the hierarchical structure of the application. The costs for the sequential fraction of the code are obtained by profiling. taskHPF (Ciarpaglini, Folchi, Orlando, Pelagatti, & Perego, 2000) uses a two-tier model for combining task and data parallelism. The task parallel coordination structure is described by a high-level language that includes the definition of data parallel tasks with input and output parameters and the interaction between tasks based on predefined skeletons. Available skeletons are the pipeline pattern and the replicate directive to create multiple incarnations of non-scalable stages. The specification language includes the on processors directive to define the number of executing processors for each data parallel task. HPF is used to implement the data parallel tasks and to describe data distributions within these tasks. Necessary re-distribution operations between data parallel tasks are identified by the compiler and are realized by the COLTHPF (Orlando & Perego, 1999; Orlando et al., 2000) coordination layer. LLC (Dorta, González, Rodriguez, & de Sande, 2003; Dorta, López, & de Sande, 2006) is a highlevel parallel language with support for algorithmic skeletons. The host language is C augmented with OpenMP-like directives to define skeletons and to provide additional information to the compiler. The compiler llCoMP translates this code into a parallel C+MPI program. Basic data parallel skeletons in LLC are forall to define parallel loops and taskq to define task farms. Task parallelism is provided by the sections skeleton to define independent computations and the pipeline skeleton to describe pipelined computations. The implementation of the basic skeletons partitions the available processors into a number of subgroups equal to the number of tasks, e.g. number of pipeline stages or loop iterations. The mapping of tasks onto processor groups can be controlled by assigning weights. ASSIST (Vanneschi, 2002) is a framework for the skeleton-based composition of sequential and parallel modules into complex applications. Sequential modules are provided in a host language of ASSIST (C, C++, and Fortran) and operate on streams of input data. Parallel modules are expressed by the parmod construct that defines the input and output streams, a set of virtual processors and a virtual processor topology. Additionally, modules can access external objects that may be declared as shared, thus supporting data exchanges between concurrently executing modules. The interaction between the modules in form of a directed graph is described in the ASSIST-CL coordination language. The nodes of the graph correspond to components and the edges are data streams that are communicated between components. For program execution, the virtual processors of the parallel modules have to be mapped onto physical processors. This mapping can also be reconfigured at runtime (Vanneschi & Veraldi, 2007). Lithium (Aldinucci, Danelutto, & Teti, 2003) is a Java-based programming environment for the development of structured parallel applications based on a set of predefined skeletons provided in form of a library. A variety of skeletons is supported, e.g. the data parallel map and divide-&-conquer, the task parallel farm and pipe and control skeletons to model loops and conditionals. The execution of a Lithium application is based on a master-slave approach: the master contains a task pool with all executable tasks that are distributed to the slave nodes. DIP (Diaz et al., 2003) is a pattern-based coordination language with focus on domain decomposition and multi-block applications, e.g. solution methods for PDEs. The implementation of DIP is based on the border-based coordination language BCL (Diaz, Rubio, Soler, & Troya, 2002), i.e., the DIP compiler
251
Mixed Programming Models Using Parallel Tasks
translates a DIP specification program into a BCL program. BCL supports the solution of numerical problems with multiple domains by automatically creating necessary border-exchange operations between domains. Basic data parallel tasks are implemented in HPF. DIP provides the multiblock pattern to describe a fixed number of k-dimensional domains with fixed boundary coordinates that require a periodic exchange of border values. Additionally, the pipe pattern to describe a chain of pipeline stages and the replicate directive to create multiple independent incarnations of a pipeline stage each operating on different data from the input stream are available. DIP supports multiple implementation templates for each pattern. The programmer is responsible for selecting an appropriate template and for specifying the number of processors to execute each task. SBASCO (Diaz, Rubio, Soler, & Troya, 2004) is an enhancement of the DIP approach that similarly supports the multiblock, pipe and farm skeletons. Additionally, SBASCO includes a cost model for the estimation of the execution time of each skeleton depending on hardware parameters. SBASCO distinguishes two different views on the specification, the application view and the configuration view. The application view describes the structure of the application using the available skeletons and the basic data parallel components with their input and output parameters. The configuration view extends the application view with information on data distributions, processor layout and the internal structure of the components. The application view is provided by the programmer; the configuration view is used by a configuration tool to obtain an efficient allocation of the different application components on parallel platforms based on the cost model enhanced by a run-time analysis.
Library-Based Approaches Library-based approaches provide library routines to support task and data parallel executions. This includes the support of coordination and synchronization of multiple data parallel tasks, the provision of data re-distribution routines, the creation of processor groups and the execution of tasks on the correct processors. HPF/MPI (Foster, Kohr, Krishnaiyer, & Choudhary, 1996) is a library that provides an HPF binding to the MPI message passing library and thus enables HPF programs to issue MPI communication operations. Therefore, the coordination and synchronization of different concurrently executing programs is supported. Arbitrary variables defined in the HPF program can be used as parameters for the communication operations provided. These variables may be distributed among the processors executing the data parallel module and therefore the implementation of the library has to deal with arbitrary distributed data structures. For example, a point-to-point communication operation between two modules has to handle the case of different source and target distribution types. This is realized using a descriptor exchange to exchange distribution information between communicating modules. HPF_TASK_LIBRARY (Brandes, 1999) enables the interaction of data parallel HPF tasks during their execution time by providing point-to-point and collective communication operations. The library is designed for the HPF 2.0 task model that supports the creation of data parallel tasks on disjoint processor groups but does not allow communication between concurrently executed tasks. The library supports the exchange of data structures that are distributed among multiple processors. Therefore, the distribution information has to be exchanged prior to the data transmission to determine the resulting communication pattern. Nested parallel executions are supported, but only tasks on the same nesting level may communicate with each other.
252
Mixed Programming Models Using Parallel Tasks
Figure 1. Decomposition of the set of processes V={1,2,..,9} into a two-dimensional grid and executing a group-SPMD phase using vertical processor groups (left) and horizontal processor groups (right).
KeLP-HPF (Merlin et al., 1999) uses the C++ class library KeLP (Fink, 1998) to coordinate multiple data parallel HPF tasks. KeLP provides high-level abstractions to simplify the development of blockstructured algorithms on SMP clusters. KeLP builds on MPI and includes mechanisms to manage data layout, data motion and the parallel control flow. For the data layout, general block decompositions are supported and the communication schedules are determined at runtime. In the KeLP-HPF programming model, KeLP can dynamically create processor groups and start new HPF tasks. Thus, the programming model is especially suited for applications that execute regular data parallel operations on an irregular or dynamic domain, e.g. multi-block codes or adaptive refinement methods. The arguments for the data parallel tasks are provided by KeLP in a distributed format along with a mapping descriptor that informs the HPF code of the distribution type. The library ORT (Rauber, Reilein-Ruß, & Rünger, 2004a) supports the programming for applications with a two or higher dimensional task grid and task dependencies mainly aligned in the dimensions of the task grid. Examples are algorithms from linear algebra based on two or higher dimensional arrays, like the LU decomposition. The programming model of the ORT library is based on a group-SPMD model in which the set of processors is subdivided into a set of disjoint groups of processors and each processor groups executes a parallel task in parallel to the other groups. In the programming model of the ORT library there exist several partitions of the entire set of processors in disjoint groups with the specific property that the groups are orthogonal to each other in a two or higher dimensional grid; Figure 1 shows the two-dimensional case. A typical ORT program consists of computation phases and communication phases. Each phase is executed on exactly one of the processor decompositions and performs either group-SPMD computation on the decomposition or a communication within the groups. During the execution of the ORT program the active processor decomposition changes from phase to phase such that different tasks cooperate in the group-SMPD way. The ORT library calls support the building of processor decompositions based on MPI and the mapping of task to the processors groups. The orthogonal way of communication can speed up the communication phases of an application and it is useful to integrate it in a hierarchical model as it will be described as an extended programming model in a subsequent section of this article.
253
Mixed Programming Models Using Parallel Tasks
Coordination-Based Approaches Coordination-based approaches are based on a static task structure that might be provided in form of an explicit specification of the available parallelism. A compiler or a transformation-based toolset can be used to translate the specification into executable code. In contrast to language extensions, the complete structure of the application is visible to the compiler and optimizations like scheduling can be applied. Paradigm (Joisha & Banerjee, 1999) is a parallel research compiler framework for HPF programs that additionally supports task parallel extensions proposed in (Ramaswamy et al., 1997). The extensions include annotations in the program source that enable the automatic extraction of the task parallel structure of the application in form of a macro dataflow graph (MDG). The MDG has a hierarchical structure with simple nodes representing computation, loop nodes (for or while), conditional nodes(if) and user-defined nodes, edges symbolize data dependencies. The MDG is annotated with cost parameters for simple nodes and possible data re-distribution operations resulting from data dependencies. The node costs are determined by profiling and fitting the obtained results to a curve according to Amdahl’s law. The costs for data re-distribution operations depend on the size of the transmitted data, the overhead for sending and receiving data, and the transmission time of the network. The Paradigm framework includes scheduling support to map MDGs on a specific target platform. Two scheduling algorithms, TSAS and SAS, are available (Ramaswamy, 1996). The final stage is the generation of an optimized MPMD code that utilizes a data re-distribution library for multi-dimensional arrays. The communication pattern and the communication schedule of the re-distribution operations are calculated at runtime using the FALLS algorithm (Ramaswamy, Simons, & Banerjee, 1996). Network of Tasks (Pelagatti & Skillicorn, 2001) is a programming model that defines a coordination language for coarse-grained tasks with an emphasis on runtime prediction. An application is modeled as a directed acyclic graph with nodes being arbitrary parallel programs that may be heterogeneous. The nodes are adaptive, i.e., different implementations may be available and the number of executing processors can be modified. The directed edges of the graph indicate one-way communication with parallel data structures or streams. The scheduling of the task graph on a target platform is performed by a work-based allocation technique (Skillicorn, 1999). Pipelining and farming are used to increase application performance in case there are enough processors available. Pipelining allows the simultaneous execution of all nodes of a subgraph, e.g. iterations of a loop, and farming increases the effective parallelism by replicating non-scaling nodes. The costs for the entire application are composed of the costs for the nodes that are provided by the user and the costs for the communication that are derived using the BSP model. S-Net (Grelck, Scholz, & Shafarenko, 2007) is a stream-based coordination language to combine data parallel modules implemented in SAC. SAC is a side-effect free functional language that supports data parallel operations on n-dimensional stateless arrays. S-Net treats data parallel SAC programs as stateless boxes operating on input streams and producing an output stream. On arrival of an item on the input stream the box is expected to apply its operation and produce one or more output items on a single output stream. In S-NET, the functionality of these boxes is defined using a box signature that maps the input type to output types. Four constructors are available to hierarchically compose boxes into complex networks. The static serial composition connects two networks A and B by connecting the output of A with the input of B. The static parallel composition of two networks A and B sends input items depending on their types either to A or B and merges the output streams of A and B. The serial and parallel replicators support the
254
Mixed Programming Models Using Parallel Tasks
repeated execution of a network A where the iterations are connected via serial composition or parallel composition, respectively. A compiler for S-NET programs for shared memory platforms is currently under development. The performance-aware composition framework (Kessler & Löwe, 2007) supports the combination of parallel and sequential components into parallel applications with an emphasis on performance prediction. Each component is required to provide a functional interface specifying the parameters and a performance interface that contains information on the execution time depending on the number of executing processors. For each component there may exist multiple parallel or sequential implementation variants that share the same functional interface but define separate performance interfaces. The implementation of the parallel variants may be based on an SPMD programming model. The structure of the application is defined using a host language extended by annotations that are evaluated by a composition tool. Parallel components may include compose_parallel operators that mark independent invocations of components, i.e., any sequential or parallel execution order is valid. Calls to components outside this operator are assumed to be executed sequentially in the specified order. The execution of the target application is based on a static variant dispatch table that is created by the composition tool. This table contains the optimal implementation variant for each combination of component and processor number. Additionally, this table contains a schedule for each compose_parallel operator for each number of executing processors. The schedule is determined using scheduling techniques for independent M-tasks and specifies the execution order and sizes of processor groups. At runtime, the optimal schedule or implementation variant is selected depending on the actual problem size and number of available processors. A prototype compiler using the C-based parallel language Fork for the implementation of the components has been realized.
HIERARCHICAL M-TASK PROGRAMMING In this section, we present hierarchical programming approaches for mixed programming with parallel tasks. In particular, we describe the library TLib (task library), as well as the coordination model TwoL (two level) which is based on a specification language for the hierarchical composition of tasks.
TwoL Model Support for the programming with parallel tasks (called modules) can also be provided in form of a coordination approach. The TwoL (two level) model (Rauber & Rünger, 1996, Rauber & Rünger, 2000) is a top-down method for the development of applications that distinguishes two well separated layers of parallelism. The lower (data parallel) level defines the interfaces of modules that are provided by the application developer. These basic modules are treated as black-box SPMD codes that can be executed on an arbitrary number of processors. For each basic module there may be multiple implementation variants differing in the data distribution of the parameters or the employed algorithm. The upper (task parallel) level defines composed modules that are hierarchically composed of other modules. The structure of the composed modules is defined in the platform independent TwoL-specification language. The specification of a composed module is based on a module expression consisting of invocations of modules (as basic elements) and a set of coordination operators that specify data dependence or data independence between subexpressions. The ||-operator combines independent computations for
255
Mixed Programming Models Using Parallel Tasks
which both, a concurrent and a consecutive execution are possible. The °-operator demands the subsequent execution of computations due to data dependencies that may lead to data re-distribution operations. Sequential loops can be defined using the for and while operators and parallel loops are specified using the parfor operator. A data dependence between the iterations of sequential loops is assumed whereas the iterations of parallel loops are independent from each other and can be computed concurrently. The conditional execution of subexpressions is supported by the if operator. The initial TwoL-specification of an algorithm defines the maximum degree of available task parallelism. For an execution on a specific target platform, the actual degree of task parallelism that should be exploited and the data distributions of the modules need to be fixed. These decisions are made by several incremental transformation steps resulting in a non-executable parallel frame program. The parallel frame program does not include any platform-dependent information, but different platforms may require different frame programs to achieve a good performance. The final transformation step of the TwoL framework translates the parallel frame program into an executable message passing program. The generated program is responsible for the creation of the correct communication context for the execution of the basic modules, e.g. by using communicators provided by MPI, and a correct dataflow between modules by inserting data re-distribution operations at the appropriate positions. The design steps in the derivation of an efficient parallel frame program are based on a cost model that has to provide accurate predictions for the execution times of modules depending on the number of executing processors and for the data re-distribution operations between modules depending on the source and target processor groups and distribution types. For the basic modules, a cost model based on parameterized runtime formulas is employed. These closed-form symbolic formulas consist of a term describing the execution times of the arithmetic operations and functions that describe the runtime of the internal communication operations such as single-transfer and broadcast operations. The platform-independent structure of the runtime formulas can be derived by inspecting the program text. The compiler tool SCAPP (Kühnemann, Rauber, & Rünger, 2004) has been developed to automate this task. The platform-specific parameters of the formulas can be determined through profiling techniques. Date re-distribution costs are modeled using a platform-dependent startup time and byte-transfer time of the interconnection network. For composed modules, the runtime functions are composed according to the hierarchical structure of the module. (See Figure 2) The scheduling step of the TwoL framework determines an execution order for independent modules, assigns processors to modules and load balances processor groups. For this step, the specification program is transformed into a global module dependence graph (MDG) that captures the data dependencies between modules. The MDG is a directed acyclic graph that exhibits a series-parallel (SP) structure, Figure 3 (left) shows an example. For the scheduling and load balancing, the TwoL-Level (Rauber & Rünger, 1998) and the TwoL-Tree (Rauber & Rünger, 1999b) algorithms have been developed and implemented in a scheduling toolkit (Dümmler, Kunis, & Rünger, 2007a). The runtime of a TwoL program may also be influenced by the choice of the data distribution for the input and output parameters of the modules. Therefore, the TwoL framework includes support for the automatic derivation of suitable data distributions. The derivation process uses a dynamic programming approach that determines the optimal distribution types by exploiting the hierarchical structure of the application (Rauber, Rünger, & Wilhelm, 1995). The core concepts of the TwoL model have been implemented in form of a compiler tool (Rauber, Reilein-Ruß, & Rünger, 2004b; Reilein-Ruß, 2005).
256
Mixed Programming Models Using Parallel Tasks
Figure 2. Illustration of the hierarchical structure of a TLib program. M-task M1 consecutively executes M2, M3 and M13; M3 concurrently executes M4 and M9; M4 executes M5 and M6 one after another, where M6 further subdivides the available processors to excute M7 and M8 in parallel; M9 consists of the sequential execution of M10, M11, and M12.
Figure 3. Illustration of a task graph representing an M-task application in the TwoL model (left) and a possible CM-task graph (right)
257
Mixed Programming Models Using Parallel Tasks
Programming Interface Tlib The runtime library TLib has been developed to support the programming with hierarchically structured M-tasks. TLib library functions are designed to be called in an SPMD manner which results in multilevel group-SPMD programs. The entire management of groups and M-tasks at execution time is done by the library. Thus, the TLib API provides support for: a. b. c. d.
The creation and administration of a dynamic hierarchy of processor groups; The coordination and mapping of nested M-tasks to processor groups; The handling and termination of recursive calls and group splittings; The organization of communication between M-tasks.
Internally, the library uses distributed information stored in distributed descriptors which cannot be accessed directly by the user, thus hiding the complexity of the group management and the multi-level group-SPMD organization. This relieves the application programmer from realizing the technical details of hierarchical M-tasks and the corresponding group management and allows him to concentrate on how to exploit the potential M-task structure of the given application. The current version of the library is based on C and is built on top of MPI. A TLib program consists of: a. b.
A set of basic functions expressing M-tasks that are executed in an SPMD style and that comprise the computations to be performed; A set of coordination functions to control the execution of the basic functions.
The processors executing a basic M-task can exchange information with arbitrary MPI operations. The coordination functions allow a concurrent execution of basic M-tasks by the activation of suitable library functions. The coordination functions can be nested arbitrarily. Thus, a coordination function can assign other coordination functions to subgroups of processors for execution, which can then again split the corresponding subgroup and assign other coordination functions. A basic M-task function F is expressed as a function of the form, void *F (void * arg, MPI_Comm comm, T_Descr * pdescr)
where the parameter arg comprises the arguments used by F; the parameter comm specifies an MPI communicator which can be used for internal communication within the M-task F; pdescr is a reference to a TLib group descriptor containing information about the processor group onto which F is mapped. This descriptor can be used to dynamically split this processor group further in the body of F, if F exhibits an internal task parallel structure. F may also generate a recursive call of itself on a smaller subgroup, thus enabling the implementation of divide-and-conquer algorithms. The TLib library provides functions for initialization, splitting of groups into two or more subgroups, assignment of tasks to processor groups, and getting information on the subgroup structure. An example of a library function for splitting a processor group into two processor groups is,
258
Mixed Programming Models Using Parallel Tasks
int T_SplitGrp(T_Descr *pdescr, T_Descr *pdescr1, float per1, float per2)
where pdescr is a reference to the group descriptor of the original group and pdescr1 is a reference to a new group descriptor that is generated by the library function. The parameters per1 and per2 specify fractional values with per1 + per2 ≤ 1. The effect of the operation is a splitting into two disjoint processor groups with a percentage of per1 or per2 of the processors of the original group, respectively, as specified by the parameter pdescr. More splitting operations are provided, allowing, e.g., the splitting into an arbitrary number of processor groups. After a splitting operation generating a number of subgroups, M-tasks can be assigned to the newly generated subgroups by corresponding mapping functions. An example for a mapping operation onto two processor groups is, int T_Par(void *(*F1)(void *, MPI_Comm, T_Descr *), void *parg1, void *pres1, void *(*F2)(void *, MPI_Comm, T_Descr *), void *parg2, void *pres2, T_Descr *pdescr) where F1 and F2 describe the M-tasks to be mapped to the subgroups and to be executed concurrently by the subgroups; parg1 and parg2 are the parameters for F1 and F2, respectively; pres1 and pres2 are the results produced by F1 and F2, respectively; the subgroups are described by the parameter pdescr. More mapping operations are provided to assign M-tasks to an arbitrary number of subgroups. Figure 2 shows an example for the emerging hierarchical structure of TLib programs. A detailed description of the TLib library is given in (Rauber & Rünger, 2005) along with example applications demonstrating the use of the library. The use of TLib for specifying efficient parallel implementations for matrix multiplication based on the Strassen algorithm has been described in (Hunold, Rauber, & Rünger, 2004). A data re-distribution library DRDLib to support re-distributions between cooperating M-tasks using TLib has been described in (Rauber & Rünger, 2006).
ExTENDED PROGRAMMING MODEL In the TwoL programming model two modules M1 and M2 can only communicate in-between their execution, i.e. the module supplying the data has to finish its execution and the module consuming the data must not have started. For applications that require periodic data exchanges between program parts, it might also be beneficial to allow further data exchanges during the execution of concurrently executed modules. Examples for such applications are time stepping methods that perform a data exchange at the end of each time step. An implementation using the TwoL programming model restricts the modules to the execution of a single time step and, thus, limits the possible granularity of the modules. A more natural way to structure such applications is to combine multiple time steps within a module and to sup-
259
Mixed Programming Models Using Parallel Tasks
port communication between running modules to perform the required data exchanges. In the following, we present an extended programming model that follows this idea and discuss programming support for the development of efficient parallel implementations in this model.
CM-Task Programming Model The programming model of communicating multiprocessor-tasks(CM-tasks) (Dümmler, Rauber, & Rünger, 2007) extends the TwoL model by providing support for communication between concurrently executed modules. CM-tasks are parallel modules which have a set of input and output parameters and support the execution on an arbitrary number of processors. The interactions between CM-tasks are expressed by P-relations and C-relations: •
•
Precedence relations (P-relations) capture the input/output dependencies between CM-tasks. A P-relation between CM-tasks A and B denotes that A produces output data required as an input for B and might lead to a data re-distribution operation between A and B when A and B are executed on different subsets of the processors or if A provides its output data in a different distribution than expected by B. These are the dependencies captured in the original TwoL model. A communication relation (C-relation) between CM-tasks A and B denotes that A and B have to communicate with each other during their execution. This is an extension of the TwoL model since modules can now communicate during their execution, if they are connected with a C-relation.
The structure of a CM-task program can be represented by a CM-task graph G = (V, E) where the set of nodes V corresponds to the set of CM-tasks. The set of edges E = E p Ec consists of the set of directed edges Ep representing P-relations and the set of bidirectional edges Ec symbolizing C-relations. Figure 3 (right) shows an illustration of a CM-task graph. The possible execution orders of the CM-tasks are limited by the P-relations and C-relations. A P-relation connecting CM-tasks A and B requires that the execution of A must have been finished and all required data re-distributions between A and B must have been carried out before B can be started. CM-tasks connected by a C-relation must be executed concurrently by disjoint subsets of the processors to perform the specified data exchanges. Therefore, there cannot be both a P-relation and a C-relation between CM-tasks A and B and hence E p Ec = Æ . Examples for CM-task programs are iterated Runge-Kutta (IRK) methods (van der Houwen & Sommeijer, 1991; Rauber & Rünger, 1999a) and Parallel Adams methods (van der Houwen & Messina, 1999) that are time-stepping methods for the solution of initial value problems of non-stiff ODEs. Due to data dependencies between successive time steps, each time step of these applications has to be computed by a separate set of modules in the TwoL model. Using the CM-task model, successive time steps can be combined within a single set of CM-tasks and data dependencies between time steps are modeled by C-relations. This enables the CM-task version to exploit optimized communication patterns. Examples are the orthogonal arrangement of the processes (cf. Figure 1) and the use of concurrent multi-broadcast operations to realize the data exchanges between the CM-tasks. Benchmark results comparing a pure M-task based implementation with a CM-task version are presented at the end of this chapter.
260
Mixed Programming Models Using Parallel Tasks
Figure 4. Overview of the incremental transformation steps used to create an executable CM-task coordination program
Development of CM-Task Programs Support for the development of CM-task programs has been proposed in form of a compiler framework (Dümmler, Rauber, & Rünger, 2008a). The framework consists of several consecutive transformation steps and supports the incremental creation of an executable coordination program from an initial specification of the structure of the CM-task application using the non-executable, platform-independent specification language. Each transformation step of the framework adds additional information resulting in an augmented specification. The specification language supports the definition of basic CM-tasks whose implementation has to be provided by the application developer and composed CM-tasks whose structure is visible to the framework. Composed CM-tasks consist of CM-task activations and control constructs guiding the control flow, i.e. conditional execution (if-statement), the repeated execution with data dependencies (while-loop, for-loop) and without data dependencies between loop iterations (parfor-loop). The transformation process is depicted in Figure 4 and consists of four consecutive steps. The first step, the Dataflow Analyzer, takes the initial specification of a parallel algorithm as an input. The data dependencies in this specification program are defined implicitly using input/output parameter lists and variable names. The Dataflow Analyzer is responsible for uncovering these dependencies and inserting the appropriate P-Relations and C-Relations. The successive transformation step, the Scheduler, requires additional information about the target platform that is provided in form of a Machine Description. This input file specifies the number of available processors and contains approximations of the computational power, i.e. the average time required to execute an arithmetic operation, and the communication performance, i.e. the startup and byte-transfer
261
Mixed Programming Models Using Parallel Tasks
time of the interconnection network. The output of the Scheduler is a platform-dependent specification program with annotations that define the execution order and the executing processor groups for each CM-task invocation. The Data Manager inspects all P-relations and uses the computed schedule for the CM-task program to decide which data re-distribution operations are required for a correct execution. The Code Generator creates the final message passing program in the target language. The created coordination program consists of: •
• • •
The execution of CM-tasks on the processor groups computed by the scheduler; basic CM-tasks are provided by the user in form of a library and the coordination code for composed CM-tasks is created by the framework, The execution of data re-distribution operations between CM-tasks; the communication pattern is statically computed by the framework and included in the coordination program; Coordination constructs (loops, conditions) according to the input specification; Processor group management code that creates the correct MPI communicators for communication between concurrently executed CM-tasks as specified by the C-relations and for the execution of CM-tasks and data re-distribution operations.
A prototype realization of the transformation framework as a compiler tool for the C target language is available.
SCHEDULING AND MAPPING A suitable schedule is crucial to obtain the maximum performance of an M-task application. For homogenous platforms, the schedule defines the execution order of the M-tasks and the number of executing processors. Unfortunately, the M-task scheduling problem of determining the optimal schedule that leads to the lowest execution time is NP-complete. Therefore, several research groups have proposed scheduling heuristics and approximation algorithms to automatically determine a good schedule. Examples are TwoL-Level (Rauber & Rünger, 1998), TwoL-Tree (Rauber & Rünger, 1999b), CPR (Radulescu, Nicolescu, van Gemund, & Jonker, 2001), CPA (Radulescu & van Gemund, 2001), iCASLB (Vydyanathan, Krishnamoorthy, Sabin, Çatalyürek, Kurç, & Sadayappan et al., 2006a) and Loc-MPS (Vydyanathan, Krishnamoorthy, Sabin, Çatalyürek, Kurç, & Sadayappan et al., 2006b), see (Dümmler, Kunis, & Rünger, 2007b) for a comparison. For heterogeneous platforms, the schedule additionally has to define the mapping of M-tasks onto specific processors. Scheduling heuristics for large heterogeneous cluster-of-cluster platforms restrict the execution of an M-task to a single homogeneous subcluster, but each subcluster is allowed to execute multiple M-tasks concurrently. Examples are M-HEFT (Suter, Desprez, & Casanova, 2004) and HCPA (N’takpé & Suter, 2006), see (N’takpé, Suter, & Casanova, 2007) for a comparison. Multi-core and SMP clusters are heterogeneous platforms that are built up of a set of homogeneous processing cores interconnected by a heterogeneous network. For these systems, the scheduling can be performed by two consecutive steps (Dümmler, Rauber, & Rünger, 2008b): 1.
262
Scheduling the M-task graph describing the application on a set of homogeneous symbolic cores whose computing performance is equal to the physical cores of the target platform but a homogeneous
Mixed Programming Models Using Parallel Tasks
Figure 5. Illustration of a tree-structure representing the architecture of a multi-core SMP cluster (left) and the use of the Dewey notion to label the computing elements (right).
2.
interconnection network is assumed and Mapping the symbolic cores used for the scheduling decisions onto physical cores.
The scheduling step is similar to homogenous scheduling algorithms. In the following, we concentrate on the mapping step that has to define a mapping for each point in time, i.e. there has to be an assignment of the symbolic cores of the currently executing M-tasks to the physical cores of the architecture. For each M-task, the mapping is fixed, i.e. during its lifespan an M-task is executed by the same set of physical cores. The multi-core target architecture is represented in a tree structure with cores C as leaves, processors P as intermediate nodes that combine cores, computing nodes N as intermediate nodes that combine processors, and the entire architecture A as a root node. The levels of the tree correspond to different interconnection networks. For a unique identification of the physical cores within the tree structure we use the Dewey notation (Knuth, 1975). Each node n gets a label l(n) that describes the path from the root node to the specific node. The root node r gets label l(r) = 0. The label l(n) of a node n consists of the label of the parent node m concatenated by the digit i, if n is children i of m, i.e. l(n) = l(m).i. Figure 5 illustrates the tree structure of the architecture and the use of the Dewey notation to describe multicore clusters. For the definitions of the mappings, we consider the situation that g independent M-tasks should be executed concurrently and the scheduling step has assigned the group of symbolic cores Gi to M-task Mi, i = 1,…,g with Gi G j = Æ for i ≠ j. The number of symbolic cores in group Gi is denoted as gi and has been determined in the scheduling step. The mapping is a function F : {G1,...G g } ¾ ¾® 2C where C denotes the set of physical cores. F maps the groups of symbolic cores to disjoint physical cores, i.e. F (Gi ) F (G j ) = Æ for i ≠ j and each symbolic group is mapped on a physical group with the same size, i.e. |F(Gi)|=|Gi|. For each proposed mapping we define a sequence of physical cores s1, s2,…,sm with m = p*c*n
assuming an architecture with c cores per processor, p processors per node and n total nodes. The mapping function F assigns the symbolic cores of a group Gi, i = 1,…,g to consecutive physical cores in
263
Mixed Programming Models Using Parallel Tasks
Figure 6. Illustration of a consecutive mapping (left) and a scattered mapping (right) for M-tasks {M1, M2, M3, M4} each requiring 4 symbolic cores on a platform with 4 nodes consisting of 2 dual-core processors.
this sequence, i.e. ìï i -1 ïü F (Gi ) = ïís j , s j +1,..., s j +gi -1 j = 1 + å gk ïý ïï ïï k =1 þ. î The Node-Oriented Consecutive Mapping tries to map the symbolic cores of an M-task onto the same cluster node. If an M-task does not fit on a single node of the architecture, multiple nodes are used. Figure 6 (left) shows an illustration of the consecutive mapping. The advantage of this mapping strategy is to enable shared memory optimizations for the implementation of the M-tasks, e.g. to speed up the internal communication by using optimized MPI libraries or to use a shared memory or hybrid programming model. In this mapping, the physical cores are ordered such that cores of the same node are adjacent, i.e. the sequence of physical cores is, 1.1.1,1.1.2,…,1.1.c,1.2.1,…,1.p.c.,2.1.1,…,2.1.c,…,n.p.c. The Scattered Core-Level Mapping assigns corresponding symbolic cores of different M-tasks onto the same cluster node. If the number of cores of the architecture exceeds the number of independent M-tasks, multiple symbolic cores of each M-task are mapped on the same node. The scattered mapping is illustrated in Figure 6 (right). This mapping strategy ensures an equal participation of all nodes in the internal communication of the M-tasks and can speed up data exchanges between M-tasks, especially in the case that only corresponding symbolic cores communicate with each other. In the sequence of physical cores the corresponding cores of neighboring nodes are adjacent, i.e. the sequence is given by 1.1.1, 2.1.1,..., n.1.1, 1.1.2, 2.1.2,..., n.1.c, 1.2.1,..., n.p.c . The Mixed Core-Level Mapping is a generalization of the consecutive and the scattered mappings. A parameter d, 1 ≤ d ≤ p*c describes the number of consecutive symbolic cores of an M-task that are mapped to the same cluster node. For d = 1 the scattered mapping results and setting d = p*c results in
264
Mixed Programming Models Using Parallel Tasks
the consecutive mapping. This mapping can be used to adapt to the ratio of communication within Mtasks and data exchanges between M-tasks. The sequence of physical cores is given by d -1 d -1 ).(1 + ((d - 1) mod c)), 2.1.1,..., n.(1 + ).(1 + ((d - 1) mod c)),..., c c 2d - 1 ).(1 + ((2d - 1) mod c)),..., n.p.c. 1.(1 + c 1.1.1, 1.(1 +
A suitable compiler tool can be used to integrate the mapping strategies in the code generation process. A realization using the MPI library can adapt the order of the processes within the appropriate communicators.
ExAMPLE AND ExPERIMENTS As example applications for an M-task based execution model we consider numerical codes. The first examples are solution methods for non-stiff ordinary differential equations(ODEs). We consider iterated Runge-Kutta (IRK) methods (van der Houwen & Sommeijer, 1991; Rauber & Rünger, 1999a) and Parallel Adams methods (van der Houwen & Messina, 1999) that have been developed for a parallel execution. The IRK method computes s stage vectors using m fixed point iteration steps with an implicit Runge-Kutta corrector. In the M-task programming model, each fixed point iteration step can be represented by s independent M-tasks but the M-tasks of successive steps cannot be combined due to data dependencies. The extended CM-task programming model provides another possibility to structure IRK methods. The computation of each stage vector is accomplished by a single CM-task that computes all fixed point iteration steps and performs the required data exchanges during its execution. The Parallel Adams methods include the explicit Parallel Adams-Bashforth (PAB) and the implicit Parallel Adams-Moulton (PAM) methods. The combination of the PAB and PAM methods in a predictor-corrector scheme results in an implicit ODE solver (PABM). Each time step of the PABM method involves the computation of K independent stage vectors each requiring group-based communication. At the end of a time step, a global data exchange is required to compute the next approximation vector. The M-task version of PABM uses K independent M-tasks each computing one of the stage vectors, but each time step requires a separate set of M-tasks. The extended CM-task model enables the adoption of a single set of CM-tasks that keeps running over all time steps. The second example comes from the area of solution methods for partial differential equations (PDEs) that operate on a set of meshes (also called zones). Within each time step, the computation of the solution is performed independently for each zone. At the end of a time step, a border exchange between overlapping zones is required. The NAS parallel benchmark multi-zone version (NPB-MZ) provides solvers for discretized versions of the unsteady, compressible Navier-Stokes equations that operate on multiple zones (van der Wijngaart & Jin, 2003). The fine grain parallelism within the zones is exploited using shared memory OpenMP programming; the coarse grain parallelism between the zones is realized using message passing with MPI. For the purpose of this article we consider a modified version of the Lower-Upper Symmetric Gauss-Seidel multi-zone (LU-MZ) benchmark which uses MPI for both levels of parallelism. This has the advantage that multiple nodes of a distributed memory platform can
265
Mixed Programming Models Using Parallel Tasks
Figure 7. Comparison of the execution times of a single time step of the IRK method using different execution schemes on the Xeon cluster (left) and on the CHiC cluster (right)
operate on the same zone. For the benchmark test, a variety of multi-core platforms is used. The benchmarks on the SGI Altix are executed within a partition consisting of 128 nodes each equipped with two dual-core Intel Itanium Montecito processors running at 1.6 GHz. The interconnection network is a high-speed NUMAlink 4 that offers a bidirectional bandwidth of 6.4 GByte/s per link. The Intel quad-core Xeon cluster consists of two nodes each equipped with two Intel Xeon E5345 “Clovertown” quad-core processors clocked at 2.33 GHz. An InfiniBand network with 10 GBit/s connects the nodes. The CHiC cluster includes 530 nodes consisting of two AMD Opteron 2218 dual-core processors with a clock rate of 2.6 GHz connected Figure 8. Speedups of a different execution schemes for the IRK method on the SGI Altix (left) and for the PABM method on the CHiC cluster (right)
266
Mixed Programming Models Using Parallel Tasks
Figure 9. Comparison of the execution times of a single time step of the PABM method using different mapping strategies on SGI Altix (left) and on the quad-core Xeon cluster (right)
by a 10 GBit/s InifiniBand network. First, we compare a standard data parallel version with task parallel versions based on the M-task and CM-task models. The CM-task version of the IRK and PABM methods utilizes an optimized communication scheme based on an orthogonal arrangement of the processes (cf. Figure 1). The execution times of a single time step of the IRK method using the RadauIIA7 method with s = 4stage vectors are compared for the sparse Brusselator system on 16 processor cores of the Xeon cluster in Figure 7 (left). The M-task version is not competitive due to its large communication overhead. This overhead can be
Figure 10. Performance of the LU-MZ benchmark for problem classes ‘C’ and ‘D’ on CHiC (left) and SGI Altix (right)
267
Mixed Programming Models Using Parallel Tasks
reduced significantly by the CM-task version leading to much lower runtimes. Figure 7 (right) shows the execution times of the IRK method using 960 processor cores of the CHiC cluster and the dense Schrödinger equation. Communication is less important for dense systems and therefore the differences between the program versions are smaller. Again, the lowest execution times are achieved by the CMtask program version. Figure 8 compares the achieved speedups for the IRK method on the SGI Altix (left) and for the PABM method with K = 8 stage vectors on the CHiC cluster (right). In both cases the CM-task version achieves a superior performance compared to M-task based task parallelism and pure data parallelism. Figure 9 shows the execution times of a single time step of the PABM method on the SGI Altix using 256 processor cores (left) and on the quad-core Xeon cluster using 16 processor cores (right). On both systems, task parallelism leads to better execution times because the communication within M-tasks can be restricted to subgroups of cores. The best mapping strategy is the scattered mapping because the data exchanges between M-tasks at the end of each time step can be executed within a cluster node. The performance of the LU-MZ benchmark is depicted in Figure 10 for the CHiC cluster (left) and for the SGI Altix (right). Problem classes ‘C’ with a global mesh size of 480 × 320 × 28 and ‘D’ with a global mesh size of 1632 × 1216 × 34 are used. The data parallel version of class ‘C’ can only be executed for up to 448 cores because a minimum amount of data is required for each process. For a low number of cores, pure data parallelism leads to better results because a data exchange between zones is not required. But on a high number of cores the communication within the zones becomes more important because the amount of data and, thus, the amount of computation assigned to each process becomes smaller. The node consecutive mapping leads to the best performance on both platforms. For class ‘D’ the computation to communication ratio is much higher leading to smaller differences between the program versions. The IRK, PABM and LU-MZ benchmarks prove that a mixed task and data parallel execution scheme can outperform pure data parallelism on a variety of platforms. Additional optimizations of the communication pattern as it is possible with the extended CM-task model lead to a further increase of the performance. Additionally, it has been shown that different mapping strategies can lead to significant differences in the performance on multi-core clusters. The best mapping strategy mainly depends on the communication requirements of the application, but also the communication performance of the interconnection networks of the architecture needs to be taken into account.
CONCLUSION Mixed task and data parallel execution schemes are a flexible method to exploit the computing power of up-to-date distributed memory platforms. Program development in these mixed programming models is more complex and error-prone compared to pure task or data parallel models because the organization of the processor groups and the execution of data re-distribution operations additionally have to be taken into account. Moreover, the optimal assignment of the tasks of an application to processors may depend on the target platform and therefore a complex restructuring of the application might be required when porting to another platform. Therefore, a variety of programming support to assist the application developer is available. In this article, we have discussed several of these approaches. In particular, we have considered the runtime library TLib and the coordination model TwoL. TLib supports the structuring of application programs using hierarchically organized multiprocessor tasks.
268
Mixed Programming Models Using Parallel Tasks
The library provides an easy to use interface and relieves the programmer from the processor group management. The TwoL model includes a specification language for the definition of hierarchical multiprocessor task programs. Several transformation steps are available to transform a specification of a parallel algorithm into an executable message passing program. The transformation is guided by an underlying cost model. Additionally, we have discussed the model of communicating multiprocessor tasks that is a natural extension to existing models and which supports communication between running tasks. The advantage of this model is to enable special communication patterns like orthogonal communication. The benefits of this programming model were demonstrated using solution methods for ordinary differential equations. Programming support for communicating multiprocessor tasks has been presented in form of a transformation-based compiler framework. Finally, we have presented several mapping strategies to adapt multiprocessor task applications to the hierarchical structure of recent multi-core SMP clusters. The proposed mapping strategies have been applied to example codes from numerical analysis. It was shown that the optimal mapping strategy depends on the ratio of communication within multiprocessor tasks and between tasks.
REFERENCES Aldinucci, M., Danelutto, M., & Teti, P. (2003). An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems, 19(5), 611–626. doi:10.1016/S0167739X(02)00172-3 Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryo, S., et al. (2008). The Fortress language specification, Version 1.0. Santa Clara, CA: Sun Microsystems, Inc. Bal, H. E., & Haines, M. (1998). Approaches for integrating task and data parallelism. IEEE Concurrency, 6(3), 74–84. doi:10.1109/4434.708258 Ben Hassen, S., Bal, H. E., & Jacobs, C. J. H. (1998). A task- and data-parallel programming language based on shared objects. [TOPLAS]. ACM Transactions on Programming Languages and Systems, 20(6), 1131–1170. doi:10.1145/295656.295658 Brandes, T. (1999). Exploiting advanced task parallelism in high performance Fortran via a task library. In Euro-Par ‘99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing (pp. 833–844). London: Springer-Verlag. Chamberlain, B. L., Callahan, D., & Zima, H. P. (2007). Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3), 291–312. doi:10.1177/1094342007078442 Chandy, M., Foster, I., Kennedy, K., Koelbel, C., & Tseng, C.-W. (1994). Integrated support for task and data parallelism. The International Journal of Supercomputer Applications, 8(2), 80–98. Chapman, B., Haines, M., Mehrota, P., Zima, H., & van Rosendale, J. (1997). Opus: A coordination language for multidisciplinary applications. Science Progress, 6(4), 345–362.
269
Mixed Programming Models Using Parallel Tasks
Chapman, B. M., Mehrotra, P., van Rosendale, J., & Zima, H. P. (1994). A software architecture for multidisciplinary applications: integrating task and data parallelism. In CONPAR 94 - VAPP VI: Proceedings of the Third Joint International Conference on Vector and Parallel Processing (pp. 664–676). London: Springer-Verlag. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA ’05 Proceedings of the 20th annual ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications (pp. 519–538). New York: ACM. Ciarpaglini, S., Folchi, L., Orlando, S., Pelagatti, S., & Perego, R. (2000). Integrating task and data parallelism with taskHPF. In H. R. Arabnia (Ed.). Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2000. Las Vegas, NV: CSREA Press. Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2002). A border-based coordination language for integrating task and data parallelism. Journal of Parallel and Distributed Computing, 62(4), 715–740. doi:10.1006/jpdc.2001.1814 Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2003). Domain interaction patterns to coordinate HPF tasks. Parallel Computing, 29(7), 925–951. doi:10.1016/S0167-8191(03)00064-4 Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2004). SBASCO: Skeleton-based scientific components. In Proceedings of the 12th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP 2004) (pp. 318–325). Washington, DC: IEEE Computer Society. Dorta, A. J., González, J. A., Rodriguez, C., & de Sande, F. (2003). LLC: A parallel skeletal language. Parallel Processing Letters, 13(3), 437–448. doi:10.1142/S0129626403001409 Dorta, A. J., López, P., & de Sande, F. (2006). Basic skeletons in LLC. Parallel Computing, 32(7-8), 491–506. doi:10.1016/j.parco.2006.07.001 Dümmler, J., Kunis, R., & Rünger, G. (2007a). A scheduling toolkit for multiprocessor-task programming with dependencies. In Proceedings of the 13th International Euro-Par Conference (pp. 23–32). Berlin: Springer. Dümmler, J., Kunis, R., & Rünger, G. (2007b). A comparison of scheduling algorithms for multiprocessortasks with precedence constraints. In Proceedings of the 2007 High Performance Computing & Simulation (HPCS’07) Conference (pp. 663–669). ECMS. Dümmler, J., Rauber, T., & Rünger, G. (2007). Communicating multiprocessor-tasks. In Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2007). Berlin: Springer. Dümmler, J., Rauber, T., & Rünger, G. (2008a). A transformation framework for communicating multiprocessor-tasks. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) (pp. 64–71). New York: IEEE Computer Society.
270
Mixed Programming Models Using Parallel Tasks
Dümmler, J., Rauber, T., & Rünger, G. (2008b). Mapping algorithms for multiprocessor tasks on multicore clusters. In Proceedings of the 37th International Conference on Parallel Processing (ICPP08). New York: IEEE Computer Society. Fink, S. J. (1998). A programming model for block-structured scientific calculations on smp clusters. Doctoral thesis, University of California, San Diego, CA. Foster, I., Kohr, D. R., Krishnaiyer, R., & Choudhary, A. (1996). Double standards: Bringing task parallelism to HPF via the message passing interface. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (pp. 36-36). New York: IEEE Computer Society. Foster, I. T., & Chandy, K. M. (1995). Fortran M: A language for modular parallel programming. Journal of Parallel and Distributed Computing, 26(1), 24–35. doi:10.1006/jpdc.1995.1044 Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., Tseng, C.-W., et al. (1990). Fortran D Language Specification (No. CRPC-TR90079), Houston, TX. Grelck, C., Scholz, S.-B., & Shafarenko, A. V. (2007). Coordinating data parallel SAC programs with S-Net. In Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007) (pp. 1–8). New York: IEEE. High Performance Fortran Forum. (1993). High performance Fortran language specification, version 1.0 (No. CRPC-TR92225). Center for Research on Parallel Computation, Rice University, Houston, TX. High Performance Fortran Forum. (1997). High performance Fortran language specification 2.0. Center for Research on Parallel Computation, Rice University, Houston, TX. Hunold, S., Rauber, T., & Rünger, G. (2004). Multilevel hierarchical matrix-matrix multiplication on clusters. In Proceedings of the 18th International Conference of Supercomputing (ICS’04) (pp. 136–145). New York: ACM. Hunold, S., Rauber, T., & Rünger, G. (2008). Combining building blocks for parallel multi-level matrix multiplication. Parallel Computing, 34(6-8), 411–426. doi:10.1016/j.parco.2008.03.003 Joisha, P. G., & Banerjee, P. (1999). PARADIGM (version 2.0): A new HPF compilation system. In IPPS ’99/SPDP ’99: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing (pp. 609–615). Washington, DC: IEEE Computer Society. Kessler, C. W., & Löwe, W. (2007). A framework for performance-aware composition of explicitly parallel components. In [Jülich/Aachen, Germany: IOS Press.]. Proceedings of the International Conference ParCo, 2007, 227–234. Knuth, D. E. (1975). The art of computer programming. Volume 1: Fundamental Algorithms. Reading, MA: Addison Wesley. Kühnemann, M., Rauber, T., & Rünger, G. (2004). A source code analyzer for performance prediction. In Proceedings of IPDPS’04 Workshop on Massively Parallel Processing (WMPP’04. New York: IEEE.
271
Mixed Programming Models Using Parallel Tasks
Laure, E. (2001). OpusJava: A Java framework for distributed high performance computing. Future Generation Computer Systems, 18(2), 235–251. doi:10.1016/S0167-739X(00)00094-7 Laure, E., Mehrotra, P., & Zima, H. P. (1999). Opus: Heterogeneous computing with data parallel tasks. Parallel Processing Letters, 9(2). doi:10.1142/S0129626499000256 Merlin, J. H., Baden, S. B., Fink, S., & Chapman, B. M. (1999). Multiple data parallelism with HPF and KeLP. Future Generation Computer Systems, 15(3), 393–405. doi:10.1016/S0167-739X(98)00083-1 N’takpé. T., & Suter, F. (2006). Critical path and area based scheduling of parallel task graphs on heterogeneous platforms. In Proceedings of the Twelfth International Conference on Parallel and Distributed Systems (ICPADS) (pp. 3–10), Minneapolis, MN. N’takpé. T., Suter, F., & Casanova, H. (2007). A comparison of scheduling approaches for mixed-parallel applications on heterogeneous platforms. In 6th International Symposium on Parallel and Distributed Computing (pp. 35–42). Hagenberg, Austria: IEEE Computer Press. Orlando, S., Palmerini, P., & Perego, R. (2000). Coordinating HPF programs to mix task and data parallelism. In Proceedings of the 2000 ACM Symposium on Applied Computing (SAC’00) (pp. 240–247). New York: ACM Press. Orlando, S., & Perego, R. (1999). COLTHPF, A run-time support for the high-level co-ordination of HPF tasks. Concurrency (Chichester, England), 11(8), 407–434. doi:10.1002/(SICI)10969128(199907)11:8<407::AID-CPE435>3.0.CO;2-0 Pelagatti, S. (2003). Task and Data Parallelism in P3L. In F. A. Rabhi & S. Gorlatch (Eds.), Patterns and skeletons for parallel and distributed computing (pp.155–186). London: Springer-Verlag. Pelagatti, S., & Skillicorn, D. B. (2001). Coordinating programs in the network of tasks model. Journal of Systems Integration, 10(2), 107–126. doi:10.1023/A:1011228808844 Radulescu, A., Nicolescu, C., van Gemund, A. J. C., & Jonker, P. (2001). CPR: Mixed task and data parallel scheduling for distributed systems. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS’01) (pp. 39-46). New York: IEEE Computer Society. Radulescu, A., & van Gemund, A. J. C. (2001). A low-cost approach towards mixed task and data parallel scheduling. In Proceedings of the International Conference on Parallel Processing (ICPP’01)(pp. 69–76). New York: IEEE Computer Society. Ramaswamy, S. (1996). Simultaneous exploitation of task and data parallelism in regular scientific computations. Doctoral thesis, University of Illinois at Urbana-Champaign. Ramaswamy, S., Sapatnekar, S., & Banerjee, P. (1997). A framework for exploiting task and data parallelism on distributed memory multicomputers. IEEE Transactions on Parallel and Distributed Systems, 8(11), 1098–1116. doi:10.1109/71.642945 Ramaswamy, S., Simons, B., & Banerjee, P. (1996). Optimizations for efficient array redistribution on distributed memory multicomputers. Journal of Parallel and Distributed Computing, 38(2), 217–228. doi:10.1006/jpdc.1996.0142
272
Mixed Programming Models Using Parallel Tasks
Rauber, T., Reilein-Ruß, R., & Rünger, G. (2004a). Group-SPMD programming with orthogonal processor groups. Concurrency and Computation: Practice and Experience . Special Issue on Compilers for Parallel Computers, 16(2-3), 173–195. Rauber, T., Reilein-Ruß, R., & Rünger, G. (2004b). On compiler support for mixed task and data parallelism. In G. R. Joubert, W. E. Nagel, F. J. Peter, & W. V. Walter (Eds.), Parallel Computing: Software Technology, Algorithms, Architectures & Applications. Proceedings of 12th International Conference on Parallel Computing (ParCo’03) (pp. 23–30). New York: Elsevier. Rauber, T., & Rünger, G. (1996). The compiler TwoL for the design of parallel implementations. In Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques(PACT’96) (pp. 292-301). Washington, DC: IEEE Computer Society. Rauber, T., & Rünger, G. (1999). Compiler support for task scheduling in hierarchical execution models. Journal of Systems Architecture, 45(6-7), 483–503. doi:10.1016/S1383-7621(98)00019-8 Rauber, T., & Rünger, G. (1999a). Parallel execution of embedded and iterated Runge-Kutta methods. Concurrency (Chichester, England), 11(7), 367–385. doi:10.1002/(SICI)1096-9128(199906)11:7<367::AIDCPE430>3.0.CO;2-G Rauber, T., & Rünger, G. (1999b). Scheduling of data parallel modules for scientific computing. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing (PPSC), SIAM(CD-ROM), San Antonio, TX. Rauber, T., & Rünger, G. (2000). A transformation approach to derive efficient parallel implementations. IEEE Transactions on Software Engineering, 26(4), 315–339. doi:10.1109/32.844492 Rauber, T., & Rünger, G. (2005). TLib - A library to support programming with hierarchical multiprocessor tasks. Journal of Parallel and Distributed Computing, 65(3), 347–360. Rauber, T., & Rünger, G. (2006). A data re-distribution library for multi-processor task programming. International Journal of Foundations of Computer Science, 17(2), 251–270. doi:10.1142/ S0129054106003814 Rauber, T., & Rünger, G. (2007). Mixed task and data parallel executions in general linear methods. Science Progress, 15(3), 137–155. Rauber, T., Rünger, G., & Wilhelm, R. (1995). Deriving optimal data distributions for group parallel numerical algorithms. In Proceedings of the Conference on Programming Models for Massively Parallel Computers (PMMP’94) (pp. 33–41). Washington, DC: IEEE Computer Society. Reilein-Ruß, R. (2005). Eine komponentenbasierte Realisierung der TwoL Spracharchitektur. PhD Thesis, TU Chemnitz, Fakultät für Informatik, Chemnitz, Germany. Sips, H. J., & van Reeuwijk, C. (2004). An integrated annotation and compilation framework for task and data parallel programming in Java. In Parallel Computing (PARCO): Software Technology, Algorithms, Architectures and Applications (pp. 111–118). New York: Elsevier. Skillicorn, D. B. (1999). The network of tasks model, (TR1999-427). Queen’s University, Kingston, Canada.
273
Mixed Programming Models Using Parallel Tasks
Subhlok, J., & Vondran, G. (1995). Optimal mapping of sequences of data parallel tasks. ACM SIGPLAN Notices, 30(8), 134–143. doi:10.1145/209937.209951 Subhlok, J., & Yang, B. (1997). A new model for integrated nested task and data parallel programming. In Proceedings of the 6th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (pp. 1–12). New York: ACM Press. Suter, F., Desprez, F., & Casanova, H. (2004). From heterogeneous task scheduling to heterogeneous mixed parallel scheduling. In Proceedings of the 10th International Euro-Par Conference (Euro-Par’04), (LNCS: Vol. 3149, pp. 230–237). Pisa, Italy: Springer. van der Houwen, P. J., & Messina, E. (1999). Parallel Adams methods. Journal of Computational and Applied Mathematics, 101(1-2), 153–165. doi:10.1016/S0377-0427(98)00214-3 van der Houwen, P. J., & Sommeijer, B. P. (1991). Iterated Runge-Kutta methods on parallel computers. SIAM Journal on Scientific and Statistical Computing, 12(5), 1000–1028. doi:10.1137/0912054 van der Wijngaart, R. F., & Jin, H. (2003). The NAS parallel benchmarks, multi-zone versions (No. NAS-03-010). NASA Ames Research Center, Sunnydale, CA. van Reeuwijk, C., Kuijlman, F., & Sips, H. J. (2003). Spar: A set of extensions to Java for scientific computation. Concurrency and Computation, 15, 277–299. doi:10.1002/cpe.659 Vanneschi, M. (2002). The programming model of ASSIST, an environment for parallel and distributed portable applications. Parallel Computing, 28(12), 1709–1732. doi:10.1016/S0167-8191(02)00188-6 Vanneschi, M., & Veraldi, L. (2007). Dynamicity in distributed applications: Issues, problems and the ASSIST approach. Parallel Computing, 33(12), 822–845. doi:10.1016/j.parco.2007.08.001 Vydyanathan, N., Krishnamoorthy, S., Sabin, G., Çatalyürek, Ü. V., Kurç, T. M., Sadayappan, P., et al. (2006a). An integrated approach for processor allocation and scheduling of mixed-parallel applications. In Proceedings of the 2006 International Conference on Parallel Processing (ICPP’06) (pp. 443–450). New York: IEEE. Vydyanathan, N., Krishnamoorthy, S., Sabin, G., Çatalyürek, Ü. V., Kurç, T. M., Sadayappan, P., et al. (2006b). Locality conscious processor allocation and scheduling for mixed parallel applications. In Proceedings of the 2006 IEEE International Conference on Cluster Computing, September 25-28, 2006, Barcelona, Spain. New York: IEEE. West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating task and data parallelism. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers’95) (p. 211). New York: IEEE Computer Society.
KEY TERMS AND DEFINITIONS CM-Task: A CM-task is an extension of an M-task that additionally supports data exchanges with other CM-tasks during its execution.
274
Mixed Programming Models Using Parallel Tasks
Data Parallelism: Data parallel computations apply the same operation in parallel on different elements of the same set of data. M-Task: An M-task is a parallel program fragment that operates on a set of input parameters and produces a set of output parameters. The implementation of an M-task supports an execution on an arbitrary number of processors. Mapping: Mapping assigns specific physical processing units, e.g., specific cores of a multi-core SMP cluster, to tasks of an application. Mixed Parallelism: Mixed parallelism is a combination of task and data parallelism that supports the concurrent execution of independent data parallel tasks each operating on a different set of data. Scheduling: Scheduling defines an execution order of the tasks of an application and an assignment of tasks to processing units. Scheduling for mixed parallel applications additionally has to fix the number of executing processors for each data parallel task. Task Parallelism: Task parallel computations consist of a set of different tasks that operate independently on different sets of data.
275
276
Chapter 12
Programmability and Scalability on Multi-Core Architectures Jaeyoung Yi Yonsei University, Seoul, Korea Yong J. Jang Yonsei University, Seoul, Korea Doohwan Oh Yonsei University, Seoul, Korea Won W. Ro Yonsei University, Seoul, Korea
ABSTRACT In this chapter, we will describe today’s technological trends on building a multi-core based microprocessor and its programmability and scalability issues. Ever since multi-core processors have been commercialized, we have seen many different multi-core processors. However, the issues related to how to utilize the physical parallelism of cores for software execution have not been suitably addressed so far. Compared to implementing multiple identical cores on a single chip, separating an original sequential program into multiple running threads has been an even more challenging task. In this chapter, we introduce several different software programs which can be successfully ported on the future multi-core based processors and describe how they could benefit from the multi-core systems. Towards the end, the future trends in the multi-core systems are overviewed.
INTRODUCTION Intel has shipped the first dual-core processor as early as 2005 and many major processor vendors have developed dual-core or quad-core processors since then. We are now entering the new era of the multi-core processors, and practically every field in computer science or computer engineering will be affected by this strong movement. Though computing power has improved dramatically with higher clock frequencies and techniques such as superscalar, superpipelining, and VLIW (Very Large InstrucDOI: 10.4018/978-1-60566-661-7.ch012
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Programmability and Scalability on Multi-Core Architectures
tion Word), it seems that this progress will see a significant slowdown and we will have to come up with other solutions to maintain the speed of improvement we are enjoying now. There are three main reasons for the slowdown in single core performance improvement. First of all, we cannot increase the clock frequency following the Moore’s Law due to the power dissipation concerns and thermal problems. As millions of transistors are integrated onto one chip and as the clock speed goes up, the heat becomes too much to handle with current affordable cooling solutions. Secondly, the latency of processor-memory requests becomes a limiting factor, caused by the gap of speed advancements between the processor and the memory. Indeed, this becomes a major bottleneck for overall computing performance. Lastly, it is known that ILP (instruction-level parallelism) from a single thread almost reaches its limit in the current microprocessor architectures and compiler techniques. Therefore, we can see that the next path to take is the multi-core approach; instead of trying to improve the performance of single thread execution, we should partition applications into multiple threads that can run in parallel on prevailing multi-core systems. In this chapter, we are going to look at recent researches on multi-core architectures and how to fully utilize the multi-core systems. In the next section, we look at hardware designs and characteristics of the multi-core architectures. In Section 3, we explain software programming skills to exploit parallelism in two specific applications which are network coding and Intrusion Detection Systems (IDS). In Section 4, we touch on the programmability and scalability issues to use multi-core systems for video applications, and in Section 5, we conclude the chapter with forecasting on future multi-core systems.
BACKGROUND STUDY: HARDWARE DESIGNS OF MULTI-CORE ARCHITECTURE In this section, the general hardware architecture of current multi-core processors is surveyed as background study. We first describe the basic method to build multi-core processors and then describe the memory hierarchy design issues for multi-core system.
Multi-Core Processor Architecture: Homogeneous or Heterogeneous In a simplest way to design multi-core processors, we can just imagine to arrange multiple processing units (so called cores) on a single chip. In theory, this might be a good way to boost performance by providing parallelism through more processors. However, in reality, multi-cores do not always promise the performance enhancement in software execution (Hill & Marty, 2008) . In fact, it could cause opposite results due to some communication restrictions and memory sharing problems. Therefore, hardware designers of multi-core based processors have to research various methods of multi-core structure and simultaneously find effective algorithm of data sharing to boost performance and efficiency of processing. According to an architect for a specific design, the internal architecture of each core can vary. Actually, the core can be either heterogeneous or homogeneous. In Figure 1, we show the two ways to build a multi-core processor. The left-side two diagrams show multi-core processor from Intel and AMD which can be classified as “homogeneous multi-core processors”. As the name implies, these models integrate identical cores on to a single chip. Yorkfield from Intel is a quad-core CPU that integrates two dual-core CPUs onto a single die. Phenome from AMD consists of four cores that have identical architecture and shared L3 cache on a single chip.
277
Programmability and Scalability on Multi-Core Architectures
Figure 1. Two-ways to build a multi-core processor
On the other hand, the diagram on the right side, Cell Broadband Engine is a heterogeneous processor. The processor is originally developed as a co-project of Sony, IBM, and Toshiba. There are one functional core and eight synergistic processor elements performing data-intensive processing (Gschwind, Hofstee, Flachs, Hopkins, Watambe & Yamazaki, 2006) . Today, most of manufacturers have followed homogeneous type for a certain reason; they have just preferred to utilize the previously developed design for a single core processor. In fact, it might provide the balance of computing ability between high throughput and good single-thread performance. In addition, simple re-use of micro-architecture provides great portability and scalability for previously provided applications and legacy codes. It is an easy way to build multi-core architectures as we only need to add multiple cores to a single-chip processor. In spite of these advantages of homogeneous structure, most processor architects expect that heterogeneous multi-core processors such as the Cell processor would perform more powerfully and effectively than homogeneous ones (Pericas, Cristal, Cazorla, Gonzalez, Jimenez & Valero, 2007)(Kumar, Tullsen & Jouppi, 2006). The reason is that the heterogeneous type can be designed to match each application to the specific multi-core processor; it is a better approach to meet high performance. This approach allows us to design more efficient power consumption architecture in smaller chip die area (Kowaliski, n.d.) . Another reason is that the number of data to process is being enlarged more sharply than the number of instructions with development of the digital technology. In order to process numerous data, it is needless to say that heterogeneous one is much more suitable than homogeneous one. One of examples for heterogeneous multi-core architecture is implemented with Graphics Processing Units. The expectation to large amount of data processing would affect that GPU (Graphics Processing Unit) is developed as a general purpose processing unit named GP-GPU (general purpose-GPU). Most companies that make processing units such as GPU, DSP, and CPU also expect GP-GPU should make strong transition into CPU’s market (Kowaliski, n.d.) .
278
Programmability and Scalability on Multi-Core Architectures
Figure 2. Single-core and multi-core processors various multi-core processors
The example of this trend is shown at major processor companies; AMD has acquired ATI that is a famous graphics processing unit maker. Also, it seems that Intel has interested in this trend as developing the GPU architecture named Larabee. In addition, NVIDIA, one of the most famous GPU makers, tries to develop general purpose processing unit based on the traditional graphics processing unit.
Memory Hierarchy Design Main memory latency becomes a major delay in modern computer systems either in single-core systems or multi-core systems. Indeed, the architectural design for memory system is very important for the performance of multi-core processors. Especially, cache has one more important duty in the multi-core systems; the coherency protocol. The shared cache in multi-core processor has to observe the change of data and notify each cores about it. In Figure 2, we show various multi-core processor models with different cache designs. There are no cache sharing in the single core processor and the simple dual-core processor. On the other hand, the shared L2 cache dual-core processors integrate a shared L2 cache; the cache coherence protocol must be adequately designed in order to use the shared L2 cache in a proper way. One of the most important aspects in designing multi-core processors is memory hierarchy and data sharing techniques among different cores. Much like the traditional shared memory architectures, the data between running threads on different cores are passed through the shared memory. To this point, the cache design plays an important role and becomes a major issue in designing multi-core processors.
279
Programmability and Scalability on Multi-Core Architectures
As a consequence, the performance of today’s multi-core systems remarkably depends on the size of cache, the cache hierarchy, and shared cache architecture (Schirrmeister, 2007). In fact, the performance would be improved on the single core architecture as the size of cache becomes larger. However, the advantages of a large-sized cache become less effective in multi-core processors due to the data sharing operation and coherence protocol (Kowaliski, n.d.) . Indeed, the structures of cache hierarchy and coherence protocols are considered as more important than the size in multi-core processors. For that reason, research about cache structure in a multi-core processor is being studied by various sides; many different structures have been proposed and developed in order to achieve improvement of performance. The design of multi-core is more complicated than the design of single core, due to data sharing between cores and grouping of cores. The simple combination of single cores without any consideration for grouping does not perform well; the main reason is that an original single core has not been designed considering any parallelism. However, to have a better design as a multi-core processor, we must design it fully considering the parallel execution of software and communications between cores. Therefore, the special design and technique should be added on top of simple arrangement of cores. This special design and technique could be related to the basic core architectures or a combination structure of cores architectures. There are many ways to extract the potential parallelism by providing a better hardware platform. To find the best way, we must design the hardware platform based on execution of software parallelism.
ExPLOITING SOFTWARE PARALLELISM ON A MULTI-CORE SYSTEM In this section, we will look into three approaches to exploit software parallelism using a multi-core system. The first application described is network application. After that, we will present an algorithm development of intrusion detection system as a multi-core application.
Network Coding on Multi-Core Systems Network coding is a method that increases the network transmission rate while increasing reliability and security. It does this by performing coding operations on packets not only in the source nodes but also at intermediate nodes throughout the network topology between the source and receivers. The idea was first proposed by Ahlswede et al (Akenine-Moller, 2002), who showed the usefulness of network coding in multicast networks. It was further researched by others, who showed that simple linear structures could be used for the implementation of network coding (Gschwind, et al., 2006), and going further, that a random combination of the linear codes could be used in decoding (Fernando, Harris, Wloka & Zeller, 2004) . In Figure 3, we show a communication network, a directed graph where the edges represent pathways for information (Li, Yeung & Cai, 2003) . At the source S, information is generated and then multicasted to other nodes in the network. Here, every node can pass on whatever information it has received. Now, suppose you generate data bits a and b at source S. We want to send the data to both node D and node E. By the Max-flow min-cut theorem (Wikkipedia, Gauss-Jordan elimination), we can calculate the maximum flow, that is, the maximum amount of information we can transmit through this network. We cannot achieve this maximum rate by just routing, and that is where network coding comes in.
280
Programmability and Scalability on Multi-Core Architectures
Figure 3. A communication networks for network coding
We first send data a through path SA, AC, AD, and data b through SB, BC, BE. With the routing scheme, we can only send a copy of either a or b but not both, from C down to the path CZ. Suppose we send data a through CZ. Then node D would receive data a twice, once from A and once from Z, and would not get data b. Sending data b instead would also raise the same problem for node E. Therefore, we could say that routing is insufficient as it cannot send both data a and data b to both destinations node D and E simultaneously. Using network coding, on the other hand, we could encode the data a and b received in node C and send down the encoded version to CZ. Say we use bitwise xor for encoding. Then, data a and b are encoded to ‘a xor b’. The encoded data is sent along on the path CZ, ZD, and ZE. Node D receives data a and ‘a xor b’, so it can decode and get data b from it. It is the same for node E, where it receives data b and ‘a xor b’, extracting data a by decoding. By looking at this example, it is clear that network coding has a huge advantage over simple routing. Network coding enables us to multicast two bits per unit time from the source down to the destinations, which you cannot achieve through routing. Now, with this high transmission capacity, another factor in performance is the encoding/decoding speed. It has to be fast enough not to be a performance bottleneck, and in today’s multi-core environment, fast encoding/decoding can be achieved by exploiting parallelism. In this case, we shall pick the method of linear encoding and random linear decoding mentioned earlier, and see how we can parallelize it. First, we will take a look at the big picture of encoding and decoding, then the specific algorithm and parallelization. Let us assume that an application generates a stream of equal sized frames. Organize these frames into blocks, which contain a number of consecutive frames. Suppose the frames are numbered, then b(blockID, blockSize) denotes a block which holds frame(blockID) to frame(blockID+blockSize-1). A coded packet c(blockID,blockSIze) is a linear combinations of the frames within b(blockID,blockSIze). That is c(blockID ,blockSize ) = å k =1 ek p(k + blockSize - 1) , where pk is an application frame and the coefficient ek is a certain element in a chosen finite field F. Every arithmetic operation will be over field F. (Figure 4) A blockSize number of application frames is needed to make a coded packet, so a source node waits blockSize
281
Programmability and Scalability on Multi-Core Architectures
Figure 4. Blocks and coded packets
for enough frames to accumulate before starting encoding. The encoded packet will be broadcasted to other destination nodes along with the coefficient vector stored in the header. Nodes in the path to the destinations nodes will re-encode the coded packets and send them along. When the coded packet reaches a destination node it will get stored it in the memory. For the destination node to decode the package into the original data block with blockSize frames, it needs to get blockSize coded packets with independent coefficient vectors. T T T T T T T ] , and P = [pblockID ...pblockID +blockSize-1 ] , where If we denote E = [e1 ...eblockSize ] , C T = [c1T ...cblockSize superscript T stand for the transpose operation. As the coded packet was calculated as C = EP, we can decode C into the original block P in the destination nodes with the formula P=E-1C. Note that the matrix E needs to be convertible, so all coefficient vectors ek’s must be independent with each other. The pseudocode for the encoding algorithm is given in Figure 5, and more easily represented in Figure 4. It is basically a matrix operation. We can parallelize this operation to run in multiple threads, and depending on how we divide it the speedup could be zero to twofold. For instance, suppose you divide the work in each row * column multiplication operation in Figure 5. The operation would calculate a1 * b1 + a2 * b2 + + a 8 * b8 , so you can make a few threads that does each multiplication an*bn and add up all the results of the threads in the end. It seems possible when you only think about it in the algorithm level, but the problem is that you cannot store results simultaneously in the memory. Race conditions could occur, so you have to add locks to the critical section to avoid collision. However, these locks mean that the threads cannot run in parallel; each thread has to wait its turn to use the lock. Therefore, even if you can do the multiplication in separate cores simultaneously, the memory store operation will become a bottleneck and sequentialize the whole process, hindering the speedup that can be achieved in multi-core environment. (Figure 6)
282
Programmability and Scalability on Multi-Core Architectures
Figure 5. Encoding algorithm
Figure 6. Encoding process
On the other hand, suppose you divide the work by assigning each whole row * column multiplication on different threads instead of breaking down each operation on threads, as in Figure 7. This way the process is divided into chunks so that memory storage is not a problem. Inside each chunk the operations store the temporary results in the same memory cell, but since it is sequential inside each thread it is OK. If you look at each different chunk, they store all results in different memory cells, avoiding collision and thus suitable to run in parallel. The specific description of the algorithm is as follows. As in Figure 7, we can parallelize the encoding algorithm by using threads to divide the workload. In Figure 6, a single thread does vector multiplication e1 ∙ b(1,8), e2 ∙ b(1,8), …, e8 ∙ b(1,8) to get the coded packets c1, c2, …,c8 sequentially, one at a time. In the parallelized version, we split the work into 4
Figure 7. Parallelized encoding algorithm
283
Programmability and Scalability on Multi-Core Architectures
Figure 8. Parallelized encoding process
independent parts, each running on a different thread. Thus, if the processor has more than 4 cores, we can get the coded packets c1 and c5 computed in the first core, c2 and c6 in the second, and so on. This is shown in Figure 8. Note that coded packets in the same color are calculated in the same thread. This parallelization means we get a fourfold throughput in the encoding process, excluding the time it takes to manage the threads. Figure 9. After the blocks are encoded into coded packets, they are sent to the destination node. At every node on the way to the destination, the packets go through a re-encoding process. The re-encoding process is basically the same as the encoding process, where the coded packet gets encoded once more with a randomly selected re-encoding vector. The newly coded packet c´(blockID ,blockSize ) = å k =1
blockSize
Figure 9. Re-encoding process
284
e´k ck is sent
Programmability and Scalability on Multi-Core Architectures
Figure 10. Decoding algorithm in a single-core system
along with the combined coefficient vector e´= å k =1 e´k ek . This re-encoding process can therefore be parallelized similarly, with multiple threads doing the calculations of matrix multiplication. After going through the re-encoding process in every node it passes, a coded packet along with the corresponding coefficient vector finally reaches the destination(s). The destination node waits until it receives enough of these coded packets to decode, which would be the blockSize supposing that all coefficient vectors are independent. The following process after receiving all needed packets is decoding; reconstructing the original data from the coded packets, and this is another place we could upgrade the performance in a multi-core system. If you arrange the coded blocks as a matrix, we can calculate the inverse matrix and then the original block using the Gauss-Jordan algorithm (Wikkipedia, Gauss-Jordan elimination). In a single-core machine without using threads, this process is purely sequential, as in Figure 10. However, in a multi-core environment, we can speed this process up by dividing the work into several threads which would run on separate cores in parallel. The aim of the decoding process is to transform the coefficient matrix to its reduced row echelon form with basic row operations. For each i’th vector in the coefficient matrix, the first step is to divide the whole vector by the value of the basis coordinate, say val, to make the basis coordinate 1. Then the second step is for rows above, subtract (i’th vector) * val to make the value of column i 0. For rows below, divide the vector by val and then subtract the current_vector. Each loop of this will set column i as 0, except for the i’th row. After going through every vector like this, we will get the identity matrix. Apply the same operations in the reduction process to an identity matrix, and it will reveal the inverse matrix with which we can just multiply with the coded packets to get the original blocks. Now the parallelization part here is simple. In the second step of the above algorithm, each row operation is independent with one another with no race conditions. Thus we can part the row operations into small groups to run concurrently on separate threads. This part of the algorithm corresponds to line 5-14 in the pseudocode of Figure 10, and simply making multiple threads to execute this part will do the job. See Figure 11. We have looked at the encoding/decoding process used in network coding, and have parallelized blockSize
285
Programmability and Scalability on Multi-Core Architectures
Figure 11. Parallelized decoding algorithm – I’th basis operation
it into multiple threads. As you can see, there are several problems to think of before letting multiple threads divide up the job, such as race conditions, critical regions, locks, and etc. The hard part is that present compilers do not catch these errors; you are on your own on that. Race conditions are especially hard to manage, as once you let an error like that get in it could be hard to detect. It is possible that a problem would just pop out after years of safe use of the program. It could lead to a critical mistake if hidden in medical equipment or space ships or so on. Thus, as it is with all software codes, check and double check the algorithm, and execute sufficient testing before declaring it safe to use.
Implementing Intrusion Detection System on Multi-Core Architecture Ever since the Internet service has been introduced, it has been a major method for people to communicate to each other and collect useful information from the world. Although this provides a lot of convenience to everyday life, it also contains some serious threats in a point that it may expose personal privacy. Hence the necessity of Internet security has been emphasized in order to protect personal information from the world. Moreover, ubiquitous computing will be actively developed and widely adopted in the near future; this trend will require more advanced Internet service as a backbone platform to implement successful ubiquitous computing environment. In this section, we will discuss parallel intrusion detection algorithms designed for multi-core systems.
Overview of IDS (Intrusion Detection System) Among the network security products, Intrusion Detection System (IDS) is a leading network security solution in the market. The IDS systems can be divided into two groups: the host-based IDS and the network-based IDS. Since we are interested in parallel implementation of IDS applications, the networkbased IDS will be the main target of our research. One of the main advantages of the network-based IDS is that it can support a large scale network. In addition, it is also able to detect the host server before being attacked. However there are two major problems of the network-based IDS: the packet filtering problem and the classification problem based on string matching (Akenine-Moller, 2002). The former has been improved by several previous studies
286
Programmability and Scalability on Multi-Core Architectures
Figure 12. Multi-thread test
however the latter still needs to be studied more. Boyer-Moore’s algorithm is the most well known algorithm among String matching. The algorithm is a general purpose string matching algorithm; it scans and compares the string to be matched the input string starting from the rightmost character of the string (Li, Yeung & Cai, 2003) . The weakness of string matching is that all data must be scanned. In the scanning process there are large power distribution and slowing down of performance speed (Chen, & Lee, 1999) . Therefore, this weakness should be improved. Pattern matching is the most important part in the network based IDS. However, it always requires many calculations and even worse; the number of attacking patterns is increasing day by day. Pattern matching should be able to manage patterns efficiently with various lengths, capitals and small letters, and several ordered letters at the same time. Therefore, it would be efficient to have multi-pattern matching in network based IDS to manage packets coming in at high speed (Ni, Lin, Chen & Ungsunan, 2007) . Figure 12 shows data comparing Q6600 (single-core) and E2140 (multi-core) of Intel. It is possible to improve the efficiency up to 40% by using multi-threads in virus checking. This means that the large parts of the network security can be improved by being parallelized. However, the present network security research is focused on mathematical approach, data communication, and data processing. Thus, the network security method should be more efficiently implemented by using multi-core based systems.
Parallelization of Intrusion Detection System The future computer system will be designed based on multi-core. This trend will also be applied in IDS. However, most of the research related to IDS is being studied in single-core and software. As mentioned before, several parts in IDS can be parallelized. In fact, pattern matching algorithm is an orderly process so it takes a long time (Ni, et al., 2007) . It can be solved by exploiting parallelism in the pattern matching algorithm. Figure 13 and Table 1 show the structure of the pattern matching and order of execution, respectively (Ni, et al., 2007) . As it is shown in Figure 13, the IDS scan the patterns through the pattern matching. The scanned patterns are decided in the CPU whether or not it is an attacking pattern. This process takes a long time because all patterns need to be scanned. The process until now has performed orderly. However the patterns can be divided in to several blocks because they are independent to one another.
287
Programmability and Scalability on Multi-Core Architectures
Figure 13. Structure of parallel pattern matching
Because of these characteristics, blocks can be allocated to several cores in the multi-core environment (Kowaliski, n.d.) . Each block is divided into several threads and those threads enter each allocated core. Through this, the process of pattern matching in the multi-core gets high processing speed.
GPU DEVELOPMENT ON MULTI-CORE SYSTEM Parallel computing and data-parallel programming environments provide us a high performance in a computer system. Especially, it becomes a crucial feature in the graphics processors because nearly 10, 000 data elements need to be handled at a given time (Boyd, 2008) . As a result, the graphics processor has become one of the target applications to utilize multi-core system. The large-scale 3D graphic applications have created many-cores GPUs and a large number of CPUs. Indeed many-core CPUs need a new software paradigm which can easily exploit software parallelism. To reflect this trend, NVIDIA have announced CUDA in 2007 which is a parallel programming language. The CUDA is available in multi-core system that has a shared-memory parallel processing
Table 1. Order of pattern matching 1
Fetch the patterns in the network interface
2
Scan the pattern and send to L2 cache
3
Pattern saved in the shared L2 cache
4
Several blocked-patterns input to each CPU
5
Patterns saved in L1 cache in CPU
6
Check the patterns whether it is attacking pattern or not
288
Programmability and Scalability on Multi-Core Architectures
Figure 14. 3D graphics system architecture
architectures (Nickolls, Buck & Garland, 2007) . In the graphic market, the many-core GPUs are rapidly developed because graphic processing includes many calculations. Generally, the development of the graphic market requires 3D graphic technology. 3D graphic technology should be changed because it needs high processing power. As a result, power distribution in 3D graphic technology needs to use multi-cores in order to reduce the power consumption. Moore’s law says there are more transistors on one die so more cores can be integrated. Today’s GPUs have been improved to the GPGPUs (General-Purpose computation on GPUs) (Nickolls, et al., 2007) . This leads to the strong result of the improvement in the speed.
Background of 3D Graphics Processing and GPU Generally the process of 3D Graphics is divided into three stages: application stage, geometry stage, and rasterization stage (Chen, & Lee, 1999) . The stage is shown in Figure 14. The above figure explains the three stages of the process of 3D Graphics. The first stage is the application stage. This stage is processed in the CPU and manages the user input, calculates the physical 3D object data, and also provides vertex data - such as plots, lines, triangle or polygon. The second stage is the geometry stage. In this stage, the vertex data is received from application stage and the geometry convergence is calculated (it involves the multiplication of vector) and color bit – such as clipping, lighting and coordinate transformation. The last stage is the rasterization stage; it maps the previously saved texture in the geometry stage and some effect. In addition, it also saves the final result in the frame buffer. Most of the 3D graphics chips (GPU) focus on accelerating the rasterization stage and leave the geometry processing to host CPU because the geometry processing demands lots of floating-point operations which cannot be easily handled (Chen, & Lee, 1999) . To achieve real-time 3D graphics performance, we focus more on the rasterization stage. The GPU is the most heavily used processor in the graphics field. It has shown rapid improvement in
289
Programmability and Scalability on Multi-Core Architectures
Figure 15. A modern GPU
the recent 10 years, and this has supported the increase of the capacity of the computer. In detail, GPU gives solution to rendering and parallel data. Figure 15 below describes the simplification of modern GPU (Boyd, 2008) . Parallelizing the GPU with CPU has solved the problem of data processing that causes performance bottleneck. However, this could not solve the numerous data of graphic processes. Therefore, the private used hardware (GPU) has been designed to solve the previous problem. The main core of the GPU is the shader. The shader controls the calculation of vertex or pixel data which is mainly focused on the calculation of the graphics data (Boyd, 2008) . Furthermore, a unified parallel shader has been designed to solve the I/O problems and to improve the capacity of the shader.
Parallelization of 3D Graphic Processing The whole process of the 3D graphics processing was operated in a CPU. However, there is a problem that loads of CPU gets bigger due to the high processing of the calculation (Fernando, et al., 2004) . As a result, there is a GPU (Graphics + CPU) that can separate geometry state and rasterization state to one core. The shader used in GPU is specialized processor that can accelerate API (Application Program Interface). Meanwhile, the user can receive the technical program from the 3D graphic program or other hardware block in GPU. The technical program involves the vertex and pixel data in the 3D graphic calculation process that is independent or similar calculation. So, the GPU has parallelized vertex and pixel shader. Also, inside of shader, there is a parallelized structure to improve the capability. The current trend is to integrate shaders in GPU to improve the use of shader. Integrated shaders do not have the distinction between vertex shader and pixel shader. The small shaders with the same structure together become one big component. One big shader operates efficiently because the inner small shaders are operating in parallelization. This can operate to vertex shader and pixel shader depends on the cases. Because the 3D graphics API is being programmed in the pipeline, it is possible to calculate
290
Programmability and Scalability on Multi-Core Architectures
Figure 16. GPU using parallelized shader
independently even the vertex shader and pixel shader are integrated. Parallelized integrated-shader can solve the problem of calculation being convergent and save hardware resources. Moreover, it uses the same instruction set so that the shader programming becomes easier. Figure 16 shows the inner structure of GPU where the parallelized integrate-shader is used. A parallelized shader is a homogeneous multiprocessor that has several small shaders with the same structures. Each parallelized shader is designed to accelerate the calculation of vertex or pixel in 3D programs. Even though the GPU is a processor core which is specified from the data parallel, there is a problem in using it as general purpose. However it is inefficient to be used only to process graphics. As a result, the GPGPU has been developed for general purpose of numeric value calculation. Furthermore, the two leading companies of the graphic market, NVIDIA and AMD have developed general purpose parallel programming tools. Those are CUDA (Compute Unified Device Architecture) from NVIDIA and CTM (Close to the Metal) from AMD. For example, set NVIDIA’s GPGPU to computer and install the CUDA software. This can increase the speed in the programs with many floating point calculations such as graphics. CUDA uses parallel data cache in between ALU and memory to perform the threadunit parallel process by using several ALUs. Roughly speaking, several clusters of desktop PC can be efficient as the super computer.
CONCLUSION Today, most of processor manufacturers are interested in multi-core architectures and release various multi-core products. We expect this trend would last for the next couple of years due to the following three aspects. First of all, the clock speed is no longer a major issue of processor design. This is due to the fact that as increased the clock speed, the leakage current on a chip also becomes high, so the
291
Programmability and Scalability on Multi-Core Architectures
dramatic increase of power consumption is caused and processor temperature goes high. Therefore, the higher clock frequency is no longer useful factor and the new implementation technique such as multicore architecture is required to elevate performance of processors. The second fact is that manufacturers can integrate more and more transistors onto a single chip. Since the number of transistors per chip has increased continually, the trend of multi-core processor that integrates two or more processing cores onto a single chip becomes realistic in computer technology (Hayes, 2007) . Thirdly, the market need is also the one factor that brings up the trend. Most computer users want to perform multiple tasks on their desktop machine, concurrently such as listening music, playing games, watching television, internet surfing, and so on. Therefore, the needs of computer users encourage processor makers to obtain the performance improvement through parallelism or processing cores. In this chapter, we have provided several software applications which can be used efficiently in multi-core processors.
REFERENCES Akenine-Moller, T., & Haines, E., (2002, July). Real-time rendering (2nd Ed.). Wellesley, MA: A. K. Peters Publishing Company. Aldwairi, M., Conte, T., & Franzon, P., (2005) Configurable string matching hardware for speeding up intrusion detection. ACM SIGARCH Computer Architecture News, 33(1). Alshwede, R., Cai, N, Li, S.-Y. R., & Yeung, R. W. (2000). Network information flow: Single Source. IEEE Transactions on information theory, (submitted for publication). Boyd, C. (2008, March/April). Data-parallel computing. ACM Queue; Tomorrow’s Computing Today, 6(2). doi:10.1145/1365490.1365499 Boyer, R., & Moore, J. (1977). A fast string searching algorithm. Communications of the ACM, 20(10), 762–777. doi:10.1145/359842.359859 Chen, C.-H., & Lee, C.-Y. (1999). A cost effective lighting processor for 3D graphics application. Proceedings of International Conference on Image Processing, 2, 792–796. Dharmapurikar, S., & Lockwood, J. (2006, October). Fast and Scalable Pattern Matching for Network Intrusion Detection Systems. Communication of the IEEE Journal, 24(10). Femando, R., Harris, M., Wloka, M., & Zeller, C. (2004). Programming graphics hardware. In Tutorial on EUROGRAPHICS. NVIDIA Corporation. Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins M., Watambe, Y., & Yamazaki, T., (2006). Synergistic Processing in Cell’s Multicore Architecture. IEEE Computer Society, 0272-1732/06. Hammond, L., Nayfeh, B.A., & Olukotun, K. (1997, September). A Single-Chip Multiprocessor. IEEE Computer, September, 30(9), 79-85 Hayes, B. (2007). Computing in a parallel universe. American Scientist, ▪▪▪, 95. Hennessy, J.L., & Patterson, D.A., (n.d.). Computer Architecture – A Quantitative Approach (4th Ed).
292
Programmability and Scalability on Multi-Core Architectures
Hill, M. D., & Marty, M. R. (2008, July). Amdahl’s Law in the Multicore Era. HPCA 2008, IEEE 14th International Symposium (pp.187). Ho, T., Medard, M., Koetter, R., Karger, D. R., Effros, M., Shi, J., & Leong, B. (2006, October). A random linear network coding approach to multicast. IEEE Transactions on Information Theory, 52(10). doi:10.1109/TIT.2006.881746 Koetter, R., & Medard, M. (2003, October). An algebraic approach to network coding. IEEE/ACM Transactions on Networking (TON), 11(5), 782 – 795. Kowaliski, C. (2008). NVIDIA CEO talks down CPU-GPU hybrids, Larrabee. The Tech Report, April 11th. Retrieved from http://techreport.com/discussions.x/14538 Kumar, R., Tullsen, D. M., & Jouppi, N. P. (2006). Core Architecture Optimization for Heterogeneous Chip Multiprocessors. In Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques (pact 2006) (pp. 23-32). Kumary, R. Tullsen D.M., Ranganathan, P., Jouppi, N.P., & Farkas, K.I., (2004). Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance. In Proceedings of the 31st International Symposium on Computer Architecture (ISCA’04), June, 2004. Kwok, T. T.-O., & Kwok, Y.-K. (2007). Design and Evaluation of Parallel String Matching Algorithms for Network Intrusion Detection Systems (NPC 2007), (LNCS 4672, pp. 344-353). Berlin: Springer. Li, S.-Y. R., Yeung, R. W., & Cai, N. (2003, Feb.). Linear network coding. IEEE Transactions on Information Theory, 49(2), 371–381. doi:10.1109/TIT.2002.807285 Ni, J., & Lin, C. Chen, Z., & Ungsunan, P. (2007, September). A Fast Multi-pattern Matching Algorithm for Deep Packet Inspection on a Network Processor. In Proceedings of International Conference on Parallel Processing (ICPP 2007)(p.16). Nickolls, J., Buck. I, & Garland, M., (2008). Scalable Parallel Programming with CUPA. ACM QUEUE, March/April, 6(2), 40-53 Olukotun, K., & Hammond, L., (September 2005). The Future of Microprocessors. ACM Queue, September, 3(7), 26-29 Patterson, D.A., & Hennessy, J.L. Computer Organization and Design (3rd Ed.). Paxson, V., & Sommer, R. (2007). An Architecture Exploiting Multi-Core Processors to Parallelize Network Intrusion Prevention. In . Proceedings of IEEE Sarnoff Symposium, 3(7), 26–29. Pericas, M., Cristal, A., Cazorla, F. J., Gonzalez, R., Jimenez, D. A., & Valero, M. (2007). A Flexible Heterogeneous Multi-Core Architecture. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, (pp. 13 -24). Schirrmeister, F. (2007). Multi-core Processors: Fundamentals, Trends, and Challenges, Embedded Systems Conference, (pp. 6-15). Shen, J. P., & Lipasti, M. (2004). Modern Processor Design: Fundamentals of Superscalar Processors (1st Ed.).
293
Programmability and Scalability on Multi-Core Architectures
Wikipdedia, Gauss-Jordan elimination. Retrieved from http://en.wikipedia.org/wiki/Gauss-Jordan_elimination Wikipedia, Max-flow min-cut theorem. Retrieved from http://en.wikipedia.org/wiki/Max-flow_mincut_theorem
KEY TERMS AND DEFINITIONS Cache Coherency: Cache coherence is a method of managing conflicts and maintain consistency between cache and memory. Compiler: A compiler is a set of programs that translates text written in a computer language (the source language) into an another computer language (the target language). Instruction-Level Parallelism (ILP): ILP is a measure of how many of the instructions in a computer program can be computed simultaneously. Multi-Core: Multi-core are multi-core architectures with a high number of cores. Multiprocessor: Multiprocessor is a single computer system that has two or more processors. Parallelism: Parallelism is a method of computation in which many calculations are carried out simultaneously. Thread-Level Parallelism (TLP): TLP is a form of executing threads across different parallel computing nodes
294
295
Chapter 13
Assembling of Parallel Programs for Large Scale Numerical Modeling V. E. Malyshkin Russian Academy of Sciences, Russia
ABSTRACT The main ideas of the Assembly Technology (AT) in its application to parallel implementation of large scale realistic numerical models on a rectangular mesh are considered and demonstrated by the parallelization (fragmentation) of the Particle-In-Cell method (PIC) application to solution of the problem of energy exchange in plasma cloud. The implementation of the numerical models with the assembly technology is based on the construction of a fragmented parallel program. Assembling of a numerical simulation program under AT provides automatically different useful dynamic properties of the target program including dynamic load balance on the basis of the fragments migration from overloaded into underloaded processor elements of a multicomputer. Parallel program assembling approach also can be considered as combination and adaptation for parallel programming of the well known modular programming and domain decomposition techniques and supported by the system software for fragmented programs assembling.
INTRODUCTION Parallel implementation of realistic numerical models, using direct numerical modeling of a physical phenomenon on the basis of description of the phenomenon behaviour in the local area, usually requires high performance computations. However the algorithms of these models based on the regular data structures (like rectangular mesh) are also remarkable for irregularity and even dynamically changing irregularity of the data structure (adoptive mesh, variable time step, particles, etc.). For example, in the PIC method the test particles are in the bottom of such an irregularity. Hence, these models are very DOI: 10.4018/978-1-60566-661-7.ch013
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Assembling of Parallel Programs for Large Scale Numerical Modeling
difficult for effective parallelization and high performance implementation with conventional programming languages and systems. The Assembly Technology (AT) (Kraeva & Malyshkin, 1997), (Kraeva & Malyshkin, 1999), (Valkovskii, Malyskin, 1988) has been especially created in order to support the development of fragmented parallel programs for multicomputers. Fragmentation and dynamic load balancing are the key features of programming and program execution under AT. The application of the AT to implementation of the large scale numerical models is demonstrated on the example of parallel implementation of the PIC method (Berezin & Vshivkov, 1980), (Hockney & Eastwood, 1981), (Kraeva & Malyshkin, 2001) application to solution of the problem of energy exchange into plasma cloud. Actually AT integrates such well known programming techniques as modular programming and domain decomposition in order to provide suitable technology for the development of parallel programs implementing large scale numerical models. AT supports exactly the process of the whole program assembling out of atomic fragments of computation. The process of new knowledge extraction consisted of two major components. First, new fact is found in real physical (chemical, …) experiments. After that a theory is constructed that should explain new fact and predict unknown facts. The theory serves to the science until some new not explainable fact is found. This is long time and resources consumable process. Real experiments are often very expensive, original equipment for such experiments are prepared long time, and go on. Now the third component is added to scientific process. Numerical simulation of the natural phenomena on supercomputers is now used in order to test the developed theory in numerical experiments, not in real physical experiments. Also such numerical experiments often help to form the new real experiment if necessary. Sometimes the parameters of a physical system can not be measured, for example, the processes in plasma or inside the sun. In these cases only numerical simulation can find some arguments in order to support or to reject the theory. In comparison to real experiments, numerical experiments consume far less resources and can be organized very quickly. Therefore, the investigations of the phenomenon can be done more quickly and the phenomenon can be studied more carefully in numerous experiments. It is no wonder that the modern supercomputers are mostly loaded by the large scale numerical simulation (Kedrinskii, Vshivkov, Dudnikov, Shokin & Lazareva, 2004), (Kuksheva, Malyshkin, Nikitin, Snytnikov, Snytnikov, & Vshivkov, 2005). Unfortunately, the development of parallel programs is very difficult problem. Earlier, sequential programming languages and systems provided for numerical mathematicians the possibility to program their numerical models more or less well without any assistance from the professional programmers. It was their “private technology” (Malyshkin, 2006) of programming. Now another situation is. Development of parallel programs is far more difficult and labor consumed work. Additionally, parallel programs are very sensible to any errors, to any not optimal design decisions and/or inefficiency in programming. As result numerical mathematicians are now unable to develop parallel programs implementing their numerical models without assistance from the professional programmers. The technology of parallel programs of numerical modeling assembling out of ready made atomic fragments is suggested as “private technology” of programming for numerical mathematicians that are often working with restricted number of numerical method and algorithms. The AT is demonstrated on PIC implementation (algorithms parallelization/fragmentation and program construction). Methods of the whole program assembling out of atomic fragments of computation are in use already long time in different forms (scalable computing, granularity, etc). Actually AT integrates such well known
296
Assembling of Parallel Programs for Large Scale Numerical Modeling
programming techniques as modular programming and domain decomposition in order to provide suitable technology for the development of parallel programs implementing large scale numerical models. AT supports exactly the process of the whole program assembling out of atomic fragments of computation. Also the peculiarities of numerical algorithm parallel implementation are taken into account. The most close to AT approach to the development of application parallel programs demonstrate the IBM programming system ALF for the microprocessor Cell (ALF for Cell be programmer’s guide and API reference), (ALF for hybrid-x86 programmer’s guide and API reference)
THE PIC METHOD AND THE PROBLEMS OF ITS PARALLEL IMPLEMENTATION The particle simulation is a powerful tool for modeling of the behaviour of complex non-linear phenomena in plasmas and fluids. In the PIC method, trajectories of a huge number of test particles are calculated as these particles are moved under the influence of the electromagnetic fields computed self-consistently on a discrete mesh. These trajectories represent a desirable solution of the system of differential equations describing a physical phenomenon under study (Berezin & Vshivkov, 1980; Hockney & Eastwood, 1981). A real physical space is represented by a model of simulation domain called the space of modeling (SM). The electric E and magnetic B fields are defined as vectors and discretised upon rectangular mesh (or several shifted meshes, as shown in Figs.1 and 2). Thus, as distinct from other numerical methods on the rectangular mesh, in the PIC method there are two different data structures – particles and meshes. None of the particles affects another particle. At any moment of modeling a particle belongs to a certain cell of each mesh. Each charged particle is characterized by its mass, co-ordinates and velocities. Instead of solution of equations in the 6D space of co-ordinates and velocities, the dynamics of the system is determined by integrating the equations of motion of every particle in the series of discrete time steps. At each time step tk+ 1 = tk + Δt the following is done: 1. 2. 3.
4.
For each particle, the Lorentz force is calculated from the values of electromagnetic fields at the nearest mesh points (gathering phase); For each particle the new co-ordinates and velocity of a particle are calculated; a particle can move from one cell to another (moving phase); For each particle the charge carried by a particle to the new cell vertices is calculated to obtain the current charge and density, which are also discretised upon the rectangular mesh (scattering phase); Maxwell’s equations are solved to update the electromagnetic field (mesh phase).
The sizes of a time step and of a cell are chosen in such a manner that a particle cannot fly farther than into the adjacent cell at one time step of modeling. The number of time steps depends on a physical experiment. A more detailed description of the PIC method can be found in (Berezin & Vshivkov, 1980; Hockney & Eastwood, 1981). The PIC algorithm has the great possibility for the parallelisation, because all the particles are moved independently. The volume of computations at the first three phases of each time step is proportional to the number of particles. About 90% of multicomputer resources are spent for the particle processing.
297
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 1. A cell of the SM with the electric E and magnetic B fields, discretised upon shifted meshes
Thus, in order to implement the PIC code on multicomputer with high performance, the equal number of particles should be assigned for processing to each processor element (PE). However, on the MIMD distributed memory multicomputers, the performance characteristics of the PIC code crucially depends on how the mesh and particles are distributed among the PEs. In order to decrease the communication overheads at the first and the third phases of a time step, it is required that a PE contains as the cells (values of the electromagnetic fields at the mesh points) as the particles located inside them. Unfortunately, in the course of modeling, some particles might fly from one cell to another. To satisfy the previous requirement, two basic decompositions can be used (Kraeva & Malyshkin, 1997). In the so-called Lagrangian decomposition, the equal number of particles are assigned to each PE with no regard for their position in the SM. In this case, the values of the electromagnetic fields, the current charge and density at all the mesh points should be copied in every PE. Otherwise, the communication overheads at the first and the third phases will decrease the effectiveness of parallelisation. Disadvantages of the Lagrangian decomposition are the following: • •
Strict memory requirements; Communication overheads at the second phase (to update the current charge and the current density in each PE).
Figure 2. The whole space of modeling (SM) assembled out of cells
298
Assembling of Parallel Programs for Large Scale Numerical Modeling
In the Eulerian decomposition, each PE contains a fixed rectangular sub-domain, including electromagnetic fields at the corresponding mesh points and particles in the corresponding cells. If a particle leaves its sub-domain and flies to another sub-domain in the course of modeling, then this particle should be transferred to the PE containing this latter sub-domain. Thus, even with an equal initial workload of the PEs, in several steps of simulation, some PEs might contain more particles than the others. This results in the load imbalance. The character of the particles motion does not fully depend on equations, but also on the initial particles distribution and the initial value of the electromagnetic field. Many researchers studied parallel implementation of the PIC method on different multicomputers. Several methods of the PIC parallelization were developed. The big list of references to articles devoted to PIC parallelization can be found in (Kraeva & Malyshkin, 1997). In order to reach high performance these methods take into account the particles distribution. Let us consider some examples of particles distribution, which correspond to the different real physical experiments. • • •
•
Uniform distribution of the particles in the entire space of modeling. The case of a plate. The space of modeling has n1 × n2 × n3 size. The particles are uniformly distributed in k × n2 × n3size space (k << n1). Flow. The set of particles is divided into two subsets: the particles with zero initial velocity and the active particles with the initially nonzero velocity. Active particles are organized as a flow crossing the space along a certain direction. Explosion. There are two subsets of particles. The background particles with zero initial velocities are uniformly distributed in the entire space of modeling. All the active particles form a symmetric cloud (r << h, where r is the radius of the cloud, h is the mesh step). The velocities of active particles are directed along the radius of the cloud.
The main problem of programming is that the data distribution among the PEs depends not only on the volume of data, but also on the data properties (particles velocities, configuration of electromagnetic field, etc.). With the same volume of data but different particles distributions inside the space the data processing is organized in different ways. As the particles distribution is not stable in the course of modeling, the program control and data distribution among the PEs should be dynamically changing. It is clear that the parallel implementation of the PIC method on a distributed memory multicomputer strongly demands the dynamic load balancing.
BASIC CONCEPTS OF THE TECHNOLOGY OF FRAGMENTED PROGRAMMING Numerical algorithms, generally, and the PIC method, in particular, are very suitable for the application of AT. Considering different approaches to parallel implementation of numerical models it is necessary always to bear in mind, that constructed parallel programs should possess the dynamic properties such as: 1.
Non determinism of execution. The order of processes execution is not fully fixed. The order is chosen in the course of execution for better use of the multicomputer resources.
299
Assembling of Parallel Programs for Large Scale Numerical Modeling
2. 3. 4. 5. 6.
Dynamic tunability of the program to all the available resources. Dynamic resources assignment Dynamic load balancing Program portability Dynamic behavior of the program.
The program should follow to the behavior of the simulated phenomenon. Provision of dynamic properties of a program can be found on the way of the fragmented representation of an algorithm and the program. This affects different stages of an application program development. This is good time to remark that only technological solutions, i.e., the solutions, that can be used in universal technology of program construction, are selected for inclusion into AT. AT provides high quality of implementation of any suitable numerical model. But if a certain numerical model should be implemented with maximally high quality, then with the use of specific algorithms and specific programming techniques the implementing program of more high performance, then under AT, can be developed “manually”.
Algorithm and Program Fragmentation A.
B.
C.
An application problem description should be divided into a system of reasonably small atomic fragments, representing the realization entities of a model. Fragments might be represented in programming languages by variables, procedures, subroutines, macros, nets, notions, functions, etc. An atomic fragment (P_fragment), contains both data and code. In other words, a program, realizing an application problem, is assembled out of such small P_fragments of computations which are connected through variables for data transfer. Under AT the size of atomic fragments can be changed from one to another program execution. The fragmented structure of an application parallel program is kept in the executable code and provides the possibility for organization of flexible and high performance execution of the fragmented program. The general idea is the following. A fragmented program is composed as a set of executable P_fragments. Into every PE a part of P_fragments is loaded that constitute the program for the PE. This program is executed inside every PE, looping over all the P_fragments, loaded to the PE. If these fragments are small enough, then initially for each PE of a multicomputer the equal workload is assembled out of these P_fragments. The workload of PEs can be changed in the course of computation, and if at least one PE becomes overloaded, then a part of P_fragments (with the processing data), which were assigned for execution into the overloaded PE, should migrate to the neighbours underloaded PEs equalizing the workload of multicomputer PEs. Providing dynamic load balancing of a multicomputer, scalability and many other dynamic properties of an application program is based on such a fragmentation.
This is of course a general idea only.
300
Assembling of Parallel Programs for Large Scale Numerical Modeling
Assembling vs. Partitioning Our basic key word is assembly. Contrary to partitioning, the AT supports explicit assembling of a whole program out of ready-made fragments of computations, rather then dividing a problem, defined as a whole, into the suitable fragments to be executed on the different PEs of a multicomputer. These fragments are the elementary blocks, the “bricks”, to construct a whole program. An algorithm of a problem assembling is kept and used later for dynamic program/problem parallelization. Assembling defines the “seams” of a whole computation, the way of the fragments connection. Therefore, these seams are the most suitable way to cut the entire computation for parallel execution. For this reason program parallelization can be always done if the appropriate size of atomic fragments is chosen.
Separation of the Fine Grain and the Coarse Grain Computations The fine grain computations are encapsulated inside a module that realizes the computations bound up with atomic fragment (P_fragment). Such a module can be implemented effectively on a processor element. The whole parallel program is assembled out of these ready-made P_fragments. The set of P_fragments of a program defines a set of interacting processes (coarse grain computations). Encapsulation of the fine grain computations, their complexity, inside an atomic fragment provides the possibility of formalization of a parallel program construction and the use of explicitly two-level representation of an algorithm: the programming level inside an atomic fragment and the scheme level of an application program assembling.
Explicitly Two-Level Programming First, suitable atomic fragments of computation are designed, programmed and debugged separately. Then the whole computation (problem solution) is assembled out of these fragments. As a code for atomic fragment, a sequential library subroutine can be used, for example.
Automatic Providing of Dynamic Properties of a Program Fragmenting an algorithm we should try to satisfy two conditions that not always can be generally satisfied, but for numerical algorithm it might be done: All the processes should consume approximately equal volume of resources. On the set of all the processes there should exist such a partial ordering relation <, that the processes interact only with their neighbours, i.e., process pi can interact with the process pj iff pi < pj & ⌐ p (pi < p < pj) Explicit numerical algorithms on rectangular mesh practically always permit such a fragmentation. Execution of the set of P_fragments can be organized in order their behaviour would imitate the behaviour of a liquid in the system of communicating vessels. This is the technological basis in order uniformly to solve the problem of automatic providing dynamic properties of a target program.
301
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 3. Decomposition of SM for implementation of the PIC on the line of PEs
Computation and Communicating in Parallel Into each PE many enough P_fragments are loaded. As result, if one of P_fragments started its communication, the others can continue the computation. Therefore, in the course of the program execution the most of communications can be done in parallel with computations. This is such multiprogramming on the set of P_fragments. Many other properties of a program like library of subroutines accumulation for any platform, program scalability, memory use optimization and go on can be provided by fragmentation too.
Separation of Semantics and Scheme of Computation The fine grain computations define a sufficient part of semantics (functions) of computations. They are realized within an atomic fragment. Therefore, on the coarse grain level, only a scheme (non-interpreted or semi-interpreted) of computations is assembled. It means that formal methods of automatic synthesis of parallel programs (Valkovskii & Malyskin, 1988) can be successfully used and the standard schemes of parallel computations can be accumulated in libraries.
PARALLELIzATION OF NUMERICAL METHODS WITH AT Let us consider the assembly approach to the numerical algorithms parallelisation on the example of the PIC method algorithms parallel implementation. The line and the 2D grid structure of interprocessor communications of a multicomputer are sufficient for the effective parallel implementation of numerical methods on the rectangular meshes. In (Valkovskii & Malyskin, 1988) the algorithm of the 2D grid mapping into the hypercube keeping the neibourhood of PEs is given. A cell is natural atomic fragment of computation for a numerical method implementation. It contains both data (particles inside the cells of a fragment and values of electromagnetic fields, current density at their mesh points) and the procedures, which operate with these data. For the PIC method, when a
302
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 4. Decomposition of SM for implementation of PIC on the 2D grid of PEs
particle moves from one cell to another, it should be removed from the former cell and added to the latter cell. Thus, we can say that with the AT the Eulerian decomposition is implemented.
PIC Parallelization on the line of PEs Let us consider first in what way the PIC is parallelized for the multicomputers with the line structure of interprocessor communications. The three-dimensional simulation domain is initially partitioned into N blocks (where N is the number of PEs). Each block_i consists of several adjacent layers of cells and contains approximately the same number of particles (Figure 3). When the load imbalance is crucial, some layers of the block located into an overloaded PE are transferred to another less loaded PE. In the course of modeling, the adjacent blocks are located in the linked PEs. Therefore, the adjacent layers are located in the same or in the linked PEs. It is important for the second phase, at which some particles can fly from one cell into another and for the fourth phase, when for recalculation of values of electromagnetic fields in a certain cell, values in the adjacent cells are also used. Figure 5. Virtual layers for implementation of PIC on the 2D grid of PEs
303
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 6. Direction of data transfer for implementation of the PIC on the 2D grid
PIC parallelization on the 2D grid of PEs Let us consider now the PIC method parallelization for the 2D grid of PEs. Let the number of PEs be equal to l × m. Then SM is divided into l blocks orthogonal to a certain axis. Each block consists of several adjacent layers and contains about NP/l particles (in the same way as it was done for the line of PEs). The block_i is assigned for processing to the i-th row of the 2-D grid (Figure 4). Blocks are formed in order to provide an equal total workload of every PE’s row of the processor grid. Then every block_i is divided into m sub-blocks block_i_j, which are distributed for processing among m PEs of the row. These sub-blocks are composed in such a way in order to provide an equal workload of every PE of the row. If overload of at least one PE occurs in the course of modeling, this PE is able to recognize it at the moment when the number of particles substantially exceeds NP/(l × m). Then this PE initiates the re-balancing procedure. If the number of layers k » N (or k » l in the case of the grid of PEs), it is difficult or even impossible to divide the SM into blocks with the equal number of particles. Also, if particles are concentrated inside a single cell, it is definitely impossible to divide SM into the equal sub-domains. In order to attain the better load balance, the following modified domain decomposition is used. A layer containing more than the average number of particles is copied at least into 2 or more neighbouring PEs (Figure 5) – these are virtual layers. A set of particles located inside such a layer are distributed among all these PEs. In the course of the load balancing, particles inside the virtual layers are the first to be redistributed among PEs, and only if necessary, the layers are also redistributed. For the computations inside a PE, there is no difference between virtual and non-virtual layers. Any layer could become virtual in the course of modeling, a virtual layer can stop to be virtual. We can see, that in both cases there is no a necessity to provide flying of a cell. A cell is very small fragment, therefore a large resources should be spent to provide its flying. Thus, there is a necessity to use bigger indivisible fragments on the step of execution (not on the step of a problem/program assembling!). In the case of the line of PEs a layer of the SM should be chosen as indivisible fragment of concrete implementation of PIC. Such a fragment is called a minimal fragment. For PIC implementation on the grid of PEs a column is taken as minimal indivisible fragment. Procedure realizing minimal fragment is composed statically out of P_fragments, before the whole program assembling. This essentially improves the performance of an executable code. In such a way, a cell is used as atomic fragment at the step of numerical algorithm description. At
304
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 7. Unification of the Hx mesh variables for two minimal fragments (layers)
the step of execution of a numerical algorithm, different minimal fragments assembled out of atomic fragments are chosen depending on architecture of a multicomputer.
General PIC Method Fragmentation General PIC method fragmentation is based on the dividing of SM into the parallelepipeds. The size of a parallelepiped is chosen in such a way, that several fragments can be loaded into every PE. All the other notions (equal workload, virtual fragments and so on) are defined in the same way. This type of fragmentation is suitable for any current multicomputers.
IMPLEMENTATION OF THE PIC METHOD ON MULTICOMPUTERS Using AT PIC method has been implemented on different multicomputer systems. In order to provide a good portability the language C was chosen for the parallel PIC code implementation. For the dynamic load balancing of PIC several algorithms were developed (Kraeva & Malyshkin, 1999), (Kraeva & Malyshkin, 2001). In the cases of the grid communication structure (Figure 6) and the virtual fragments (particles might fly not only to the neighbouring PEs) special tracing functions are used. According to the AT the array of particles is divided into N parts, where N is the number of minimal fragments (layers of SM in the case of the line of PEs, and columns and parallelepiped in the case of the 2D grid). When elements of the mesh variables (electromagnetic fields, current charge and density) of different minimal fragments hit upon the same point of SM, they are unified (Figure 7). The elements of the mesh variables of one minimal fragment are not added to the data structure for this fragment, but are stored in the 3D arrays with elements of the mesh variables of other minimal fragments in the PE. This appears possible due to the rectangular shape of blocks block_i (block_i_j in the case of the 2D grid). Such a decision allows us to decrease the memory requirements and to speed up computations during the fourth phase of the PIC algorithm. In the case of dynamic load balancing, when some minimal fragments are transferred from one PE to another, the size of the 3D arrays of elements of the mesh variables is dynamically changed. This demands special implementation of such dynamic arrays. In the case of mesh fragmentation into parallelepiped, dynamic load balancing is reached by the fragments migration only.
305
Assembling of Parallel Programs for Large Scale Numerical Modeling
DYNAMIC LOAD BALANCING To attain a high performance of parallel PIC implementation, a number of centralized and decentralized load balancing algorithms were especially developed.
Initial Load Balancing A layer of cells is chosen as minimal fragment for the PIC implementation on the line of PEs. Each minimal fragment has its own weight. The weight of a minimal fragment is equal to the number of particles in this fragment. The sum of weights of all the minimal fragments in some PE determines the workload of PE. The 3D simulation domain is initially partitioned with a certain algorithm into N blocks (where N is the number of PEs). Each block consists of several adjacent minimal fragments and contains approximately the average number of particles. For the initial load balancing two heuristic centralized algorithms were designed (Kraeva & Malyshkin, 1999), (Kraeva & Malyshkin, 2001). These algorithms employ the information about the weights of all the minimal fragments. Each PE has this information and constructs the workload card by the same algorithm. The workload card contains the list of minimal fragments that should be loaded to the PEs. If the number of minimal fragments is much greater than the number of PEs, it is usually possible to distribute the minimal fragments among the PEs in such a way that every block would contain approximately the same number of particles. If considerable portion of particles is concentrated inside a single cell, it is impossible to divide the SM into the blocks with quite an equal workload. To solve this problem, a notion of virtual layer is introduced. The centralized algorithm was modified for the case of virtual fragments. If overloading of at least one PE occurs in the course of modeling, this PE is able to recognize it at the moment when the number of particles substantially exceeds NP/N. Then this PE initiates the rebalancing procedure.
Dynamic Load Balancing If a load imbalance occurs, the procedure BALANCE is called. In this procedure, the decision about the direction of data transfer and the volume of data to be transferred is taken. For any load balancing algorithm there exists its own realizing procedure BALANCE. The procedure TRANSFER is used for the data transfer. There are two implementations of this procedure: for the line of PEs and for the grid of PEs. The procedure TRANSFER is the same for any load-balancing algorithm on the line of PEs. Parameters of the procedure are the number of particles to be exchanged and the direction of data transfer. Let us consider algorithms of the dynamic load balancing of the PIC. All the PEs are numerated. In the case of the line of PE, each PE has number i, where 0 ≤ i ≤ number_of_PEs. In the case of the (l*m) grid of PEs, the number of PE is the pair (i, j), where 0 ≤ i ≤ l, 0 ≤ j < m. In the same way, layers and columns of the SM are numerated.
Centralized Dynamic Load Balancing Algorithm For the dynamic load balancing the initial load balancing algorithm can be used. One of PEs collects the information about weights of the minimal fragments and broadcasts this information to all the other PEs. All the PEs build the new load card. After that neighbouring PEs exchange minimal fragments
306
Assembling of Parallel Programs for Large Scale Numerical Modeling
Figure 8. Review window for Hx mesh variable of an atomic fragment of computation
according to the information in the new workload card. Imbalance threshold. If centralized algorithms are used for the dynamic load balancing, the PEs exchange the information about load balancing, therefore every PE has information about the number of particles in all the PEs. In any PE, the difference mnp-NP/N (where mnp is the maximum number of particles in PE, NP is the total number of particles, N is the number of PEs) is calculated. If the difference is more than the threshold Th, the procedure BALANCE is called. The threshold can be a constant, chosen in advance, or an adaptive number. In the latter case, initially Th=0. In the course of modeling, the time t_part, which is required to implement steps (1-3) of the PIC algorithm for one particle, is calculated. After every BALANCE call the time of balancing t_bal is calculated. Th is assigned to be equal to t_bal/t_part (how many particles could be processed for the same time as one balancing requires). After each subsequent step of the PIC algorithm, Th is decreased by mnp-NP/N. When the value of Th is negative, BALANCE is called. If the threshold is always equal to zero, the procedure BALANCE is called after each time step of modeling.
Decentralized Dynamic Load Balancing Algorithm for the Constant Number of Particles The use of centralized algorithms is good enough for multicomputers containing a few PEs. But if the number of PEs is big enough, the communication overheads could neutralize the advantages of the dynamic load balancing. In this case it is more preferable to use decentralized algorithms. Such algorithms use information about the load balance in the local domain of PEs only. If the number of test particles is not changed in the course of modeling, a simple decentralized algorithm could be suggested. Each PE has the information on how many particles were moved to/from its neighbouring PEs. To equalize the load it is sufficient just to receive/send the same number of particles from/to the neighbouring PEs. It should be noted that this algorithm works in the case of virtual fragments only.
307
Assembling of Parallel Programs for Large Scale Numerical Modeling
Specialized Decentralized Algorithm To equalize the load balance for the PIC implementation, the following specialized algorithm was designed. In the course of simulation in every PE the main direction of particles motion is calculated from the values of particles velocities. In the load balancing, particles are delivered to the direction opposite to the main direction. As in the previous algorithm it is assumed that the number of particles is not changed in the course of simulation. The number of particles to be transferred from overloaded PEs to their neighbours in the direction opposite to the main direction is calculated from the average number of particles and the number of particles in the PE. Some particles are transferred in advance, in order to reduce the number of calls of the dynamic load balancing procedure. This is the case of dynamic behavior of the program when the program “feels” the behavior of the model.
Diffusive Load Balancing Algorithms The basic diffusive load-balancing algorithm was implemented and tested for the PIC parallel implementation (Kraeva & Malyshkin, 1999), (Kraeva & Malyshkin, 2001), (Kuksheva, Malyshkin, Nikitin, Snytnikov, Snytnikov, & Vshivkov, 2001). The size of the local domain is equal to two. Any diffusive algorithm is characterized by the number of steps. Actually the number of steps defines how far the data from one PE could be transferred in the course of load balancing. For every step the procedure TRANSFER is called. The more the number of steps of diffusive algorithm the better load balance could be attained, but also the more time is required for load balancing. The tests have shown that the total program time does not decrease with the growth of the number of steps.
AUTOMATIC GENERATION OF PARALLEL CODE The PIC method is applied to simulation of many natural phenomena. In order to facilitate the parallel PIC implementation, a special system of parallel program automatic generation was designed. This system consists of the VISual system of parallel program Assembling (VisA) and a parallel code generator in C language. The process of the generation of a parallel program for the PIC (and the same way for the other numerical algorithms on the rectangular meshes) consists of three steps. At the first step, a user defines the atomic fragment of computation – a cell of the SM. This cell contains elements of the mesh variables at several mesh points, an array of particles (for PIC method) and procedures in C language, which describe all the computations inside the cell – {procedure1,…, procedurek} (Figure 8). At the second step the description of assembling of the minimal fragments out of atomic fragments is done in visual system VisA, after that the whole computation is assembled in the manner like a wall is assembled out of the bricks. After that the generator constructs a program for implementation of the defined minimal fragment (a layer, a column or a parallelepiped). The particle arrays of atomic fragments merge to a single particle array for the minimal fragment. The elements of the mesh variables, which hit upon the same point of SM, are unified. In such a way, for every mesh variable, only one 3D array of its elements is formed. At the third step, the decision on the PEs workload is made. The generator creates a parallel program
308
Assembling of Parallel Programs for Large Scale Numerical Modeling
implementing the whole computation for a target multicomputer. This program includes data initialization and a time loop. At each iteration of the time loop, k loops (where k is the number of procedures in the description of an atomic fragment) over all the minimal fragments of PE run. After each k-th loop, those elements of the mesh variables, which are copied in several PEs, are updated (if necessary). All the particles are stored in m arrays (where m is the number of minimal fragments in a certain PE). However, similarly the case of the minimal fragment assembling, the elements of a mesh variable in all the minimal fragments of a PE form one 3D array. The user develops procedures (computations inside a cell) in C language, using also several additional statements for defining the computations over the mesh variables.
CONCLUSION The AT provids high performance of an assembled program execution, its high flexibility in reconstruction of the code and dynamic tunability to available resources of a multicomputer. High performance of the program execution provides modeling of large scale problems such as the study of a cloud plasma explosion in the magnetized background, modeling of interaction of a laser impulse with plasma, astrophysical problems solution, etc. We applied the AT to implementation of different numerical methods and hope to create a general tool to support implementation of mathematical approximating models. Finally the question can be given: “How many numerical algorithms can be fragmented?” The answer to this question can be found in (Malyskin, Sorokin & Chauk, 2008). The answer is: any numerical mass algorithm can be fragmented, but with the different results. In order to reach good result many efforts should made. Very often a deep modification of initial algorithms should be done similar to the algorithms modification for their parallelization. But this is another topic for consideration.
REFERENCES ALF for Cell BE Programmer’s Guide and API Reference. Retrieved from http://www01.ibm.com/chips/ techlib/techlib.nsf/techdocs/41838EDB5A15CCCD002573530063D465 ALF for Hybrid-x86 Programmer’s Guide and API Reference. Retrieved from http://www01.ibm.com/ chips/techlib/techlib.nsf/techdocs/389BBE99638335B80025735300624044 Berezin, Y. A., & Vshivkov, V. A. (1980). The method of particles in rarefied plasma dynamic. Novosibirsk, Russia: Nauka (Science). Corradi, A., Leonardi, L., & Zambonelli, F. (1997). Performance comparison of load balancing policies based on a diffusion scheme. In Proc. of the Euro-Par’97 (LNCS Vol. 1300). Springer: Germany. Hockney, R., & Eastwood, J. (1981). Computer simulation using particles. London: McGraw-Hill, Inc.
309
Assembling of Parallel Programs for Large Scale Numerical Modeling
Kedrinskii, V. K., Vshivkov, V. A., Dudnikova, G. I., Shokin, Yu. I., & Lazareva, G. G. (2004). Focusing of an oscillating shock wave emitted by a toroidal bubble cloud. Journal of Experimental and Theoretical Physics, 98(6), 1138–1145. doi:10.1134/1.1777626 Kraeva, M. A., & Malyshkin, V. E. (1997). Implementation of PIC method on MIMD multicomputers with assembly technology. In Proc. of the High Performance Computing and Networking Europe 1997 Int. Conference. (LNCS, Vol.1255), (pp. 541-549). Berlin: Springer Verlag. Kraeva, M. A., & Malyshkin, V. E. (1999). Algorithms of parallel realization of PIC method with assembly technology. In Proceedings of 7th High Performance Computing and Networking Europe, (LNCS Vol. 1593), (pp. 329-338). Berlin: Springer Verlag. Kraeva, M. A., & Malyshkin, V. E. (2001). Assembly technology for parallel realization of numerical models on MIMD-multicomputers. International Journal on Future Generation Computer Systems, Elsevier Science, 17(6), 755–765. doi:10.1016/S0167-739X(00)00058-3 Kuksheva, E. A., Malyshkin, V. E., Nikitin, S. A., Snytnikov, A. V., Snytnikov, V. N., & Vshivkov, V. A. (2005). Supercomputer simulation of self-gravitating media. International Journal on Future Generation Computer Systems, 21(5), 749–758. doi:10.1016/j.future.2004.05.019 Malyshkin, V. (2006). How to create the magic wand? Currently implementable formulation of the problem. In New Trends in Software Methodologies, Tools and Techniques, Proceedings of the Fifth SoMeT_06, 147, 127-132. Malyshkin, V. E. (1995). Functionality in ASSY system and language of functional programming. In Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis. (pp. 92-97). Aizu-Wakamatsu, Japan: IEEE Comp. Soc. Press. Malyshkin V.E., Sorokin S.B., & K.G.Chauk (2008, May). Fragmented numerical algorithms for the library parallel standard subroutines. Accepted to publication in Siberian Journal of Numerical Mathematics, Novosibirsk, Russia. Snytnikov, V. N., Vshivkov, V. A., Kuksheva, E. A., Neupokoev, E. V., Nikitin, S. A., & Snytnikov, A. V. (2004). Three-dimensional numerical simulation of a nonstationary gravitating n-body system with gas. Astronomy Letters, 30(2), 124–138. doi:10.1134/1.1646697 Valkovskii, V. A., & Malyshkin, V. E. (1988). Synthesis of parallel programs and systems on the basis of computational models. Novosibirsk, Russia: Nauka. Vshivkov, V. A., Nikitin, S. A., & Snytnikov, V. N. (2003). Studying instability of collisionless systems on stochastic trajectories. JETP Letters, 78(6), 358–362. doi:10.1134/1.1630127 Walker, D. W. (1990). Characterising the parallel performance of a large-scale, particle-in-cell plasma simulation code. International Journal on Concurrency: Practice and Experience., 2(4), 257–288. doi:10.1002/cpe.4330020402
310
Assembling of Parallel Programs for Large Scale Numerical Modeling
KEY TERMS AND DEFINITIONS Assembly Technology: Technology of parallel programs development for large scale numerical simulation based on assembling of the whole computation out of atomic fragments of computation. The technology integrates well known techniques of modular programming and domain decomposition and is supported by the system software. Cluster: A multicomputer with the tree structure of communication net. Dynamic Load Balancing: Equalizing of workload of multicomputer processor elements in the course of a program execution in order to reach better multicomputer performance. Dynamic Tunability of a Program to All the Available Resources: A program should be able to use all the available resources of a multicomputer. Multicomputer: A set of computers connected by the communication net and able with the use of special system software to solve jointly the same application problem. Well known examples of multicomputer’s communication net are rectangular mesh, tree, torus, hypercube. Parallel Programming: The development of programs able to be executed on multicomputers. Particle-In-Cell Method: Widely used numerical method for direct simulation of natural phenomena where the material is represented by the huge number of test particles. Instead of solution of the system of partial differential equations in the 6D space of co-ordinates and velocities, the dynamics of a simulated phenomenon is determined by integrating the equations of motion of every particle in the series of discrete time steps. The method began to be applicable with the use of supercomputers only.
311
312
Chapter 14
Cell Processing for two Scientific Computing Kernels Meilian Xu University of Manitoba, Canada Parimala Thulasiraman University of Manitoba, Canada Ruppa K. Thulasiram University of Manitoba, Canada
ABSTRACT This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional CooleyTukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.
INTRODUCTION High performance computing (HPC) clusters provide increased performance by splitting the computational tasks among the nodes in the cluster and have been commonly used to study scientific computing applications. These clusters are cost effective, scalable and run standard software libraries such as MPI which are specifically designed to develop scientific application programs on HPC. They are also comparable in performance speed and availability to supercomputers.A typical example is the Beowulf cluster DOI: 10.4018/978-1-60566-661-7.ch014
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cell Processing for two Scientific Computing Kernels
which uses commercial off-the-shelf computers to produce a cost-effective alternative to a traditional supercomputer. In the list of top 500 fastest computers in top500.org, many of them are pure clusters. One of the crucial issue in clusters is the communication bandwidth. High speed interconnection networks such as Infiniband have paved the way for increased performance gain in clusters. However, the development trend in clusters has been greatly influenced by hardware constraints leading to three walls called brick wall (Asanovic et al., 2006). According to Moore’s law, the number of transistors on the chip will double every 18 to 24 months. However, the speed of processor clocks has not kept up with the increased transistor design. This is due to the physical constraints imposed on clock speed increase. For example, too much heat dissipation leads to complicated cooling techniques to prevent the hardware from deteriorating. And too much power consumption daunts the customers from adopting new hardware, increasing the cost of commodity applications. Power consumption is doubling with the doubling of operating frequency leading to the first of the three walls, power wall. On the other hand, even with the increased processor frequency achieved so far, the system performance has not improved significantly in comparison to the increased clock speeds. In many applications, the data size operated on by each processor changes dynamically, which in turn, affects the computational requirements of the problem leading to communication/synchronization latencies and load imbalance. Multithreading is one way of tolerating latencies. However, previous research (Thulasiram & Thulasiraman, 2003; Thulasiraman, Khokhar, Heber, & Gao, 2004) has indicated that though multithreading solves the latency problem to some extent by keeping all processors busy exploiting parallelism in an application, it has not been enough. Accessing data in such applications greatly affects memory access efficiency due to the non-uniform memory access patterns that are unknown until runtime. In addition, the gap between the processor speed and memory speed is widening as processor speed increases more rapidly than memory speed leading to the second wall, memory wall. To solve this problem, many memory levels are incorporated which requires exotic management strategies. However, the time and effort required to extract the full benefits of these features detracts from the effort exerted on real coding and optimization. Furthermore, it has become a very difficult task for algorithm designers to fully exploit instruction level parallelism (ILP) to utilize the processor resources effectively to keep the processors busy. Solutions to this problem have been in using deep pipelines with out-of-order execution. However, this approach impacts the performance of the algorithm due to the high penalty paid on wrong branch predictions. This leads to the third wall, ILP wall. These three walls force architecture designers to develop solutions that can sustain the requirements imposed by applications and provide solutions to some of the problems imposed by hardware in traditional multiprocessors. A multi-core architecture is one of the solutions to tackle the three walls. These architectures are driven by the need for decreased power consumption, increased operations/watt and Moore’s Gap. A multi-core architecture consists of a multi-core processor, which is also called a chip-level multiprocessor (CMP). A multi-core processor combines two or more independent cores into a single die. It is a new architecture and cannot be regarded as a new SMP (Symmetric MultiProcessor) architecture since all cores in this architecture share on-chip resources while separate processors in the conventional SMP do not. For example, each core of AMD Opteron dual-core processor has its own L2 cache, but the two cores still share other interconnect to the rest of system such as the memory controller. These dual-core processors belong to homogeneous multi-core processors because the resources and execution units (or cores) are mere replications of each other. The number of cores on a single die is still growing. Quad-Core Intel Xeon processor and Quad-Core AMD Opteron processor are already available. Cyclops64 has as many as 64 homogeneous cores on a single chip, which is usually known as a many-core architecture. On the
313
Cell Processing for two Scientific Computing Kernels
other hand, the IBM Cell Broadband Engine (Cell/B.E.) processor is a heterogeneous multi-core processor (Chen, Raghavan, Dale, & Iwata, 2007), which has one conventional microprocessor, Power Processor Element (PPE), and eight SIMD co-processing elements called Synergistic Processor Elements(SPEs). PPE and SPEs use different Instruction Set Architecture (ISAs). These devices communicate with one another by an ultra speed broadband connection called the Element Interconnect Bus (EIB). The PPE, a superscalar RISC processor, acts as the central controller for the SPEs and provides multithreaded support to better utilize the resources of modern processor architectures. Just as the neuron cells in the brain work together, the Cell incorporates many electronic devices to work together as a complete system. The Cell is, therefore, a System-on-Chip or heterogeneous multi-core architecture. The concept of multi-core architectures and its implementations have paved the way to building tera- and peta-scale supercomputer systems. Los Alamos Roadrunner, an Opteron-Cell hybrid supercomputer, aims at providing a sustained petaflop supercomputer based on AMD Opteron multi-core processors and Cell/B.E. processors. The concept of multiprocessors is not new. It has existed in other hardware designs such as GPU (Graphics Processing Unit), FPGA (Field Programmable Gate Array), and network processors. In this chapter, we design and develop parallel algorithms for two scientific computing kernels, FDTD (Finite-Difference Time-Domain) and FFT (Fast Fourier Transform) on multicore architecture,in particular the Cell/B.E. FDTD is a regular scientific computing problem which has its applications in electromagnetic theory and medical imaging (Xu, Sabouni, Thulasiraman, Noghanian, & Pistorius, 2007). It follows nearest neighbour communication pattern and is synchronous in nature. The FDTD is computationally data intensive and is usually a kernel in the applications. Therefore, improving the FDTD algorithm is very important to the overall performance of the application. The FFT is a semi-irregular problem and a kernel in many applications such as computed tomography and option pricing in finance (Barua, Thulasiram, & Thulasiraman, 2005). The partners of the butterfly computation change at each iteration thereby changing the communication pattern at each iteration. In the FFT algorithm the processors can be only one iteration ahead of their neighbouring processors. In this chapter we explain the Indirect Swap Networks (ISN) technique, an idea proposed in VLSI circuits that can be efficiently used to compute the butterfly computations in FFT. Data mapping in the swap network topology reduces the communication overhead by half at each iteration compared to the traditional Cooley-Tukey algorithm. The rest of the chapter is organized as follows. Section 0 introduces in detail the Cell/B.E. which brings new challenges to parallel algorithm design. FDTD is described in Section 0. The parallel FDTD algorithms for distributed memory machines and homogeneous muilticore processors are provided in Section 0 and Section 0 respectively. Section 0 explains the parallel algorithm design on the Cell/B.E. The experimental results of these three parallel algorithms are presented in Section 0. FFT is described in Section 3. An introduction to FFT is provided in Section 3. The indirect swap network in explained in Section 3. The algorithm based on ISN is parallelized on the Cell/B.E. and is explained in Section 4 followed by experimental results in Section 5. The experience of exploiting multicore processors for these two scientific computing kernels is summarized in Section 8 which concludes this chapter.
CELL BROADBAND ENGINE PROCESSOR Applications that require stream lining of data and instructions are more suitable for vector processors such as Cray X-MP supercomputers existed in the 1980s to 1990s. In the recent years, there have been several other vector computer architectures such as NEX XS series, Cray X1, Fujitsu vector systems,
314
Cell Processing for two Scientific Computing Kernels
Hitachi SR8000 emulating vector architectures. Furthermore, the SSE instructions in regular Intel processors introduce vector instruction (even if for very short vector length) to regular proccessor chips. The Cell/B.E. is also an architecture to support vector operations(Chen et al., 2007). One of Cell/B.E.’s unique features, the Single Instruction Multiple Data (SIMD) computing, allows data level parallelism and moves towards vector processing. The Cell/B.E. processor is the first implementation of the Cell Broadband Engine Architecture (CBEA) (Chen et al., 2007). CBEA was implemented to address some of the issues related to the three walls existing in conventional uni-processor systems. The Cell/B.E. processor is a heterogeneous multi-core processor. It consists of one conventional 64-bit Power Processor Element (PPE), eight Synergistic Processor Elements (SPEs), a memory controller, an I/O controller, and an on-chip coherent bus EIB (Element Interconnect Bus) which connects all elements on the single chip. The eight SPEs are purposely designed for intensive computing via large number of wide uniform registers (128-entry 128-bit registers) and 256KB local store for each SPE. The Memory Flow Controller (MFC) on each SPE and the high bandwidth EIB (with a peak bandwidth of 204.8 GBytes/s) enable SPEs to interact with PPE, with other SPEs, and with main memory efficiently. The novice features make the Cell/B.E. processor an attractive and well suited for scientific computing applications (Williams et al., 2006). The Cell/B.E. processor exhibits several levels of parallelism. Coarse-grained parallelism exists between the PPE and SPEs, and between different SPEs. The PPE and SPEs can work on different tasks concurrently. Each SPE can also perform different tasks simultaneously. Fine-grained parallelism can be implemented both on the PPE and on the SPE. Both the PPE and the SPEs have their own SIMD instruction sets, each capable of executing two instructions per clock cycle. The PPE has a two-way multi-threaded hardware support and is a dual-issue in-order processor. The SPE does not support multithreaded on the hardware level, however, it is also a dual-issue in-order processor because of its two pipelines. Also, the MFC of each SPE can move data around without interrupting the ongoing tasks on the PPE and SPEs. The nature of parallelism on the Cell/B.E. processor is expected to produce significant performance improvement if fully explored and utilized (Chen et al., 2007). All these features make the Cell/B.E. processor an attractive and new architecture for compute intensive applications. Liu et al. (Liu et al., 2007) develop a digital media indexing application on Cell/B.E.. Williams et al. (Williams et al., 2006) investigate the performance of several key scientific computing kernels on Cell/B.E. processor. They conclude that the Cell/B.E. processor’s three level software-controlled memory architecture (the 128 registers, the LS, and the main memory) outperforms the conventional cache-based architectures, especially for applications with predictable memory access patterns by effective overlap of computation and communication.
FINITE DIFFERENCE TIME DOMAIN ALGORITHM This section explains the Finite-Difference Time-Domain (FDTD) method. FDTD is a popular method in many applications such as electromagnetic theory (Yu, Mittra, Su, Liu, & Yang, 2006) and medical imaging (Xu et al., 2007). FDTD is inherently data-intensive and compute-intensive exhibiting nearest neighbour communication patterns. Since it is usually a kernel in many applications, the performance of the FDTD algorithm to the overall performance of the entire application is crucially important. In this section, we develop parallel FDTD algorithm for the Cell/B.E. and compare the results to two different architectures, distributed memory clusters and homogeneous multicore AMD Opteron. We discuss the
315
Cell Processing for two Scientific Computing Kernels
experimental results and the challenges posed in developing the algorithms taking into consideration the architectural features of these architectures.
Finite Difference Time Domain Algorithm FDTD is a numerical technique proposed by Yee in 1966 to solve Maxwell’s equations in electromagnetics field (Yee, 1966). Yee’s algorithm discretizes the 3D region of interest into a mesh of cubic cells or 2D region into a grid of rectangular cells. These cells are called Yee cells. Each Yee cell has electrical fields( E ) and magnetic fields( H ) when the region is pinged with microwaves. The electrical fields and magnetic fields interleave with each other spatially: the edges of the cells in electrical mesh lie at the center of the cells in magnetic cells. Electrical fields and magnetic fields are updated at alternate half time steps in a leapfrog scheme in time. An application of FDTD for breast cancer detection uses the following equations to model the electrical and magnetic fields update. We refer readers (Xu et al., 2007) for details of the application. E zx |ni,+j 1 = aE zx |ni, j +
b ù ´ éH |n +1/2 -H y |ni -+1,1/2 j úû Dx êë y i, j
(1)
E zy |ni,+j 1 = aE zy |ni, j -
b ù ´ éH |n +1/2 -H x |ni,+j -1/2 1 ûú Dy ëê x i, j
(2)
H x |ni,+j 1/2 = H x |ni,-j 1/2 -g éêE zx |ni, j +1 +E zy |ni, j +1 -E zx |ni, j -E zy |ni, j ùú ë û
(3)
H y |ni,+j 1/2 = H y |in,-j 1/2 +g éêE zx |ni +1, j +E zy |ni +1, j -E zx |ni, j -E zy |ni, j ùú ë û
(4)
1-
sDt 2e0 er
1+
sDt 2e0 er
a=
b=
Dt e0 er (1 +
g=
316
(5)
Dt mDy
sDt ) 2e0 er
(6)
(7)
Cell Processing for two Scientific Computing Kernels
In the equations, E zx |ni,+j 1 is the electrical field at position (i, j) at the time interval (n + 1). H x |ni,+j 1/2 is the magnetic field at position (i, j) at the time interval (n + 1/2). σ is the conductivity of the material. ε0 and εr represent the permittivity of the free space and the material respectively. Μ denotes the permeability of the material. Δt is the time step. Δx × Δy is the size of the Yee cell. A sequential FDTD on a conventional computer is shown in Algorithm 1. N is the number of Yee cells in each direction, assuming that each direction is equally divided. MAX_TIMESTEPS is the max number of time steps (iterations) for field updates. FDTD is the kernel of many applications in electromagnetic field (Taflove & Hagness, 2000; Xu et al., 2007). As an iterative algorithm, its performance is critical to its widespread applications. However, it is computationally intensive and therefore parallel processing is required (Xu et al., 2007). The complexity of a 2D FDTD algorithm is O(N3) The sequential FDTD algorithm takes about 200 seconds for a 600 × 600 computational domain in 4000 time steps on an AMD Athlon 64 X2 Dual Core processor at 2GHz. In medical imaging, finer granularity is a necessity to produce more accurate results. However, increased granularity indicates increased computation time along with more memory requirement. These reasons have led us to design parallel FDTD algorithms for different architectures. A number of parallel FDTD research has been reported using different parallel schemes on different platforms for different applications. Guiffaut et al. (Guiffaut & Mahdjoubi, 2001) implement a parallel FDTD on a computational domain of 150 × 150 × 50 cells on PC and the Cray T3E. They use Message Passing Interface (MPI) and adopt vector communication scheme and matrix communication scheme, obtaining higher efficiency by the latter scheme. Su et al. (Su, EI-kady, Bader, & Lin, 2004) combine OpenMP and MPI to parallelize FDTD: OpenMP used for the one time initialization and each time-step updating of the E-fields and H-fields; MPI is used for the communication between neighboring processors. Yu et al. (Yu et al., 2006) introduce three communication schemes in parallel FDTD. The three schemes differ in which components of E-fields and H-fields should be exchanged and which process should update the E-fields on the interface. Algorithm 1 Sequential FDTD on a conventional computer
Initialize electric fields and magnetic fields; Calculate coefficients for all Yee cell; for n = 1 to MAX_TIMESTEPS do for i = 1 to N do for j = 1 to N do Update Ezx[i][j] using equation 1; Update Ezy[i][j] using equation 2; Update Hx[i][j] using equation 3; Update Hy[i][j] using equation 4; end for end for end for
317
Cell Processing for two Scientific Computing Kernels
FDTD on Distributed-Memory Machines FDTD is data-parallel in nature and exhibits apparent nearest-neighbor communication pattern (Yu et al., 2006). Therefore, FDTD is a suitable algorithm for parallelization on distributed memory machines using Message Passing Interface (MPI). The factors impacting the performance of parallel FDTD on distributed memory machines are communication and synchronization overhead. As shown in the previous section, field updates of each Yee cell requires information from its neighbors. There is no communication overhead if the neighbors of the Yee cells reside on the same processor. However, communication becomes an issue at the border of decomposition where some or all of cells’ neighbors are on the neighboring processors. The computational domain has to be large to provide accurate results which implies the communication overhead is high while transferring large amounts of data. Therefore, overlapping computation with communication is critical to gaining performance. Another nature of FDTD is that the field updates cannot proceed to the next time step until all Yee cells have been updated for the current time step. This incurs synchronization overhead for each time step. Therefore, in designing the parallel FDTD algorithm, proper data distribution and mapping on the available processors is critical to avoiding communication bottlenecks. Yu et al. (Yu et al., 2006) introduce three communication schemes for parallel FDTD. The three schemes differ in which components of E and H should be exchanged and which processor should update the E on the interface. The division of the computational domain is on the E along the Cartesian axis. In this chapter, the computational domain is divided along the x axis of E . Suppose the computation domain is divided into n × n cells and p processors are used for FDTD computation. Then, each processor receives a matrix of m × n cells where m = n/p. Each processor i (i is not equal to 1 and p, the last processor) shares the first and mth row of its computational domain with processor i −1 and i + 1 respectively. Therefore, E on the interface of adjacent processors are calculated on both processors. The purpose of the scheme is to eliminate the communication of E and only communicate H , trying to improve the computation/communication efficiency. The parallel FDTD algorithm, referred to as MPI-version parallel FDTD algorithm, is given in Algorithm 2. Algorithm 2 Parallel FDTD on distributed-memory machines (MPI-version parallel FDTD)
Initialize electric fields and magnetic fields; if processor is master processor then Calculate coefficients for all Yee cells; Decide the Yee cells for each processor and send coefficients of those Yee cells to the corresponding processors; else Receive the coefficients of the Yee cells residing on the local processor; end if forn = 1 to MAX TIMESTEPSdofori = 1 to N=Pdoforj = 1 to Ndo Update Ezx[i][j] using equation 1; Update Ezy[i][j] using equation 2; Update Hx[i][j] using equation 3;
318
Cell Processing for two Scientific Computing Kernels
Update Hy[i][j] using equation 4; end for end for Exchange magnetic fields with the neighboring processors; Synchronize among all processors end for if processor is not master processor then Send the final results to master processor; else Receive results from all other processors; Output the results at the observation points; end if
FDTD on Homogeneous Multicore Achitecture The multicore machine available for this work is a Sun Fire X4600 serverIt is configured with eight sockets. Each socket is configured with an AMD Opteron dual-core processor. The processors are connected with AMD HyperTransport Technology links. As a whole system, it is a ccNUMA(cache-coherent Non-Uniform Memory Access) SMP system. Each processor has its dedicated memory attached to two cores. It can access the memory of other processors via AMD’s unique Direct Connect Architecture (DCA). AMD Opteron dual-core processors have interesting design technologies to tackle some aspects of the three walls. Each core has separate L1 and L2 cache. Separate L2 caches prevent potential synchronization bottleneck for multiple threads on multiple cores competing over the same data cache. Hence, separate cores can process separate data sets, avoiding cache contention and coherency problems. Furthermore, AMD has a unique implementation of ccNUMA based on DCA. Inside each of the eight dual-core processors, there is a cross-bar switch. One side of the switch attaches the two cores. The other side of the switch attaches the DCA with a shared memory controller and HyperTransport Technology links. The shared memory controller connects the two cores to dedicated memory. The HyperTransport Technology links allow dual cores on one processor access to another processor’s dedicated memory. Therefore, for each core, some memory is directly attached, yielding a lower latency, while some is not directly attached and has a higher latency. The combination of ccNUMA and DCA technology can improve performance by locating data close to the thread that needs it, which is called “memory affinity”. Besides, the hypervisor which virtualizes the underlying multi-core processor and multi-processor system, provides facilities to specify “thread affinity” to assign dedicated cores for threads. This facility contributes to further performance improvement. Although FDTD is computationally intensive, it shows apparent data parallelism and high data locality property. Each Yee cell update, both for electrical fields and for magnetic fields, only needs information of its near neighbors as shown in equation 1 through equation 4. Locality is one of the key factors that impact performance on cache-based computers. The inherent locality property of FDTD may bring significant performance on the homogeneous multi-core system via shared-memory parallel programming paradigm, especially with the hardware support of separate L2 cache for each core. Therefore, we designed a shared-memory version of FDTD as shown in Algorithm 3, which we refer to as OpenMP version parallel FDTD.
319
Cell Processing for two Scientific Computing Kernels
Algorithm 3 Parallel FDTD on shared memory machines (OpenMP-version parallel FDTD)
Initialize electric fields and magnetic fields; Calculate coefficients for all Yee cell; forn = 1 to MAX_TIMESTEPSdo #pragma omp parallel { #pragma for private(i, j) fori = 1 to Ndoforj = 1 to Ndo Update Ezx[i][j] using equation 1; Update Ezy[i][j] using equation 2; Update Hx[i][j] using equation 3; Update Hy[i][j] using equation 4; end for end for } end for
FDTD on Cell/B.E. Processor This section will list the key issues to fully utilize the parallelism of Cell/B.E. for FDTD. One issue is the limited size of the local store (LS) on each SPE. The 256KB LS is for both instructions and data. Based on the equations (1) to equation (4), for a computational domain of 600 × 600 Yee cells, 16M memory is needed to hold the variables for the coefficients and the fields at run time, without considering the code and other variables. Therefore, one of the main issues is to decide on how to make the data fit in the limited memory size at run time. A solution to this issue is we let each SPE consider a part of the computational domain once. At each time step, the SPE fetches the coefficients and the field values of the Yee cells within the part of the computational domain, and updates the fields using the equations. The updated field values are stored back to the corresponding memory locations to make space for the next part of the computational domain. The SPE then starts with the next part of the domain. The process continues until all the Yee cells of the computational domain are updated for the current time step. Another round of the whole process starts for the next time step for MAX_TIMESTEPS rounds. By fetching and storing data between the main memory and the LS, each SPE can manage the LS to have instructions and data under 256KB limit at run time. Another issue is how to decide on the size and the frequency of exchanging data between the memory and the LS such that the Cell/B.E. processor is fully utilized . The SPEs can only operate on instructions and data residing on the LS. Unlike the PPE, SPEs cannot access the main memory directly. It has to fetch instructions and data from the memory to the LS using asynchronous coherent DMA commands. Therefore, the communication cost must be considered during algorithm design. A suitable size and frequency for the transfers has to be determined to ensure there is no data starvation and there is minimal overhead. Several points are critical in reducing the communication cost and achieving efficient SPE data access: data alignment, access pattern, DMA initiator, and location. The MFC of the SPE supports
320
Cell Processing for two Scientific Computing Kernels
transfers of 1,2,4,8 and n × 16 (up to 16K) bytes. Transfers less than 16 bytes must be naturally aligned and have the same quad-word offset for the source and the destination addresses. Also, all transfers cannot be completed without the EIB. Hence, the cost on the EIB must be minimized. A minimal overhead of the EIB can be achieved if transfers are at least 128 bytes, and transfers greater than or equal to 128 bytes are cache-line aligned, i.e., aligned to 128 bytes. Furthermore, whenever possible we let SPE initiate the DMAs and pull the data from the main memory instead of PPE’s L2 cache. MFC transfers from the system memory have high bandwidth and moderate latency, whereas transfers from L2 cache have moderate bandwidth and low latency. The third issue is the synchronization problem when more than one SPE is used to explore the parallelism between SPEs. Algorithm 0 shows that the field update for all Yee cells must be completed for the current time step before any Yee cells can be dealt with further for the next time step. Therefore, when more than one SPE is used to update different part of the computational domain, the synchronization among all participant SPEs is mandatory for correct results. The Cell/B.E. processor supports different synchronization mechanisms (Chen et al., 2007): MFC atomic update commands, mailboxes, SPE signal notification registers, events and interrupts, or just polling of the shared memory. We consider the mailboxes and SPE signal notification registers in the paper. For the first method, SPEs use mailboxes while PPE acts as the arbitrator. When each SPE finishes its tasks for the current time step, it uses mailbox to notify the PPE that it is ready for the next time step. When the PPE receives messages from all participant SPEs, it sends a message via mailboxes to those SPEs and lets the SPEs start the task for the next time step. The PPE is not involved for SPE signal notification registers method. One SPE acts as the master SPE, and other SPEs are slave SPEs. The slave SPEs send signals to the master SPE when their tasks for the current time step is completed, and wait for the signal of starting the task for the next time step from the master SPE. The master SPE sends such signal only when it receives signals from all slave SPEs. The last issue is on the implementation level: the exploitation of SIMD on the SPE. SPEs are SIMDonly co-processors. Scalar codes, especially codes for arithmetic operations, may deteriorate the performance since the SPE has to re-organize the data and instructions to be executed on the SPE. The code written in a high-level language must rely on the compiler technology to be auto-vectorized to exploit SIMD capability of the SPE. However, the flexibility of high-level languages makes it difficult to achieve optimal results for different applications. Therefore, explicit control of the instructions by the programmers is a detrimental for optimal performance. For this purpose, the SPE provides intrinsics which are essentially inline assembly code with C function call syntax. These intrinsics provide such functions as register coloring, instruction scheduling, data loads and stores, looping and branching, and literal vector construction. The paper considers literal vector constructions since most of the tasks for FDTD are arithmetic operations. It aims to manually apply SIMD to the two FOR loops shown in Algorithm 1. Based on all the issues and solutions, we designed a parallel FDTD algorithm as shown in Algorithm 5 (referred as CellBE-version parallel FDTD) for the SPE side respectively. The PPE is used to manage all SPE threads and calculate the intializatin values. The purpose is to fully exploit the natural parallelism provided by the processor in order to achieve significant performance improvement. Algorithm 5 FDTD on the SPE (CellBE-version parallel FDTD)
Send ready singal to the PPE; Receive information for synchronization;
321
Cell Processing for two Scientific Computing Kernels
DMA in control block to get information about its task assigned and running setting; for n = 1 to MAX TIMESTEPS do while Ezx of Yee cells not updated do Fetch chunks of data, including the coefficients and the field values of last time step; Update Ezx using SIMD version of equation 1; Store the updated Ezx back to the corresponding memory location; end while while Ezy of Yee cells not updated do Fetch chunks of data, including the coefficients and the field values of last time step; Update Ezy using SIMD version of equation 2; Store the updated Ezy back to the corresponding memory location; end while while Hx of Yee cells not updated do Fetch chunks of data, including the coefficients and the field values of last time step; Update Hx using SIMD version of equation 3; Store the updated Hx back to the corresponding memory location; end while while Hy of Yee cells not updated do Fetch chunks of data, including the coefficients and the field values of last time step; Update Hy using SIMD version of equation 4; Store the updated Hy back to the corresponding memory location; end while Synchronize with other SPEs; end for Send finish signal to the PPE;
Experiment Results and Comparisons The three parallel algorithms were designed for three architectures: distributed memory machines, homogeneous multicore machines and the Cell/B.E. processor. They were run on four configurations. We use processing unit number as x axis to avoid confusion between processor number, thread number, core number, and SPE number. Although PPE is used for part of the computation, its contribution to the final performance is negligible compared to the computation on the SPEs. Therefore, the processing unit number indicates the number of SPEs for the Cell/B.E. processor. The four configurations on which the parallel algorithms are designed and implemented are summarized below. •
322
AMD Athlon cluster: a cluster of 24 nodes. Each node is an AMD Athlon dual-core processor at 2GHz, 512KB cache, with 100Mb/s Ethernet switch as the interconnection; GNU C compiler;
Cell Processing for two Scientific Computing Kernels
•
•
•
AMD Opteron single-core cluster: a cluster of 16 nodes. Each node is a dual AMD Opteron single-core processors at 2.4GHz, 2GB per node of physical memory, with Voltaire Infiniband Switched-fabric interconnection; C compiler from Portland Group. AMD Opteron dual-core shared memory machine: 8 AMD dual-core Opteron processor at 1GHz, 1M cache per core, 4GB memory per processor and 32GB distributed shared memory in the system; Sun C compiler and Omni compiler; IBM Cell/B.E. processor: Georgia Tech Cell/B.E. cluster containing 14 IBM Blade QS20 dualcell blades, each running at 3.2GHz, GNU C compiler.
Figure 1 depicts the computation time for these four configurations. There are different ways to compare the performance between pairs of configuratons. It illustrates the performance of the MPI-version parallel FDTD algorithm (Algorithm 2) on two clusters. The AMD Opteron single-core cluster outperforms the AMD Athlon cluster when the same number of processing units is used. One of the main reasons for this difference is that the two clusters use different interconnection networks between processors. The Voltaire Infiniband Switched-fabric interconnection network of the AMD Opteron single-core cluster provides faster communication speed and lowers the communication latencies and synchronization latencies. Figure 1 also shows the performance of MPI-version on the AMD Opteron signle-core cluster and the OpenMP-version on AMD Opteron dual-core shared memory machines. We notice that the AMD Opteron dual-core processor outperforms Opteron single-core processor both at the core level (1 processing unit) and at processor level (2 processing units for dual-core versus 1 processing unit for single-core). However, for 4 and 8 processing units, the homogeneous multicore architecture with Opteron dual-core processors has longer computation time than the AMD Opteron single-core cluster. The reason is the overhead of multi-threads and the longer memory latency for dual-core system. Although the singlecore Opteron processors resides on different computers, these computers are connected with Voltaire Infiniband Switched-fabric interconnection which minimize the communication latencies. The performance comparison between AMD Opteron single-core cluster and the Cell/B.E. processor is depicted in Figure 1. The Cell/B.E. uses constantly reduced time when more SPEs are involved. This implies that the performance on the Cell/B.E. processor keeps almost constant ratio of 1.45 over the AMD Figure 1. Computation time for different processors
323
Cell Processing for two Scientific Computing Kernels
Opteron single-core processors, no matter how many processing units are involved. At the processor level, a Cell/B.E. processor using 8 SPEs is 7.05 faster than an Opteron single-core processor. The final comparison is between the Cell/B.E. processor and the Opteron dual-core processors in the homogeneous multicore architecture. We can see from the figure that when more processing units are involved, the speedup of Opteron dual-core processor in the shared memory machine is lower than the speedup of the Cell/B.E. processor. This is due to the thread overheads when using more cores. At the processor level, a Cell/B.E. processor using 8 SPEs is 3.37 faster than an Opetron dual-core processor. As discussed in the previous section, DMA size may be a factor for the performance improvement since a large number of transfers incur more communication overhead. For this purpose, we designed a simulation scenario where different number of rows (each row has 600 floats, which is 2,400 bytes) in the computational domain are transferred in each DMA command. The result is depicted in Figure 2(a). The almost flat curves (downward a little for 6 rows in each DMA command) indicates that the DMA size is not a factor for FDTD since the minimal transfer size (for 1 row) is already 2400 bytes. For bigger sizes, the next DMA command has to wait for the previous transfer (large number of data, e.g. 14, 400 bytes for 6 rows) to be completed. The figure shows another issue for the communication overhead, the time for synchronization. It can be seen from the different spaces between different curves. The space between the top two curves (for 1 SPE and 2 SPEs) is the widest, while the space between the bottom two curves (for 4 SPEs and 8 SPEs) is the narrowest. The observation verifies the fact that more overhead occurs when more SPEs are involved. In fact, the speed up for 2 SPEs is 1.95, 3.69 for 4 SPEs, and 4.92 for 8 SPEs. Another scenario was designed to verify the performance difference when using signal and mailbox synchronization mechanisms. The results shown in Figure 2(b) indicate that the two mechanisms give comparatively equal performance. Based on these comparisons, we can conclude that: • •
The Cell/B.E. processor provides significant performance improvement over conventional processors and parallel architectures. All parts of the whole parallel system are important to the final performance. These include the processor, the interconnection network, and the compiler.
FFT This section explains the Fast Fourier Transform (FFT). Communication and synchronization are two main latency issues in computing FFT on parallel architectures (Loan, 1992). Both latencies have to be either hidden or tolerated to achieve high performance. One approach to achieve this is by multithreading. Another approach to tolerate latency is to map data efficiently onto the processors’ local memory and exploiting data locality. Indirect swap networks (ISN), an idea proposed in VLSI circuits can be efficiently used to compute the butterfly computations in FFT (Yeh, Parhami, Varvarigos, & Lee, 2002). Data mapping in the swap network topology reduces the communication overhead by half at each iteration. This section explains the traditional FFT Cooley-Tukey algorithm followed by the FFT algorithm based on the ISN method. The parallel FFT algorithm based on ISN is designed on the Cell/B.E. and compared to a cluster.
324
Cell Processing for two Scientific Computing Kernels
Figure 2. Performance of FDTD on Cell/B.E.
Fast Fourier Transform The discrete Fourier transform (DFT) is used in many applications such as in digital signal processing to analyze the signals frequency spectrum, to solve partial differential equations or to perform convolutions. The 1D DFT computation can be expressed as a matrix-vector multiplication. A straightforward solution for N input elements is of complexity O(N2). The Fast Fourier Transform (FFT) proposed by Cooley-Tukey is a fast algorithm for computing the DFT that reduces the complexity to O(N log N). The FFT has been studied extensively as a frequency analysis tool in diverse applications areas such as audio, signal, image processing, computed tomography and computational finance (Barua et al.,
325
Cell Processing for two Scientific Computing Kernels
2005). There are many variants of the FFT algorithm. Mathematically, all variations differ in the use of permutations and transformations of the data points (Loan, 1992). For a sequence x(r) with N data points, decimation-in-time (DIT) FFT, divides the sequence into two halves x1(r) and x2(r) at every iteration. On the other hand, decimation-in-frequency (DIF) FFT divides the sequence into odd and even data points at every iteration. The difference of division method leads to different structure of the butterfly computation. Depending on the number of groups to divide the input elements, there are radix-2, radix-4, mixed-radix, split-radix FFTs in the literature. In this chapter, we consider the basic radix-2 DIT FFT on N input complex elements where N is a power of 2. Parallelizing the FFT on multiprocessor computers concerns the mapping of data onto processors. On shared-memory processors, the whole data is placed in one global memory, allowing all processors to have access to the data. The computation is subdivided among the processors in such a way that the load is balanced and memory conflict is low. The recursive FFT algorithm can be easily programmed on such machines. On distributed architectures, each processor has its own local memory and data exchanges are via message passing. In this architecture, the recursive FFT algorithm is not the appropriate algorithm because combining even and odd parts of elements at each iteration while the data is distributed on different processors requires relatively high level of programming sophistication. Therefore, an iterative scheme of FFT is more suitable for distributed machines. There are mainly two latency issues in computing FFTs on parallel architectures: communication and synchronization. During the butterfly computation, the partners change at each iteration and an efficient data mapping is difficult. Data need to be communicated between processors at every iteration. This implies synchronization between processors. In order to achieve high performance, both these latencies have to be either hidden or tolerated. One such approach is multithreading (Thulasiraman, Theobald, Khokhar, & Gao, 2000). Another approach to tolerate latency is by mapping data efficiently onto the processors local memory, that is exploiting data locality. Yeh et al. (Yeh et al., 2002) proposed an efficient parallel architecture for FFT in VLSI circuits using indirect swap networks (ISN). Data mapping in the swap network topology reduces the communication overhead by half at each iteration. The idea of swap network has been applied to option pricing in computational finance applications (Barua et al., 2005) and has shown to produce better performance than the traditional parallel DIT FFT. However, synchronization latency is still an issue for large data size.
Cooley-Tukey Butterfly Network and ISN At each iteration of the FFT computations, two data points perform a butterfly computation. The butterfly computation can be conceptually described as follows: a and b are points or complex numbers.The upper part of the butterfly operation computes the summation of a and b with a twiddle factor ω while N N summations and differences the lower part computes the difference. In each iteration, there are 2 2 (Grama, Gupta, Kumar, & Karypis, 2003). N elements on P procesP sors, involves communication for log P iterations and terminates after log N iterations. If we assume shuffled input data at the beginning (Grama et al., 2003), the first (log N – log P) iterations require no communication. Therefore, during the first (log N – log P) iterations (local stage), a sequential FFT algorithm can be used inside each processor. At the end of the (log N – log P)th iteration, the latest In general, a parallel algorithm for FFT, with blocked data distribution of
326
Cell Processing for two Scientific Computing Kernels
N data points exist in each processor. The last log P iterations require remote P N communications (called remote stage). Note that the other half of the pairs for the elements on one P processor reside on the same remote processor. The identity of the processors for remote communication can be identified very easily. That is, at the kth stage of the remote stages ( k = 0, , log P - 1 ), if processor Pi needs to communicate with processor Pj then j = iXOR2k where XOR is exclusive OR binary operation (Chu & George, 2000; Grama et al., 2003). N Note that in the Cooley-Tukey FFT algorithm, data elements are exchanged between two paired P processors without inter-processor permutation at the remote stage, leaving each paired processors with 2N the same copy of elements for butterfly computations. Since the same butterfly computations are P performed on the processors, there are redundant computations. If only one processor performs the butcomputed values for
terfly computations, then some of the processors may be idle. Furthermore, this communication incurs N elements at each stage for remote stages and the distance each message trava message overhead of P els increases as iterations move forward depending on the interconnection network. The consequence is that more communication and synchronization overhead leads to traffic congestion in the butterfly network. One solution to reduce data communication at each remote stage is through inter-processor permutation by using Indirect Swap Network (ISN) (Yeh et al., 2002). For local stages, each processor permutes N N elements locally and performs butterfly calculations; for remote stages, each processor permutes 2P 2P N and exchanges data with its paired processor. Note that the permutation exploits data locality and 2P N thereby reducing message overhead between two paired processors by . This is a significant decrease 2P in communication for very large networks. An indirect swap network is depicted in Figure 3 for 16 elements on 4 processors. In this example, at remote stage 0, processors 0 and 1 exchange data points 2, 3 and 4,5 respectively. The other data points are kept intact in their respective processors (data points 0 and 1 in processor 0, data points 6 and 7 in processor 1). In general for a given N and P processors, N data points are swapped between two processors. The result is reducing the communication of the 2P traditional butterfly network by half.
Parallel FFT Based on ISN on Cell/B.E. As a new architecture for high performance computing, Cell/B.E. has been investigated for different FFT algorithms. Chow et al. (Chow, Gossum, & Brokenshire, 2005) investigate the performance of Cell/B.E. for a modified stride-by-1 algorithm proposed by Bailey based on Stockham Self-sorting FFT. They fix the input sampling size to 16 million (224) single precision complex elements and achieve 46.8 Gflop/s on a 3.2GHz Cell/B.E.. Williams et al. (Williams et al., 2006) investigate 1D/2D FFT on Cell/B.E. on one SPE. Bader et al. (Bader & Agarwal, 2007) investigate an iterative out-of-place DIF FFT with 1K to 16K complex input samples and obtain a single precision performance of 18.6Gflop/s. Their approach incurs frequent synchronization overhead both at the end of the butterfly update and at the end of the
327
Cell Processing for two Scientific Computing Kernels
Figure 3. Indirect swap network with bit-reversed input and scrambled output
followed permutations. FFTW adds various benchmarks of FFT on IBM Cell Blade and PlayStation 3 for different combination among single precision, double precision, real number inputs, complex number inputs, 1D, 2D, and 3D transforms. In the implementation of the FFT algorithm on the Cell/B.E., assume N is the size of the data and P is the number of SPEs. The PPE bit-reverses input data which is naturally ordered. The PPE prepares and conveys information such as the memory address of the bit-reversed data, the memory address of the swap area, the number of SPEs, and the problem size when creating SPE threads. After SPEs receive the information, each of them gets the corresponding ( N ) amount of data from the main memory acP cording to their id. At the same time, SPE can overlap the communication between the main memory and LS with the task of computing twiddle factors. Each SPE then starts (log N – log P) iterations of the sequential computation. After (log N – log P) iterations, each of the SPEs starts the iterations of the remote stage. At every iteration of the remote stage, each SPE stores intermediate results back to the swap area and synchronizes to ensure every SPE stores their portion of intermediate results to the swap area. At the end of synchronization, each SPE gets their paired partners from the swap area to perform N data is stored back for each SPE at each the butterfly computation. Note that in the swap network, 2P iteration. In the Cell/B.E. implementation, the SPEs do not exchange data directly with one another, which is different from the distributed algorithm implementation. In a cluster, the data is initially distributed to the processors by the master processor, and the processors communicate with one another to obtain
328
Cell Processing for two Scientific Computing Kernels
N communications, which is an overhead in the 2P distributed implementation. On the Cell/B.E., data exchange is between the SPE and main memory via asynchronous DMA transfer issued by SPE. This is a significant advantage on the Cell/B.E. over distributed memory machines. The EIB is fast and allows fast communication between the main memory and SPEs. On the distributed memory machines, the interconnection network plays a crucial role in the exchange of data between processors. On the Cell/B.E., since it is system-on-chip architecture, DMA access is fast, and every element works together to accomplish the task. Another issue is synchronization. On the distributed memory machines, the processors synchronize at each iteration. In the FFT implementation on Cell/B.E., the SPEs also need to synchronize, but some unique features of the Cell/B.E. bring great benefits to FFT computations. As shown in section 0, two synchronization mechanisms, mailbox and SPE signal notification registers, are used for FFT. At the end of all iterations, the SPEs write the final results back to the main memory. This new FFT algorithm based on ISN for Cell/B.E. is presented as a pseudo-code in Algorithm 6. It only shows the workload on the SPE. The PPE is responsible to bit-reverse the naturally-ordered input at the beginning and shuffle the final computation results of SPEs such that the overall output is naturally-ordered as in butterfly network. Algorithm 6 Parallel FFT based on ISN for Cell on SPE their paired partners at each iteration. This requires
Input:N/P bit-reversed single precision complex number in array A[N/P ], P SPEs, N = 2i;P = 2j;N >> P, array B[N/P ] to store transferred data temporarily Output: scrambled N/P complex numbers transformed in array A DMA in N/P complex numbers to array A; Compute twiddle factors and stored in array W[N/2]; fori = 0 to (logN - logP - 1) do NG = 2i; {number of groups} shuffle twiddle factors W[N/2]; forj = 0 to N/P- 1 step 2 doif ((j&NG) = 0) then pID = j xor NG; {butterfly partner id} Copy A[j] and A[pID] to B[j] and B[j+1]; else pID = (j + 1) xor NG; Copy A[pID] and A[j+1] to B[j] and B[j+1]; end if end for whileUTE > 8 do{UTE: un-transformed elements number} SIMDize butterfly computation between neighboring 8 elements in array B; UTE- = 8; end while Compute any un-transferred elements if N/P is not multiple of 8; Swap results in array B to array A; end for fori = 0 to (logP - 1) doif ((SPEid & NG) = 0) then DMA out all N/2P elements with odd number indices to main memory;
329
Cell Processing for two Scientific Computing Kernels
else DMA out all N=2P elements with even number indices to main memory; end if Synchronize with all other SPEs; if ((SPEid & NG) = 0) then DMA in N/2P elements from main memory and put into the odd number indexed positions; else DMA in N/2P elements from main memory and put into the even number indexed positions; end if NG = 2i; shuffle twiddle factors W[N=2]; whileUTE > 8 do SIMDize butterfly computation between neighboring 8 elements in array B; UTE- = 8; end while Compute any un-transferred elements if N/P is not multiple of 8; Swap results in array B to array A; end for DMA out final N/P transformed results in array A to PPE;
Experimental Results The new FFT algorithm based on swap network was implemented using sdk2.1 on an IBM Blade QS20 dual-Cell blade running at 3.2GHz available at Georgia Institute of Technology. The compiler is xlc compiler. Figure 4 shows the performance of the algorithm for different problem sizes on different numbers of SPEs. The figure shows that the execution time decreases when increasing number of SPEs for different input sizes. Furthermore, the time for 4K input decreases faster than the time for 1K when increasing number of SPEs. This is because DMA supports up to 16K asynchronous transfers between main memory and local store. Therefore, for larger problem size, the communication overhead is very close to the overhead of smaller problem size. The difference between run time for larger problem size and smaller problem size is mainly the computation time on each SPE. In order to investigate features of Cell/B.E., we compare the execution time of the algorithm on Cell/B.E. with its execution time on a cluster (Barua et al., 2005). The cluster is a 20 node SunFire 6800 running MPI. The SunFire system consists of Ultra Space III CPUs, with 1050 MHz clock rate and 40 gigabytes of cumulative shared memory running Solaris 8 operating system. The comparison is depicted in Figure 5(a) for 16K single precision complex numbers. As shown in the figure, Cell/B.E. performs much better than the cluster. For 8 SPEs on Cell/B.E. and for 8 processors of the cluster, Cell/B.E. is 6.4 times faster than the cluster for 16K input data size. The reason is due to the large communication N ´ log P communications per processor for log P iterations. On the overhead in the cluster, that is, 2P
330
Cell Processing for two Scientific Computing Kernels
Figure 4. Computation time for different problem size on different number of SPEs
contrary, the high-speed EIB on Cell/B.E., which supports a peak bandwidth of 204.8GBytes/s for intrachip transfers, provides good performance for Cell/B.E., especially when the problem size increases. This can be further validated by Figure 5(b). The FFT algorithm for Cell/B.E. outperforms the FFT algorithm for the traditional cluster significantly for larger problem sizes. Note that the communication between the main memory and the SPEs do not degrade the performance of the algorithm. This is part due to the system on chip architecture of Cell/B.E.. The interconnection network which is a hindrance on distributed memory machines is not of concern on Cell/B.E. We have used the high speed EIB, asynchronous DMA transfer overlapped with computation, large number of large uniform registers for SIMD operations available on the Cell/B.E. to our advantage in the FFT implementation. This is our initial work on FFT. We hope to compare our algorithm and results on Cell/B.E. with other existing FFT algorithms and their results on Cell/B.E., such as FFTW and FFTC.
CONCLUSION Although many scientific computing applications have achieved great performance improvement via different parallel paradigms, they are limited for further upgrade because of the three walls posed on the conventional processors. In this chapter, we have focused on multicore processors, especially Cell/B.E., which aims to tackle the three walls to provide significant performance improvement via novice features such as ultra high speed on-chip bus EIB, eight SIMD coprocessor SPEs, software managed memory hierarchy and hardware support of asynchronous communication between hierarchical memories. However, the novelty brings challenges for parallel algorithm design. Therefore, we have investigated two scientific computing kernels, FDTD and FFT, as case studies to illustrate the challenges and
331
Cell Processing for two Scientific Computing Kernels
Figure 5. Comparison between Cell/B.E. and cluster
solutions when designing parallel algorithms on Cell/B.E.. For 2D FDTD, we have achieved an overall speedup of 14.14 over AMD Athlon running at 2GHz and 7.05 over AMD Opteron running at 2.4GHz at the processor level for a computational domain of 600 × 600 Yee cells. As to 1D FFT, for 8 SPEs of IBM Blade QS20 dual-Cell blade running at 3.2GHz and for 8 processors of the cluster of SunFire 6800 running at 1050MHz clock rate, Cell/B.E. is 3.7 times faster than the cluster for 4K input data size and 6.4 times faster than the cluster for 16K input data size. The results obtained from these problems are promising to further consider Cell/B.E. as high performance comptung architecture for many more
332
Cell Processing for two Scientific Computing Kernels
applications. I is not difficult to see that Cell/B.E., especially its eight independent SPEs, brings great performance improvement via manual SIMD operations, explicit data movement management by asynchronous DMA transfers and explicit scheduling and synchronization (such as multiple buffering) among all nine cores. However, all the performance improvement techniques put more burdens on developers compared to coventional processors, which in turn impacts the productivity and code portability. Therefore, developers have to balance between productivity and performance (a metric called relative productivity), which can be measured by the ratio of speedup over SLOC (source lines of code) (Alam S. R., Meredith J. S., Vetter J. S., 2007). With the popularity of multi-core architectures, the industry and the academia have been improving the productivity from multi-core architectures by enhancing the software stacks such as the optimized compiler, optimized library, and different programming models and development platforms. These enhancements will help developers on multi-core architectures, including Cell/B.E., fully unleash the power of multi-core without introducing too much programming complexities, thus achieving high relative productivity.
ACKNOWLEDGMENT The authors are thankful to the University of Manitoba Research Grants Program for their support in this research. The authors would also like to acknowledge the partial financial support from Natural Sciences and Engineering Research Council (NSERC) of Canada. The authors acknowledge Georgia Institute of Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for the use of Cell Broadband Engine resources that have contributed to this research.
REFERENCES Alam, S. R., Meredith, J. S., & Vetter, J. S. (2007, Sept.) Balancing productivity and performance on the cell broandband engine. IEEE Annual International Conference on Cluster Computing. Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, R., Keutzer, K., et al. (2006, Dec). The Landscape of Parallel Computing Research: A View from Berkeley (Tech. Rep. No. UCB/EECS2006-183). EECS Department, University of California, Berkeley. Bader, D. A., & Agarwal, V. (2007, Dec). FFTC: Fastest fourier transform on the ibm cell broadband engine. In 14th IEEE international conference on high performance computing (hipc 2007) Goa, India, (pp. 18–21). Barua, S., Thulasiram, R. K., & Thulasiraman, P. (2005, Aug.). High performance computing for a financial application using fast Fourier transform. In Euro-par parallel processing (p. 1246-1253). Lisbon, Portugal. Chen, T., Raghavan, R., Dale, J. N., & Iwata, E. (2007, Sept.). Cell Broadband Engine Architecture and its first implementation-A performance view. IBM. Journal of Research and Development (Srinagar), 51(5), 559–572.
333
Cell Processing for two Scientific Computing Kernels
Chow, A. C., Gossum, G. C., & Brokenshire, D. A. (2005). A programming example: Large fft on the cell broadband engine. In Gspx. tech. conf. proc. of the global signal processing expo. Chu, E., & George, A. (2000). Inside the fft black box: Serial and parallel fast Fourier transform algorithms. Boca Raton, FL: CRC Press LLC. Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003). Introduction to parallel computing. Upper Saddle River, NJ: Pearson Education Limited. Guiffaut, C., & Mahdjoubi, K. (2001, April). A Parallel FDTD Algorithm Using the MPI Library. IEEE Antennas andPropagation Magazine, 43(No. 2), 94–103. Liu, L.-L., Liu, Q., Natsev, A., Ross, K. A., Smith, J. R., & Varbanescu, A. L. (2007, July). Digital media indexing on the cell processor. In 16th international conference on parallel architecture and compilation techniques, Beijing, China (pp. 425–425). Loan, C. V. (1992). Computational frameworks for the fast Fourier transform. Philadelphia, PA: Society for Industrial and Applied Mathematics. Su, M. EI-kady, I., Bader, D. A., & Lin, S. (2004, August). A Novel FDTD Application Featuring OpenMP-MPI Hybrid Parallelization. In 33rd international conference on parallel processing(icpp) Montreal, Canada, (pp. pp. 373–379). Taflove, A., & Hagness, S. (2000). Computational Electrodynimics: The Finite-Difference Time-Domain Method, second edition. Boston: Artech House. Thulasiram, R. K., & Thulasiraman, P. (2003, August). Performance evaluation of a multithreaded fast fourier transform algorithm for derivative pricing. [TJS]. The Journal of Supercomputing, 26(1), 43–58. doi:10.1023/A:1024464001273doi:10.1023/A:1024464001273 Thulasiraman, P., Khokhar, A., Heber, G., & Gao, G. (2004, Jan.). A fine-grain load adaptive algorithm of the 2d discrete wavelet transform for multithreaded architectures. [JPDC]. Journal of Parallel and Distributed Computing, 64(1), 68–78. doi:10.1016/j.jpdc.2003.06.003doi:10.1016/j.jpdc.2003.06.003 Thulasiraman, P., Theobald, K. B., Khokhar, A. A., & Gao, G. R. (2000, July). Multithreaded algorithms for the fast Fourier transform. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada, (p. 176-185). Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., & Yelick, K. (2006, May). The Potential of the Cell Processor for Scientific Computing. In Computing frontiers (cf’06) Ischia, Italy (pp. 9–20). Xu, M., Sabouni, A., Thulasiraman, P., Noghanian, S., & Pistorius, S. (2007, Sept.). Image Reconstruction using Microwave Tomography for Breast Cancer Detection on Distributed Memory Machine. In International conference on parallel processing (icpp) XiAn, China (p. 1-8). Yee, K. (1966, May). Numerial solution of initial boundary value problems involving maxwell’s equations in isotropic media. IEEE Transactions on Antennas and Propagation, AP-14(8), 302–307.
334
Cell Processing for two Scientific Computing Kernels
Yeh, C.-H., Parhami, B., Varvarigos, E. A., & Lee, H. (2002, July). VLSI layout and packaging of butterfly networks. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada (pp. 196–205). Yu, W., Mittra, R., Su, T., Liu, Y., & Yang, X. (2006). Parallel Finite-Difference Time-Domain Method. Boston: Artech House publishers.
KEY TERMS AND DEFINITIONS Cell Broadband Engine Architecture: Cell Broadband Engine Architecture, also abbreviated CBEA or Cell/B.E. or called Cell as a shorthand, is a microprocessor architecture jointly developed by Sony, Toshiba, and IBM. It is a heterogeneous multi-core architecture by combining one general-purpose Power architecture core PPE (Power Processor Element) with eight streamline coprocessing elements SPEs (Synergistic Processor Elements). PPE supports the operating system and is mainly used for control tasks. SPEs support SIMD (Single Instruction Multiple Data) processing. Each SPE has 128 entries 128-bit registers and 256KB local memory (called local storage) both for instructions and data. The PPE, SPEs and memory subsystem are connected by on-chip bus Element Interconnect Bus (EIB). Cell is designed as a general-purpose high-performance processor to bridge the gap between conventional desktop processors and more specialized high-performance processors. It has been installed in Sony PlayStation 3. Cooley-Tukey FFT Algorithm: The Cooley-Tukey algorithm, named after J.W. Cooley and John Tukey, is the most common FFT algorithm. It re-express the DFT of an arbitrary composite size N = N1N2 in terms of smaller DFTs of size N1 and N2, recursively, in order to reduce the computation time to O(N logN). Discrete Fourier Transform: Discrete Fourier transform (DFT) is one of the specific forms of Fourier analysis which transforms one function in the time domain into another function in the frequency domain. A DFT decomposes a sequence of values into components of different frequencies. A direct way to compute a DFT of N points takes O(N2) arithmetical operations. The inverse DFT (IDFT) transforms one function in the frequency domain into another function in the time domain. Fast Fourier Transform: Fast Fourier Transform (FFT) is an efficient algorithm to compute the DFT and its inverse. Instead of O(N2) arithmetical operations in a direct way to compute a DFT of N points, FFT can compute the same result in only O(N logN) operations. Finite Difference Time Domain: Finite Difference Time Domain (FDTD) is a numerical technique proposed by Yee in 1966 to solve Maxwell’s equations in electromagentics fields. It discretizes a 3D field into a mesh of cubic cells or a 2D field into a grid of rectangular cells using central-difference approximations. It is a time-stepping algorithm. Each cell has electrical field vector components and magnetic field vector components which are updated at alternate half time steps in a leapfrog scheme in time. Indirect Swap Network: Indirect Swap Network (ISN) is an improvement to the Cooley-Tukey butterfly network. It aims to reduce data communication by half than traditional Cooley-Tukey network by introducing inter-processor permutation when implemented on parallel systems using message passing model. Multi-Core Processor: A multi-core processor combines two or more independent cores (normally a CPU) into a single package composed of a single die, or more dies packaged together. It is also called
335
Cell Processing for two Scientific Computing Kernels
chip-level multiprocessor (CMP) and implements multiprocessing in a signle physical package. If the cores are identical, the processor is called homogeneous multi-core processor, such as AMD Opteron dual-core processor. If the cores are not the same, the processor is called heterogeneous multi-core processor, such as Cell/B.E. processor. Single Instruction Multiple Data: Single Instruction Multiple Data (SIMD) is a technique used to achieve data level parallelism, the same instruction applied to multiple data. It is one category of processor architecture proposed in Flynn’s taxonomy. SIMD was popular in large-scale supercomputers with vector processors. Now, smaller-scale SIMD operations have become widespread in general-purpose computers, such as SSE instructions in regular Intel processors and SIMD instruction set in Cell/B.E. processors.
336
Section 4
Scheduling and Communication Techniques
338
Chapter 15
On Application Behavior Extraction and Prediction to Support and Improve Process Scheduling Decisions Evgueni Dodonov University of São Paulo – ICMC, Brazil Rodrigo Fernandes de Mello University of São Paulo – ICMC, Brazil
ABSTRACT The knowledge of application behavior allows predicting their expected workload and future operations. Such knowledge can be used to support, improve and optimize scheduling decisions by distributing data accesses and minimizing communication overheads. Different techniques can be used to obtain such knowledge, varying from simple source code analysis, sequential access pattern extraction, history-based approaches and on-line behavior extraction methods. The extracted behavior can be later classified into different groups, representing process execution states, and then used to predict future process events. This chapter describes different approaches, strategies and methods for application behavior extraction and classification, and also how this information can be used to predict new events, focusing on distributed process scheduling.
INTRODUCTION The knowledge of the application behavior allows predicting the application workload during its execution and forecasting distributed data accesses. In order to obtain such data, different strategies can be employed, varying from simple source code analysis, sequential access pattern extraction (Kotz & Ellis, 1993), history-based approaches (Gibbons, 1997; Smith, Foster & Taylor, 1998) and on-line behavior extraction methods (Senger, Mello, Santana, & Santana, 2005; Dodonov, Mello, & Yang, 2006). It is possible to define two different strategies for application behavior extraction. The first approach DOI: 10.4018/978-1-60566-661-7.ch015
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
On Application Behavior Extraction and Prediction
consists of a static source code evaluation, where the behavior is evaluated in an empiric way, without real application execution. The second, also known as dynamic evaluation, consists of the application behavior evaluation during execution. The static evaluation approach is elderly, being conceived by Church and Turing, and is well described by Fischer (1965). Originally, the technique was applied to finite state automata and Turing machines, aiming at detecting possible problems which could lead to deadlock states or improper application terminations. With the evolution of computing systems, new static evaluation techniques were introduced. Among those techniques are the model verification method, which aims at reducing the application behavior to a formal representation (Schuster, 2003), and the abstract interpretation method (Loiseaux, Graf, Sifakis, Bouajjani, & Bensalem, 1995), which represents the application behavior using a series of finite state machines, characterizing different application behavior using independent automata states. Dynamic behavior evaluation technique, on its turn, investigates the behavior during process execution, usually with aid of debugging or monitoring utilities. It can be further divided into continuous monitoring and event-based approaches (Jain, 1991). Continuous monitoring techniques consist of periodic extraction of application characteristics, determining the current behavior by evaluating differences among each execution state. This technique can be easily employed, as no application modification is required. However, as the monitoring occurs on pre-determined intervals, it introduces a constant overhead. Besides, a disadvantage of this approach lies in its impreciseness – as the monitoring occurs on fixed intervals, it is not possible to determine the precise behavior on specific execution states. The event-based technique evaluates the behavior by determining critical execution states and extracting application characteristics when such states are reached, resulting in a more precise behavior determination. The adoption of this technique is more difficult, as it requires previous knowledge of application functionalities in order to correctly determine critical execution states that are further used to extract its behavior. Besides, this approach usually requires source code instrumentation or the interception and interpretation of function calls. Among the advantages of event-based approach is the lower execution overhead, as the behavior is only extracted on specific execution points (for example, on data transfers, or thread synchronization operations). The area of behavior prediction has received a lot of attention over the last years, resulting in a series of comparisons among different approaches, motivating competitions such as the K. U. Leuven (Suykens & Vandewalle, 2000), Eunite (Chen, Chang, & Lin, 2004) and Santa Fe (Weigend & Gershenfeld, 1994). However, most of the researches are focused on generic chaotic time-series prediction, or posterior behavior reconstruction. The study on the applicability of behavior prediction techniques to support and improve the performance of distributed systems, combined with our previous researches in this field, has motivated us to write this chapter, aiming at outlining different approaches and strategies employed on the process behavior extraction, classification and prediction, and their usage in distributed scheduling, load balancing and data access anticipation. This chapter is organized as follows: Section 2 overviews the evolution of application behavior extraction strategies. Section 3 describes different approaches and techniques for behavior classification and prediction, and Section 4 outlines practical applications of the prediction results in distributed environments. Finally, Section 5 summarizes this chapter.
339
On Application Behavior Extraction and Prediction
BACKGROUND: BEHAVIOR ExTRACTION APPROACHES One of the most common tasks for static application behavior evaluation is the source code analysis. The most common approaches rely on widely available tools - such as grep, find and sed (Wilding & Behman, 2005). It is also possible to use specific source code indexation applications, such as LXR (http:// lxr.sourceforge.net) and similar tools, or code documentation-oriented approaches, such as ones of Java Docs or Doxygen (http://www.doxygen.org). Application behavior can also be evaluated by specific tools, such as Lint, Blast and Valgrind (Nethercote & Fitzhardinge, 2004), that allow to extract and profile the application behavior aiming at identifying possible faults during execution (such as incorrect parameter casting, bad memory usage and unreachable code). Those approaches, however, rely on the availability of application source code. In cases where the code is not available, other techniques can be employed. One of such techniques is known as tracing, which consists in the extraction and further analysis of system calls performed by the application. Generally, the system kernel provides support to trace such operations, usually via the ptrace system call, which notifies the kernel that the application is being monitored. In this case, it is possible to trace application operations in a completely transparent way. As examples of applications that rely on system call tracing, we may mention strace (http://strace.sourceforge.net) and GridBox (Dodonov, Souza, & Guardia, 2004) projects. The ptrace approach allows modifying and retrieving additional information for each system call, being widely used for system and application monitoring and debugging. However, as the call tracing occurs on the kernel level, it is difficult to intercept high-level functions. When such functionality is required, dynamic system linker-related techniques can be used, such as specific environment variables used by Linux linker (ld.so), or a tracing application like ltrace (Wilding & Behman, 2005). This approach allows intercepting any function call and tracing or modifying it when required, as demonstrated by the GridBox (Dodonov et al., 2004) and MidHPC (Andrade Filho et al., 2008) projects. Continuous monitoring-based approaches, on their turn, are more widely used as they do not require knowledge of application functionality. Such methods usually rely on periodical system statistics collecting and analysis, either on the user level, as in the StatMonitor (Keskar & Leibowitz, 2005; Dodonov et al., 2006), or on the kernel level (Senger et al., 2005). Furthermore, application-specific techniques can be employed when using a pre-defined execution environment, such as MPI, as demonstrated next.
MPI Application Behavior Extraction One of the most widely parallel programming technique is the MPI (Message Passing Interface) standard (Gropp, Lusk, Doss, & Skjellum, 1996; Squyres & Lumsdaine, 2003). This approach provides a mechanism for transparent message passing among distributed nodes, allowing synchronous and asynchronous data transfers in a distributed environment. The most widely used MPI implementations are MPICH (Gropp et al, 1996) and LAM-MPI (Squyres & Lumsdaine, 2003). Among other implementations are the Intel MPI Library (http://intel.com/go/mpi), originally known as Vampir (Nagel, Arnold, Weber, Hoppe, & Solchenbach, 1996), and the recently introduced Open-MPI (Gabriel et al., 2004). Initially, MPI was intended to be used in cluster environments, composed of homogeneous nodes
340
On Application Behavior Extraction and Prediction
and without fault tolerance mechanisms. However, with the evolution of distributed environments, such approaches became a clear disadvantage, being inadequate for heterogeneous and large scale systems. This lead to the introduction of new MPI versions, focused on large heterogeneous distributed environments such as computational grids, with approaches such as the GRID-MPI project (Ishikawa, Matsuda, Kudoh, Tezuka, & Sekiguchi, 2003). Since the MPI standard is implemented in the form of a library, the application debugging and monitoring requires specific libraries or tools to obtain the behavior information, such as the Xmpi utility, included in most of the MPI distributions. Specific MPI profiling and monitoring utilities are also widely used, such as the Intel Trace Tools (ITT), included in the Intel MPI library, which allow both continuous and event-based evaluation of distributed MPI applications. The ITT applications provide support for controlled execution, application tracing, flow control and communication pattern visualization among nodes, being divided into two main applications: Intel Trace Collector, used for on-line application behavior collection, and Intel Trace Analyzer, which evaluates the obtained data. The monitoring and behavior extraction is carried out using a shared library, dynamically linked to a MPI application. The data is stored in a consistent way, and is further evaluated by Intel Trace Analyzer. It is also possible to instrument source code of an application using specific functions to define, in a precise manner, which data should be captured. A similar approach is used in MPE (Multi Processing Environment) system, composed of a series of libraries, applications and graphical utilities for performance evaluation of MPI applications (Chan, Gropp, & Lusk, 2003). The system employs an event-based approach, providing libraries for MPI event tracing, similar to Intel Trace Tools, and a graphical analysis tool. MPE, unlike Intel Trace Tools, is freely available. A similar strategy is used by the Instrumentation Library for MPI (MPICL), which provides an API to profile and instrument the MPI applications (Huband & McDonald, 2001). However, it requires manual source code modifications. Besides such tools, mpiP utility may also be mentioned, as it provides a lightweight and scalable performance analysis library for MPI applications (mpiP: Lightweight, Scalable MPI Profiling, http://mpip.sourceforge.net). Finally, a dynamic monitoring approach is introduced by Dodonov et al. (2006), which aims at extracting and predicting MPI operations by transparently intercepting function calls during the application execution.
Distributed Application Behavior Extraction The extracted process behavior can be classified and analyzed using stochastic techniques (Devarakonda & Iyer, 1989; Feitelson, Rudolph, Schweigelshohn, Sevcik & Wong, 1997), processing of historical traces (Gibbons, 1997; Smith et al., 1998), on-line evaluation algorithms (Arpaci-Dusseau, Cutler & Mainwaring, 1998; Silva & Scherson, 2000), and mixed approaches (Senger et al., 2005, Dodonov et al., 2006). Devarakonda and Iyer (1989) proposed a statistic approach to predict CPU resource usage, input/ output operations and application memory utilization, using a clustering algorithm based on k-means combined with the Markov chains approach. This strategy allows to evaluate the behavior and identify resource-intensive applications.
341
On Application Behavior Extraction and Prediction
A similar approach is proposed by Feitelson et al. (1997), where the same application is repeatedly executed and its behavior variations evaluated. The authors observed small differences between different executions, therefore allowing to employ previously observed application functional parameters to determine the future behavior without explicit user cooperation or application modification. Approaches based on application execution are studied by several authors (Gibbons, 1997; Smith et al., 1998), who evaluate the average CPU load, memory usage and input/output operations. The derived results are further used to predict application behavior. Distributed memory accesses can also be evaluated using series of execution traces, as demonstrated by Fang et al. (2004). Similar approaches are also employed in distributed memory systems, for nonsequential data access pattern prediction, as discussed by Bianchini, Pinto and Amorim (1998). Complex data access patterns can be determined using semantic structures and analytical approaches (Lei & Duchamp, 1997), data access classification (Mehrotra & Harrison, 1999), and stochastic approaches such as hidden Markov chains model (Madhyastha & Reed, 1997). On-line process behavior prediction approaches are studied by Silva and Scherson (1997), using Bayes model and fuzzy logic to model application behavior. Similar approaches are discussed by ArpaciDusseau et al. (1998) and Corbalan, Martorell and Labarta (2001), using the collected information to control a dynamic load balancing mechanism. Finally, approaches based on artificial intelligence techniques for on-line behavior classification and prediction are studied by Senger, Mello and Dodonov (Senger, 2004; Senger et al., 2005; Mello, Senger & Yang, 2005; Dodonov et al., 2006; Dodonov & Mello, 2007). Those works employ neural networks to classify and predict the process behavior. Prediction results are further used for dynamic load balancing and scheduling algorithms (Mello & Senger, 2004), and for data prefetching (Dodonov et al., 2006). The approaches introduced in this section allow extracting and representing the behavior of distributed applications. However, the resulting data must be processed in order to be useful for future behavior prediction, as it will be shown in the next section.
APPROACHES FOR APPLICATION BEHAVIOR CLASSIFICATION AND PREDICTION The application behavior classification technique aims at reducing repetitive or similar events to a series of representative patterns, which can further be used to predict or forecast future events. It is possible to use data extracted from application behavior. However, this approach results in a huge amount of repetitive and similar data, which could compromise the prediction process. In this context, classification techniques can be used to reduce the dimensionality of data, grouping similar behaviors and performing predictions over the most relevant points. The classification is also particularly useful in cases where similarities among observed behavior cannot be trivially detected. The classification can be performed with aid of stochastic techniques, such as the iterative clustering, linear classification or auto-regressive approaches; or with the aid of artificial intelligence techniques as neural networks and evolutionary computing. The iterative clustering technique is represented by the k-means, fuzzy c-means and quality threshold (qt) clustering models. The k-means technique intends to separate the input space into different clusters, by determining a centroid value which minimizes the classification quadratic error (Bradley, Fayyad, & Reina, 1998). The fuzzy c-means approach modifies the k-means model by calculating the similarity
342
On Application Behavior Extraction and Prediction
degree among all clusters (Liao, Celmins, & Robert J. Hammell, 2003), and the qt clustering improves both k- and c-means strategies by automatically calculating the ideal number of clusters (Jiang, Tang, & Zhang, 2004). Another classification strategy is used by linear classification techniques, also known as the maximum margin classifiers, such as the Support Vector Machines (SVM) (Schölkopf & Smola, 2001). Such technique maps the input patterns, represented by a multi-dimensional weight vector, into a higher degree space by constructing a series of hyperplanes to maximize the separation of patterns from each other. The larger is the distance among the planes, the higher is the separation degree and, consequently, the better is the generalization. This approach can also be used to predict future patterns, as demonstrated by Hirose et al. (Hirose, Shimizu, Kanai, Kuroda, & Noguchi, 2007), where it is employed on the prediction of long disordered sequences. Model-based behavior prediction is also used in different stochastic auto-regressive approaches, such as: SVCA, which employs non-linear auto-regression; NARX and ARMAX models, intended to predict time series with independent variables; and ARMA and ARIMA models, used for general time series prediction (Jain, 1991). Self-Organized Maps, or SOM networks, originally introduced by Kohonen (Kaski & Oja, 1999), are frequently used for input data generalization according to similarities among them. SOM uses an unsupervised learning model, where the best matching neuron represents similar patterns, which are determined by distributing all the input values over a map and evaluating distances among them. In this process, the neuron with the closest weight to input value is declared the winner, and has its weight is adjusted towards the input value. The weights of the other neurons in its neighborhood are also adjusted, according to their distance to the winning neuron. Although the self-organized maps provide unsupervised pattern classification and feature extraction, their efficiency is limited by the need of a previous definition of network topology. Aiming at a more flexible pattern classification, self-expansible neural networks were introduced (Kunze & Steffens, 1995; Fritzke, 1995; Thacker, Abraham & Courtney, 1997), such as Cascade-Correlation Learning Architecture (CCLA), Growing Cell Structure (GCS), Probabilistic GCS, Growing Neural Gas (GNG), Growing Self-Organizing Maps, Restricted Coloumb Energy (RCE) and Contextual Layered Associative Memory (CLAM). The basic training and classification process of self-expansible networks is similar to self-organized maps. However, the self-expansible networks create new elements on demand, according to the variations among the input patterns, aiming at reducing the global residual training error by supporting the neuron with highest accumulated error rate. The creation of new elements is periodical and requires several previous training phases, therefore limiting the networks capabilities for on-line classification. The introduction of novelty-detection networks, such as GWR (Marsland et al., 2002) and SONDE (Albertini & Mello, 2007), further extended the family of adaptive data classification approaches. Such networks provide additional advantages, creating new neurons at any time and forgetting irrelevant information, what may result in a more precise and efficient pattern classification for temporal applications. A different approach is employed by the Adaptive Resonance Theory (ART) family of neural networks, proposed by Grossberg (Carpenter & Grossberg, 1987). The basic ART system is an unsupervised learning model and typically consists of neuron layers for comparison and recognition, a vigilance parameter, and a reset module. The comparison and recognition layers are responsible for similarity determination among input values, and their classification. The vigilance parameter has considerable influence in the
343
On Application Behavior Extraction and Prediction
classification: the higher is the vigilance parameter, the more accurate is the classification, although such accuracy implies in lost of generalization. Finally, the reset module is used to control the algorithm, constantly verifying the classification precision according to the vigilance parameter. In order to determine the adequate number of clusters, the resulting classification can be evaluated to determine the inter-cluster and intra-cluster distance, as proposed by Mello et al. (2005). The average distance among clusters is known as inter-cluster distance, and the distance among elements of the same cluster as intra-cluster distance. Those parameters determine the level of independence among the classified data, allowing tuning the desired generalization. The original version of ART network is known as ART-1, and was limited to the classification of binary data. In order to allow the classification of continuous input values, an extension of the network was proposed, called ART-2. Further performance optimizations of the ART-2 network resulted in ART2A network (He, Tan & Tan, 2004).Among other modifications of ART family are Fuzzy ART, which employs fuzzy logic to reduce the number of clusters created during classification; ART-3 network that allows partial neuron inhibition using a neuro-transmitting mechanism; ARTMAP or Predictive ART, which combines ART-1 and ART-2 network into a supervised learning structure, among others (Marinai, Gori, & Soda, 2005). A different approach is used in the Independent Component Analysis (ICA) family of networks which allows extracting a series of independent signals from a composite one by detecting the correlation among them. Applied to process behavior classification, such networks can be used to determine and separate different behaviors of a process. Among different ICA network implementations are Infomax, FastICA and MF-ICA (Rosca, Erdogmus, Príncipe & Haykin, 2006). A different classification and prediction model is used by the Radial Basis Function (RBF) networks (Powell, 1987), which calculate a regressive function to represent the input patterns. This is performed by combining a series of Gaussian functions, whose combination results in regressive equation which describes the observed behavior. Among notable extensions to the RBF network are the Recurrent RBF, described by Zemouri et al. (2003), where it is used to predict chaotic time series, and Time-Delay RBF (Berthold, 1994), employed on temporal behavior recognition. The concept of interconnections among neurons with constant feedback and back-propagation techniques is used in the Multi-Layer Perceptron (MLP) neural network (Hornik, Stinchcombe & White, 1989). This allows the network to learn and adapt itself to the input patterns. Another approach is employed in the Time-Delay Neural Networks (TDNN), introduced by Waibel et al. (1989), and used for position-independent recognition of features within a larger pattern. While in a traditional neural network the basic unit computes the weighted sum of its inputs and then forwards it through a nonlinear function to other units, the basic unit of the TDNN network is modified by introducing n delays to the input, so a input layer composed of y inputs would generate z = y * (n+1) inputs to the network, corresponding to the past inputs. By employing delays, the neural network can relate and compare the current input to previously observed data. In this way it effectively implements a shortterm memory mechanism. The output values are compared to the expected ones and the error value is obtained and back-propagated through the network, updating the weights and decreasing the prediction error. This procedure is repeated until the results converge to the expected outputs. Among the extensions of the time-delay family of networks is the ATNN network, which extends the TDNN by including the ability to adjust the delays dynamically, therefore allowing a better adaptation to different access patterns, as demonstrated by Day and Davenport (1993). A statistical approach for prediction is introduced by Bayesian networks (Morawski, 1989), that
344
On Application Behavior Extraction and Prediction
consider the correlation and conditional dependences among different patterns in order to predict future accesses. A different prediction strategy relies on Markov Chains (Bolch, Greiner, Meer & Trivedi, 1998), representing a sequence of application state changes, with given probability for each transition. Thus, the complete application execution can be represented as a sequence of state changes. By knowing the probability of each such change, it is possible to predict future application behavior (Dodonov et al, 2006). This approach is further extended by Hidden Markov Chains Model (HMM), which considers series of state changes to realize more complex predictions (Madhyastha & Reed, 1997), and Kalman filter, which aims at forecasting the behavioral trends (Kohler, 1997). It is also possible to employ similarity and clustering techniques, such as the ones used by the SOM neural network, to forecast future application behavior. Among such techniques are Temporal Kohonen Maps, or TKM (Chappel & Taylor, 1993), which aims at classifying the application behavior according to its temporary states. This technique introduces the concept of short-term memory, which is used to store the historical neighborhood changes for each application state. A different strategy is used by auto-regressive SOM (AR-SOM) model (Lampinen & Oja, 1989), which represents each application state as an auto-regressive vector, composed of a time series of past application behaviors. Therefore, such approach allows to use the traditional SOM model to predict possible application behavior changes. Another approach, also based on SOM network, is represented by the Vector-Quantized Temporal Associative Memory (VQTAM) network, proposed by Barreto and Araujo (2004). This method combines the application behavior classification with an associative memory to predict future application events. The short-term memory concept for data classification and prediction is employed by recurrent neural networks, characterized by usage of connections to store recent events, such as the Elman network (Kremer, 1995), which employs a special context layer to maintain historical events, or Hopfield network (Hopfield, 1988) that acts as an associative memory between input and output patterns. Such networks provide a fast access to recent events, and are highly efficient for short-time predictions. However, while such approaches can be effective for short-term prediction, they do not provide adequate results for longterm sequences, requiring a different approach. Therefore, Long-Short Term Memory (LSTM) network was introduced in researches by Hochreiter, Schmidhuber and Gers (Hochreiter & Schmidhuber, 1997; Gers & Schmidhuber, 2000), combining conventional short-term memory with a long-term prediction, noise detection and reduction techniques. The network aims at maintaining a constant error rate flow, preventing false positives and resulting in a superior prediction precision when compared to other approaches, as shown in several works (Pérez-Ortiz, Gers & Schmidhuber, 2003). The long-term prediction is differentiated from short-term by introducing memory gates which control whether the memory content should be modified. This approach results in an effective prediction for both short and long-range sequences, even over noisy observations. It also requires a lower computational cost. However, the network topology must be carefully planned for an efficient prediction. A novel approach for behavior prediction was simultaneously introduced by several authors, represented by techniques such as Echo State Networks, Liquid State Machines and Backpropagation-Decorrelation, all known under the name of Reservoir Computing (Jaeger, 2007). Their functionalities are similar to recurrent networks, storing temporal information in internal network nodes. The proposed approaches aim at providing efficient and low-cost prediction of series with a priori unknown behavior, being based on a large network with randomly generated units, known as reservoir. The input signal is submitted to
345
On Application Behavior Extraction and Prediction
the network, being mapped into a higher dimension by the reservoir dynamics. A supervising mechanism, controlled by the output weight vector, is trained to read the state of the reservoir and map it according to the desired output. Such mapping is known as the echo property of the reservoir. As only the output weights of the network are modified, it results in a constant execution time. The results obtained from such techniques are promising, as they are able to outperform previously employed techniques by several orders of magnitude according to Jaeger (2007). Among the limitations of reservoir computing are the learning instability, inflicted by unconstrained training, and the size of reservoir network, which is often composed of thousands of neurons. Aiming at overcoming such limitation, a new approach was introduced recently by Gao et al. (2007), known as Spiral Recurrent Neural Network (SRNN). The proposed network introduces fading memory feature by combining a trainable hidden recurrent layer with the echo property of reservoir computing, resulting in highly effective prediction for both short-term and long-term periods according to the researchers.
APPLICATIONS The usage of process behavior classification and prediction aiming performance improvements in distributed systems has received considerable attention over the last decades, and currently is used in different areas of high-performance computing. Devarakonda and Iyer (1989) use the k-means technique to evaluate the application history, determining the critical execution points and overall resource utilization at different execution states. Linear classification, as performed by SVM, can be used for pattern prediction over long disordered sequences, as demonstrated by Hirose et al. (2007), and for time series prediction, as shown by Muller et al. (1997) and Chen et al. (2004). Markov chains and time-delay neural network are employed for access pattern detection and prediction in distributed systems by Sakr et al. (1996) and Dodonov and Mello (2006, 2007). The works demonstrated the efficiency of such techniques for correct access pattern representation in distributed systems, aiming at support the data prefetching and load balancing. Artificial intelligence techniques are used for distributed scheduling in the MidHPC project (Mello et al., 2007), intended for transparent execution of concurrent application over large distributed environment. Application behavior knowledge is employed by the Route scheduling algorithm (Mello, Senger, & Yang, 2006), which deploys jobs based on the observed behavior. The Route algorithm is further extended by by the RouteGA algorithm, (Mello, Andrade, Senger, & Yang, 2007), which takes scheduling decisions considering the network analysis performed by a genetic algorithm, and migrates processes according to the network environment configuration (Dodonov, Mello, & Yang, 2005). The nature of data accesses in distributed systems is studied by Kroeger and Long (1999), demonstrating that the analysis of execution history can improve the prediction results in up to 80%. Similar results were obtained in research by Byna et al. (2004), evaluating MPI accesses in distributed applications. The prediction of process behavior is also used by Martin et al. (2003) to improve the trade-off between latency and bandwidth in shared memory multiprocessors. Neural networks-based approaches are also used for job scheduling in grid environment by Ishii, Mello and Yang (2007), as well as for user behavior extraction and classification (Santos, Mello, & Yang, 2007).
346
On Application Behavior Extraction and Prediction
Finally, a series of innovative process scheduling approaches based on application behavior analysis are studied in researches by Mello and Yang (2008), evaluating chaos theory approaches; and Nery et al. (2006), distributing processes over the network considering Ant colony optimization.
CONCLUSION In this chapter, we presented different techniques for application behavior extraction, classification and prediction, ranging from source code evaluation approaches and process execution tracing to platformspecific techniques, such as MPI-based approaches and distributed application execution monitoring. The correct identification and classification of different process execution states are responsible for the effectiveness of behavior prediction. Therefore, several statistical and artificial intelligence-based approaches were studied for access pattern extraction, classification and prediction. Finally, applications based on different behavior prediction strategies were discussed in this chapter to illustrate the effectiveness of such techniques.
REFERENCES Albertini, M. K., & Mello, R. F. (2007). A self-organizing neural network for detecting novelties. In Sac ’07: Proceedings of the 2007 ACM Symposium on Applied Computing (pp. 462–466). New York: ACM Press. Andrade Filho, J. A., Mello, R. F., Dodonov, E., Senger, L. J., Yang, L. T., & Li, K. C. (2008). Toward an Efficient Middleware for Multithreaded Applications in Computational Grid. In IEEE International Conference on Computational Science and Engineering, (p. 147-154). Arpaci-Dusseau, A. C., Culler, D. E., & Mainwaring, M. (1998). Scheduling with Implicit Information in Distributed Systems. In Proceedings of ACM SIGMETRICS’98 (pp. 233–248). Barreto, G. A., & Araujo, A. F. R. (2004). Identification and control of dynamical systems using the self-organizing map. In Special Issue of IEEE Transactions on Temporal Coding (pp. 1244-1259). Berthold, M. (1994). A time delay radial basis function network for phoneme recognition. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on Neural Networks, 7. Bianchini, R., Pinto, R., & Amorim, C. L. (1998). Data prefetching for software DSMs. In International Conference on Supercomputing (pp. 385-392). Bolch, G., Greiner, S., de Meer, H., & Trivedi, K. S. (1998). Queueing networks and markov chains: modeling and performance evaluation with computer science applications. New York: Wiley-Interscience. Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining (pp. 9-15).
347
On Application Behavior Extraction and Prediction
Byna, S., Sun, X.-H., Gropp, W., & Thakur, R. (2004). Predicting memory-access cost based on dataaccess patterns. In Cluster ’04: Proceedings of the 2004 IEEE International Conference on Cluster Computing (pp. 327–336). Washington, DC: IEEE Computer Society. Carpenter, G. A., & Grossberg, S. (1987). Art 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26, 4919–4930. doi:10.1364/AO.26.004919 Chan, A., Gropp, W., & Lusk, E. (2003). User’s guide for mpe extensions for mpi programs. Technical Report ANL-98/xx, Argonne National Laboratory, 1998. Retrieved from ftp://ftp.mcs.anl.gov/pub/mpi/ mpeman.ps. Chappell, G. J., & Taylor, J. G. (1993). The temporal kohonen map. Neural Networks Journal, 6(3), 441–445. doi:10.1016/0893-6080(93)90011-K Chen, B. J., Chang, M. W., & Lin, C. J. (2004). Load forecasting using support vector machines: A study on eunite competition 2001. IEEE Transactions on Power Systems, 19(4), 1821–1830. doi:10.1109/ TPWRS.2004.835679 Corbalan, J., Martorell, X., & Labarta, J. (2001). Improving Gang Scheduling through Job Performance Analysis and Malleability. In International Conference on Supercomputing (pp. 303–311). Sorrento, Italy. Day, S. P., & Davenport, M. R. (1993, March). Continuous-time temporal back-propagation with adaptable time delays. IEEE Transactions on Neural Networks, 4(2), 348–354. doi:10.1109/72.207622 de Mello, R. F., Filho, J. A. A., Senger, L. J., & Yang, L. T. (2007). RouteGA: A grid load balancing algorithm with genetic support. In AINA (pp. 885-892). New York: IEEE Computer Society. de Mello, R. F., Senger, L. J., & Yang, L. T. (2006). Performance evaluation of route: A load balancing algorithm for grid computing. Research Initiative, Treatment Action, 13(1), 87–108. Devarakonda, M. V., & Iyer, R. K. (1989). Predictability of process resource usage: A measurement-based study on UNIX. IEEE Transactions on Software Engineering, 15(12), 1579–1586. doi:10.1109/32.58769 Dodonov, E., Mello, R., & Yang, L. T. (2006). Adaptive technique for automatic communication access pattern discovery applied to data prefetching in distributed applications using neural networks and stochastic models. In Proceedings of the ISPA’06. Dodonov, E., & Mello, R. F. (2007). A model for automatic on-line process behavior extraction, classification and prediction in heterogeneous distributed systems. In CCGRID’07: Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (pp. 899–904). Washington, DC: IEEE Computer Society. Dodonov, E., Mello, R. F., & Yang, L. T. (2005). A network evaluation for lan, man and wan grid environments. In L. T. Yang, M. Amamiya, Z. Liu, M. Guo, & F. J. Rammig (Eds.), EUC, 3824 (pp.11331146). Berlin: Springer.
348
On Application Behavior Extraction and Prediction
Dodonov, E., Sousa, J. Q., & Guardia, H. C. (2004). Gridbox: securing hosts from malicious and greedy applications. In Proceedings of the 2nd Workshop on Middleware for Grid Computing (pp. 17–22). New York: ACM Press. dos Santos, M. L., de Mello, R. F., & Yang, L. T. (2007). Extraction and classification of user behavior. In T.-W. Kuo, E. H.-M. Sha, M. Guo, L. T. Yang, & Z. Shao (Eds.), EUC (LNCS Vol. 4808, p. 493-506). Berlin: Springer. Fang, W., Wang, C. L., Zhu, W., & Lau, F. C. M. (2004). Pat: a postmortem object access pattern analysis and visualization tool. In CCGRID 2004, 4th IEEE/ACM International Symposium on Cluster Computing and the Grid, (pp. 379-386). Chicago: IEEE Computer Society. Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., & Wong, P. (1997). Theory and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing (LNCS Vol. 1291, pp. 1–34). Berlin: Springer Verlag. Fischer, P. C. (1965). On formalisms for turing machines. Journal of the ACM, 12(4), 570–580. doi:10.1145/321296.321308 Fritzke, B. (1995). A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7 (pp. 625–632). Cambridge MA: MIT Press. Gabriel, E., Fagg, E. G., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., et al. (2004). Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings of 11th European PVM/MPI Users’ Group Meeting, (pp. 97-104). Gao, H., Sollacher, R., & Kriegel, H. P. (2007). Spiral recurrent neural network for online learning. In Proceedings of 15th European Symposium on Artificial Neural Networks (pp. 483–488). Gers, A. F., & Schmidhuber, J. (2000, 25). Long short-term memory learns context free and context sensitive languages. In Proceedings of IDSIA, 3. Gibbons, R. (1997). A historical application profiler for use by parallel schedulers. In Job Scheduling Strategies for Parallel Processing (pp. 58–77). Berlin: Springer Verlag. Gropp, W., Lusk, E., Doss, N., & Skjellum, A. (1996, September). A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6), 789–828. doi:10.1016/0167-8191(96)00024-5 He, J., Tan, A.-H., & Tan, C.-L. (2004). Modified art 2a growing network capable of generating a fixed number of nodes. IEEE Transactions on Neural Networks, 15(3), 728–737. doi:10.1109/ TNN.2004.826220 Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., & Noguchi, T. (2007). Poodle-l: A two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics (Oxford, England), 23(16), 2046–2053. doi:10.1093/bioinformatics/btm302 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. doi:10.1162/neco.1997.9.8.1735
349
On Application Behavior Extraction and Prediction
Hopfield, J. J. (1988). Neural networks and physical systems with emergent collective computational abilities. Neurocomputing: Foundations of Research, 457–464. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8 Huband, S., & McDonald, C. (2001). A preliminary topological debugger for MPI programs. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, (pp. 422-429). Ishii, R. P., de Mello, R. F., & Yang, L. T. (2007). A complex network-based approach for job scheduling in grid environments. In R. H. Perrott, B. M. Chapman, J. Subhlok, R. F. de Mello, & L. T. Yang (Eds.), HPCC (LNCS Vol. 4782, p. 204-215). Berlin: Springer. Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., & Sekiguchi, S. (2003). The design of a latency-aware mpi communication library. In Swopp03. Jaeger, H. (2007). Echo state network. Scholarpedia, 2(9), 2330. Available at http://www.scholarpedia. org/article/Echo_state_network Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. New York: John Wiley and Sons. Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386. doi:10.1109/TKDE.2004.68 Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D), 35–45. Kaski, S., & Oja, E. (1999). Kohonen maps. New YorkUSA: Elsevier Science Inc. Keskar, D., & Leibowitz, M. (2005). Speeding up openoffice: profiling, tools, approaches. In First Openoffice.org Conference. Kotz, D., & Ellis, C. S. (1993). Practical prefetching techniques for multiprocessor file systems. Journal of Distributed and Parallel Databases, 1(1), 33–51. doi:10.1007/BF01277519 Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(4), 1000–1004. doi:10.1109/72.392262 Kroeger, T. M., & Long, D. D. E. (1999). The case for efficient file access pattern modeling. In Workshop on Hot Topics in Operating Systems (p. 14-19). Kunze, M., & Steffens, J. (1995). Growing cell structure and neural gas; incremental neural networks. In Proceedings of the 4th AIHEP Workshop, Pisa, Italy. Lampinen, J., & Oja, E. (1989). Self-organizing maps for spatial and temporal AR models. In M. Pietikainen & J. Roning (Eds.), Proceedings of 6th SCIA, Scandinavian Conference on Image Analysis (pp.120–127). Helsinki, Finland: Suomen Hahmontunnistustutkimuksen seura r. y.
350
On Application Behavior Extraction and Prediction
Lei, H., & Duchamp, D. (1997). An analytical approach to file prefetching. In 1997 USENIX Annual Technical Conference. Anaheim, USA. Liao, T. W., Celmins, A. K., Robert, J., & Hammell, I. (2003). A fuzzy c-means variant for the generation of fuzzy term sets. Fuzzy Sets and Systems, 135(2), 241–257. doi:10.1016/S0165-0114(02)00136-7 Loiseaux, C., Graf, S., Sifakis, J., Bouajjani, A., & Bensalem, S. (1995). Property preserving abstractions for the verification of concurrent systems. Formal Methods in System Design, 6(1), 11–44. doi:10.1007/ BF01384313 Madhyastha, T. M., & Reed, D. A. (1997). Input/output access pattern classification using hidden Markov models. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems (pp. 57–67). San Jose, CA: ACM Press. Marinai, S., Gori, M., & Soda, G. (2005). Artificial neural networks for document analysis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1), 23–35. doi:10.1109/ TPAMI.2005.4 Marsland, S., Shapiro, J., & Nehmzow, U. (2002). A self-organizing network that grows when required. Neural Networks, 15(8-9), 1041–1058. doi:10.1016/S0893-6080(02)00078-3 Martin, M., Harper, P., Sorin, D., Hill, M., & Wood, D. (2003, June). Using destination-set prediction to improve the latency /bandwidth tradeoff in shared memory multiprocessors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. Mehrotra, S., & Harrison, L. (1996). Examination of a memory access classification scheme for pointerintensive and numeric programs. In ICS’96 (pp. 133-140). Mello, R. F., Andrade, J. A., Dodonov, E., Ishii, R. P., & Yang, L. T. (2007). Optimizing distributed data access in grid environments by using artificial intelligence techniques. In I. Stojmenovic, R. K. Thulasiram, L. T. Yang, W. Jia, M. Guo, & R. F. de Mello (Eds.), ISPA’07 (LNCS Vol. 4742, pp. 125136). Berlin: Springer. Mello, R. F., & Senger, L. J. (2004). A new migration model based on the evaluation of processes load and lifetime on heterogeneous computing environments. In 16th Symposium on Computer Architecture and High Performance Computing (SBAC’2004) Foz do Iguacu, PR, Brazil, (pp. 222–227). Mello, R. F., Senger, L. J., & Yang, L. T. (2005). Automatic text classification using an artificial neural network. High Performance Computational Science and Engineering, 1, 1–21. Mello, R. F., & Yang, L. T. (2008). Prediction of Dynamical, Non-Linear and Unstable Process Behavior. Journal of Supercomputing. Dordrecht, the Netherlands: Springer Netherlands. Morawski, P. (1989). Understanding bayesian belief networks. AI Expert, 4(5), 44–48. Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. Proceedings of the International Conference on Artificial Neural Networks, (pp. 999–1004).
351
On Application Behavior Extraction and Prediction
Nagel, W. E., Arnold, A., Weber, M., Hoppe, H. C., & Solchenbach, K. (1996). VAMPIR: Visualization and analysis of MPI resources. Supercomputer, 12(1), 69–80. Nery, B. R., de Mello, R. F., & Carvalho, A. C. P. Leon Ferreira, & Yang, L. T. (2006). Process scheduling using ant colony optimization techniques. In M. Guo, L. T. Yang, B. D. Martino, H. P. Zima, J. Dongarra, & F. Tang (Eds.), ISPA (LNCS Vol. 4330, p. 304-316). Berlin: Springer. Nethercote, N., & Fitzhardinge, J. (2004, January). Bounds-checking entire programs without recompiling. In Informal Proceedings of the Second Workshop on Semantics, Program Analysis, and Computing Environments for Memory Management (SPACE 2004), Venice, Italy. Pérez-Ortiz, J. A., & Gers, F. A. E., D., & Schmidhuber, J. (2003). Kalman filters improve LSTM e network performance in problems unsolvable by traditional recurrent nets. In Neural Networks, 16(2). Powell, M. (1987). Radial basis functions for multivariable interpolation: a review. Clarendon Press Institute Of Mathematics And Its Applications Conference Series, (pp. 143–167). Rosca, J. P., Erdogmus, D., Principe, J. C., & Haykin, S. (Eds.). (2006). Independent component analysis and blind signal separation, in Proceedings of 6th International Conference, ICA 2006, Charleston, SC. New York: Springer. Sakr, M., Giles, C., Levitan, S., Horne, B., Maggini, M., & Chiarulli, D. (1996). On-line prediction of multiprocessor memory access patterns. In Proceedings of the IEEE International Conference on Neural Networks (pp. 1564-1569). Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond (adaptive computation and machine learning). Cambridge, MA: The MIT Press. Schuster, A. (2003). Scalable distributed model checking: Experiences, lessons, and expectations. Electronic Notes on Theory of Computer Science, 89(1). Senger, L. J., Mello, R. F., Santana, M. J., & Santana, R. C. (2005). An on-line approach for classifying and extracting application behavior on Linux. In L. T. Yang & M. Guo (Eds.), High Performance Computing: Paradigm and Infrastructure (chap. 20). New York: John Wiley and Sons. Silva, F. A. B. D., & Scherson, I. D. (2000). Improving parallel job scheduling using runtime measurements. In D. G. Feitelson & L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing (LNCS, Vol. 1911, pp. 18–38). Berlin: Springer Verlag. Smith, W., Foster, I. T., & Taylor, V. E. (1998). Predicting application run times using historical information. In JSSPP (p. 122-142). Squyres, J. M., & Lumsdaine, A. (2003). A Component Architecture for LAM/MPI. In Proceedings of 10th European PVM/MPI Users’ Group Meeting (pp. 379–387). Venice, Italy: Springer-Verlag. Suykens, J. A., & Vandewalle, J. (2000). The K. U. Leuven competition data: a challenge for advanced neural network techniques. In Esann (pp. 299-304)
352
On Application Behavior Extraction and Prediction
Thacker, N. A., Abraham, I., & Courtney, P. (1997). Supervised learning extensions to the clam network. Neural Networks, 10(2), 315–326. doi:10.1016/S0893-6080(96)00074-3 Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 328–339. doi:10.1109/29.21701 Weigend, A. S., & Gershenfeld, N. A. (1994). Time series prediction: Forecasting the future and understanding the past. In A. S. Weigend & N. A. Gershenfeld (Eds.), Santa Fe Institute Studies on the Sciences of Complexity, Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis, Santa Fe, New Mexico, May 14-17, 1992. New York: Addison-Wesley Wilding, M., & Behman, D. (2005). Self-service Linux: mastering the art of problem determination (1st edition). Upper Saddle River, NJ: Prentice Hall. Zemouri, R., Racoceanu, D., & Zerhouni, N. (2003). Recurrent radial basis function network for timeseries prediction. Engineering Applications of Artificial Intelligence, 16(5-6), 453–463. doi:10.1016/ S0952-1976(03)00063-0
KEY TERMS AND DEFINITIONS Application Behavior Extraction: The process of extraction and transcription of events observed during the application execution. Application Behavior Classification: Evaluation of extracted application behavior, aiming at determining the most representative execution patterns and reducing the data dimensionality. Application Behavior Prediction: Forecasting of future application actions with base on previously observed behavior. Application Knowledge: The transcription of resource usage by applications during the course of execution. Data Prefetching: Anticipated reading of data elements according to the forecasted execution patterns, aiming at reducing the access latency. Process Execution States: Set of information which defines the process behavior on a given time instant. Process Scheduling: Allocation of applications across the environment, aiming at reducing the system idleness and minimizing the total execution time.
353
354
Chapter 16
A Structured Tabu Search Approach for Scheduling in Parallel Computing Systems Tore Ferm Sydney University, Australia Albert Y. Zomaya Sydney University, Australia
ABSRACT Task allocation and scheduling are essential for achieving the high performance expected of parallel computing systems. However, there are serious issues pertaining to the efficient utilization of computational resources in such systems that need to be resolved, such as, achieving a balance between system throughput and execution time. Moreover, many scheduling techniques involve massive task graphs with complex precedence relations, processing costs, and inter-task communication costs. In general, there are two main issues that should be highlighted: problem representation and finding an efficient solution in a timely fashion. In the work proposed here, the authors have attempted to overcome the first problem by using a structured model which offers a systematic method for the representation of the scheduling problem. The model used can encode almost all of the parameters involved in a scheduling problem in a very systematic manner. To address the second problem, a Tabu Search algorithm is used to allocate tasks to processors in a reasonable amount of time. The use of Tabu Search has the advantage of obtaining solutions to more general instances of the scheduling problem in reasonable time spans. The efficiency of the proposed framework is demonstrated by using several case studies. A number of evaluation criteria will be used to optimize the schedules. Communication- and computation-intensive task graphs are analyzed, as are a number of different task graph shapes and sizes.
INTRODUCTION The impressive proliferation of the use of parallel processor systems these days in a great variety of applications is the result of many breakthroughs over the last two decades. These breakthroughs span DOI: 10.4018/978-1-60566-661-7.ch016
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Structured Tabu Search Approach
a wide range of specialities, such as device technology, computer architectures, theory, and software tools. However, there remain many problems that need to be addressed which will keep the research community busy for years to come (Zomaya, 1996). The scheduling problem involves the allocation of a set of tasks or jobs to resources, such that the optimum performance is obtained. If these tasks are not inter-dependent the problem is known as task allocation. In a parallel computing system one would expect a linear speedup in performance when more processors (or computers) are employed. However, in practice, this is generally not the case, due to such factors as communication overhead, control overhead, and precedence constraints between tasks (Lee et al., 2008). Thus, the development of efficient scheduling techniques would improve the operation of parallel processor systems. The efficiency of a parallel processor system is commonly measured by completion time, speedup, or throughput, which in turn reflect the quality of the scheduler. Many heuristic algorithms have already been developed which provide effective solutions. Most of these methods, however, can solve only limited instances of the scheduling problem (El-Rewini, 1996, Macey and Zomaya, 1997, Nabhan and Zomaya, 1997). The scheduling problem is known to be NP-complete for the general case and even for many restricted instances (Salleh and Zomaya, 1999). For this reason, scheduling is usually handled by heuristic methods which provide reasonable solutions for restricted instances of the problem (El-Rewini, 1996). Most research on scheduling has dealt with the problem when the tasks, inter-processor communication costs, and precedence constraints are fully known. When the task information is known a priori, the problem is known as static scheduling. On the other hand, when there is no a priori knowledge about the tasks the problem is known as dynamic scheduling. For dynamic scheduling problems with precedence constraints optimal scheduling algorithms are not known to exist (Lee and Zomaya, 2008). In non-preemptive scheduling, once a task has begun on a processor, it must run to completion before another task can start execution on the same processor. In preemptive scheduling, it is possible for a task to be interrupted during its execution, and resumed from that position on the same or any other processor, at a later time. Although preemptive scheduling requires additional overhead, due to the increased complexity, it may perform more effectively than non-preemptive methods (El-Rewini, 1996). Furthermore, a non-adaptive scheduler does not change its behaviour in response to feedback from the system. This means that it is unable to adapt to changes in system activity. In contrast, an adaptive scheduler changes its scheduling according to the recent history and/or current behaviour of the system (Zomaya and Teh, 2001, Seredynski and Zomaya, 2002). In this way, adaptive schedulers may be able to adapt to changes in system use and activity. Adaptive schedulers are usually known as dynamic, since they make decisions based on information collected from the system.
TASK SCHEDULING AND PROBLEM FORMULATION An application can be represented by a Directed Acyclic Graph (DAG), G = (V,E), where V is the set of v nodes, and E is the set of e edges. A node ni, in a DAG represents a task, and the corresponding weight wi, representing the computational cost required to complete that task. An edge (i,j), connecting nodes ni and nj (in the direction i –> j), represents a task precedence constraint, in which ni must be completed before nj can begin, with the corresponding weight, cij, representing the communication cost of sending the required data to task j from task i. This communication cost is only required if nodes ni
355
A Structured Tabu Search Approach
and nj are scheduled onto different processors. A node with no incoming edges is known as an entry task, and one without any outgoing edges is known as an exit task. In the case of multiple entry or exit tasks, a pseudo-entry/exit node is created that has zero-cost edges connecting it to all entry/exit nodes. This simplifies the DAG and does not affect the schedule. A task is considered a ready task if all its precedence constraints have been met, i.e. all parent nodes have completed. The goal of scheduling a DAG is to reduce the fitness criteria specified, by mapping the tasks onto processors, properly ordering the tasks on these processors, and by ensuring that all precedence constraints are met. The most common fitness criterion is a simple measure of the length of the schedule (makespan) produced. However, other methods exist, such as: minimizing the amount of communication across the interconnection network; load balancing the computation as equally as possible among all processors; minimizing idle time on the processors; or any combination of these. Some heuristics also aim at the parallel system architecture and attempt to minimize the setup costs of the parallel processors (or computers) (Bruno et al., 1974, Dogan and Özgüner, 2002). There are a number of different broad techniques that have developed for solving the task scheduling problem. List-based techniques are the most common, and are popular because they produce competitive solutions with relatively low time complexity when compared to the other techniques. The two steps that comprise most list-based techniques are task prioritisation and processor selection, where tasks are prioritised based upon a prioritising function and subsequently mapped onto a processor. (This second step is trivial for homogeneous systems, where the processor speed does not matter.) The algorithms maintain a list of all the tasks ordered by their priority. In clustering techniques, initial clusters contain a single task. An iteration of the heuristic improves the clustering by combining some of the clusters, should the resulting combined cluster reduce the finish time. This technique requires an additional step, when compared to list-based techniques, which involves mapping an arbitrary number of clusters onto a bounded number of processors (or further merging the clusters so that the number of clusters matches the number of processors available). Duplication-based techniques are another design which revolves around reducing the amount of communication overhead in a parallel execution. Therefore by (redundantly) duplicating certain tasks, and running them on more than one processor, the precedence constraints are maintained, but the communication from that duplicated task to a child task is eliminated. Finally the most computationally expensive of the scheduling techniques is the guided random search technique. This technique involves some very popular algorithms, such as Simulated Annealing, Genetic Algorithms, Tabu Search and Neural Networks. These algorithms generally require a set of parameters especially tailored to the problem they are attempting to solve. A variety of these different approaches have been compared, and their efficiency compared and Tabu Search performed roughly in the middle of all the approaches on all the tests performed (Siegel and Ali, 2000). In general, the efficient management of both the processors and communication links of a parallel and distributed system is essential in order to obtain high performance (Kwok and Ahmad, 1999). It is unfortunate that the communication links are often the bottleneck in a distributed system, and processors often end up wasting cycles idling while waiting for data from another processor in order to proceed. One different approach is a heuristic that attempts to increase the idle time on a given processor for extended periods of time so that power consumption is reduced for that processing resource (Zomaya and Chan, 2005).
356
A Structured Tabu Search Approach
TABU SEARCH Tabu Search (TS) is best thought of as an intelligent, iterative Hill Descent algorithm, which avoids becoming stuck in local minima by using short- and long-term memory. It has gained a number of key influences from a variety of sources; these include early surrogate constraint methods and cutting plane approaches (Glover and Laguna, 2002). TS incorporates adaptive memory and responsive exploration as prime aspects of the search – in this way it can be considered intelligent problem solving. TS has been proven to be effective when compared to a number of heuristics and algorithms previously proposed (Porto and Ribeiro, 1995), and has been used to solve a wide variety of combinatorial optimisation problems. A very useful introduction to TS can also be found in (Hertz et al., 1995), which demonstrates the main attributes and applications of this powerful search technique.
Short Term Memory Tabu Search’s short term memory focuses on restricting the search space to avoid revisiting solutions and cycling. A list of previously visited solutions is recorded during the search that, along with a value that represents the lifetime that the move is to remain Tabu (or restricted), represents the history of the search. The size of this tabu list can also affect the search considerably. If the size is too small then the primary goal of the short term memory, preventing cycling, might not be achieved. Conversely if it is too large, then too many restrictions are created, which limit the search space covered. Unfortunately, there is no exact method for determining the proper value to prevent cycling in an optimisation problem such as task scheduling. The existence of the tabu list does have the side effect of denying the search from exploring certain areas of the search space in some cases. Unlike Genetic Algorithms or Simulated Annealing, TS attempts to avoid randomization whenever possible, employing it only when other implementation approaches are cumbersome.
Longer Term Memory Long-term memory is generally frequency-based, and keeps track of the number of times a certain attribute has occurred and the value of the solution while it was present. Long-term memory is used in TS in order to apply graduated tabu states, which can be used to define penalty and incentive values to modify the evaluation of moves. This allows certain aspects of a solution to increase (or decrease) the overall fitness of a move. In this way a move that might provide a lower makespan might be chosen because an attribute it contains makes the resulting solution beneficial, or helps lead the search into a promising search space.
Intensification Intensification strategies are based on modifying choice rules in order to guide the search towards solutions that have particular promising attributes, which have been discovered to be historically good. Similarly, they may also change the search space entirely to a promising region, in order to more thoroughly analyze the area.
357
A Structured Tabu Search Approach
Diversification Diversification is performed in TS in order to visit areas of the search space that may remain ignored due to the nature of the search. The easiest way to reach these areas is to perform a number of random restarts from differing points on the search plane. Diversification strategies are employed within TS for a variety of reasons. The chief among these reasons is to avoid cycling or visiting the same set of solutions repeatedly. Other reasons for diversification include adding robustness to the search and escaping from local optima. Genetic Algorithms use randomisation and population based techniques in order to diversify their search domain, while Simulated Annealing also uses randomisation in the form of the temperature function. Diversification is particularly useful when better solutions can only be reached by crossing barriers in the solution space.
Implementation An analysis of previous scheduling heuristics and algorithms has proven that they do not account for the amount of communication present in the schedules produced. Many of the previous designs have either ignored communication altogether, assumed communication is constant, or have used communication, but considered it to have no bearing on the outcome of the final schedule. The TS implementation has been developed with a variety of schedule evaluation criteria, in order to determine whether the schedules produced are effectively using the computational resources available, while limiting the use of potentially limiting factors such as the interconnection network. Each of these evaluation criteria also aims to minimize the makespan (or increase speedup) of the schedules produced, in order for TS to remain competitive (Porto and Ribeiro, 1995).
Design Overview As mentioned previously, TS has been proven to be an effective scheduling technique, thus it provides a good basis to develop a scheduler for heterogeneous parallel systems. This TS implementation consists of a number of classes, each providing a necessary role in the TS; it has been very loosely based off a TS skeleton that has been previously developed (Blesa et al., 2001): •
Solution: Represents a solution to the task scheduling problem. Contains the current schedule and fitness rating, also providing methods to alter the solution legally (i.e. performing moves). Movement: Represents a move from one solution to a neighbouring one. This can either be moving a task from one processor to another, or swapping the processor assignments of two tasks. Solver: Runs the actual TS and provides the means to keep track of the current solution and the best solutions. TabuStorage: Is a list of all the moves that have taken place, moves currently considered tabu have a TabuLife greater than 0 iterations.
•
• •
There are also two auxiliary classes:
358
A Structured Tabu Search Approach
• •
Problem: Represents the Task Graph itself. Contains all the precedence constraints, task computation values (per processor), and inter-task communication values. Setup: A simple class containing a number of variables that can be provided to the TS in order to customise it further, including the Aspiration Plus parameters, TabuLife, etc.
The general diversification steps that exist within most TS implementations, in order to increase the variety of the solutions explored, have been altered to include a system that automatically increases the Figure 1. A description of the Tabu Search algorithm
359
A Structured Tabu Search Approach
Figure 2. The algorithm for generating an initial solution
number of processors available on-the-fly. As the solution becomes stagnant at a particular number of processors, another processor is added and the algorithm begins again from this new point. In this way, the search need only be run once, with a maximum number of processors specified – the search will proceed until it reaches a maximum set number of iterations, or reaches the processor limit. It can be shown that the optimum number of processors may not always be the largest provided – depending on the structure of the task graph itself. The generalized graphical structure of the TS implementation algorithm is shown in Figure 1 with a brief algorithm description in pseudo code.
Initialization The random task graphs for this thesis have been generated using Task Graphs For Free (TGFF) version 3.0 (Dick et al., 1998). There are three important aspects contained within the ‘.tgff’ file; the properties of the task graph itself, the processors (computation), and the network (communication). The task graph properties are a listing of the nodes and edges of the task graph, also containing their type. The edges also represent the precedence constraints within a DAG, there is a single list containing the edge type to communication costs relationship, as it is assumed that the network is constant throughout the parallel system. The processor properties contain a list of types, with corresponding computational cost (these map to the task types provided in the task graph properties section of the file) – there is a separate list for each processor. In the initialisation phase, before the TS begins, the properties of the task graph are read into a number of data structures, which are used to represent the tasks themselves, the precedence constraints, and the computational costs for each processor. The initial solutions for the current_solution and best_solution are generated here using a greedy algorithm, so as to provide a starting point for the algorithm.
Initial Solution The initial solution generated by the TS implementation is a greedy heuristic algorithm, ensuring each task’s earliest finish time (EFT). The finish time is used instead of the start time on a heterogeneous processing system, because the computational cost is not fixed. At each iteration, a task is selected and assigned to the processor that will provide it with the EFT. This algorithm maintains precedence constraints and benefits from the heterogeneity by allowing it
360
A Structured Tabu Search Approach
to make a decision based upon both the communication time (if the processor is different to a parent node) and the computation cost of each particular processor (Figure 2).
Neighborhood A neighborhood of solutions is obtained by removing a single task from the task list of one processor and moving it to the task list of another. The entire neighbourhood is obtained by going through every task, and moving it to every other processor in the system. To expand upon the neighbourhood and to help expand the search area, we have added another move category – the swap move. In this move, two tasks are selected and their processor assignments are swapped. In this way the search can proceed further within fewer moves (if the move is worth making), at the expense of some computational efficiency – since more moves are present in each neighbourhood, more should be examined at each iteration of the search. It also allows the search to escape a poor area of the solution space, by allowing the search to ‘look-ahead’ two moves, as opposed to a single step. Thus a neighbouring solution is a solution that differs by a single task assignment, or the swapping of two task assignments. Each move consists of relabelling the processor a task belongs to. A move consists of a source and destination processor, and either one or two task ids. A solution’s neighbourhood is generated every iteration of the search in the form of a move list, which is the list of moves possible from the current solution to transform it into all possible neighbouring solutions. These moves do not include moves considered Tabu, which are located in the tabu list.
Candidate List We have used the Aspiration Plus candidate selection criteria in order to reduce the computational cost of the TS implementation. In Aspiration Plus, moves are analyzed until a move under the given threshold is found, called first. The search continues for plus moves, whereupon the best move found is made (Rangaswamy, 1998). This strategy is further strengthened by the use of a min and max value, the search will always analyze min moves, but never more than max. The user specifies three variables (within the Setup class) for the Aspiration Plus candidate selection: • • •
max – the maximum number of moves to analyze, min – the minimum number of moves to analyze, and plus – the amount of moves to search after first is found.
Unfortunately, while the Aspiration Plus candidate selection criteria dramatically reduces the computational cost of the TS by several orders of magnitude, there remains the possibility that good moves are not evaluated in a given iteration, because only a small subset of moves is considered.
Tabu List The main mechanism for using memory within TS is the tabu list, a list of all the moves that have been made so far, along with a time limit specifying the number of iterations that they are to remain tabu. Each move made within the search is stored inversely within the tabu list, with its associated tabulife.
361
A Structured Tabu Search Approach
Table 1. Evaluation criteria Function
Description
Minimize Length (makespan)
The goal of most scheduling research, to reduce the overall length of the solution schedule.
Minimize Communication
Minimizes the total amount of communication present in the final solution as well as minimize the length of the schedule. A good compromise between finding a good schedule, and an efficient use of network resources.
Load Balance
Spreads the load of computation as evenly as possible among the available processors, while still attempting to keep the overall length of the schedule to a minimum.
Combination
This function attempts to minimize communication, balance the load as much as possible among the component processors, and minimize the length of the final solution.
This short-term memory function is used to prevent the search from cycling or returning to previously visited solutions easily. The tabu list must be checked before a move is to be applied, in order to ensure that a tabu move is not made. Merely being located within the tabu list is not enough; the move must also have a tabulife value greater than zero to be considered tabu. If a move located within the tabu list has a tabulife value of zero, it means that the inverse move had been made previously, has since become non-tabu, and thus is available to be performed again.
Evaluation Functions Specified by the user, a number of different evaluation functions of the solutions can be invoked. These evaluation functions allow the user to determine the governing factors they want to be used in evaluating the fitness of a solution found by the algorithm. While the overall goal of many scheduling algorithms has been to reduce the overall makespan (or length) of the schedules produced, it is noted that this may not always be the only factor necessary for an efficient solution. The amount of communication present in a solution can determine the load that will be placed upon the interconnection network during the execution of the parallelised application. On a limited bandwidth interconnection network, the cost of placing high loads upon the network is at a premium; therefore it is wise to reduce the overall inter-task communication as much as possible. Modern media and information has increased the amount of bandwidth traversing networks, and this means that the network is becoming more and more important to many different applications. In order to produce competitive solutions in a congested network environment, the amount of communication must be limited. Minimizing the communication used is especially important in communication-intensive applications, where each inter-task communication is likely to be highly expensive. Some parallel systems seek to distribute as much of the computation as evenly as possible over their component processors, in other words: load balancing. This is another measure by which the user can analyze the schedules produced by the TS implementation. This evaluation scheme is useful because the solutions generated have processing time proportional to the speed at which they execute tasks. A processor twice as fast as another, will have two times the amount of tasks scheduled to it, but will maintain the same processing time. Load balancing is also important in time-shared systems, where a user may only have a specific amount of time on each processor in order to execute their parallelized application.
362
A Structured Tabu Search Approach
Figure 3. (a) Task graph generated by TGFF v3.0, 50 tasks, even in/out degree; and (b) Scheduling of task graph in (a) with the TS implementation – using the ‘minimize length’ evaluation criteria
Figure 4. Execution trace of the task graph in Figure 3(a)
363
A Structured Tabu Search Approach
Similarly, a combination of all these methods might be appropriate, the user may want to analyze the schedules for an evenly distributed system that reduces the total amount of communication, with the fastest finish time possible. In all the above cases, the requirements for different systems are catered for by the specific use of differing evaluation functions. Each evaluation function allows the user more control over the properties that the schedules produced contain. Table 1 lists the four evaluation criteria that have been used to determine the effect of communication on the schedules produced by the TS implementation.
Execution Trace The major components of the TS implementation have been described in detail in the previous six sections. Figure 3(a) illustrates what a task graph looks like before it is inputted into the TS implementation. For this brief trace, the ‘minimize length’ evaluation criteria will be used. The search begins with a sequential execution (on a single processor). At each iteration, the search attempts to improve upon the makespan of the schedule by moving a task from one processor to another, or swapping two tasks processor assignments. If the schedule cannot be improved upon in 100 moves, then another processor is added and the algorithm continues. In this case the best solution found for minimizing the makespan of the schedule, was with 15 processors. The resulting schedule can be seen in Figure 3(b), and the general ‘slow-start’ of the parallelization can easily be seen with the top-end of the schedule being very under-utilized when compared to the lower-end. Figure 4 contains a trace of the TS implementation, and the value of the schedule length at each iteration. The trace begins with a sequential schedule on a single processor, and has a value of 460, when the second processor is added the schedule immediately improves to around 200 time units. Similarly when the third processor is added the schedule length drops dramatically again. Each additional processor after this, however only decreases the schedule length minimally. The light line represents the theoretical minimum, which is defined as:
Theoretical Minimum = Sequential Execution Time / Number of Processors The best performance obtained close to the theoretical minimum is at four processors. After this value, the improvement in schedule length fails to keep up with the theoretical minimum – this results in a poorer utilization of the parallel heterogeneous system. The reason the schedule length is able to perform better than the theoretical minimum, and critical path is because of the heterogeneity of the parallel system; where some processors will perform faster than the average (which is used to calculate these values).
RESULTS: COMPUTATION-INTENSIVE CASES A comparative analysis of the different evaluation functions of the schedules proposed are presented in this chapter. More specifically, the results presented in this section will focus on computation-intensive
364
A Structured Tabu Search Approach
Table 2. Task graph degree shapes High In/Low Out Degree
very tall, thin task graphs, less scope for parallelising
Even In/Out Degree
generally in-between the two extremes
Low In/High Out Degree
very wide task graphs, excellent for parallelising
task graphs. Examples of computation-intensive applications include large-scale simulations, such as, the SETI@home class of applications (Anderson et al., 2000), without the need for much communication. Three measures have been used to analyze the quality of the schedules produced and evaluated by the TS implementation. The first is the speedup, which is an important factor when gauging any scheduling algorithm and has been used to analyze the effectiveness of each of the evaluation functions, and this is defined to be: Figure 5. (a) A high in/low out degree task graph, (b) An even in/out degree task graph, and (c) A low in/high out degree task graph
365
A Structured Tabu Search Approach
Table 3. The tests performed for the computation intensive task graphs Degree Even In/Out Degree (5:5)
# of Processors 20
Evaluation Criteria Minimize Communication Minimize Length Load Balance Combination
50
100
150
200
Low In/High Out Degree (3:7)
…
…
High In/Low Out Degree (6:2)
…
…
Speedup = Sequential Execution Time / Parallel Execution Time The sequential execution time is the schedule length of a task graph on a single processor. The parallel execution time is the schedule length of a task graph on multiple processors. The second measure that is used to evaluate the efficiency of a given schedule is the speedup per processor. The speedup per processor can be used to determine the amount of idle time present in the schedule, and can also be referred to as the average utilization of each processor in the parallel system. As the value reaches 1, the processors are reaching full utilization. The upper bound for the speedup per processor is 1.0, and it is defined as:
Speedup per Processor = Speedup / Number of Processors The final factor used to judge the effectiveness of the solutions presented is the communication usage, and is defined to be the amount of communication time units used in a solution divided by the
366
A Structured Tabu Search Approach
Figure 6. Speedup for computation-intensive: (a) high in/low out; (b) Even in/out; and (c) Low in/high out degree task graph sets
sequential execution time.
CCR` = Communication / Sequential Execution Time To test the TS implementation for computationally intensive task graphs, three sets of randomly generated task graphs have been used. The first set contains task graphs with an even in-out degree, the second set contains task graphs with high in degree and low out degree, and the third a low in degree and high out degree. Table 2 describes the various attributes that apply to each shape of task graph, and Figure 5 illustrates these features graphically.
Test Parameters The three test sets each contained five task graphs generated by TGFF v3.0 (Dick et al., 1998). The number of tasks varied in each task graph of a set, ranging from 20 up to 200 tasks, totalling five task graphs per set. The number of tasks was not increased beyond 200 because of the high time complexity of the TS implementation. Such tests would take an unacceptable amount of time to complete. Each of these task graphs was then run four times with the program for 3000 iterations, using all four evaluation
367
A Structured Tabu Search Approach
Figure 7. Speedup per processor for computation-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/high out degree task graph sets
criteria, which results in a total of over 60 tests being performed. Since these task graphs were computation-intensive, the computation to communication ratio (CCR) was set to 5:1 – this allows for a wider disparity between the computation- and communication-intensive task graphs. Table 3 summarizes the tests performed.
Performance Results The test results are presented in four different sections. The first section is shown in Figure 6, where comparisons between the speedup obtained by the differing evaluation criteria are conducted on a range of differing graph sizes and shapes. The second section consists of an analysis of the number of processors used in the best solution found in each test and the overall usage of these processors, as shown in Figure 7. Communication usage is presented and discussed in the third section, and is shown in Figure 8. The final section presents only a single example, in order to illustrate that the differences between physical features of the task graphs can affect the final solution considerably. This is represented in Figures 9 and 10.
Comparisons with Differing Evaluation Criteria It is shown in Figure 6 that regardless of the evaluation criteria being used, the speedup obtained from the TS implementation is competitive. Each schedule produced generates a speedup value that increases
368
A Structured Tabu Search Approach
Figure 8. Communication usage for computation-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/high out degree task graph sets
with task graph size and width. This is important because the trade-off between speedup and specialized properties on the schedules produced is very small, the average penalty to the speedup for minimizing communication is less than 30%. Therefore for heterogeneous parallel systems where the network may be congested (or the bottleneck), schedules can be produced that perform on par with the most efficient schedules produced, while also maintaining the minimum load on the interconnection network. Similarly with time-sharing systems, Figure 9. Schedule length & communication usage for a computation-intensive task graph with 100 tasks, with respect to different task graph shapes/types
369
A Structured Tabu Search Approach
Figure 10. Speedup for a computation-intensive task graph with 100 tasks, with respect to different task graph shapes/types
where each processor may only be available for a certain amount of time, the processing can be spread evenly across all processors while still obtaining a competitive speedup. There are a few anomalies present within the results. The limit of 3000 iterations (or moves) for the search (imposed to restrict the overall running time of the algorithm) forced the larger task graphs (150200 tasks) to terminate before they had reached the allotted maximum of 15 processors. This potentially limited speedup possibly attainable, therefore the speedup obtained for the ‘minimize length’ evaluation criteria is not reflective of general results, where it should obtain the fastest speedup. This is a result of the variable time until a new processor is added to the algorithm; as such, the ‘minimize length’ criteria did not add as many processors to the algorithm as some of the other criteria.
Comparisons with Number of Processors Figure 7 shows the results obtained when using the average speedup in Figure 6, and dividing it by the number of processors used in each solution. This shows the overall effectiveness of the speedup, where the closer a solution gets to having a speedup per processor value of 1.0, the more efficient the schedule is in terms of processor usage. Also noticeable is the task graphs that are more parallelisable are able to obtain a higher speedup per processor as the number of tasks increases. This is due to the fact that the schedules provided for these task graphs are able to utilize additional processors more efficiently than the high in/low out degree task graphs, because the task graphs widen faster. The poor speedup per processor of the ‘minimize length’ criteria occurs because many of the solutions produced for these evaluation criteria are very sparse. The slow-start effect of many of the task graphs also reduces the total speedup possible – as the number of processors increases, the amount of idle time at the beginning of the schedule increases, until the task graph begins to widen sufficiently to take advantage of the additional processors.
Minimizing Communication The test results in Figure 8 demonstrate the efficiency of the evaluation criteria on reducing the burden placed upon the network by the parallelized application. There is a significant drop, in the
370
A Structured Tabu Search Approach
Figure 11. Speedup for communication-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/ high out degree task graph sets
order of over 50%, in the amount of communication placed on the network when either the ‘Combination’ or ‘Minimize Communication’ evaluation criteria are used. As mentioned previously, the penalty in speedup for using these alternate evaluation criteria is minimal when compared to the reduction in communication. The results are clearly consistent throughout the numerous task graphs, and highlight the efficient use of network resources with these evaluation criteria. The computation-intensive task graphs tend to have a very minimal impact on network load due to the very small values of the communication edges. Therefore computation-intensive task graphs are ideal for parallelizing without evaluating the amount of communication being used in a solution. The reason that the ‘Load Balance’ and ‘Minimize Length’ evaluation criteria perform worse than the ‘Minimize Communication’ and ‘Combination’ evaluation criteria, is because they give no regard to the amount of communication being used. They may produce schedules with a higher speedup value but unnecessarily burden the network with additional communication in order to obtain somewhat minimal gains in speedup.
Comparisons with Various Graph Features It is clear from Figure 9 that the task graphs that widen quickly (Low In/High Out and Even In/Out,) both achieve better schedule lengths. This is because the parallel processors can begin to take immediate advantage of the parallelizability of these task graphs. The communication usage is different among all the different task graph types, and is based upon the number and nature of the precedence constraints
371
A Structured Tabu Search Approach
Figure 12. Speedup per processor for communication-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/high out degree task graph sets
present in each graph. Not surprisingly the trends continue in favor of the highly parallelizable task graphs for speedup and utilization and are shown in Figure 10. The more parallelizable the task graph, the higher the speedup (as also shown in the lower schedule lengths in Figure 8) and higher the utilization in general. The higher utilization is reached because, on average, the wider task graphs begin to use the additional processors a lot sooner than the taller task graphs. The test results presented in Figures 9 and 10 can easily be replicated among the other equivalent tests, but only a single result is presented here to demonstrate that the shape of a task graph significantly limits the speedup achievable in the final schedules.
Results: Communication-Intensive Cases This section will focus on presenting the results for the second set of tests conducted, aimed at communication-intensive task graphs. These differ quite markedly from computation intensive task graphs, where tasks are able to be shifted to other processors with little-to-no penalty. In communicationintensive task graphs, the cost for changing processors tends to be expensive and often unfeasible. Real world examples of communication-intensive applications include any application that is running on time-shared hardware, where one or more components may have to wait for its share to proceed (thus increasing communication times), such as, Acoustic Beam Forming (Lee and Sullivan, 1993). The same performance measurements used in the previous section (speedup, utilization and CCR`) will again be
372
A Structured Tabu Search Approach
Figure 13. Communication usage for communication-intensive: (a) High in/low out; (b) Even in/out; and (c) Low in/high out degree task graphs
used to compare the various evaluation criteria and schedule outputs.
Test Parameters Table 4 shows the tests that were performed on the computation intensive task graphs, and an identical set of tests were conducted on communication intensive task graphs and TGFF v3.0 was used to generate the task graphs. The task graphs themselves differed slightly from the computation intensive ones because of the random nature of the task graph generation. The computation to communication ratio was set to 1:5, so that the effects of the communication on parallelizing applications can be readily seen when compared to results in the previous section.
Performance Results The performance results for the communication intensive task graphs have been split up into three sections. The first section contains a comparison between the speedup produced for a variety of task graph shapes and number of tasks. The second section analyzes the computational efficiency of the solutions produced by the TS implementation. The final section is most important for communication intensive task graphs and displays the amount of communication present in each schedule produced; it is here that we can truly see the differences in schedules produced by both the computation intensive and com-
373
A Structured Tabu Search Approach
munication intensive task graphs.
Comparisons with Differing Evaluation Criteria Figure 11 clearly demonstrates that it is not feasible to parallelise communication intensive task graphs which are not very wide. A speedup value of 1 indicates that there was no improvement on the sequential solution. Similar to the results in the previous section, the wider the task graph, the higher the speedup obtainable. These wide task graphs halved the makespan of their schedules while minimizing the amount of communication present, on all but the smallest task graphs. The other evaluation criteria (which disregard communication) obtained increasing speedups as the number of tasks increased, irrespective of task graph shape.
Comparisons with Number of Processors The speedup per processor is a measure of how utilized, on average, every processor is in the parallel system. To reach a value of 1 the processor must contain no idle time, and this is only achievable in rare cases with more than one processor. The results of a speedup per processor of value 1 contained in Figure 12 are because those task graphs were found to be unsuitable for parallelization, and the produced schedule contained a single processor. A sequential result on a single processor will always return a utilization value of 1, which means 100% processor usage. When compared to the results in the previous section, the utilization of processors changed rapidly depending on the number of processors in the final solution. The computation intensive task graphs, however, increased the utilization of the processors as the number of tasks increased. This is due to the smaller communication costs which allow child nodes to be located on different processors without delaying the overall schedule. In a communication intensive task graph, there is a significant delay for a child node located on a different processor before the child can begin processing – unfortunately this increases the amount of idle time on the processors, reducing the overall utilization of the parallel system. The theoretical upper bound of 1.0 is overstepped a few times in the results. This occurs because of the heterogeneous nature of the parallel system. The sequential time is calculated from the average of all processors available, and if the tasks are run on processors that use a faster time than the average, then the overall utilisation can appear to be more than 100%. Basically, a higher percent of processing is performed in a lower than average amount of time.
Minimizing Communication Communication-intensive task graphs are truly where the communication needs to be taken into account because any communication across processors will be significantly large. Figure 13 clearly shows that if the evaluation criteria contain the need to minimize communication, then in most cases for a communication intensive application it is not feasible to parallelize (this is shown with a value of 0 communication usage for many of the ‘minimize communication’ and ‘combination’ evaluation results). The Low In/High Out Degree and the larger task Even In/Out Degree task graphs, which are the widest, can be parallelized with reasonable efficiency and reduction in communication usage. It should also be noted that when increasing the speedup several times over (see Figure 11), the communication cost for these schedules is alarmingly high. Again, these schedules are only efficient if
374
A Structured Tabu Search Approach
the interconnection network has a lot of bandwidth. Should the network be shared by other systems or be subject to bandwidth restrictions, burdening the network this much is inefficient. For comparative purposes, the communication usage for the ‘minimize length’ and ‘load balance’ evaluation criteria on 150 tasks, is roughly 3-4 times the processing time units required to sequentially compute. This is clearly inefficient, and demonstrates the need to account for communication usage in a communication-intensive task graph schedule if it is to be viable in realistic networks.
CONCLUSION A major goal when designing a scheduling algorithm is to reduce the makespan in order to reduce the running time of the application. Unfortunately this doesn’t always lead to the best use of resources, whether they are computational resources, networking resources, or time itself. Designing proper evaluation criteria in order to efficiently utilize these resources is essential if larger, distributed heterogeneous systems are to become effective (such as Grids). Despite the excellent quality of the schedules the previous work has produced, and accounting for communication costs, all previously proposed algorithms fail to mention how much they utilize the network as part of the criteria for determining the efficiency of a schedule. In this chapter, a variety of evaluation criteria were presented, which demonstrate, in conjunction with a robust Tabu Search implementation for parallel computing systems, that good quality schedules can be produced that are tailored to the specific requirements of the computing system. The most balanced schedules were produced when all three criteria were used collectively in the ‘combination’ evaluation criteria. This resulted in schedules that limited the load to the network, utilized the available processors evenly and generally had a competitive speedup. The ‘minimize length’ criteria generally produced excellent quality solutions in regards to speedup, but this resulted in an extremely poor use of the interconnection network – especially for the communication-intensive task graphs. Conversely, the ‘minimize communication’ criteria produced schedules of reasonable makespan, but limited the load significantly on the interconnection network. Further extensions of this work consist of the reduction of the number of assumptions placed upon the system and taking into account variability in network conditions. Expanding the Tabu Search implementation to take advantage of some of the more advanced features would increase the robustness of the search.
REFERENCES Anderson, D., et al. (2000). Internet computing for SETI. In G. Lemarchand and K. Meech, (Eds.), The Proceedings of Bioastronomy 99: A New Era in Bioastronomy, ASP Conference Series No. 213, (p. 511). San Francisco: Astronomical Society of the Pacific Blesa, M. J., Hernandez, L., & Xhafa, F. (2001). Parallel skeletons for Tabu search method. In The Proceedings of International Conference on Parallel and Distributed Systems (ICPADS). Bruno, J., Coffman, E. G., & Sethi, R. (1974). Scheduling independent tasks to reduce mean finishing time. Communications of the ACM, 17(7), 382–387. doi:10.1145/361011.361064
375
A Structured Tabu Search Approach
Dick, R. P., Rhodes, D. L., & Wolf, W. (1998). TGFF: Task graphs for free. In The Proceedings of the 6th International Workshop on Hardware/Software Codesign, (pp. 97–101). Dogan, A., & Özgüner, F. (2002). Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3), 308–323. doi:10.1109/71.993209 El-Rewini, H. (1996). Partitioning and scheduling. In A.Y. Zomaya, (ed.), Parallel and Distributed Computing Handbook, (pp. 239–273). New York: McGraw-Hill. Glover, F., & Laguna, M. (2002). Tabu search. New York: Kluwer Academic Publishers, USA. Hertz, A., Taillard, E., & de Werra, D. (1995). A tutorial on Tabu search. Proceedings of Giornate di Lavoro AIRO, (pp.13–24). Kwok, Y.-K., & Ahmad, I. (1999). Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4), 406–471. doi:10.1145/344588.344618 Lee, C. E., & Sullivan, D. (1993). Design of a heterogeneous parallel processing system for beam forming. In The Proceedings of the Workshop on Heterogeneous Processing, (pp. 113–118). Lee, Y.-C., Subrata, R., & Zomaya, A. Y. (2008). Efficient exploitation of grids for large–scale parallel applications. In A.S. Becker (Ed.) Concurrent and Parallel Computing: Theory, Implementation and Applications, (pp. 8.165–8.184). Hauppauge, NY: Nova Science Publishers. Lee, Y.-C., & Zomaya, A. Y. (2008). Scheduling in grid environments. In S. Rajasekaran & J. Reif (Eds.) Handbook of Parallel Computing: Models, Algorithms and Applications, pp. 21.1–21.19. Boca Raton, FL: Chapman& Hall/CRC Press. Macey, B. S., & Zomaya, A. Y. (1997). A comparison of list scheduling heuristics for communication intensive task graphs. International Journal of Cybernetics and Systems, 28, 535–546. doi:10.1080/019697297125921 Nabhan, T. M., & Zomaya, A. Y. (1997). A parallel computing engine for a class of time critical processes [Part B]. IEEE Transactions on Systems, Man, and Cybernetics, 27(5), 774–786. doi:10.1109/3477.623231 Porto, S. C. S., & Ribeiro, C. C. (1995). A Tabu search approach to task scheduling on heterogeneous processors under precedence constraints. International Journal of High Speed Computing, 7(1). doi:10.1142/S012905339500004X Rangaswamy, B. (1998). Tabu search candidate list strategies in scheduling. In the Proceedings of the 6th INFORMS Advances in Computational and Stochastic Optimization, Logic Programming and Heuristic Search: Interfaces in Computer Science and Operations Research Conference. Salleh, S., & Zomaya, A. Y. (1999). Scheduling in parallel computing systems: fuzzy and annealing techniques. New York: Kluwer Academic Publishers, USA. Seredynski, F., & Zomaya, A. Y. (2002). Sequential and parallel cellular automata-based scheduling algorithms. IEEE Transactions on Parallel and Distributed Systems, 13(10), 1009–1023. doi:10.1109/ TPDS.2002.1041877
376
A Structured Tabu Search Approach
Siegel, H. J., & Ali, S. (2000). Techniques for mapping tasks to machines in heterogeneous computing systems. Journal of Systems Architecture, 46(8), 627–639. doi:10.1016/S1383-7621(99)00033-8 Zomaya, A. Y. (Ed.). (1996). Parallel and distributed computing handbook. New York: McGraw-Hill. Zomaya, A. Y., & Chan, F. (2005). Efficient clustering for parallel task execution in distributed systems. Journal of Foundations of Computer Science, 16(2), 281–299. doi:10.1142/S0129054105002991 Zomaya, A. Y., & Teh, Y.-W. (2001). Observations on using genetic algorithms for dynamic load-balancing. IEEE Transactions on Parallel and Distributed Systems, 12(9), 899–911. doi:10.1109/71.954620
KEY TERMS AND DEFINTIONS Adaptive Scheduler: This type of schedulers changes its scheduling scheme according to the recent history and/or current behaviour of the system. In this way, adaptive schedulers may be able to adapt to changes in system use and activity. Adaptive schedulers are usually known as dynamic, since they make decisions based on information collected from the system. Non-Adaptive Scheduler: This type of schedulers does not change its behaviour in response to feedback from the system. This means that it is unable to adapt to changes in system activity. Non-Preemptive Scheduling: In this class of scheduling once a task has begun on a processor, it must run to completion before another task can start execution on the same processor. Preemptive Scheduling: In this class of scheduling it is possible for a task to be interrupted during its execution, and resumed from that position on the same or any other processor, at a later time. Although preemptive scheduling requires additional overhead, due to the increased complexity, it may perform more effectively than non-preemptive methods. Scheduling: The allocation of a set of tasks or jobs to resources, such that the optimum performance is obtained. If these tasks are not inter-dependent the problem is known as task allocation. When the task information is known a priori, the problem is known as static scheduling. On the other hand, when there is no a priori knowledge about the tasks the problem is known as dynamic scheduling. Tabu Search: Is an intelligent, iterative Hill Descent algorithm, which avoids local minima by using short- and long-term memory. It has gained a number of key influences from a variety of sources; these include early surrogate constraint methods and cutting plane approaches. Tabu search incorporates adaptive memory and responsive exploration as prime aspects of the search.
377
378
Chapter 17
Communication Issues in Scalable Parallel Computing1 C. E. R. Alves Universidade Sao Judas Tadeu, Brazil E. N. Cáceres Universidade Federal de Mato Grosso do Sul, Brazil F. Dehne Carleton University, Canada S. W. Song Universidade de Sao Paulo, Brazil
ABSTRACT In this book chapter, the authors discuss some important communication issues to obtain a highly scalable computing system. They consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain scalable parallel algorithms. The communication cost is modeled by the number of communication rounds and the objective is to design algorithms that require the minimum number of communication rounds. They discuss some important issues and make considerations of practical importance, based on our previous experience in the design and implementation of parallel algorithms. The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful they should attempt to minimize this amount, even when it is already within the limit allowed by the CGM model. The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize and the overall communication time taken in the communication rounds. Sometimes a larger number of communication rounds may actually reduce the total amount of data transmitted in the communications rounds. These two issues have guided us to present efficient parallel algorithms for the string similarity problem, used as an illustration.
DOI: 10.4018/978-1-60566-661-7.ch017
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Communication Issues in Scalable Parallel Computing
INTRODUCTION In this book chapter, we discuss some important communication issues to obtain a highly scalable computing system. Scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged. We consider the CGM (Coarse-Grained Multicomputer) model, a realistic computing model to obtain scalable parallel algorithms. A CGM algorithm that solves a problem of size n with p processors each with O(n/p) memory consists of an alternating sequence of computation rounds and communication rounds. In one communication round, we allow the exchange of O(n/p) data among the processors. The communication cost is modeled by the number of communication rounds and the objective is to design algorithms that require the minimum number of communication rounds. We discuss some important issues and make considerations of practical importance, based on our previous experience in the design and implementation of several parallel algorithms. The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful we should attempt to minimize this amount, even when it is already within the maximum allowed by the CGM model which is O(n/p). The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize and the overall communication time taken in the communication rounds. Under the CGM model we want to minimize the number of communication rounds so that we do not have to care about the particular interconnection network. In a practical implementation, we do have more information concerning the hardware utilized and the communication times in a particular interconnection network. Sometimes a larger number of communication rounds may actually reduce the total amount of data transmitted in the communications rounds. Although the goal of the CGM model is to minimize the number of communication rounds, ultimately the main objective is to minimize the overall running time that includes the computation and the communication times. These two issues have guided us to present efficient parallel algorithms for the string similarity problem, used as an illustration. By using the wavefront-based algorithms we present in this book chapter to illustrate these two issues, we also address a third issue, the desirability of avoiding costly global communication such as broadcast and all-to-all primitives. This is obtained by using wavefront or systolic parallel algorithms where each processor communicates with only a few other processors. The string similarity problem is presented here as an illustration. This problem is interesting in its own right. Together with many other important string processing problems (Alves et al., 2006), string similarity is a fundamental problem in Computational Biology that appears in more complex problems (Setubal & Meidanis, 1997), such as the search of similarities between bio-sequences (Needleman & Wunsch, 1970; Sellers, 1980; Smith & Waterman, 1981). We show two wavefront parallel algorithms to solve the string similarity problem. We implement both the basic algorithm (Alves et al., 2002) and the improved algorithm (Alves et al., 2003) by taking into consideration the communication issues discussed in this book chapter and obtain very efficient and scalable solutions.
PARALLEL COMPUTATION MODEL Valiant (1990) introduced a simple coarse grained parallel computing model, called Bulk Synchronous Parallel Model – BSP. It gives reasonable predictions on the performance of the algorithms when
379
Communication Issues in Scalable Parallel Computing
implemented on existing, mainly distributed memory, parallel machines. It is also one of the earliest models to consider communication costs and to abstract the characteristics of parallel machines with a few parameters. The main objective of BSP is to serve a bridging model between the hardware and software necessities. This is one of the fundamental characteristics for the success of the von Neumann model. In the BSP model, parallel computation is modeled by a series of super-steps. In this model, p processors with local memory communicate through some interconnection network managed by a router with global synchronization. A BSP algorithm consists of a sequence of super-steps separated by synchronization barriers. In a super-step, each processor executes a set of independent operations using local data available in each processor at the start of the super-step, as well as communication consisting of send and receive of messages. An h-relation in a super-step corresponds to sending or receiving at most h messages in each processor. The response to a message sent in one super-step can only be used in the next super-step. In this paper we use a similar model called the Coarse Grained Multicomputers – (denoted by BSP/ CGM), proposed by Dehne et al. (1993). A BSP/CGM consists of a set of p processors P1, P2,…,Pp with O(n/p) local memory per processor and each processor is connected through any interconnection network. The term coarse granularity comes from the fact that the problem size in each processor n/p is considerably larger than the number of processors, that is, n/p>>p. A BSP/CGM algorithm consists of alternating local computation and global communication rounds separated by a barrier synchronization. The BSP/CGM model uses only two parameters: the input size n and the number of processors p. In a computing round, each processor runs a sequential algorithm to process its data locally. A communication round consists of sending and receiving messages, in such a way that each processor sends at most O(n/p) data and receives at most O(n/p) data. We require that all information sent from a given processor to another processor in one communication round is packed into one long message, thereby minimizing the message overhead. In the BSP/CGM model, the communication cost is modeled by the number of communication rounds which we wish to minimize. In a good BSP/CGM algorithm the number of communication rounds does not depend on the input size n. The ideal algorithm requires a constant number of communication rounds. If this is not possible, we attempt to get an algorithm for which this number is independent on n but depends on p. This is the case of the present chapter. The BSP/CGM model has the advantage of producing results are close to the actual performance of commercially available parallel machines. Some algorithms for computational geometry and graph problems require a constant number or O(log p) communication rounds (e.g. see Dehne et al. (1993)). The BSP/CGM model is particularly suitable for current parallel machines in which the global computing speed is considerably greater than the global communication speed. One way to explore the use of parallel computation can be through the use of clusters of workstations or Fast/Gigabit Ethernet connected Linux-based Beowulf machines, with Parallel Virtual Machine PVM or Message Passing Interface - MPI libraries. The latency in such clusters or Beowulf machines of 1Gb/s is currently less than 10 μs and programming using these resources is today a major trend in parallel and distributed computing. Though much effort has been expended to deal with the problems of interconnection of clusters or Beowulfs and the programming environment, there is still few works on methodologies to design and analyze algorithms for scalable parallel computing systems.
380
Communication Issues in Scalable Parallel Computing
Figure 1. String alignment examples
THE STRING SIMILARITY PROBLEM In Molecular Biology, the search for tools that identify, store, compare and analyze very long bio-sequences is becoming a major research area in Computational Biology. In particular, sequence comparison is a fundamental problem that appears in more complex problems (Setubal & Meidanis, 1997), such as the search of similarities between bio-sequences (Needleman & Wunsch, 1970; Sellers, 1980; Smith & Waterman, 1981), as well as in the solution of several other problems such as approximate string matching, file comparison, and text searching with errors (Hall & Dowling, 1980; Hunt & Szymansky, 1977; Wu & Manber, 1992). One main motivation for biological sequence comparison, in particular proteins, comes from the fact that proteins that have similar tri-dimensional forms usually have the same functionality. The tridimensional form is given by the sequence of symbols that constitute the protein. In this way, we can guess a functionality of a new protein by searching a known protein that is similar to it. In this section we present the string similarity problem. One way to identify similarities between sequences is to align them, with the insertion of spaces in the two sequences, in such way that the two sequences become equal in length. We expect that the alignment of two sequences that are similar will show the parts where they match, and different parts where spaces are inserted. We are interested in the best alignment between two strings, and the score of such an alignment gives a measure of how much the strings are similar. The similarity problem is defined as follows. Let A = a1a2…am and C = c1c2…cn be two strings over some alphabet. To align the two strings, we insert spaces in the two sequences in such way that they become equal in length. See Figure 1 where each column consists of a symbol of A (or a space) and a symbol of C (or a space). An alignment between A and C is a matching of the symbols a A and c C in such way that if we draw lines between the corresponding matched symbols, these lines cannot cross each other. The alignment shows the similarities between the two strings. Figure 1 shows two simple alignment examples where we assign a score of 1 when the aligned symbols in a column match and 0 otherwise. The alignment on the right has a higher score (5) than that on the left (3). A more general score assignment for a given alignment between strings is done as follows. Each column of the alignment receives a certain value depending on its contents and the total score for the alignment is the sum of the values assigned to its columns. Consider a column consisting of symbols r and s. If r = s (i.e. a match), it will receive a value p(r, s) > 0. If r ≠ s (a mismatch), the column will receive a value p(r , s ) < 0 . Finally, a column with a space in it receives a value −k, where k N . We look for the alignment (optimal alignment) that gives the maximum score. This maximum score is called the similarity measure between the two strings to be denoted by sim(A,C) for strings A and C. There may
381
Communication Issues in Scalable Parallel Computing
Figure 2. Grid DAG G for A= baabcbca and B = baabcabcab
be more than one alignment with maximum score (Setubal & Meidanis, 1997). Dynamic programming is a technique used in the solution of many optimization and decision problems. It decomposes the problem into a sequence of optimization or decision steps that are interconnected and are solved one after another. The optimal solution of the problem is obtained by the decomposition of the problem in sub-problems, and computing the optimal solution for each sub-problem. By combining these solutions we obtain the optimal solution of the global problem. Differently from the other optimization methods, such as linear programming and branch and bound, dynamic programming is not a general technique. Optimization problems should be translated into a more specific form before dynamic programming can be used. This translation can be very difficult. This constitutes a further difficulty in addition to the need of formulating the problem to be solved efficiently by the dynamic programming approach. Consider two strings A and C, where |A| = m and |C| = n. We can solve the string similarity problem by computing all the similarities between arbitrary prefixes of the two strings starting with the shorter prefixes and use previously computed results to solve the problem for larger prefixes. There are m + 1 possible prefixes of A and n + 1 prefixes of C. Thus, we can arrange our calculations in an (m + 1) × (n +1) matrix S where each S(r,s) represents the similarity between A[1..r] and C[1..s], that denote the prefixes a1a2…ar and c1c2…cs, respectively. Observe that we can compute the values of S(r,s) by using the three previous values S(r – 1,s), S(r – 1,s – 1) and S(r, s – 1), because there are only three ways to compute an alignment between A[1..r] and C[1..s]. We can align A[1..r] with C[1..s – 1] and match a space with C[s], or align A[1..r – 1] with C[1..s – 1] and match A[r] with B[s], or align A[1..r – 1] with C[1..s] and match a space with A[r]. (Figure 2)
Figure 3. The recursive definition of the similarity score
382
Communication Issues in Scalable Parallel Computing
The similarity score S of the alignment between strings A and C can be computed as in Figure 3. An l1 × l2 grid DAG (Figure 2) is a directed acyclic graph whose vertices are the l1l2 points of an l1 × l2 grid, with edges from grid point G(i, j) to the grid points G(i, j + 1), G(i + 1, j) and G(i + 1, j + 1). Let A and C be two strings with |A| = m and |C| = n symbols, respectively. We associate an (m + 1) × (n + 1) grid DAG G with the similarity problem in the natural way: the (m + 1)(n + 1) vertices of G are in one-to-one correspondence with the (m + 1)(n + 1) entries of the S-matrix, and the cost of an edge from vertex (t, l) to vertex (i, j) is equal to k if t = i and l = j – 1 or if t = i – 1 and l = j; and to p(i, j) if t = i – 1 and l = j – 1. It is easy to see that the string similarity problem can be viewed as computing the minimum sourcesink path in a grid DAG. In Figure 2 the problem is to find the minimum path from (0,0) to (8,10). A sequential algorithm to compute the similarity between two strings of lengths m and n uses a technique called dynamic programming. The complexity of this algorithm is O(mn). The construction of the optimal alignment can be done in sequential time O(m + n) (Setubal & Meidanis, 1997). PRAM (Parallel Random Access Machine) algorithms for the dynamic programming problem have been obtained by Galil and Park (1991). PRAM algorithms for the string editing problem have been proposed by Apostolico et al. (1990). A more general study of parallel algorithms for dynamic programming can be seen in (Gengler, 1996). We present two algorithms that use the realistic BSP/CGM model. A characteristic and advantage of the wavefront or systolic algorithm is the modest communication requirement, with each processor communicating with few other processors. This makes it very suitable as a potential application for grid computing where we wish to avoid costly global communication operations such as broadcast and all-to-all operations.
THE BASIC SIMILARITY ALGORITHM The basic similarity algorithm is due to Alves et al. (2002). It is a BSP/CGM algorithm and attempts to minimize the number of communication rounds. Consider two given strings A = a1a2…am and C = c1c2…cn. The basic similarity algorithm computes the similarity between A and C on a CGM/BSP with p processors and mn/p local memory in each processor. We divide C into p pieces, of size n/p, and each processor Pi, 1 ≤ i ≤ p, receives the string A and the i-th piece of C ( c (i - 1)n / p + 1,..., cin / p ). Each processor Pi computes the elements Si(r, s) of the submatrix Si, where 1 ≤ r ≤ m and (i - 1)n / p + 1 £ s £ in / p using the three previous elements S (r – 1,s), S (r – 1, s – 1) and S (r, s – 1), i
i
i
because, as mentioned before, there are only three ways of computing an alignment between A[1..r] and C[1..s]. We can align A[1..r] with C[1..(s – 1)] and match a space with C[s], or align A[1..(r – 1)] with C[1..(s – 1)] and match A[r] with B[s], or align A[1..(r – 1)] with C[1..s] and match a space with A[r]. To compute the submatrix Si, each processor Pi uses the best sequential algorithm locally. It is easy to see that processor Pi, i > 1, can only start computing the elements Si(r, s) after the processor Pi– 1 has computed part of the submatrix Si– 1 (r, s). Denote by Rik, 1 ≤ i, k ≤ p, all the elements of the right boundary (rightmost column) of the k-th part of the submatrix Si. More precisely, Rik = {Si(r, in/p,(k – 1)m/p + 1 ≤ r ≤ km/p}.
383
Communication Issues in Scalable Parallel Computing
Figure 4. An O(p) communication rounds scheduling used in the basic algorithm
The idea of the algorithm is the following: After computing the k-th part of the submatrix Si, processor Pi sends to processor Pi+ 1 the elements of Rik . Using Rik, processor Pi+ 1 can compute the k-th part of the submatrix Si+ 1. After p – 1 rounds, processor Pp receives Rp-11 and computes the first part of the submatrix Sp. At round 2p – 2, processor Pp receives Rp-1p and computes the p-th part of the submatrix Sp and finishes the computation. Using this schedule (Figure 4), we can see that in the first round, only processor P1 works. In the second round, processors P1 and P2 work. It is easy to see that in round k, all processors Pi work, where 1 ≤ i ≤ k. We now present the basic string similarity algorithm.Basic Similarity Algorithm (see Figure 5). Theorem 1. The basic similarity algorithm uses 2p – 2 communication rounds with O(mn/p) sequential computing time in each processor. Proof. Processor P1 sends R1k to processor P2 after computing the k-th block of m/p rows of the mn/p subFigure 5. The basic similarity algorithm
384
Communication Issues in Scalable Parallel Computing
Figure 6. Table of running times of the basic algorithm for various string lengths
matrix S1. After p – 1 communication rounds, processor P1 finishes its work. Similarly, processor P2 finishes its work after p communication rounds. Then, after p – 2 + i communication rounds, processor Pi finishes its work. Since we have p processors, after 2p – 2 communication rounds, all the p processors have finished their work. Each processor uses a sequential algorithm to compute the similarity submatrix Si. Thus this algorithm takes O(mn/p) computing time. Theorem 2. At the end of the basic similarity algorithm, S(m, n) will store the score of the similarity between the strings A and C. Proof. By Theorem 1, after 2p – 2 communication rounds, processor Pp finishes its work. Since we are essentially computing the similarity sequentially in each processor and sending the boundaries to the right processor, the correctness of the algorithm comes naturally from the correctness of the sequential algorithm. Then, after 2p – 2 communication rounds, S(m, n) will store the similarity between the strings A and C.
Experimental Results of the Basic Algorithm In this section we present the experimental results of the basic similarity algorithm. The following figures give running time curves. We have implemented the O(p) rounds basic similarity algorithm on a Beowulf with 64 nodes. Each node has 256 MB of RAM memory and more 256 MB for swap. The nodes are connected through a 100 MB interconnection network.
385
Communication Issues in Scalable Parallel Computing
Figure 7. Curves of the observed times for various string lengths
Figure 8. Curves of the observed times for various string lengths
386
Communication Issues in Scalable Parallel Computing
Figure 9. An O(p) communication rounds scheduling with α= 1
The obtained times (Figures 6, 7 and 8) show that with small sequences, the communication time is significant when compared to the computation time with more than 8 and 16 processors, respectively (512 × 512 and 512 × 1024). When we apply the algorithm to sequences greater than 8192, using one or two processors, the main memory is not enough to solve the problem. The utilization of swap gives us meaningless resulting times. This would not occur if the nodes have more main memory. Thus we have suppressed these times. In general, the implementation of the CGM/BSP algorithm shows that the theoretical results are confirmed in the implementation. The basic similarity algorithm requires O(p) communication rounds to compute the score of the similarity between two strings. We have worked with a fixed block size of m/p × n/p. Another good alternative is to work with adaptative choice of the optimal block size to further decrease the running time of the algorithm. The alignment between the two strings can be obtained with O(p) communication rounds backtracking from the lower right corner of the grid graph in O(m + n) time (Setubal & Meidanis, 1997). For this, S(r, s) for all points of the grid graph must be stored during the computation (requiring O(mn) space).
THE IMPROVED SIMILARITY ALGORITHM Alves et al. (2003) extend and improve the basic similarity algorithm (Alves et al., 2002) for computing an alignment between two strings A and C, with A =|m| and C =|n|. On a distributed memory parallel computer of p processors each with O((m + n) / p) memory, the improved algorithm also requires O(p) communication rounds, more precisely (1 + 1 / α)p – 2 communication rounds where α is a parameter to be presented shortly, and O(mn / p) local computing time. As in the basic algorithm, the processors communicate in a wavefront or systolic manner, such that each processor communicates with few other processors. Actually each processor sends data to only two other processors. The novelty of the improved similarity algorithm is based on a compromise between the workload of each processor and the number of communication rounds required, expressed by a parameter called α. The proposed algorithm is expressed in terms of this parameter that can be tuned to obtain the best overall parallel time in a given implementation. In addition to showing theoretic complexity we confirm the efficiency of the proposed algorithm through implementation. As will be seen shortly, very promising experimental results are obtained on a 64-node Beowulf machine. We present a parameterized O(p) communication rounds parallel algorithm for computing the simi-
387
Communication Issues in Scalable Parallel Computing
Figure 10. An O(p) communication rounds scheduling with α = 1/2
larity between two strings A and C, over some alphabet, with |A|= m and |C|= n. We use the CGM/BSP model with p processors, where each processor has O(mn / p) local memory. As will be seen later, this can be reduced to O((m + n) / p). Let us first give the main idea to compute the similarity matrix S by p processors. The string A is broadcasted to all processors, and the string C is divided into p pieces, of size n / p, and each processor Pi, 1 ≤ l ≤ p, receives the i-th piece of C ( c (i - 1)n / p + 1...cin / p ). The scheduling scheme is illustrated in Figure 9. The notation Pik denotes the work of Processor Pi at round k. Thus initially P1 starts computing at round 0. Then P1 and P2 can work at round 1, P1, P2 and P3 at round 2, and so on. In other words, after computing the k-th part of the sub-matrix Si (denoted Sik), processor Pi sends to processor Pi+ 1 the elements of the right boundary (rightmost column) of Sik. These elements are denoted by Rik. Using Rik, processor Pi + 1 can compute the k -th part of the sub-matrix Si+ 1. After p – 1 rounds, processor Pp receives Rp-11 and computes the first part of the sub-matrix Sp. In round 2p – 2, processor Pp receives Rp-1p and computes the p-th part of the sub-matrix Sp and finishes the computation.
Figure 11. The improved similarity algorithm
388
Communication Issues in Scalable Parallel Computing
It is easy to see that with this scheduling, processor Pp only initiates its work when processor P1 is finishing its computation, at round p – 1. Therefore, we have a very poor load balancing. In the following we attempt to assign work to the processors as soon as possible. This can be done by decreasing the size of the messages that processor Pi sends to processors Pi+ 1. Instead of message size m / p we consider sizes α m / p and explore several sizes of α. In our work, we make the assumption that the sizes of the messages α m / p divides m. Therefore, Sik (the similarity sub-matrix computed by processor Pi at round k) represents k α m / p + 1 to (k + 1) α m / p rows of Si that are computed at the k-th round. We now present the improved similarity algorithm. The improved algorithm works as follow: After computing Sik, processor Pi sends Rik to processor Pi+ 1. Processor Pi+ 1 receives Rik from Pi and computes Si+1k+1. After p – 2 rounds, processor Pp receives Rp-1p-2 and computes Spp-1. If we use α < 1 all the processors will work simultaneously after the p – 2-th round. We explore several values for α trying to find a balance between the workload of the processors and the number of rounds of the algorithms. Figure 10 shows how the algorithm works when α = 1/2. In this case, processor Pp receives Rp-13p-3, computes Sp3p-2 and finishes the computation. Improved Similarity Algorithm (see Figure 11). Using the schedule of Figure 10, we can see that in the first round, only processor P1 works. In the second round, processors P1 and P2 work. It is easy to see that at the k-th round, all processors Pi work, where 1 ≤ i ≤ k. Since the total number of rounds is increased with smaller values of α the processors start working earlier. Theorem 3 The improved algorithm uses (1 + 1 / α)p – 2 communication rounds with mn / p sequential computing time in each processor. Proof: Processor P1 sends R1k to processor P2 after computing the k-th block of α m / p rows of the mn / p sub-matrix S1. After p / α – 1 communication rounds, processor P1 finishes its work. Similarly, processor P finishes its work after p / a communication rounds. Then, after p / α – 2 + i communication rounds, 2
processor Pi finishes its work. Since we have p processors, after (1 + 1 / α)p – 2 communication rounds, all the p processors have finished their work. Each processor uses a sequential algorithm to compute the similarity sub-matrix Si. Thus this algorithm takes O(mn / p) computing time. Theorem 4 At the end of the improved algorithm, S(m, n) will store the score of the similarity between the strings A and C.
389
Communication Issues in Scalable Parallel Computing
Figure 12. Table showing running times for various values of αwith m=8K and n=16K
Proof: Theorem 3 proves that after (1 + 1 / α)p – 2 communication rounds, processor Pp finishes its work. Since we are essentially computing the similarity sequentially in each processor and sending the boundaries to the right processor, the correctness of the algorithm comes naturally from the correctness of the sequential algorithm. Then, after (1 + 1 / α)p – 2 communication rounds, S(m, n) will store the similarity between the strings A and C.
Figure 13. Time curves vs. number of processors with m=8K and n=16K
390
Communication Issues in Scalable Parallel Computing
Figure 14. Time curves vs. values of α with m=8K and n=16K
Figure 15. Table showing running times for various values of α with m=4K and n=8K
Figure 16. Time curves versus number of processors with m=4K and n=8K
391
Communication Issues in Scalable Parallel Computing
Figure 17. Curves of the observed times - quadratic space
Experimental Results of the Improved Similarity Algorithm In this section we present the experimental results of the improved similarity algorithm. We have implemented the improved similarity algorithm on a Beowulf with 64 nodes. Each node has 256 MB of RAM memory in addition to 256 MB for swap. The nodes are connected through a 100 MB interconnection network. Figures 12, 13 and 14 show the running times of the improved similarity algorithm for different values of α for string lengths of m=8K and n=16K. For a given experiment and hardware platform a parameter tuning phase is required to obtain the best value for α. Figures 12, 13 and 14 show running times for string sizes m =8K and n =16K where K=1024. It can be seen that, for very small α, the communication time is significant when compared to the computation time. We have analyzed the behavior of α to estimate the optimal block size. The observed times show that when α m / p decreases from 16 to 8 (the number of rows of the sub-matrix Si(k)), we have an increase on the total time. The best times are obtained for α between 1/4 and 1/8. Figures 15 and 16 show the running times of the improved similarity algorithm for different values
Figure 18. Curves of the observed times - linear space
392
Communication Issues in Scalable Parallel Computing
of α for string lengths of m=4K and n=8K. Again, for a given experiment and hardware platform a parameter tuning phase is required to obtain the best value for α.
quadratic vs. Linear Space Implementation We can further improve our results by exploring a linear space implementation, by storing a vector instead of the entire matrix. In the usual quadratic space implementation, each processor uses O(mn / p) space, while in the linear space implementation each processor requires only O((m + n) / p) space. The results are impressive, as shown in Figures 17 and 18. With less demand on the swap of disk space, we get an almost 50% improvement. We have used α=1.
CONCLUSION We have presented a basic and an improved parameterized BSP/CGM parallel algorithm to compute the score of the similarity between two strings. On a distributed memory parallel computer of p processors each with O((m + n) / p) memory, the proposed algorithm requires O(p) communication rounds and O(mn / p) local computing time. The novelty of the improved similarity algorithm is based on a compromise between the workload of each processor and the number of communication rounds required, expressed by a new parameter called α. We have worked with a variable block size of α m / p × n / p and studied the behavior of the block size. We show how this parameter can be tuned to obtain the best overall parallel time in a given implementation. Very promising experimental results are shown. Though we dedicated considerable space to present the two string similarity algorithms, these algorithms serve the purpose of illustrating two main issues. The first issue is the amount of data transmitted in a communication round. For a practical implementation to be successful we should attempt to minimize this amount, even when it is already within the limit allowed by the CGM model. The second issue concerns the trade-off between the number of communication rounds which the CGM attempts to minimize and the overall communication time taken in the communication rounds. Sometimes a larger number of communication rounds may actually reduce the total amount of data transmitted in the communications rounds. To this end the parameter α is introduced in the improved similarity algorithm. By adjusting the proper value of α, we can actually require more communication rounds while diminishing the total amount of data transmitted in the communication rounds, thus resulting in a more efficient solution. As a final observation notice that a characteristic of the wavefront communication requirement is that each processor communicates with few other processors. This makes it very suitable as a potential application for grid computing.
REFERENCES Alves, C. E. R., Caceres, E. N., Dehne, F., & Song, S. W. (2002). A CGM/BSP Parallel Similarity Algorithm. In Proceedings I Brazilian Workshop on Bioinformatics (pp. 1-8). Porto Alegre: SBC Computer Society.
393
Communication Issues in Scalable Parallel Computing
Alves, C. E. R., Caceres, E. N., Dehne, F., & Song, S. W. (2003). A Parallel Wavefront Algorithm for Efficient Biological Sequence Comparison. In Kumar, M. L. Gavrilva, C. J. K. Tan, & P. L’Ecuyer (Eds.). The 2003 International Conference on Computational Science and its Applications. (LNCS Vol. 2668, pp. 249-258). Berlin: Springer Verlag. Alves, C. E. R., Caceres, E. N., & Song, S. W. (2006). A coarse-grained parallel algorithm for the allsubstrings longest common subsequence problem. Algorithmica, 45(3), 301–335. doi:10.1007/s00453006-1216-z Apostolico, A., Atallah, M. J., Larmore, L. L., & Macfaddin, S. (1990). Efficient parallel algorithms for string editing and related problems. SIAM Journal on Computing, 19(5), 968–988. doi:10.1137/0219066 Dehne, F. (1999). Coarse grained parallel algorithms. Algorithmica, 24(3/4), 173–176. Dehne, F., Fabri, A., & Rau-Chaplin, A. (1993). Scalable parallel geometric algorithms for coarse grained multicomputers. In Proceedings ACM 9th Annual Computational Geometry (pp. 298-307). Galil, Z., & Park, K. (1991). Parallel dynamic programming (Tech. Rep. CUCS-040-91). New York: Columbia University, Computer Science Department. Gengler, M. (1996). An introduction to parallel dynamic programming. In Solving Combinatorial Optimization Problems in Parallel. (LNCS Vol. 1054 pp. 87-114). Berlin: Springer Verlag. Hall, P. A., & Dowling, G. R. (1980). Approximate string matching. Comput. Surveys, 12(4), 381–402. doi:10.1145/356827.356830 Hunt, J. W., & Szymansky, T. (1977). An algorithm for differential file comparison. Communications of the ACM, 20(5), 350–353. doi:10.1145/359581.359603 Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/00222836(70)90057-4 Sellers, P. H. (1980). The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, (4): 359–373. doi:10.1016/0196-6774(80)90016-4 Setubal, J., & Meidanis, J. (1997). Introduction to computational molecular biology. Boston: PWS Publishing Company. Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Bio. (147), 195-197. Valiant, L. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8), 103–111. doi:10.1145/79173.79181 Wu, S., & Manber, U. (1992). Fast text searching allowing errors. Communications of the ACM, 35(10), 83–91. doi:10.1145/135239.135244
394
Communication Issues in Scalable Parallel Computing
KEY TERMS AND DEFINITIONS Coarse-Grained Multicomputer: A simple and realistic parallel computing model, characterized by two parameters (input size n and number of processors p), in which local computation rounds alternate with global communication rounds, with the goal of minimizing the number of communication rounds. Granularity: A measure of the size of the components, or descriptions of components, that make up a system. In parallel computing, granularity refers to the amount of computation that can be performed by the processors before requiring a communication stepto exchange data. Scalability: A desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged. String Similarity Metrics: Textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison. Systolic Algorithm: An algorithm that has the characteristics of a systolic array. Systolic Array: A pipelined network of processing elements called cells, used in parallel computing, where cells compute data and store it independently of each other and passes the computed data to neighbor cells. Wavefront Algorithm: An algorithm that has the characteristics of a systolic array, also known as systolic algorithm.
ENDNOTE 1
Partially supported by FAPESP Proc. No. 2004/08928-3, CNPq Proc. No. 55.0094/05-9, 55.0895/07-8, 30.5362/06-2, 30.2942/04-1, 62.0123/04-4, 48.5460/06-8, FUNDECT 41/100.115/2006, and the Natural Sciences and Engineering Resarch of Canada.
395
396
Chapter 18
Scientific Workflow Scheduling with TimeRelated QoS Evaluation Wanchun Dou Nanjing University, P. R. China Jinjun Chen Swinburne University of Technologies, Australia
ABSTRACT This chapter introduces a scheduling approach for cross-domain scientific workflow execution with timerelated QoS evaluation. Generally, scientific workflow execution often spans self-managing administrative domains to achieving global collaboration advantage. In practice, it is infeasible for a domain-specific application to disclose its process details for privacy or security reasons. Consequently, it is a challenging endeavor to coordinate scientific workflows and its distributed domain-specific applications from service invocation perspective. Therefore, in this chapter, the authors aim at proposing a collaborative scheduling approach, with time-related QoS evaluation, for navigating cross-domain collaboration. Under this collaborative scheduling approach, a private workflow fragment could maintain temporal consistency with a global scientific workflow in resource sharing and task enactments. Furthermore, an evaluation is presented to demonstrate the scheduling approach.
INTRODUCTION In the past few years, some computing infrastructures, e.g., grid infrastructure, have been emerged for accommodating powerful computing and for enhancing resource sharing capabilities required by crossorganizational workflow application (Wieczorek, 2005; Fox, 2006). It is a new special type of workflow that often underlies many large-scale complex e-science applications such as climate modeling, structural biology and chemistry, medical surgery or disaster recovery simulation (Ludäscher, 2005; Bowers, 2008; Zhao, 2006). This new type of scientific workflow applications is gaining more and more momentums due to their key role in e-Science and cyber-infrastructure applications. As scientific workflows are typically DOI: 10.4018/978-1-60566-661-7.ch018
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scientific Workflow Scheduling with Time-Related QoS Evaluation
data-centric and dataflow-oriented “analysis pipelines” (Ludäscher, 2005; McPhillips, 2005), scientists often need to “glue” together various cross-domain services such as cross-organizational data management, analysis, simulation, and visualization services (Yan, 2007; Rygg, 2008). Compared with business workflows, scientific workflows have special features such as computation, data or transaction intensity, less human interactions, and a larger number of activities (Wieczorek, 2005). Accordingly, scientific workflow applications frequently require collaborative patterns marked by multiple domain-specific applications from different organizations. An engaged domain-specific application often contributes a definite local computing goal to global scientific workflow execution. Typically, in this loose coupled application environment, goal-specific scientists are rather individualistic and more likely to create their own “knowledge discovery workflow” by taking advantage of available services (Ludäscher, 2005). It promotes scientific collaboration in form of service invocation for achieving certain computing goals. To facilitate scientific workflow’s development and execution, cross-domain workflow modeling and scheduling are key topics that currently cause more and more attentions (Wieczorek, 2005; Yan, 2007;Yu, J., & Buyya, R., 2005;Yu, J., Buyya, R., & Tham, C. K., 2005). For example, Yu and Buyya (Yu, J., & Buyya, R., 2005) provided a general taxonomy of scientific workflow, in which workflow design, workflow scheduling, fault tolerance, and data movement are four key features associated with the development and execution of a scientific workflow management system in Grid environment. Furthermore, they believed that a scientific workflow paradigm could greatly enhance scientific collaboration through spanning multiple administrative domains to obtain specific processing capabilities. Here, scientific collaborations are often navigated by data-dependency and temporal-dependency relations among goal-specific domain applications, in which a domain-specific application is often implemented as a local workflow fragment deployed inside a self-managing organization for providing the demanded services in time. In a grid computing infrastructure, a service for scientific collaboration is often called a grid service that address resource discovery, security, resource allocation, and other concerns (Foster, 2001). For crossorganizational collaboration, existing (global) analysis techniques often mandate every domain-specific service to unveil all individual behaviors for scientific collaboration (Chiu, 2004). Unfortunately, such an analysis is infeasible when a domain-specific service refuses to disclose its process details for privacy or security reasons (Dumitrescu, 2005; Liu, 2006). Therefore, it is always a challenging endeavor to coordinate a scientific workflow and its distributed domain-specific applications (local workflow fragments for producing domain-specific service), especially when a local workflow fragment is engaged in different scientific workflow executions in a concurrent environment. Generally, a local workflow fragment for producing domain-specific service is often deployed inside a self-governing organization, which could be treated as a private workflow fragment of the self-governing organization. In this situation, to effectively coordinate internal service performances and external service invocation, as well as their quality, collaborative scheduling between a scientific workflow and engaged self-managing organizations may be greatly helpful for promoting the interactions of independent local applications with the higher-level global application. It aims at coordinating executions for computationand data-rich scientific collaboration in a readily available way. For example, resource management in Grid environment is typically subject to individual access, accounting, priority, and security policies of the resource owner. Resource sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs in form of service. The usage policy imposing on these resources is often enforced by a self-managing organization (Foster, 2001; Batista, 2008). At runtime, if a selfmanaging organization refuses to disclose its process details for privacy or security reasons, the resource
397
Scientific Workflow Scheduling with Time-Related QoS Evaluation
service process is often promoted by a resource-broker (Abramsona, 2002; Elmroth, 2008). Besides, if a resource could not be shared by different resource users at the same time, executions of different scientific workflow around these resources should coordinate their resource sharing in a compromising way. Otherwise, some conflicts would be occurred during the execution. Therefore, cross-organizational scientific workflow execution, resource allocation, and compromising usage policy should be scheduled in an incorporated way in a concurrent environment (Yan, 2007; Li, 2006). For instance, a computing center is a typical self-managing organization that often bears up heavy computing loads from numerous goal-specific applications. The scheduling of a computing center for satisfying its multi external service requirements is a typical coordinative process between a scientific workflow and a self-managing organization. Resource compromising usage policy is often recruited for coordinating its computational resource’s using processes engaged in different scientific collaborations in a concurrent environment. Additionally, for a performance-driven scientific workflow execution, collaborative scheduling process is a more complex situation as a collaborative scheduling process not only covers cross-organizational resource sharing, but also covers task enactments deployed inside self-managing organizations (Yan, 2007; Batista, 2008), which is often initiated by domain-specific service specifications and its application context specifications. In view of these observations, a collaborative scheduling approach is investigated, in this paper, for achieving coordinated executions of a scientific workflow with time-related QoS evaluation. It is specifically deployed in a Grid environment. Taking advantage of the collaborative scheduling strategy, a private workflow fragment could maintain its temporal consistency with a scientific workflow in resource sharing and task enactments. Please note that our method subscribes to relative time rather than absolute time in collaborative scheduling applications. The rest of this chapter is organized as follows. In Section 2, some preliminary knowledge of QoS is presented for piloting our further discussion. In Section 3, a temporal model of service-driven scientific workflow execution is investigated. In section 4, application context analyses of scientific workflow execution is discussed. In Section 5, taking advantage of the temporal model presented in Section 3 and the context analysis presented in Section4, a temporal reasoning rule is put forward for collaboration scheduling application of a scientific workflow. In Section 6, an evaluation is proposed for demonstrating our approach presented in this paper. In Section 7, related works and comparison analysis are presented to evaluate the feasibility of our proposal. Finally, the conclusions and our future work are presented in Section 8.
PRELIMINARY KNOWLEDGE OF qOS With recent advances in pervasive devices and communication technologies, there are increasing demands in scientific and engineering application for ubiquitous access to networked services. These services extend supports from Web browsers on personal computers to handheld devices and sensor networks. Generally, a service is a function that is well-defined, self-contained, and does not depend on the context or state of other services. Service-Oriented Architecture (SOA) is essentially a collection of services. These services communicate with each other and the communication can involve either simple data passing or it could involve two or more services for coordinating some activity (http://www.servicearchitecture.com/). Figure 1 illustrates a general style of the service-oriented scenario. In Fig.1, a service consumer sends a service request to a service provider, and the service provider returns a response to the service con-
398
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 1. Service-oriented scenario between service consumer and service provider
sumer. The request and subsequent response connections are defined in some way that is understandable to both the service consumer and service provider. Here, the service could be reified into a unit of work done by a service provider to achieve desired goal for a service consumer. Here, both service provider and service consumer could be roles played by software agents on behalf of their owners. Service-oriented applications are most launched by this Web service invocation style as illustrated by Fig.1. Generally, Web services are self-contained business applications, which can be published, located and invoked by other applications over the Internet. Different vendors, research firms and standards organizations could define their Web services differently, however the common theme in all these definitions is that Web services are loosely coupled, dynamically bound, accessed over the web and standards based. Web services often use XML schema to create a robust connection. They are based on strict standard specifications to work together and with other similar kinds of applications. More specifically, Web services are based on three key standards in their current manifestation, i.e., SOAP (XML-based message format), WSDL (XML-based Web Services Description Language), and UDDI (XML-based Universal Description, Discovery, and Integration of Web Services). Any use of these basic standards constitutes a Web service. Universal, platform independent connectivity (via XML-based SOAP messages) and self-describing interfaces (through WSDL) characterize the Web services, and UDDI is the foundation for a dynamic repository which provides the means to locate appropriate Web Services. The typical Web Service invocation is demonstrated by Fig.2, by taking advantage of those standards. Web services allows for the development of loosely coupled solutions. The independent resources expose an interface, which can be accessed over the network. For example, a firm may expose a particu-
Figure 2. A typical Web service invocation paradigm in technology
399
Scientific Workflow Scheduling with Time-Related QoS Evaluation
lar application as a service, which would allow the firm’s partners to access the particular service. This is made possible by standards which define how Web services are described, discovered, and invoked. This adherence to strict standards enables applications in one business to inter-operate easily with other businesses. In addition, it allows application interactions across disparate platforms and those running on legacy systems and thereby offers a company the capability of conducting business electronically with potential business partners in a multitude of ways at reasonable cost. It has to be acknowledged that Web Services technology is only one of several technologies that enable component-based distributed computing and support information system integration efforts, largely due to its universal nature and the broad support by major IT technologies. Other standards, such as WSFL (Web Services Flow Language) or BPEL4WS (Business Process Execution Language for Web Services), also play an important role, but are not necessarily required to consume or provide Web Services, and if the location of the Web Service is known even UDDI is not required. Those basic concept and scenarios mentioned above could also be referred to http://www.w3.org/2002/ws/. The emergence of Web services has created unprecedented opportunities for organizations to establish more agile and versatile collaborations with other organizations. Widely available and standardized Web services make it possible to realize cross-organizational collaboration. A typical SOA paradigm based on Web Service rationale could be illustrated by Fig.3, in which there are three fundamental roles: Service Provider, Service Requestor and Service Registry and 3 fundamental operations: Publish, Find and Bind. The service provider is responsible for creating a service description, publishing its service description to one or more service registries, and receiving Web service invocation messages from one or more service requestors. A service requestor is responsible for finding a service description published to one or more service registries and is responsible for using service descriptions to bind to or invoke Web services hosted by service providers. The service registry is responsible for advertising Web service descriptions published to it by service providers and for allowing service requestors to search the collection of service descriptions contained within the service registry. Once the service registry makes the match, the rest of the interaction is directly between the service requestor and the service provider for the Web service invocation (Graham, 2001) Please note that even grid service is often recruited in grid application and scientific workflow research domain, as it is essentially a special web service that provides a set of well-defined interfaces and that follows specific conventions, we do not distinguish a grid service and web service in this chapter. In Fig.3, since they are intended to be discovered and used by other applications across the Web, Web services need to be described and understood both in terms of functional capabilities and QoS’s properties. Therefore, a service is always specified by its function-attributes (i.e. a service’s function specification including inputs, outputs, preconditions and effects) and its non-function attributes (e.g. time, price, availability, et al, for evaluating a service’s execution). Generally, the service profile primarily describes a service’s function-attributes. In cross-domain grid service invocations, quality of a grid service (mainly specified by its non-function) is often evaluated in terms of common security semantics, distributed workflow and resource management, coordinated fail-over, problem determination services, and other metrics across a collection of resources with heterogeneous and often dynamic characteristics. In (Zeng, 2004), five generic quality criteria for elementary services are presented as follows: (1) Execution price. Given an operation op of a service s, the execution price qpr(s; op) is the fee that a service requester has to pay for invoking the operation op. (2) Execution duration. Given an operation op of a service s, the execution duration qdu(s; op) measures the expected delay in seconds between the moment when a request is sent and the moment when the results are received. (3) Reputation. The reputation
400
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 3. A typical SOA paradigm based on Web service
qrep(s) of a service s is a measure of its trustworthiness. It mainly depends on end user’s experiences of using the service s. Generally, different end users may have different opinions on the same service. (4) Successful execution rate. The successful execution rate qrat(s) of a service s is the probability that a request is correctly responded (i.e., the operation is completed and a message indicating that the execution has been successfully completed is received by service requestor) within the maximum expected time frame indicated in the Web service description. The successful execution rate (or success rate for short) is a measure related to hardware and/or software configuration of Web services and the network connections between the service requesters and providers, and (5) Availability. The availability qav(s) of a service s is the probability that the service is accessible.
A Temporal Model of Service-Driven Scientific Workflow Execution In this chapter, we mainly focus on discussing scientific workflow scheduling with time-related QoS evaluation in grid environment. More specifically, a temporal model for service-driven scientific workflow execution is presented in this Section, and its further applications are investigated in later sections. In (Cardoso, 2004), four distinct advantages are highlighted for organizations to characterize their workflow developments and executions based on QoS: 1.
2.
3.
QoS-based design: it allows organizations to translate their vision into their business processes more efficiently, since workflow can be designed according to QoS metrics. For e-commerce processes it is important to know the QoS an application will exhibit before making the service available to its customers. QoS-based selection and execution: it allows for the selection and execution of workflows based on their QoS, to better fulfill customer expectations. As workflow systems carry out more complex and mission-critical applications, QoS analysis serves to ensure that each application meets user requirements. QoS monitoring: it makes possible the monitoring of workflows based on QoS. Workflows must be rigorously and constantly monitored throughout their life cycles to assure compliance both with initial QoS requirements and targeted objectives. QoS monitoring allows adaptation strategies to be triggered when undesired metrics are identified or when threshold values are reached.
401
Scientific Workflow Scheduling with Time-Related QoS Evaluation
4.
QoS-based adaptation: it allows for the evaluation of alternative strategies when workflow adaptation becomes necessary. In order to complete a workflow according to initial QoS requirements, it is necessary to expect to adapt, re-plan, and reschedule a workflow in response to unexpected progress, delays, or technical conditions. When adaptation is necessary, a set of potential alternatives is generated, with the objective of changing a workflow as its QoS continues to meet initial requirements. For each alternative, prior to actually carrying out the adaptation in a running workflow, it is necessary to estimate its impact on the workflow QoS.
In a service-driven workflow system, time is one of the key parameter engaged in its QoS specification (Cardoso, 2004). Timing constraint is often associated with organizational rules, laws, commitment, technical demands, and so on. In (Zeng, 2004), two timing constraints are put forward related to activities, which are internal timing constraint and external timing constraint. They specified the internal timing constraint as the execution duration or executable time span; and specified the external timing constraints as the temporal dependency relations between different activities. On the assumption that given a workflow model, designers could assign execution duration and executable time span (during which an activity could be executed) to every individual activity based on their experience and expectation from the past execution, Li, et al (Zeng, 2004) defined the duration time accurately in their timing constraint model. In practice, we believe that it may be more reasonable to specify the duration time as a time span. For example, it may be more acceptable to specify the execution duration is 3 days to 5 days (ab. (3, 5)) than to specify the execution duration is 4 days, at the stage of system modeling. By extending the timing constraint definitions presented in (Zeng, 2004) with this idea, we put forward a general timing constraint model for service invocation engaged in cross-domain workflow execution. Facilitating temporal-dependency analysis, we believe that service invocation cost often consists of service producing cost and service delivering cost. Service producing process aims at producing concrete service content, it underlies later service delivering. As service producing process is often deployed for reifying required service item, its time-related QoS evaluation is often calculated based on internal temporal cost inside an organization. Compared to service producing cost, time-related QoS evaluation of service delivering cost is often calculated based on external temporal cost associated with service distributing process among service providers and service consumers, as well as administrative cost consumed in cross-organizational collaboration. Accordingly, cost evaluation of a service invocation would be calculated based on this two cost calculating. For example, a car part vendor or car part enterprise receives an order for some parts, the service process spans the time from receiving the order to delivering its products. It often contains two stages: the first stage focus on manufacturing the required parts associated with enterprise’s internal time elapsing; the second stage focuses on timely delivering the required parts associated with enterprise’s external time elapsing. The QoS is often related to both the time of service producing process and service delivering process. Here, the internal time is determined by service producing process and the external time is determined by service delivering process. The cost of service could be evaluated based on the time cost associated with these two stages, which is two side of time analysis upon a same service invocation. Please note that in some situations, there could be only service delivering process without service producing process in some situation. For example, the service provided by Urban Emergency Monitoring and Support Centre (EMSC) is often related to service delivering without service producing process in a concrete service invocation process. Here, the QoS is only related to the time of service delivering. Associated with service producing process and service delivering process, some typical service invo-
402
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 4. A temporal logic-based time model for steering service innovation
cating modes are discussed, here, for specifying their coordination. Fig.4 illustrates a typical coordination relation between a service provider and a service consumer. The temporal parameters as illustrated in Fig.4 are specified by Table1. According to the temporal-dependency relation among these parameters as listed in Table1, some typical service innovation styles are specified as follows. 1. 2. 3. 4.
If SP-End =SD-Start, we believe that the service delivering process is a strong service delivering style. If SP-End <SD-Start, we believe that the service delivering process is a weak service delivering style. If SC-Start =SD-End, we believe that the service consuming process is a strong service invocation style. If SC-Start < SD-End, we believe that the service consuming process is a weak service invocation style.
In practice, SC-Start is often determined firstly, which initiated the service invocation in a bottom-to-up way. More specifically, a service provider often schedules its service producing process and its services delivering process according to the deadline required by a service consumer. What a service requestor
Table 1. Specifications of temporal parameters indicated in Figure 4 Temporal Parameters
Specifications
SD-Start
The start time of service delivering process
SD-End
The end time of service delivering process The start time of service producing process
SP-Start SP-End
The end time of service producing process
SC-Start
The start time of service consuming process
SC-End
The end time of service consuming process
[T-SR]
The time point for firing the service producing process
403
Scientific Workflow Scheduling with Time-Related QoS Evaluation
cares about is the expected time to achieve its required item; while the service provider take care of the time of service producing and service delivering, respectively, based on certain service providing deadline. Here, if the time pair of [SP-Start, SP-End] is degraded into (0, 0), it is a special service paradigm without the process of service producing. For example, if the car part vendor or car part enterprise has the required parts in stock, the time of part producing is omitted and the service process is depended on the process of part delivering. If the service provider and service consumer as illustrated in Figure4 respectively stand for two workflow fragments, it indicates a typical service-driven cross-domain workflow execution with timerelated QoS evaluation. It is more suitable for specifying a service-driven scientific workflow execution in a grid environment. Furthermore, in Fig.4, the service definitions mainly consist of such foundational prescriptions as definitions of service function, QoS prescription, cost evaluation, serving relation definition, and other service item definitions. It prescribes the policies of how to organize the required web service into a service-based workflow system. It definitely satisfies the requirements of a scientific workflow execution on security issues in grid environment. For example, in grid environment, resource access and resource sharing is often invocated through granting license to certain resource user and then opens the resource access to valid consumer (Yan, 2007; Foster, 2001). According to the rationale of temporal logic and the scenario of service invocating mode as demonstrated by Fig.4, certain security policy could be easily embed into grid service invocation for a grid computing paradigm. In practice, certain security policies are often integrated into a grid service’s QoS evaluation, which provide different profiles of control flow and data flow specification. In the following section, we will take advantage of this temporal model to explore the scheduling application of cross-domain scientific workflow execution in grid environment.
APPLICATION CONTExT ANALYSES OF SCIENTIFIC WORKFLOW ExECUTION As mentioned in (Yu, J., & Buyya, R., 2005), cross-organizational collaboration engaged in a scientific workflow often aims at obtaining specific processing capabilities through spanning multiple administrative domains. Here, a domain-specific application engaged in scientific collaboration is often uniquely associated with a local workflow fragment deployed in a self-managing organization. In this paper, when a workflow fragment refuses to disclose some of its process details for privacy or security reasons, it would be treated as a private workflow fragment of a self-managing organization. For a private workflow fragment, the actions and resources hidden from a scientific workflow specification and execution would be treated as silent actions and silent resources. Compared with the silent actions and silent resource, a self-managing organization only exposes its publicly accessible port for its scientific collaboration. Therefore, a private goal-specific workflow fragment consists of a set of silent actions, silent resources and some publicly accessible ports. It is essentially a gray box embedded in a scientific workflow. In scientific workflow execution, it is wholly a functional unit for scientific collaboration, and is triggered by its publicly accessible port for certain computing goals. In this paper, a publicly accessible port would be treated as an interaction interface between a scientific workflow and a self-managing organization. Fig.5 demonstrates a scientific workflow and its application context associated with three selfmanaging organizations. In Fig.5, a scientific workflow consists of three tasks (i.e., Ti, Tj, and Tk) and three resources (i.e., Ri, Rj, and Rk). Ti, Tj, and Tk are respectively associated with three private work-
404
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 5. Global application context of a cross-organizational scientific workflow execution
flow fragments (i.e., Pri-WF1, Pri-WF2, and Pri-WF3) for achieving certain local computing goals. Pri-WF1, Pri-WF2, and Pri-WF3 are respectively deployed inside three self-managing organizations (i.e., SM-Org-1, SM-Org-2, and SM-Org-3). Obviously, this scientific workflow is typically deployed in a cross-organizational way. For a scientific workflow specification, it can not covers the silent actions and silent resources contained in private fragments, as they exclusively belong to self-managing organizations for certain privacy or security reasons. Fig.6 illustrates Pri-WF1, Pri-WF2, and Pri-WF3, and a global scientific workflow view by masking the silent actions and silent resources engaged in Pri-WF1, Pri-WF2, and Pri-WF3. From Fig.6, we could find that Pri-WF1, Pri-WF2, and Pri-WF3 are respectively enacted by different self-managing organizations in isolated environments. The scientific workflow only covers their publicly accessible ports. As a private workflow fragment often masks part of its own internal workflow specification and its scheduling specification from the scientific workflow, it is a challenging endeavor to coordinate the executions of a scientific workflow and a private workflow fragment at runtime for their scientific collaboration. It greatly depends on the collaborative execution specification. In view of this challenge, we will then discuss the temporal context of a scientific workflow for its runtime scheduling. As mentioned in (Rajpathak, 2006), “scheduling deals with the assignment of jobs and activities to resources and time ranges in accordance with relevant constraints and requirements.” For a scientific workflow, its scheduling application is always promoted in a top-down way. For example, the scheduling tools such as Petri Net, WF-net, or DAG (Zeng, 2004; Li, 2003;van der Aalst, 1998; Guan, 2006) are typical associated with a direction from source behaviors to sink behaviors. They are typical a downstream scheduling style for scheduling application. In this scheduling style, the start time of the source behaviors are determined in advance, then succeeding activities are scheduled according to certain workflow patterns (e.g., And-Split, Or-Split, And-Join, and Or-Join workflow pattern (van der Aalst, 2003), to name a few) and certain temporal-dependencies (e.g., Before, Meet, Overlap (Allen, 1983), etc). Its scheduling application is unfolded in the same direction with its practical execution direction. Here, we take advantage of a time axis t1 to indicate the scheduling application of a scientific workflow.
405
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 6. Three private workflow fragments and a scientific workflow view by masking its silent context
For a private workflow fragment associated with a scientific workflow, its scheduling application is different from the scientific workflow. As a private workflow fragment is always triggered by its publicly accessible port, although there are certain source behaviors and sink behaviors in its model and its later concrete execution, the concrete temporal parameters of its behaviors could not be scheduled independently. Its scheduling application is unfolded by two steps. At the first stage, according to its expected computing goal specified by a scientific collaboration, a private workflow fragment schedules its workflow model and execution in an isolated application environment. At this scheduling stage, we take no consideration of the temporal constraints of the publicly accessible ports specified by the scientific workflow scheduling specification. Here, we take advantage of a time axis t2 to indicate the scheduling application environment of a private workflow fragment. At the second stage, taking advantage of the temporal constraints of the publicly accessible port, the temporal distributions of the private workflow fragment indicated by time axis t2 are wholly mapped onto time axis t1 to keep the temporal consistency with its publicly accessible port. Through time mapping, we can guarantee the temporal consistency between executions of a private workflow fragment and a scientific workflow for their scientific collaboration. The first scheduling stage aims at specifying a private workflow fragment’s internal temporal dependencies among its silent behaviors and publicly accessible port without external temporal constraints. Its scheduling application is initiated from a certain source point. It is unfolded in the same direction with the workflow fragment’s execution in practice. The second scheduling stage aims at keeping the external temporal consistency with a scientific workflow execution for certain scientific collaboration through temporal transferring from time axis t1 to time axis t2. Its temporal calculating process is initiated by a publicly accessible port. It may be unfolded in a reversed direction compared with the workflow fragment’s execution in practice. It is a typical hierarchical scheduling process. For example, in Fig.5,
406
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 7. Typical temporal parameters and their distributions for scheduling a scientific workflow SWF
publicly accessible ports of Ri and Ti inside SM-Org-1 stand for a sink resource and a sink task of a private workflow fragment Pri-WF1. In this situation, the scheduling application of Pri-WF1 depends on the expected start time of the scientific workflow. As the scheduling application of Pri-WF1 is initiated by the scheduling result of the scientific workflow, their executions should be scheduled in an incorporated way. Please note that the temporal disciplines (e.g., weak service delivering style, strong service delivering style, weak service invocation style, and strong service invocation style) presented in section 3 could be incorporated into the application context of a cross-organizational scientific workflow execution. A concrete example of service delivering style would be demonstrated in our evaluation analysis presented in section 6.2. In the next section, we will focus on exploring a temporal reasoning rule which can coordinate cross-organizational executions of a scientific workflow.
A Temporal Reasoning Rule for Collaborative Scheduling Application of Scientific Workflow According to the temporal model as presented in Section3, a temporal reasoning rule would be investigated in this section for cross-domain scientific workflow scheduling application. Firstly, suppose that there is just one publicly accessible port contained in a private workflow fragment in a self-managing organization. More complex situations would be investigated at the end of this section. Definition1. For a scientific workflow SWF, its expected executable duration could be specified by a time period of [SWF-Estart, SWF-Eend], in which SWF-Estart and SWF-Eend respectively stand for SWF’s expected start time and expected end time. Definition2. Suppose that there is a private workflow fragment Pri-WFi associated with a scientific workflow SWF. It has a publicly accessible port Pi engaged in SWF’s execution. For Pi, its expected executable duration could be indicated by a time period of [Pri-WFi-Ep-start, Pri-WFi-Ep-end], in which Pri-WFi-Ep-start and Pri-WFi-Ep-end respectively stand for Pi’ expected start time and its expected end time. Here, SWF-Estart, SWF-Eend, Pri-WFi-Ep-start, and Pri-WFi-Ep-end are specified by SWF’s scheduling specification. Fig.7 indicates this unique scheduling process of SWF with a time axis t1. Please note that the temporal parameters indicated by time axis t1 are relative time rather than absolute time. In practice, Pri-WFi is uniquely associated with SWF’s execution through Pi, in which SWF plays as a service consumer and Pri-WFi plays as a service provider in their cross-domain scientific collaboration. In their service-driven scientific collaboration, as a service consumer, SWF should firstly specify
407
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 8. Typical temporal parameters and their distributions for scheduling a private workflow fragment Pri-WFi
its service requirement, in terms of what and when; then, as a service consumer, Pri-WFi is scheduled for providing the demanded service, in time, in terms of how and when. Concretely, Pri-WFi’s execution aims at providing the demanded service based on SWF’s specification in time. Once a service item is determined in terms of what, required silent resources and silent task enactments could be deployed by a self-managing organization for achieving the expected computing goal. It is associated with the first scheduling stage as mentioned in Section2. To provide the demanded service in time, Pri-WFi’s implementation should be scheduled in terms of when. It is associated with the second scheduling stage as mentioned in Section2. Generally, for a goal-driven workflow execution, if there is no external temporal dependency with other workflow executions, it could be scheduled based on its capacity and past experiences in a selfmanaging way with a special execution goal (Li, 2003). Definition3. For a private workflow fragment Pri-WFi that takes no external temporal dependency with other workflow executions, its expected executable duration could be specified by a time period of [Pri-WFi-Estart, Pri-WFi-Eend], in which Pri-WFi-Estart and Pri-WFi-Eend respectively stands for Pri-WFi’s expected start time and its expected end time. Fig.8 demonstrates this unique scheduling process of Pri-WFi specified by time axis t2. Similarly, the temporal parameters indicated by time axis t2 are also relative time rather than absolute time. As Pri-WFi is uniquely associated with SWF through Pi, there are certain temporal dependencies between Pri-WFi and SWF. To provide the required computing service in time, Pri-WFi should be active in a required duration based on their temporal dependencies. Pri-WFi’s start time should be deduced based on the temporal constraints of its publicly accessible port specified by SWF’s specification rather than determined independently. Here, suppose that associated with time axis t2, Pi’s expected start time and expected end time are respectively indicated by Pri-WFi-E’p-start and Pri-WFi-E’p-end. Here, Pri-WFiE’p-start and Pri-WFi-E’p-end should be respectively equal to Pri-WFi-Ep-start and Pri-WFi-Ep-end in terms of absolute time or in execution. To keep the temporal consistency between Pri-WFi and SWF, the time parameters Pri-WFi-Estart, Pri-WFi-Eend, Pri-WFi-E’p-start and Pri-WFi-E’p-end indicated by t2 should be mapped to SWF’s time axis t1. In view of this observation, a temporal transferring rule will be investigated in this section, for keeping temporal consistency in cross-organizational scientific collaboration.
408
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 9. Temporal parameters and their distributions of a scientific workflow example SWF associated with time axis t1
1.
2.
For a scientific workflow SWF, as the time period of [SWF-Estart, SWF-Eend] covers the time period of [Pri-WFi-Ep-start, Pri-WFi-Ep-end], i.e., [Pri-WFi-Ep-start, Pri-WFi-Ep-end]⊆[SWF-Estart, SWF-Eend], SWFEstart should be determined firstly, then Pri-WFi-Ep-start and Pri-WFi-Ep-end are determined according to the values of SWF-Estart and SWF’s internal temporal distributions. This temporal scheduling is formalized by a scheduling logic of SWF-Estart [Pri-WFi-Ep-start, Pri-WFi-Ep-end]. It indicates a top-down or a global-to-local temporal reasoning path for scheduling a scientific workflow. For a private workflow fragment Pri-WFi, although the time period of [Pri-WFi-Estart, Pri-WFi-Eend] covers the time period of [Pri-WFi-E’p-start, Pri-WFi-E’p-end], i.e., [Pri-WFi-E’p-start, Pri-WFi-E’p-end]⊆[PriWFi-Estart, Pri-WFi-Eend], a different temporal scheduling logic is held. More specifically, only after the values of Pri-WFi-E’p-start and Pri-WFi-E’p-end are achieved based on the scheduling logic of SWFEstart [Pri-WFi-Ep-start, Pri-WFi-Ep-end], Pri-WFi-Estart and Pri-WFi-Eend could be deduced based on the concrete values of Pri-WFi-E’p-start, Pri-WFi-E’p-end and Pri-WFi’ internal temporal distributions. Here, Fig.8 illustrates Pri-WFi’s internal temporal distributions in a qualitative way. This temporal scheduling process could be formalized by a scheduling logic of [Pri-WFi-E’p-start, Pri-WFi-E’p-end] [Pri-WF -E , Pri-WF -E ]. It indicates a bottom-up or a local-to-global temporal reasoning i start i end path for scheduling a private workflow fragment in an incorporated scheduling environment. It is different from the global-to-local temporal reasoning path.
Pri-WFi-E’p-start and Pri-WFi-E’p-end should be respectively equal to Pri-WFi-Ep-start and Pri-WFi-Ep-end in terms of absolute time. Accordingly, the temporal association relation between SWF and WFi is specified by Definition4. Definition 4. The temporal dependency around a publicly accessible port between SWF and Pri-WFi could be formalized by [SWF-Estart, SWF-Eend] [Pri-WFi-Ep-start, Pri-WFi-Ep-end] [Pri-WFi-Estart, Pri-WFi-Eend] for keeping temporal consistency in cross-organizational scientific collaboration. Here, an example is presented to demonstrate the application of the temporal transferring process according to Definition 4. Fig.9 and Fig.10 respectively illustrate the scheduled temporal distributions a SWF and a Pri-WFi. These two scheduled temporal distributions are respectively associated with time axes t1 and t2. More specifically, in Fig.5, SWF-Estart=0, SWF-Eend=10, Pri-WFi-Ep-start=3, and Pri-WFi-Ep-end=5; in Fig.6, PriWFi-Estart=0, Pri-WFi-Eend=5, Pri-WFi-E’p-start=2, and Pri-WFi-E’p-end = 4. Here, a temporal association relation is taken into consideration between this two time axes. The incorporated temporal scheduling environment should be specified by a united time axis. For brevity and without the loss of generality, the time axis t1 would be selected as a united time axis of t. Here, some duration parameters associated with Pri-WFi are specified, as below, for later temporal reasoning.
409
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 10. Temporal parameters and their distributions of a private workflow fragment example PriWFi associated with time axis t2
1. 2. 3.
The duration between Pri-WFi-Estart and Pri-WFi-E’p-start is 2 time units, i.e., Pri-WFi-d1= 2 time units; The duration between Pri-WFi-E’p-start and Pri-WFi-E’p-end is 2 time units, i.e., Pri-WFi-d2= 2 time units, and, The duration between Pri-WFi-E’p-end and Pri-WFi-Eend is 1 time units, i.e., Pri-WFi-d3= 1 time units.
According to these parameters, the time parameters of Pri-WFi could be re-specified, as below, in the united time axis t. Fig.11 illustrates the re-specified temporal parameters and their distributions in the united time axis t. 1. 2. 3. 4.
Pri-WFi-E’p-start = Pri-WFi-Ep-start = 3. Pri-WFi-E’p-end = Pri-WFi-Ep-end = 5. Pri-WFi-Estart = Pri-WFi-Ep-start − Pri-WFi-d1 = 3 – 2 = 1. Pri-WFi-Eend = Pri-WFi-Ep-end +Pri-WFi-d3 = 5 + 1 = 6.
In Fig.11, Pri-WFi’s start time (i.e., Pri-WFi-Estart = 1) as specified by time axis t is an ideal start time for producing the required service item for SWF’s execution. Otherwise, Pri-WFi will occupy some additional time costs or can not provide the demanded service item in time. For example, if Pri-WFi starts at the zero time point in time axis t, i.e., Pri-WFi-Estart =0, as Pi’ expected start time is fixed in SWF’s specification, i.e., Pri-WFi-Ep-start = 3, and Pri-WFi-Ep-end = 5 could not be changed, the duration between Pri-WFi-Estart and Pri-WFi-Ep-start is 3 time units, i.e., Pri-WFi-d1= 3 time units. Obviously, it wastes 1 time unit compared to Tj-d1’s original value (i.e., Tj-d1=2) that we calculated previously. On the other hand, if Pri-WFi starts at the 2nd time point in time axis t, i.e., Pri-WFi-Estart =2, as the duration from Pri-WFiEstart to Pri-WFi-Ep-start is a fixed value, according to their relative time distributions, Pri-WFi-Ep-start should be at the 4th time point, i.e., Pri-WFi-Ep-start =4, to meet Pri-WFi’s workflow specification. Obviously, it Figure 11. Re-specified temporal parameters and their distributions of example SWF and Pri-WFi associated with a united time axis t based on their temporal dependent relation
410
Scientific Workflow Scheduling with Time-Related QoS Evaluation
delays the service invocation for satisfying Pri-WF’s execution. This example demonstrates a real-time application for scientific collaboration. In practice, the execution of a scientific workflow system may be a mixture of hard real-time applications and soft real-time applications. Generally, a system is said to be real-time if the total correctness of an operation depends not only upon its logical correctness, but also upon the time in which it is performed (Liu, 2002). Moreover, in a hard or immediate real-time system, the completion of an operation after its deadline is considered useless. On the other hand, a soft real-time system will tolerate such lateness and take the overhead of context switching into consideration. Soft real-time systems are typically useful if there are some issues of concurrent access that need to keep a number of connected systems up to date with changing situations. To incorporate the soft real-time property into workflow scheduling application, a publicly accessible port should have some typical attributes of Has-Earliest-Start-Time, Has-Latest-Start-Time, Has-Earliest-End-Time, and Has-Latest-End-Time as specified in (Rajpathak, 2006). Moreover, a temporal-dependable service initiated by a publicly accessible port subscribes to Allen’s (Allen, 1983) representation of standard time and relations between a private workflow fragment and a scientific workflow. These temporal attributes are key temporal constraints for task enactment and resource allocation for scientific collaboration. Here, some more complex situations would be investigated. Suppose that there is more than one publicly accessible port contained in a self-managing organization. With this scenario in our mind, some complex situations are distinguished as below. 1.
2.
3.
4.
If the ports belong to a same private workflow fragment Pri-WFi, and Pri-WFi just engaged in a scientific collaboration with a scientific workflow, its local temporal scheduling among its silent actions, silent resources and publicly accessible ports just aims at providing the demanded service item in time. In this situation, there is no conflict in resource sharing and task enactment, and it is easy to schedule a Pri-WFi in a self-managing way. If the ports belong to a same private workflow fragment Pri-WFi, and Pri-WFi is engaged in more than one scientific workflow in a concurrent environment, its local temporal scheduling among its silent actions, silent resources and publicly accessible ports should be coordinated with each other for satisfying different service items for different scientific workflows. It is required that Pri-WFi should compromise the service producing processes if there is a conflict in resource sharing and task enactment. If the ports belong to different private workflow fragments, and there is no shared silent action or silent resource for producing the service items among the private workflow fragments, the private workflow fragments could be scheduled independently with each other, according to the temporal transferring rule proposed in this paper. If the ports belong to different private workflow fragments but there are some shared silent actions or silent resources among the workflow fragments, the scheduling application of the workflow fragments should be promoted in an incorporated way. In the following section, an evaluation would be presented to demonstrate this complex situation based on the method presented in this section.
411
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 12. Scientific workflow execution process based on context switching and role binding
EVALUATION: A SCIENTIFIC WORKFLOW ENGINEERING DEVELOPMENT WITH TIME-RELATED qOS EVALUATION Fig.12 demonstrates a scientific workflow execution process based on context-awareness technique in a simulating way. It is a typically scheduler-based paradigm, and provides the foundation for development a context- and role-driven workflow engine through timely role binding. It is navigated through effectively context switching at runtime. More specifically, a Member judges its application situation, apperceives application context information for its behavior goals, selects a suitable role according to certain application logics, and then enters into a concrete application context with the role specification. The taxonomy of context as illustrated in Fig.12 navigates effective context-awareness and suitable role binding for promoting a scientific workflow execution. Please note that the application logic demonstrated in Fig.12 is spread from two directions with certain state transitions. One is initiated by a vertical navigating logic, and the other is horizontal navigating logic. The vertical navigating logic is initiated by three stages: task performing, resource accesses, and collaborating with others. The horizontal navigating logic is composed of context-awareness, runtime context switching, and concrete task execution. With this navigating discipline as demonstrated in Fig.12, cross-domain scientific workflow system could be effectively promoted in a collaborative way. Accordingly, a service-driven scientific workflow system can be characterized as sequences of service invocations based on certain temporal logic among distributed and heterogeneous stand-alone systems that can provide autonomous services. The autonomous services could be treated as task-oriented processing. According to the temporal disciplines presented in this chapter, a certificate-driven scientific workflow application paradigm will be explored with time-related QoS evaluation. In practice, a scientific workflow is often deployed in a cross-organizational way, which is often enabled in form of virtual organization. In (Foster, 2001), virtual organizations were characterized by flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and organizations. In (Yuan, 2000), three generic types of accounts on virtual organizations were sum-
412
Scientific Workflow Scheduling with Time-Related QoS Evaluation
marized. The first one is on organizations that extend some of their organizational activities externally, thus forming virtual alliances to achieve organizational objectives. Integrating several companies’ core competences and resources may form virtual organizations in a collaborative way. The second description of the virtual organization is related to a perceptual organization that is “abstract, unseeing and existing within the minds of those who form a particular organization”, which is an antithesis of the physical organization with which we are familiar. The third type of description is of organizations that are established with information technology such as corporations with an intensive use of telecommuting. Service execution times, level of control and flexible change control could be abstracted as key aspects for service-driven workflow specifications and enactment (Grefen, 1999), especially for processes crossing organizational boundaries (Grefen, 2001). For scientific workflow execution, servers supporting workflow application are decentralized (duplicated) throughout the virtual organization and the distributed servers are controlled by a centralized authority (headquarters). It is characterized by some basic features as follows. 1. 2. 3. 4. 5.
Lifetime of cooperative is limited; Organization-across collaboration; Access to a wide range of specialized resources during collaboration; Task- or goal-driven autonomous processes; Role-based communication, et ac.
Here, time constraints are often imposed on resource access or task enactment. At the stage of modeling an organization-across workflow, the concept of control flow is exploited to prescribe the service relation among organization with a temporal dependency. During workflow execution, control flow is instantiated into a logical switching according to a scheduled temporal logic among activities to satisfy certain collaboration. Temporal logic among cross-organizational collaboration often specifies the organization-across workflow execution for certain collaboration. If these temporal specifications are initiated by a global workflow engine, private workflow fragment in organization level could be automatically navigated with the specified temporal logic. It guarantees that these workflow fragments could be fired timely for satisfy certain collaboration in form of service invocation. Each workflow fragments in organization level could centralize on its inside-execution in self-governing way, taking little care of their temporal context switching. In Grid application environment, this temporal logic is often realized in form of certificate mechanism. Service is often opened by granting certificates to creditable candidates for their certain resource access. This certificate mechanism provides certain authorization in terms of what, when, who, and how, in which temporal discipline is a key factor for certificate granting and using. In view of these observations, a certificate-driven workflow execution scenario will be explored for cross-organizational workflow collaboration and resource access. The validity of certificate has a limited lifetime for their service invocation in workflow execution. This certificate-driven workflow execution scenario could be specified as follows: 1. 2.
Server-level or proxy-level private workflow fragments delegate their certificate granting to workflow engine; Invocation of services and functions among private workflow fragments are awakened through certificate granted by workflow engine;
413
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Figure 13. A grid-oriented and certificate-driven workflow system in self-governing fashion
3. 4.
The period of certificate’s validity reflects the lifetime of cooperative, and guarantees the QoS in time; Private workflow fragments are task- or goal-driven autonomous process, in which workflow engine play as a nerve center in a workflow system.
Essentially, this certificate-driven scientific workflow execution is initiated by a server-based workflow engine that control global workflow execution according to cross-organizational collaboration among private workflow fragments and their resource accesses. Fig.13 demonstrates the enactment of a prototype conformable to the service computing paradigm as demonstrated by Fig.1, Fig.2, and Fig.3. The collaboration disciplines engaged in Figure 13 are depicted as follows. Step1: Proxy-based private workflow fragment hand over their routines of certificate release to a global workflow engine that acts as the certificate authority in later scientific workflow executed. Step2: According to the pre-defined global scientific workflow application logic, the global workflow engine initiated cross-organizational collaboration through granting a certificate to a candidate for certain service innovation (resource access). Service invocation is enabled via a certificate identification and authentication, and validity of a certificate specifies the collaborative duration. Step3: After granting a certificate, a duplicated content of the certificate is sent to the resource or service host for identifying and authenticating future logging or visiting. Step4: According to the certificate and its security level, the certificate holder could get the access to the needed resource or invoke some service across the borders of different security domains in order to achieve its local computing goals. Step5: If a task is not finished in the period of validity, the resource access is forbidden. In this situation,
414
Scientific Workflow Scheduling with Time-Related QoS Evaluation
the actor must apply for an added time and then repeat step 2. Otherwise, the task is finished in the scheduled time. Please note that this step is indispensable, if there has an unexpected requirement during workflow execution across the borders of different security domains. The invocation processes among these steps are certificate-driven, which is essentially initiated by QoS evaluation. The granting is initiated by the temporal discipline discussed in the previous sections. The execution logic and the process logic are illustrated, respectively, by Fig.13.a and Fig.13.b. The period of validity is based on the time constraint model exploited in section 3. For achieving the object, the global scientific workflow engine demonstrated in Fig.13 should contain some basic items related to service definitions as below: 1. 2. 3. 4. 5.
Resource pool indexing the available resource supporting workflow execution Directory-based resource location mechanism and workflow-peer location mechanism A certificate authority (CA) for certificate granting Trigger mechanisms initiated by service invoking or ECA rules. Delegation capability supporting dynamic process data transportation, agent application, and other proxy-based issues in access control.
The scientific workflow engine is characterized by a key mechanism of service definition mentioned in section 3. Their temporal-dependent relation engaged in their cross-domain collaboration is navigated by the temporal-dependent rationale presented in section 4. Please note that the global scientific workflow engine plays as the part of a centrally managed security mechanism by taking over the security issue in certificate granting, while the issues or routines about authentication are carried out among private workflow fragment directly. Private workflow fragments have their security control inside themselves according to their service ability, storage space, security level, networking speed, and service demand in hierarchical way.
RELATED WORKS AND COMPARISON ANALYSIS The scheduling issue is very important for enhancing the scalability, autonomy, quality and performance of scientific workflows (Ludäscher, 2005; Yan, 2007; Li, 2006; Rajpathak, 2006). In (Yu, J., & Buyya, R., 2005), three major categories of scientific workflow scheduling architecture are presented, i.e., centralized, hierarchical and decentralized scheduling schemes. In the centralized workflow enactment environment, one central scheduler makes scheduling decision for all tasks engaged in future workflow execution. For hierarchical scheduling, there is a central manager and multiple lower-level sub-workflow schedulers. This central manager is responsible for controlling workflow execution and assigning sub-workflows to the lower-level schedulers. In contrast with the centralized and hierarchical schemes, there are multiple schedulers without any central controller in decentralized scheduling. A scheduler can communicates with others and schedules a sub-workflow to other schedulers with lower load. The authors (Yu, J., & Buyya, R., 2005) believed that the centralized scheme can produce efficient schedules because it has all necessary information about all tasks engaged in workflow execution. However, it is not scalable with respect to the number of task and Grid resource that are generally autonomous. The major advantage of using the hierarchical architecture is that different scheduling policies can be deployed in the central
415
Scientific Workflow Scheduling with Time-Related QoS Evaluation
manager and lower-level schedulers. However, the failure of the central manager will result in entire system failure. Decentralized scheduling is more scalable but faces more challenges to generate optimal solutions for overall workflow performance. The method presented in this paper falls into the third scheme, i.e., decentralized scheduling scheme. Compared with the related works, the main contributions of this paper are twofold. First, as a typical application environment, Grid is an efficient infrastructure for scientific workflow development and execution. In a general Grid environment, scheduling of resource allocation is an important issue for cross-organizational Grid service invocation based on certain privacy and security usage policies (Dumitrescu, 2005; Batista, 2008; Abramsona, 2002; Elmroth, 2008; Li, 2006). Generally, it takes less consideration of task scheduling application (i.e., private workflow fragment scheduling) enacted inside a self-managing organization for achieving a Grid service. In this paper, we incorporated the resource and task into a private workflow fragment scheduling for satisfying a demanded service item with certain QoS evaluation. It enhances the QoS of a cross-organizational service invocation for a scientific collaboration, through keeping the temporal consistency between a scientific workflow and the private workflow fragments. Second, for a cross-organizational scientific collaboration, privacy and security issues are key factors that should be incorporated into concrete scheduling application. In technique, brokering strategy (Abramsona, 2002; Elmroth, 2008) or view techniques (Chiu, 2004) have been proved as efficient approaches for dealing with this problem. In this paper, the collaboration scheduling is essentially promoted based on workflow view technique, in which publicly accessible ports play as interaction view opening for scientific workflow execution. Concretely, a scientific workflow imposes certain temporal constraints on the publicly accessible ports. The silent resources and the silent tasks engaged in a private workflow fragment are scheduled based on these temporal constraints of the publicly accessible ports. It guarantees that the scheduling application of a private workflow fragment is closely navigated by a scientific workflow scheduling application. To our best knowledge, the workflow view technique is mainly employed in cross-organizational business workflow for execution supervision. In this paper, we use this technique for collaboration scheduling of a scientific workflow, which is a novel application of workflow view technique.
CONCLUSION In this paper, a collaborative scheduling approach with time-related QoS evaluation is presented based on a temporal model. The proposed approach aims at keeping the temporal consistency of a scientific collaboration among distributed private workflow fragment in resource sharing and task enactments. Through an evaluation, we also demonstrated the capability of our approach for promoting multiple scientific workflow executions in a concurrent environment. This collaborative scheduling approach could also be helpful with QoS-aware middleware development for cross-organizational scientific collaborations, which will be studied as a future research topic.
416
Scientific Workflow Scheduling with Time-Related QoS Evaluation
ACKNOWLEDGMENT This paper is partly supported by the National Science Foundation of China under Grant No.60673017, and part of this chapter is cited from our previous research work.
REFERENCES Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its implementation in the nimrod-g resource broker. Future Generation Computer Systems, 18(8), 1061–1074. doi:10.1016/S0167-739X(02)00085-7 Allen, J. F. (1983). Maintaining knowledge about temporal internals. Communications of the ACM, 26(11), 832–834. doi:10.1145/182.358434 Batista, D. M., da Fonseca, N. L. S., Miyazawa, F. K., & Granelli, F. (2008). Self-adjustment of resource allocation for grid applications. Computer Networks: The International Journal of Computer and Telecommunications Networking, 52(8), 1762–1781. Bowers, S., McPhillips, T. M., & Ludäscher, B. (2008). Provenance in collection-oriented scientific workflows. Concurrency and Computation, 20(5), 519–529. doi:10.1002/cpe.1226 Cardoso, J., Miller, J., Sheth, A., & Arnold, J. (2004). Quality of service for workflow and web service processes. Web Semantics: Science . Services and Agents on the World Wide Web, 1(3), 281–308. doi:10.1016/j.websem.2004.03.001 Chiu, D. K. W., Cheung, S. C., Till, S., Karlapalem, K., Li, Q., & Kafeza, E. (2004). Workflow viewdriven cross-organizational interoperability in a web service environment. Information Technology and Management, 5(3/4), 221–250. doi:10.1023/B:ITEM.0000031580.57966.d4 Dumitrescu, C. L., Wilde, M., & Foster, I. (2005). A model for usage policy-based resource allocation in grids. In R. Dienstbier (Ed.), Proc. 6th IEEE Int’l Workshop Policies for Distributed Systems and Networks (pp. 191-200). Stockholm, Sweden: IEEE Computer Society Press. Elmroth, E., & Tordsson, J. (2008). Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions. Future Generation Computer Systems, 24(6), 585–593. doi:10.1016/j.future.2007.06.001 Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200–222. Fox, G. C., & Gannon, D. (2006). Special issue: Workflow in grid systems. Concurrency and Computation, 18(10), 1009–1019. doi:10.1002/cpe.1019 Graham, S., Davis, D., Simeonov, S., Daniels, G., Brittenham, P., Nakamura, Y., et al. (Eds.). (2001). Building web services with Java™: Making sense of XML, SOAP, WSDL, and UDDI. New York: Sams Publishing.
417
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Grefen, P., Aberer, K., Ludwig, H., & Hoffner, Y. (2001). CrossFlow: Cross-organizational workflow management for service outsourcing in dynamic virtual enterprises. A Quarterly Bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering, 24(1), 52–57. Grefen, P., & Hoffner, Y. (1999). CrossFlow: Cross-organizational workflow support for virtual organizations. Research Issues on Data Engineering: Information Technology for Virtual Enterprises(RIDE-VE ‘99.)(pp. 90-91). Sydney, Australia: IEEE Computer Society Press. Guan, Z., Hernandez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V., & Liu, Y. (2006). GridFlow: A grid-enabled scientific workflow system with a Petri-Net-based interface. Concurrency and Computation, 18(10), 1115–1140. doi:10.1002/cpe.988 Li, C., & Li, L. (2006). A sistributed multiple simensional QoS constrained resource scheduling optimization policy in computational grid. Journal of Computer and System Sciences, 72(4), 706–726. doi:10.1016/j.jcss.2006.01.003 Li, J. Q., Fan, Y. S., & Zhou, M. C. (2003). Timing constraint workflow nets for workflow analysis. [Part A]. IEEE Transactions on Systems, Man, and Cybernetics, 33(2), 179–193. doi:10.1109/ TSMCA.2003.811771 Liu, D. T., Franklin, M. J., Abdulla, G. M., Garlick, J., & Miller, M. (2006). Data-preservation in scientific workflow middleware. In R. Dienstbier (Ed.), Proc. 18th Int’l Conf. Scientific and Statistical Database Management (SSDBM’06)(pp. 49-58). Vienna, Austria: IEEE Computer Society Press. Liu, J. W. S. (Ed.). (2002). Real-times Systems. New York: Pearson Education Press. Ludäscher, B., & Goble, C. (2005). Guest Editors’ Introduction to the Special Section on Scientific Workflows. SIGMOD Record, 34(3), 3–4. doi:10.1145/1084805.1084807 McPhillips, T. M., & Bowers, S. (2005). An Approach for Pipelining Nested Collections in Scientific Workflows. SIGMOD Record, 34(3), 12–17. doi:10.1145/1084805.1084809 Rajpathak, D., Motta, E., Zdrahal, Z., & Roy, R. (2006). A Generic Library of Problem Solving Methods for Scheduling Applications. IEEE Transactions on Knowledge and Data Engineering, 18(6), 815–828. doi:10.1109/TKDE.2006.85 Rygg, A., Roe, P., Wong, O., & Sumitomo, J. (2008). GPFlow: An Intuitive Environment for Web-Based Scientific Workflow. Concurrency and Computation, 20(4), 393–408. doi:10.1002/cpe.1216 van der Aalst, W. M. P. (1998). The Application of Petri Nets to Workflow Management. J. Circuits . Syst. Comput., 8(1), 21–66. van der Aalst, W. M. P., Hofstede, A.H.M.ter, & Barros, A. P. (2003). Workflow Patterns. Distributed and Parallel Databases, 14(1), 5–51. doi:10.1023/A:1022883727209 Wieczorek, M., Prodan, R., & Fahringer, T. (2005). Scheduling of Scientific Workflows in the ASKALON Grid Environment. SIGMOD Record, 34(3), 56–62. doi:10.1145/1084805.1084816
418
Scientific Workflow Scheduling with Time-Related QoS Evaluation
Yan, Y., & Chapman, B. (2007). Scientific Workflow Scheduling in Computational Grids – Planning, Reservation, and Data/Network-Awareness. In R. Dienstbier (Ed.), Proc. 8th IEEE/ACM Int’l Conf. Grid Computing (pp. 18-25). Austin, Texas: IEEE Computer Society Press. Yu, J., & Buyya, R. (2005). A Taxonomy of Scientific Workflow Systems for Grid Computing. SIGMOD Record, 34(3), 44–49. doi:10.1145/1084805.1084814 Yu, J., Buyya, R., & Tham, C. K. (2005). Cost-based Scheduling of Scientific Workflow Applications on Utility Grids. In R. Dienstbier (Ed.), Proc. 1st Int’l Conf. e-Science and Grid Computing (e-Science’05) (pp. 1-8). Melbourne, Australia: IEEE Computer Society Press. Yuan, P. S., Matthew, K. O., & Shao, Y. L. (2000). Virtual Organizations: The Key Dimensions. In R. Dienstbier (Ed.), Proceeding of Academia/Industry Working Conference on Research Challenges (AIWORC’00) (pp. 3-8). Buffalo, New York: IEEE Computer Society Press. Zeng, L., Benatallah, B., Ngu, A. H. H., Dumas, M., Kalagnanam, J., & Chang, H. (2004). QoS-Aware Middleware for Web Services Composition. IEEE Transactions on Software Engineering, 30(5), 311–327. doi:10.1109/TSE.2004.11 Zhao, Z., Booms, S., Belloum, A., de Laat, C., & Hertzberger, B. (2006). VLE-WFBus: A Scientific Workflow Bus for Multi e-Science Domains. In R. Dienstbier (Ed.), Proc. 2th IEEE Int. Conf. e-Science and Grid Computing(e-Science’06)(p. 11). Amsterdam, Netherlands: IEEE Computer Society Press.
KEY TERMS AND DEFINITIONS Certificate Mechanism: A security policy recruited by Grid application, in which cross-domain resource sharing is enabled by certain certificate verification process among collaborators. Grid: Grid specifies the next generation infrastructure of Internet and its web-based applications. QoS: A set of evaluation parameters for evaluating the quality of a service. Scheduling: Scheduling deals with the assignment of jobs and activities to resources and time ranges in accordance with relevant constraints and requirements. Scientific Workflow: A novel workflow application style for e-Scientific activities. Temporal Model: A model for specify the temporal-dependent relation among collaborative activities. Workflow Fragment: A local workflow execution situation.
419
Section 5
Service Computing
421
Chapter 19
Grid Transaction Management and Highly Reliable Grid Platform Feilong Tang Shanghai Jiao Tong University, China Minyi Guo Shanghai Jiao Tong University, China
ABSTRACT As Grid technology is expanding from scientific computing to business applications, open grid platform increasingly needs the support of transaction services. This chapter proposes a grid transaction service (GridTS) and GridTS based transaction processing model, defines two kinds of grid transactions: atomic grid transaction for short-lived reliable applications and long-lived transaction for business processes. The chapter also presents solutions to managing these two kinds of transactions to reach different consistent requirements. Moreover, this chapter investigates a mechanism for automatic generation of compensating transactions in the execution of long-lived transactions through the GridTS. Finally, it discusses the future trends along the reliable grid platform research.
INTRODUCTION Grid computing is a natural evolution of distributed computing and Internet applications for largescale science and engineering problems, aiming at effective resource sharing and task collaboration in distributed and self-governing environments. The main goal of grid computing is sharing large-scale resources and accomplishing collaborative tasks through enabling people to utilize computing and storage resources transparently. By providing service oriented computing and data infrastructures, grid technology is becoming the preferred basis for large-scale distributed computing, and expanding from scientific computing to business applications (Berman,2003; Foster,2002; Wang, 2004). Many key grid applications especially business applications require reliability guarantee from highly reliable grid computing platform (Jiang 2006). As an effective and widely-used means, transaction DOI: 10.4018/978-1-60566-661-7.ch019
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Grid Transaction Management and Highly Reliable Grid Platform
technology can help people to make this vision a reality, providing application developers with multiple transparencies on location, replica, concurrency and failure (Wang, 2008). As a result, transaction management is one of the most important core services of reliable grid platform for the mission-critical commercial grid applications (Yang, 2008). In grids, a transaction is a set of operations that execute on geographically distributed grid services. Transaction management service is responsible for ensuring the reliable execution of these distributed grid applications to keep the system consistent, free the applications from various failures. Ideally, it also shields users from the complex recovery process. The traditional distributed transactions, where application systems are tightly coupled, have the ACID properties, i.e., Atomicity, Consistency, Isolation and Durability. However, traditional distributed transaction models and Web service transaction specifications do not work in open grid environments because: •
•
•
Grid systems are loosely coupled and autonomous. For the security and efficiency, autonomous grid services typically do not allow to be locked by outside applications while traditional atomic transaction models generally adopts locking mechanism to guarantee the atomicity. It is difficult even impracticable for application programmers to develop compensating transactions. Existing transaction models require application programmers to provide all compensating transactions. However, grid services that execute a business application are dynamically discovered; and autonomous service providers may set up special compensating rules based on their own business models. For example, some service providers allow users to cancel a ticket order without other actions while others may require users to pay some compensating fee. Grid systems are dynamic, i.e., grid services may exit the systems dynamically during an execution of a business process. Grid transaction service has to hide the dynamism from users.
As a result, it is a very important and significant work to propose and implement a transaction service for grid computing. Generally, a grid transaction service has to address following issues: • •
Coordination of the short-lived activities to form an atomic transaction, such as transferring fund from one bank account to another. Coordination of the long-lived transactional activities to fulfill a common agreement, for example, a journey arrangement that involves booking tickets, booking hotel rooms and hiring cars.
This chapter presents a grid transaction service (GridTS) and coordination algorithms to manage atomic and long-lived grid transactions, providing commercial applications with reliability support. Moreover, we propose a solution for automatic generation of compensating transaction, which is an significant advantage over existing long-lived transaction models. The objective is to set up a reliable grid platform based on transaction service for grid applications with reliable requirements, enabling application programmers to use the GridTS to implement transactional applications easily. The proposal has the following advantages over existing related researches. Firstly, the GridTS can automatically generate compensating transactions. Secondly, the GridTS can hide the dynamicity of grids from users. Next, the GridTS reserves resources for the atomic transaction commit to adapt to the autonomous grid environment. Finally, it is extensible because it is built on top of a series of open standards, technologies and infrastructures.
422
Grid Transaction Management and Highly Reliable Grid Platform
BACKGROUND The demands for reliable grid platform and transaction management service result from practical grid applications. Since the beginning of this century, both academia and computer industry have been regarding the development of grids as another chance to improve the current paradigm of Internet computing. ShanghaiGrid is a good example for such grid projects. As an internationalization city with 18 million people, Shanghai presents emergent needs for an information infrastructure to enable sharing of heterogeneous resources to improve government efficiency and quick response to emergent events. For sharing heterogeneous resources of computing, storage and data, Shanghai municipality launched the ShanghaiGrid project in 2003. ShanghaiGrid is a grid infrastructure that aggregates several heterogeneous supercomputers, data centers and other applications scattered in different organizations in Shanghai for city government as well as enterprises and communities. The primary goal is to establish a metropolitan-area service grid for widespread upper-layer applications from both research communities and government departments, tailored for the characteristics of Shanghai. The ShanghaiGrid project is built upon four major computational aggregations and networks in Shanghai, i.e., CHINANET (public Internet backbone built by China Telecom), SHERNET (Shanghai Education and Research Network), STNC (Shanghai Science and Technology Network Communication), and campus networks. From the perspective of hardware infrastructure, ShanghaiGrid aggregates various distributed and heterogeneous resources, including computers, networks, storage devices and so on. From the perspective of software infrastructure, one of the research focuses is to develop the ShanghaiGrid system software (SHGSS). The hardware facilities interconnected in ShanghaiGrid include supercomputers, data storages, devices (e.g., sensors and traffic surveillants) and other resources. These facilities are distributed in the intra-grids of the participating organizations of this project. As Figure 1 illustrates, the ShanghaiGrid hardware infrastructure comprises more than five connected intra-grids, including Shanghai Supercomputing Center (SSC) intra-grid, Shanghai Jiao Tong University (SJTU) intra-grid, Shanghai University (SHU) intra-grid, Tongji University (TJU) intra-grid and Shanghai Urban Traffic Information Center (SUTIC) intra-grid. The ShanghaiGrid system software SHGSS was designed with three levels: (1) the low-level access and management software for encapsulating heterogeneous resources as grid services, (2) the grid middleware that provide top-level traffic applications and others with transparent access to grid services, no matter where grid services are located, whether failures occur during execution of tasks, and how these services are composed, etc(Fox,2001), and the top-level grid portals that enable users to access ShanghaiGrid in a way similar to access Web. At the first stage, SHGSS supported resource encapsulation and management, service scheduling and accounting, data aggregation and adaptive transmission as well as an intelligent traffic management. After this stage, transaction management service was required for more key grid applications. From then on, grid transaction service was investigated and implemented. The following is researches and reports related to our grid transaction service. Transaction management for distributed environments have widely been researched (Ammann,1997; Ancilotti, 1990; Thomasian,1997). Traditional transaction models. Distributed Transaction Processing (DTP) and Object Transaction Service (OTS) are widely used in the traditional distributed environment. DTP defines three kinds of roles, Application Program, Transaction Manager and Resource Manager, and two types of interfaces, TX and XA interfaces. These two models do not release the locked resources until the end of a global
423
Grid Transaction Management and Highly Reliable Grid Platform
Figure 1. ShanghaiGrid hardware infrastructure
transaction, thus not able to coordinate long-lived grid transactions. Thus, they are generally not applicable for applications comprising loosely coupled, Web-based business services (Dalal,2003). Web Services transaction specifications. WS-Coordination (WS-C) and WS-Transaction (WS-T) provide a set of transaction specifications for Web Services (Cabrera,2002). WS-C describes a transaction framework comprising Activation Service, Registration Service and Protocol Service. It can accommodate multiple coordination protocols. WS-T classifies transactions in the Web Services environment into atomic transactions and business activities, and defines the corresponding coordination protocols. Business Transaction Protocol (BTP) is another important service-oriented transaction specification that defines a conceptional model and a set of complex messages to be exchanged between a coordinator and participants, specifies how to interact between Web services (Dalal, 2003). Compensating transactions. Existing long-lived transaction models were generally built on the conception of transaction compensation that was first proposed by Gray (Gray 1981). The typical implementation of compensating transactions is the Sagas model that is widely adopted in many extended transaction models (Chrysanthis,1992; Garcia-Molina,1987; Liang,1996). Sagas (Garcia-Molina,1987) is a classical transaction model for handling long-lived transactions, based on transaction compensation. In Sagas, a transaction is called a “Saga”, which consists of a set of sub-transactions with atomicity, consistency, isolation and durability properties such that T={T1, T2,…,Tn}, and a set of associated compensating transactions such that C={C1,C2,. . ., Cn}, where each sub-transaction Ti associates with a compensating transaction Ci that can semantically undo the effect caused by the commit of Ti. Sub-transactions in Sagas independently commit and immediately release
424
Grid Transaction Management and Highly Reliable Grid Platform
resources accessed in the execution of the sub-transactions in order to reduce the duration of resource lock and improve the system efficiency. In Sagas, all the committed sub-transactions must be undone if a subsequent sub-transaction fails, which causes waste of a lot of valuable work already finished. ACTA (Chrysanthis, 1992) is a comprehensive transaction framework that permits a transaction modeler to specify the effects of extended transactions on each other and on objects in the database. ACTA allows to specify interactions between transactions in terms of relationships and transactions’ effects on objects’ state and concurrency status. ACTA provides a reasoning ability more powerful and flexible than Sagas through a series of variations to the original Sagas. ConTracts (Wachter,1992) is a mechanism for grouping transactions into a multi-transaction activity. It consists of a set of predefined actions called steps, and an explicitly specified execution plan called a script. In case of a failure, the ConTract state must be restored and its execution may continue. The above transaction models do not work well in grid environments because 1. 2.
3.
Traditional atomic transaction models in general lock resources to achieve consistent commit. Grid services, however, are not ready to be locked by outside grid applications. Existing long-lived transaction models require application programmers to provide compensating transactions for all the sub-transactions. Business Transaction Protocol (BTP) and Web Services Transaction (WS-Transaction) also mentioned to use compensation for the coordination of longrunning activities, but they did not solve how to provide compensating transactions. Owing to the autonomy of grids, service providers may setup special compensating rules according their own business model, for example, different providers require users to pay different charges for the cancellation of ticket orders, while grid services that actually executes sub-transactions are dynamically discovered just before the beginning of a grid transaction. Application programmers do not know special compensating policies of services discovered dynamically, therefore are not able to provide corresponding compensating transactions in advance. Grid services may dynamically join and leave the grid system before a global grid transaction completes. The following will present how to extend related work for grid environments.
MAIN FOCUS OF THE CHAPTER Grid Transaction Service and Highly Reliable Grid Platform Grid transaction is a set of operations that execute on different grid services (Tang, 2004). The transaction service GridTS is a special service responsible for the coordination of these services to keep the system consistent, and shields users from the complex recovery process.
Layered Architecture The architecture of highly reliable grid platform consists of three layers (see Figure 2). The middle layer, GridTS, consists of following main components.
425
Grid Transaction Management and Highly Reliable Grid Platform
Figure 2. Architecture of reliable grid platform
Coordinator and Participant. They cooperatively coordinate a transaction for an application and grid services respectively. The Coordinator and Participant themselves do not execute actual application operations. Scheduler. This component takes charge of (1) creating a Coordinator and a coordination context (CC) on the application side and Participants on the service side, and (2) scheduling the Service Discovery module. Compensating Transaction Generator (CTG). If a predefined event occurs, the component first queries the corresponding compensating rule(s), and then dynamically generates a compensating operation. Also, it encapsulates the generated compensating operations into a compensating transaction when the sub-transaction commits. Log Service. This component records the coordination operations and the state information for recovery of transactions from failures. Service Discovery. This component dynamically discovers the qualified grid services according to users’ requirements, such as cost, quality and availability, to complete specified sub-transactions. Interfaces. GridTS provides grid applications with two types of APIs: the extended TX interfaces for transaction management, and the service-specific interfaces for management of the GridTS service instances and discovery of grid services to execute application operations in sub-transactions.
HOW TO USE GRIDTS GridTS is a special grid service so that it possesses all properties of a grid service. Interfaces of GridTS are encapsulated in TX portType of grid services by defining each interface, corresponding input and output parameters as operation, input and output messages. The interface definition is exemplified in Figure 3.
426
Grid Transaction Management and Highly Reliable Grid Platform
Figure 3. The definition of interface begin of GridTS
GridTS ensures the reliability for grid applications through the following ways. 1.
2.
Public transaction service. GridTS is published in the public registration center. Transactional applications discover and invoke the GridTS. The advantage in this way is flexible and convenient, which means that users may share reliability support without installing the GridTS. Private transaction service. The GridTS locates on the application-side and service-side nodes. The strength of this method is efficiency and the weakness is less flexibility.
Grid Transaction Processing Framework and Flow The framework of our grid transaction processing, shown in Figure 4, is built on GridTS, with the following three actors: 1.
2.
3.
Initiator. A transactional application is the initiator of a global grid transaction. It initiates the transaction through GridTS, then requests remote grid services involved in the transaction to execute application operations. GridTS. It is the grid transaction manager. Coordinator and Participant in the GridTS perform coordination algorithm according to the transaction kind. They interchange coordination messages to guarantee the consistency of global transaction. Coordinator is the controller of the global transaction. (Remote) Grid services. These gird services actually perform application operations in the transaction.
Based on above framework, a typical transaction processing flow is just like this. First, GridTS initiates a global transaction on behalf of an application, discovers and selects the required grid services to serve as participants. Then, its Scheduler broadcasts the CC messages to all selected participants. Finally, the created Coordinator and Participants interact to control the transaction execution, including correct completion and failure recovery. The following sections will discuss the details of transaction coordination.
427
Grid Transaction Management and Highly Reliable Grid Platform
Figure 4. GridTS-based transaction processing framework
TRANSACTION COORDINATION GridTS coordinated short-lived and long-lived transactions by executing the corresponding coordination algorithms based on the incoming transaction kinds.
Transaction Coordination for Atomic Grid Transaction Atomic transaction lasts a short time, typically within a few of seconds. The following is the formal definition.
Figure 5. An example of the atomic grid transaction
428
Grid Transaction Management and Highly Reliable Grid Platform
Figure 6. An example of transaction process: Dashed line refers to alternative candidate, which occurs only on the following long-lived transactions
Definition 1. An atomic transaction (AT) is a 4-tuple={T, D, S, R}, where T={T1, T2,. . ., Tn} is the set of atomic transactions, where Ti (1 ≤ i ≤ n) can be a set of lower-level atomic transaction Tij (1 ≤ j ≤ m) D is the set of data carried by the transaction, S is the set of states, and R is the set of dependency relationships between (sub)transactions. In an atomic transaction, all participants have to commit or abort synchronously, and the intermediate results of an atomic transaction are invisible (read or write) to other concurrent transactions. As shown in Figure 5, an atomic transaction that transfers 1000$ from an account in bank service A to another in bank service B such T={T1, T2}. The subtransaction T1 and T2 have to executed atomically. 1.
Coordination mechanism: Coordination of atomic grid transactions includes following phases: 1. Initiation of an atomic transaction. For remote grid service to join in an atomic transaction, the client-side GridTS sends CoordinationContext (CC) message to them and creates a Coordinator, which lives until end of the transaction. The CC message includes necessary information to create a transaction, including transaction type AT, transaction identifier, coordinator address and expiration. Participants return Response messages to Coordinator, indicating it agrees to join the transaction. 2. Preparation for the transaction commit. Coordinator sends Prepare messages to participants Pi (i=1, 2, . . ., N), where N is the number of subtransactions. Each Pi reserves necessary resources and returns Prepared or NotPrepared message, depending on whether the reservation is successful or not. 3. Transaction commit. Within T1, if Coordinator receives N Prepared messages, it sends Commit to all Pi and records the commit in log. Otherwise, it sends Abort to them, making them cancel the previous reservation. On receiving Commit message, each Participant: (a) requests for allocating the reserved resources, (b) records the transaction in log in order to recover later from possible failures, and (c) monitors the execution of corresponding task and report result to Coordinator.
429
Grid Transaction Management and Highly Reliable Grid Platform
Figure 7. Coordination algorithms of atomic transactions (CAAT)
2.
3.
430
Within T2, if Coordinator receives N Committed messages, it judges the transaction is correctly completed. Otherwise, it reports failure information to the user and sends Rollback messages to all Pi, making them recover to the previous states. In execution of a transaction, if any Pi itself contains sub-transactions, it will apply above mechanism recursively. In this case, nested transactions form a tree structure, and the Pi not only is a participant but also serves as the sub-coordinator for its children Pij (see Figure 6). Coordination Algorithm: The coordination algorithm of atomic transactions includes two parts: ActionOfParent and ActionOfChild, executed by the Coordinator and Participant respectively, as illustrated in Figure 7, where t is the waiting time of the Coordinator and the Participants, CC is the transaction context and Tc means the timeout that a Participant waits for the Commit message. Nested Atomic Grid Transactions: In the above coordination algorithm CAAT, if a participant nests lower-level sub-transactions it uses above mechanism recursively so as to form a transaction tree, where an internal node acts as not only a participant of its parent but also a sub-coordinator of its children, that is, these nodes interact with their parents in the participant algorithm and with their children in the coordinator algorithm. The root node represents a global transaction and it always executes the coordinator algorithm while leaf nodes actually perform application operations and always execute the participant algorithm. Formation of nested sub-transactions. Let P(i,j) is a Participant associated with the ith level and jth sub-transaction T(i,j) (j=1,2,. . ., n(i)), where n(i) is the number of sub-transactions in the ith level. We demonstrate the process for T(i,j) how to create its child transactions:
Grid Transaction Management and Highly Reliable Grid Platform
Figure 8. A long-lived transaction
1. 2. 3. 4.
T(i,j) calls interface Begin to initiate its child transactions and create a SubCoordinator(i,j). T(i,j) creates sub-transaction context CC(i+1, j) whose PortReference is the address of SubCoordinator(i,j), taking the current context CC(i,j) as the input parameter. T(i,j) propagates CC(i+1,j) to its all child transactions T(i+1,j’)(j’=1,2,. . ., n(i+1)). Each child transaction T(i+1,j’) creates a Sub-Participant(i+1,j’) and registers with the SubCoordinator(i,j) in a Response message. From then on, if T(i+1,j’) still nests child transactions, set i=i+1 and j=j’, CALGT repeats from above step 1 to 4 until it does not nest child transaction any longer.
Long-Lived Grid Transaction A long-lived transaction is the one that lasts for a long period such as a few hours even a few days because of the user interaction and processing delay. Long-lived transactions are generally associated with business processes so that they should have the following features. •
• •
It is convenient for a user to select committed sub-transactions in order to adapt to the business model. In view of cost or other factors, for example, a user often requests multiple services for the same subtask in a business process, then selects a “best” result, i.e., a traveler hopes to both buy the cheapest airline ticket and ensure his traveling plan uninterrupted so that he requests multiple booking services for only one airline ticket. Resources accessed by sub-transactions are released as early as possible to improve the system efficiency. Transaction service must provide the ability to shield the dynamicity of grids and make the global transaction proceed even if services involved in a transaction dynamically leave the grid system before the global transaction completes.
431
Grid Transaction Management and Highly Reliable Grid Platform
Definition 2. A long-lived grid transaction (LGT) is a 5-tuple={T,D,S,R,OP}, where T={Ti| Ti ∈T, 1≤i≤n, n is the number of sub-transactions involved in T} D is a set of data operated by T, S is a set of states, R is a set of dependency relationships between (sub)transactions, and OP={AP-OP, TM-OP} is a set of operations. TM-OP ⊆{Begin, Enroll, Confirm, Cancel} is a set of coordination messages and AP-OP is a set of application operations. For long-lived transactions, transaction compensation is an appropriate method to release grid resources being held by sub-transactions as early as possible. For example, in a traveling arrangement shown in Figure 8, a traveler orders airplane tickets from ticket service A and B, and reserves a room from service C. If the hotel reservation fails, the two ticket orders have to be cancelled. On the other hand, after the tickets are successfully booked, the traveler may select the “best” (e.g. the cheapest) one and cancels another. 1.
432
Coordination Mechanism: As the name suggests, a long-lived grid transaction LGT takes a relatively long time to finish, even without the interference from other concurrent transactions, so the LGT has to relax the atomicity and isolation properties. More specifically, a LGT allows some candidates to abort while others to commit. Application operations on grid services exhibit a loose unit of work where results are shared prior to completion of the global transaction, i.e, subtransactions in a LGT transaction independently commit, and then immediately release the held resources before the global transaction finishes. Compared with AT, LGT has following new characteristics: ◦ Grid services that participate in a transaction independently commit sub-transactions after receiving pre-commit message, then immediately release held resources. ◦ If some participants fail to participate in or commit a sub-transaction, a global transaction can proceed through initiating new requests to locate substitutes. ◦ Users can confirm or cancel committed sub-transactions according to their interests using compensating transactions. Note that a coordinator confirms or cancels each participant only once. ◦ Grid services that participate in a LGT may leave the grid before the global transaction completes. In that case, these services notify the coordinator before leaving the grid. Long-lived grid transaction processing consists of the following three phases: ◦ An application initiates a LGT transaction. It is similar to that in AT except the transaction type is LGT. ◦ Candidates commit independently. Coordinator sends Enroll messages to all candidates. The latter reserve and allocate necessary resources, record operations in log, then directly commit the transaction. If successfully, each candidate generates corresponding compensation transaction and returns Committed message, which contain execution results, to the Coordinator. Otherwise, it automatically rollbacks operations taken previously and returns Aborted messages. From then on, it is removed from the transaction. ◦ The user confirms successful candidates. According to the returned results, the user may take one of the following actions: (1) for candidates committing successfully, he confirms some and cancels the others by sending Confirm and Cancel messages to them respectively,
Grid Transaction Management and Highly Reliable Grid Platform
Figure 9. Coordination algorithms of long-lived grid transactions (CTLGT)
2.
3.
within Tvalid; and (2) for failed candidates, he needn’t reply them and may renew to send CoordinationContext messages to locate new candidates. As a result, if a candidate receives a Confirm message within Tvalid, it responds a Confirmed message. Otherwise, it executes a compensating transaction to recover the system to the previous state. Coordination algorithm: The coordination algorithm of LGT transactions (CALGT) also consists of two parts: the coordinator algorithm ActionOfCoordinator and the participant algorithm ActionOfParticipant, as shown in Figure 9, where t is the system time, CC is a coordination context, and Tvalid is the valid time within which a coordinator must send a confirmation or cancellation decision and participants must report their commit states. Otherwise, if a coordinator does not confirm or cancel a sub-transaction within Tvalid, the corresponding participant automatically undoes the committed sub-transaction by the compensating transaction. On the other hand, a coordinator presumes that a participant has failed if the participant does not return the commit result before Tvalid. The algorithm allows users to confirm or cancel committed sub-transactions according to their own requirements. In above long-lived coordination algorithm, if a sub-transaction Ti comprises lower-lever subtransaction Tij it calls coordinator algorithm in the nested way described in above Section. Shielding the Dynamicity of Grid Services: Services that participate in a LGT may exit from the grid before the global transaction completes. GridTS shields the dynamicity of grid in the following ways.
433
Grid Transaction Management and Highly Reliable Grid Platform
Figure 10. A LGT transaction and its compensating transactions
1.
2.
Keep on handling the global transaction. If a grid service leaves the grid before the global transaction completes, it notifies the coordinator of this decision by means of notification mechanism of grid. To support for subscription of notification message, the Coordinator implements NotificationSink interface to receive notification messages and the Participant implements NotificationSubscription and NotificationSource interfaces to manage subscription and send notification messages. Moreover, the CC message provides enough information to direct remote services how to notify the coordinator of their decision before they leave the grid system. The subscription request within the CC consists of: ▪ The kind and content of notification messages, ▪ The address to call NotificatonSink, i.e., the network address of the coordinator, and ▪ The initial lifetime of subscription, which is equal to Tvalid. When a service leaves the grid, the associated GridTS notifies the coordinator using the NotificatonSink. The coordinator resends CC message to new grid service to perform that sub-transaction, where the valid expiration of the new service is still Tvalid. Undo effects on grid services. Let services do not leave grid in the commit process. Services may leave grid in the following two situations and GridTS will take corresponding actions. 1. A grid service leaves grid before the commit of a sub-transaction. In this case, the service simply leaves grid without performing compensating actions. 2. A grid service leaves the grid after the commit of a sub-transaction, when its coordinator has received the Enrolled message. In this case, the coordinator sends Cancel messages to notify participants of execution of compensating transactions.
AUTOMATIC GENERATION OF COMPENSATING TRANSACTIONS Existing long-lived transactions models were generally built on compensating transactions, both in the traditional distributed systems and in the Web Services environments. The compensating transaction was first implemented in Sagas that requires application programmers to provide compensating transactions before a transaction execution. This section focuses on how to automate the generation of compensating transactions for grid environments, free application programmers from complex compensation details. We define the common compensating rules for data modification operations and transaction coordination messages, while
434
Grid Transaction Management and Highly Reliable Grid Platform
Figure 11. Generation of compensating transactions
allowing service providers to add and modify their own rules. In the execution of a LGT, the GridTS dynamically generates and stores a compensating transaction for each sub-transaction based on the compensating rules, shown in Figure 10. On receiving a Confirm message, the GridTS deletes the generated compensating transaction from the database because the result(s) of the sub-transaction will not change from then on.
Key Technologies for Automatic Generation of Compensating Transaction Compensating actions are closely related to system states. The states describe current properties and possible further action(s) of a transaction system. For example, we can describe the state of an airline booking system as S={reservation, available}, where reservation is the number of available tickets, and available indicates whether the system can accept new reservations or not. If reservation is greater than 0, available becomes true. Otherwise, available becomes false. States are changed by operations. However, not all operations affect system states. For example, the update(d1,d2) operation changes the data value from d1 to d2 and the Enroll message transfers the Participant state from Active to Committing, but reading a data value does not affect the system state. Definition 3. A compensating transaction (CT) is the transaction that rollbacks the operations taken by a committed transaction T and undoes semantically the effects from the commit of T, the original transaction of the compensating transaction. A compensating transaction mainly involves in two aspects. One is to undo the effects of the original operations, and another is to recover the system consistency. The key technologies to generate automatically compensating transactions include: • • •
Definition of compensating rules, Generation of compensating operations in the execution of a long-lived transaction, and Generation of a compensating transaction at the commit of a sub-transaction.
435
Grid Transaction Management and Highly Reliable Grid Platform
Set Compensating Rules Generation of a compensating transaction is event-driven. Compensating rules indicate how to undo the effects from events that change system states. We divide these events into three types: data modification event, transaction coordination event and service self-definition event (see Figure 11). Compensating rules for the first two types of events are provided by GridTS while rules for service self-definition events are set by service providers through the following interfaces: • •
setCompensatingRule (): sets compensating rules for grid services. getCompensatingRule (): gets compensating rules of grid services.
1.
Data modification event: Currently, most companies store their information in relation databases. Data modification operations in a LGT mainly consist of insertion, deletion and replacement of records in databases. Definition 4. A data modification event refers to insertion, deletion or modification of databases. Let eTi[p(d)] be a data modification event from which transaction Ti modifies data d using operation p, where p∈OT belongs to one of operation types. Furthermore, DETi is a set of data modification events caused by Ti and eTi[p(d)] ∈DETi. For a relation database, OT={update, insert, delete}. We mainly analyze how to compensate these three data modification operations. Let Si and Si+1 be the states before and after Ti commits respectively, CTi a compensating transaction of Ti, and Tj (j≠i) a dependent transaction that concurrently executes between Ti and CTi. If data accessed by Ti is not modified by Tj, CTi simply executes a reversed action for each operation in Ti. Otherwise, CTi undoes the committed transaction Ti, but may not change the results of the dependent transaction Tj . For example, the cancellation of Alice’s airline ticket reservation can not affect Bob’s reservation. Compensating rules for update, insert and delete operations are set as follows. Update: Let opi=update(d1,d2) be a operation in Ti that replaces d1 with d2. How to compensate opi depends on the data modification operation opj in Tj. An insert operation opj=insert(d) in Tj does not affect the result of opi. The compensating operation for opi is copi= update(d2,d1). A delete operation opj= delete(d2) in Tj will delete the result of opi. As a result, it is not necessary to compensate opi. An update operation opj =update(d2,d3) in Tj will change the result of opi. The compensating operation of opi depends on the type of the replacement operation. Relevant replacement Si+1 =f(Si, Ti). It means that the state Si+1 is relevant to the state Si. copi has to remove the effect of opi. For example, if opi=update(d1,d2) and d2=d1+n, the corresponding compensating operation is copi=update(d3,d4), where d4=d3-n. Irrelevant replacement Si+1=f(Ti), where Si+1 is irrelevant to Si, e.g., opi=update(“Monday”, “Tuesday”) and opj=update(“Tuesday”,“Wednesday”). Such a replacement need not be
436
Grid Transaction Management and Highly Reliable Grid Platform
2.
compensated. Insert: Let operation opi=insert(d1) in Ti insert a record with value d1. copi is also relevant to data modification operations opj in Tj . opj =insert(d2) does not affect the result of opi so that the compensating rule for opi is copi = delete(d1). opj =delete(d1) will delete the result of opi, however, opi need not be compensated in order to keep the result of opj. opj =update(d1, d2) shall be compensated as follows. If (d2≠d1) If (relevant replacement) { temp=change caused by opj=update(d1, d2); insert (temp); delete(d1); } else do nothing; Delete: A delete operation opi=delete(d) in Ti deletes a record with the value d. Any operation in Tj can not affect the result of opi so that the compensating operation for opi is simply a reversed operation copi =insert(d). Transaction Coordination Event: Definition 5. A transaction coordination event denotes that a sub-transaction receives messages from a coordinator. The set of transaction coordination events involved in a sub-transaction Ti is TETi ⊂{CC, Enroll, Confirm, Cancel}.
3.
Each transaction coordination event changes the state of a transaction system. The GridTS sets the compensating rules for the transaction coordination event in the following way. ◦ For CC message, it records the original transaction identifier and input parameters. ◦ For Enroll message, it encapsulates compensating operations in delimiters Begin and Commit, and stores the compensating transaction CTi in a database. ◦ For Cancel messages, it invokes CTi stored in the database. ◦ For Confirm message, it deletes CTi in the database because CTi will be useless after the subtransaction is confirmed. Service self-definition event: Definition 6. A service self-definition event refers to the actions that a service provider takes according to the states of a business process, which depends on the special business model of a service provider. The compensating rules for the service self-definition event are defined by service providers. They typically focus on: ◦ Subsequent activities after undoing operations in an original transaction, e.g., sending an email to notify the user of new available services.
437
Grid Transaction Management and Highly Reliable Grid Platform
◦
Economic compensation. For example, if a user cancels a committed sub-transaction which has finished a transportation order, the transportation company typically requires amends from the user.
Generate Compensating Operations In the execution of a LGT, the Compensating Transaction Generator (CTG) of GridTS monitors events, such as a delete operation or an Enroll message. Once predefined events occur, CTG examines whether the conditions for a rule are satisfied. If so, it extracts the type and parameters of the operation, queries corresponding compensating rules of the operation, generates a compensating operation, and records input parameters. For example, when a sub-transaction deletes a record from the database, the Delete event will generate a compensating operation to insert the record.
Generate and Call Compensating Transactions The Enroll message enables the CTG to generate delimiters Begin and Commit, and combines the compensating operations into a transaction. If the sub-transaction fails, all compensating operations generated previously are abandoned. A compensating transaction is stored in a database, and deleted from the database when GridTS receives a Confirm message from the Coordinator. In a LGT, both the Cancel message and a timeout signal, which is generated after the transaction expiration Tvalid, can start the corresponding compensating transaction.
Handle Noncompensable Transaction A transaction is compensable if effects from its commit can be semantically undone by another transaction, i.e., the corresponding compensating transaction. Otherwise, the transaction is noncompensable. A compensating transaction consists of a set of compensating operations. A transaction T is compensable, if and only if each operation OPi∈T has a corresponding compensating operation COPi. Some transactional grid applications comprise noncompensable operations so that these transactions are noncompensable. Generally, noncompensable operations can be divided into two types: 1. 2.
Difficult compensating operations such as the sale of stocks bought previously, which means that the execution of these compensating operations may cause unexpected results. Unable compensating operations, which refer to the operations that can not be compensated. For example, it is impossible to compensate a launched missile.
Noncompensable operations often generate effects on outside activities so that in general, their effects are not allowed visible out of these applications. Thus, GridTS does not allow such a sub-transaction to commit in the pre-commit phase if it can not find compensating rule(s) for an operation. Instead, we handle noncompensable transactions with the following policies. • •
438
GridTS imposes commit dependence between the sub-transaction and the global transaction, which indicates that the sub-transaction actually commits only if the global transaction commits. GridTS rollbacks operations taken previously but returns the Committed message to the coordinator.
Grid Transaction Management and Highly Reliable Grid Platform
•
After receiving the Confirm message, it redoes and commits the sub-transaction. GridTS rollbacks the executed operations and reports a commit exception to a user. The latter decides how to handle the exception.
FUTURE TRENDS We have proposed a transaction service GridTS and coordination algorithms for short-lived and longlived transaction management in grid environments. It is an effort towards the reliable grid platform. With the increasing reliability requirements from business applications, reliable grid platform will be an important research direction. Transaction service is an indispensable component for emerging reliable grid infrastructure. Therefore, our GridTS and coordination algorithms will provide the powerful support for research on reliability of grid platform. The future research along with this direction includesthe following aspects in order to make GridTS more practical and effective in the commercial grid environment. The first one is security guarantee during transaction processing. Grid Security Infrastructure (GSI) may be used because it provides the abilities for authentication, authorization and communication protection, based on the public-key mechanism, and is the de facto standard authentication method with “single sign-on” property. Another issue is to investigate the mechanism for solving the possible deadlock problem of competing transactions, as well as the approach for combining the transaction management with the resource scheduling and management to enhance system efficiency.
CONCLUSION We have proposed a grid transaction service GridTS and coordination algorithm for management of short-lived and long-lived reliable activities in grids. GridTS can coordinate different grid transactions through executing corresponding coordination algorithms. The design that separates GridTS and algorithms makes GridTS more flexible and scalable by means of adding new algorithms for coming reliable applications. Our proposal has three advantages. Firstly, for transactional grid applications, users only need to submit the corresponding parameters (e.g., the transaction type and timeout). GirdTS can intelligently invoke different coordination algorithms and handle the entire transaction process on behalf of the users. The complex process is hidden from the users. Secondly, GridTS is able to dynamically generate compensating transactions in the execution of long-live transactions, and at the same time provides interfaces for setting up service-specific compensating rules to satisfy different application requirements. Next, the long-lived coordination algorithm allows users to select committed results, which is applicable to practical business applications. Finally, GridTS is extensible because it is built on top of a series of open standards, technologies and infrastructures.
439
Grid Transaction Management and Highly Reliable Grid Platform
ACKNOWLEDGMENT Feilong Tang would like to thank The Japan Society for the Promotion of Science (JSPS) and The University of Aizu (UoA), Japan for providing the excellent research environment during his JSPS Postdoctoral Fellow Program in UoA, Japan. Thanks are also given to Dr. Chao-Li Wang in The University of Hong Kong, China and Professor Zixue Cheng in UoA, Japan for their precious helps. This work is supported by the National High Technology Research and Development Program (863 Program) of China (Grant Nos. 2006AA01Z172, 2006AA01Z199 and 2008AA01Z106), the National Natural Science Foundation of China (NSFC) (Grant Nos. 60773089,60533040, and 60725208), and Shanghai Pujiang Program (Grant No. 07pj14049).
REFERENCES Ammann, P., Jajodia, S., & Ray, I. (1997). Applying formal methods to semantic-based decomposition of transactions. [TODS]. ACM Transactions on Database Systems, 22(2), 215–254. doi:10.1145/249978.249981 Ancilotti, P., Lazzerini, B., & Prete, C. A. (1990). A distributed commit protocol for a multicomputer system. IEEE Transactions on Computers, 39(5), 718–724. doi:10.1109/12.53589 Berman, F., Fox, G., & Hey, T. (Eds.). (2003). Grid computing making the global infrastructure a reality. New York: Wiley Series in Communication Networking & Distributed Systems. Cabrera, F., Copel, G., & Coxetal, B. (2002). Web Services Transaction (WS- Transaction). Retrieved from http://www.ibm.com/developerworks/library/ws-transpec. Chrysanthis, P., & Ramamriham, K. (Eds.). (1992). ACTA: The SAGA continues. Transactions Models for Advanced Database Applications. San Francisco: Morgan Kaufmann. Chrysanthis, P. K., & Ramamriham, K. (1994). Synthesis of extended transaction models using ACTA. ACM Transactions on Database Systems, 19(3), 450–491. doi:10.1145/185827.185843 Dalal, S., Temel, S., & Little, M. (2003). Coordinating business transactions on the Web. IEEE Internet Computing, 7(1), 30–39. doi:10.1109/MIC.2003.1167337 Foster, I., Kesselman, C., & Nick, J. (2002). Grid services for distributed system integration. IEEE Computer, 35(6), 37–46. Fox, F., & Gannon, D. (2001). Computational grids. Computing in Science & Engineering, 3(4), 74–77. doi:10.1109/5992.931906 Garcia-Molina, H., & Salem, K. (1987). SAGAS. In Proceedings of ACM SIGMOD’87, International Conference on Management of Data, 16(3), 249-259. Gray, J. (1981). The transaction concept: Virtues and limitations. In Proceedings of the 7th International Conference on VLDB, (pp.144-154).
440
Grid Transaction Management and Highly Reliable Grid Platform
Jiang, J. L., Yang, G. W., & Shi, M. L. (2006). Transaction Model for Service Grid Environment and Implementation Considerations. In Proceedings of IEEE International Conference on Web Services (pp. 949 – 950). Liang, D., & Tripathi, S. (1996). Performance analysis of longlived transaction processing systems with rollbacks and aborts. IEEE Transactions on Knowledge and Data Engineering, 8(5), 802–815. doi:10.1109/69.542031 Tang, F. L., Li, M. L., & Huang, Z. X. (2004). Real-time transaction processing for autonomic Grid applications. Engineering Applications of Artificial Intelligence, 17(7), 799–807. doi:10.1016/S09521976(04)00122-8 Thomasian, A. (1997). A performance comparison of locking methods with limited wait depth. IEEE Transactions on Knowledge and Data Engineering, 9(3), 421–434. doi:10.1109/69.599931 Wachter, H., & Reuter, A. (Eds.). (1992). Contracts: A means for Extending Control Beyond Transaction Boundaries. Advanced Transaction Models for New Applications. San Francisco: Morgan Kaufmann. Wang, T., Vonk, J., Kratz, B., & Grefen, P. (2008). A survey on the history of transaction management: from flat to grid transactions. Distributed and Parallel Databases, 23(3), 235–270. doi:10.1007/s10619008-7028-1 Yang, Y. G., Jin, H., & Li, M. L. (2004). Grid computing in China. Journal of Grid Computing, 2(2), 193–206. doi:10.1007/s10723-004-4201-2 Yang, H. T., Wang, Z. H., & Deng, Q. H. (2008). Scheduling optimization in coupling independent services as a Grid transaction. Journal of Parallel and Distributed Computing, 68(6), 840–854. doi:10.1016/j. jpdc.2008.01.004
KEY TERMS AND DEFINITIONS Atomic Transaction: A short-lived transaction with the property “all or nothing”, i.e., subtransactions in an atomic transaction all commit or abort. Compensating Transaction: A transaction for undoing submitted transactions, which means canceling submitted operations and recovering system consistency. Grid Computing: A distributed computing paradigm for large-scale and effective resource sharing and task collaboration through enabling people to utilize computing and storage resources transparently. Grid Transaction: A set of operations that are execute on geographically distributed grid services Long-Lived Transaction: A transaction with a long lifetime. Generally, a long-lived transaction relaxes the atomicity and isolation properties. Reliability: In transaction processing, reliability is an ability of a system or component to keep system consistency through performing its required functions under stated conditions for a specified period of time. Transaction Processing: A technology responsible for ensuring the reliable execution of these distributed grid applications to keep the system consistent and free from various failures. Ideally, it also shields users from the complex recovery process.
441
442
Chapter 20
Error Recovery for SLABased Workflows Within the Business Grid Dang Minh Quan International University in Germany, Germany Jörn Altmann Seoul National University, South Korea Laurence T. Yang St. Francis Xavier University, Canada
ABSTRACT This chapter describes the error recovery mechanisms in the system handling the Grid-based workflow within the Service Level Agreement (SLA) context. It classifies the errors into two main categories. The first is the large-scale errors when one or several Grid sites are detached from the Grid system at a time. The second is the small-scale errors which may happen inside an RMS. For each type of error, the chapter introduces a recovery mechanism with the SLA context imposing the goal to the mechanisms. The authors believe that it is very useful to have an error recovery framework to avoid or eliminate the negative effects of the errors.
INTRODUCTION In the Grid Computing environment, many users need the results of their calculations within a specific period of time. Examples of those users are meteorologists running weather forecasting workflows, automobile producers running dynamic fluid simulation workflow (Lovas et al., 2004). Those users are willing to pay for getting their work completed on time. However, this requirement must be agreed on by both, the users and the Grid provider, before the application is executed. This agreement is kept in the Service Level Agreement (SLA) (Sahai et al., 2003). In general, SLAs are defined as an explicit statement of expectations and obligations in a business relationship between service providers and customers. DOI: 10.4018/978-1-60566-661-7.ch020
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Error Recovery for SLA-Based Workflows
SLAs specify the a-priori negotiated resource requirements, the quality of service (QoS), and costs. The application of such an SLA represents a legally binding contract. This is a mandatory prerequisite for the Next Generation Grids. The basic concepts of a system handling the Grid-based workflow within an SLA context are described in the following sections.
Grid-Based Workflow Model Workflows received enormous attention in the databases and information systems research and development community (Georgakopoulos et al., 1995). According to the definition from the Workflow Management Coalition (WfMC) (Fischer, 2004), a workflow is “The automation of a business process, in whole or parts, where documents, information or tasks are passed from one participant to another to be processed, according to a set of procedural rules.” Although business workflows have great influence on research, another class of workflows emerged in sophisticated scientific problem-solving environments, which is called Grid-based workflows. A Grid-based workflow differs slightly from the WfMC definition as it concentrates on intensive computation and data analyzing but not the business process. A Grid-based workflow is characterized by the following features (Singh et al., 1997): • •
• •
A Grid-based workflow usually includes many sub-jobs (i.e. applications), which perform data analysis tasks. However, those sub-jobs are not executed freely but in a strict sequence. A sub-job in a Grid-based workflow depends tightly on the output data from previous sub-jobs. With incorrect input data, a sub-job will produce wrong results and damage the result of the whole workflow. Sub-jobs in the Grid-based workflow are usually computationally intensive. They can be sequential or parallel programs and require a long runtime. Grid-based workflows usually require powerful computing facilities (e.g. super-computers or clusters) to run on.
Most of existing Grid-based workflows (Ludtke et al., 1999, Berriman et al., 2003, Lovas et al., 2004) can be presented under Directed Acyclic Graph (DAG) form so only the DAG workflow is considered in this chapter. The user specifies the required resources needed to run each sub-job, the data transfer between sub-jobs, the estimated runtime of each sub-job, and the expected runtime of the whole workflow. In this chapter, we assume that time is split into slots. Each slot equals a specific period of real time, from 3 to 5 minutes. We use the time slot concept in order to limit the number of possible start-times and end-times of sub-jobs. More over, delaying 3 minutes also has little impact with the customer. It is noted that the data to be transferred between sub-jobs can be very large.
Grid Service Model The computational Grid includes many High Performance Computing Centers (HPCCs). The resources of each HPCC are managed by a software called local Resource Management System (RMS)1. Each RMS has its own unique resource configuration. A resource configuration comprises the number of CPUs, the amount of memory, the storage capacity, the software, the number of experts, and the service price. To ensure that the sub-job can be executed within a dedicated time period, the RMS must support advance resource reservation such as CCS (Hovestadt, 2003). In our model, we reserve three main types
443
Error Recovery for SLA-Based Workflows
of resources: CPU, storage, and expert. The addition of further resources is straightforward. If two output-input-dependent sub-jobs are executed on the same RMS, it is assumed that the time required for the data transfer equals zero. This can be assumed since all compute nodes in a cluster usually use a shared storage system like NFS or DFS. In all other cases, it is assumed that a specific amount of data will be transferred within a specific period of time, requiring the reservation of bandwidth. The link capacity between two local RMSs is determined as the averagely available capacity between those two sites in the network. The available capacity is assumed to be different for each different RMS couple. Whenever a data transfer task is required on a link, the possible time period on the link is determined. During that specific time period, the task can use the whole capacity, and all other tasks have to wait. A more realistic model for bandwidth estimation (than the average capacity) can be found in (Wolski, 2003). Note, the kind of bandwidth estimation model does not have any impact on the working of the overall mechanism.
Business Model In the case of Grid-based workflow, letting users work directly with resource providers has two main disadvantages: • •
The user has to have sophisticated resource discovery and mapping tools in order to find the appropriate resource providers. The user has to manage the workflow, ranging from monitoring the running process to handling error events.
To free users from this kind of work, it is necessary to introduce a broker handling the workflow execution for the user. We proposed a business model (Quan and Altmann, 2007a) for the system. There are three main entities: the end-user, the SLA broker and the service provider: The end-user wants to run a workflow within a specific period of time. The user asks the broker to execute the workflow for him and pays the broker for the workflow execution service. The user does not need to know in detail how much he has to pay to each service provider. He only needs to know the total amount. This amount depends on the urgency of the workflow and the budget of the user. If there is a SLA violation, for example the runtime deadline has not been met, the user will ask the broker for compensation. This compensation is clearly defined in the Service Level Objectives (SLOs) of the SLA. The SLA workflow broker represents the user as specified in the SLA with the user. It controls the workflow execution. This includes mapping of sub-jobs to resources, signing SLAs with the services providers, monitoring, and error recovery. When the workflow execution has finished, it settles the accounts. It pays the service providers and charges the end-user. The profit of the broker is the difference. The value-add that the broker provides is the handling of all the tasks for the end-user. The service providers execute the sub-jobs of the workflow. In our business model, we assume that each service provider fixes the price for its resources at the time of the SLA negotiation. As the resources of a HPCC usually have the same configuration and quality, each service provider has a fixed policy for compensation if its resources fail. For example, such a policy could be that n% of the cost will be compensated if the sub-job is delayed one time slot. Figure 1 depicts a sample scenario of running a workflow on the Grid environment.
444
Error Recovery for SLA-Based Workflows
Figure 1. A sample running Grid-based workflow scenario
Problem Statement In a large and complex system like the Grid, errors can happen at any time and at any part of the system with high frequency. The source of errors varies with network cable break, scratch software, hardware error and so on. Specifically, though, we classify the errors into two main categories.
The Large-Scale Error A large-scale error happens when one or several Grid sites are detached from the Grid system at any given time. This error may be caused by a broken network link, a system power down and similar breakdowns. When one RMS is detached from the Grid system, all running/waiting sub-jobs from several workflows in that RMS are considered as failed since the system cannot control the status and collect the result from it. The checkpoint images of all sub-jobs in the failed RMS cannot be used to restart them in other healthy RMSs. Moreover, output data from the finished sub-jobs in the failed RMS is not available. Thus, several waiting sub-jobs in the other healthy RMSs cannot be run because of the unavailability of input data. In the case of canceling the workflow because of error, the system will be seriously fined as stated in the SLA. Thus, the system has no way but to try and finish executing the workflow by re-running all failed sub-jobs. However, this task faces two main problems. •
•
Mapping and re-executing only failed sub-jobs in the other healthy RMSs are not enough. A workflow requires a strict execution in order to ensure its integrity. Considering only the error sub-jobs and dismissing the others will lead to the potential possibility of breaking the integrity character. Thus, determining all sub-jobs which need to continue the workflow execution is a mandatory requirement. When the sub-jobs in the workflow must be re-executed, the ability to finish the workflow execution on time as stated in the original SLA is very low and the ability to be fined because of not fulfilling SLA is very high. Within the SLA context, which relates to business, the fine is usually
445
Error Recovery for SLA-Based Workflows
very costly and increases with the lateness of the workflow’s finished time. Thus, those sub-jobs must be mapped to the healthy RMSs in a way which minimizes the workflow’s finished time.
The Small-Scale Error An error inside an RMS may happen at any time during the sub-job running period. The error could have been caused by an operating system error, hardware error, or internal network cable error. In this case, the RMS will restart the sub-job from the checkpoint image. We also assume that the time to detect the error and the time to re-run the sub-job from the checkpoint image will cause the end-time of the subjob to be later than the pre-determined deadline. According to our business model, because of the fact that the provider is responsible for the error, the late sub-job will not be cancelled but will be allowed to run a few additional time slots. However, when one sub-job is delayed, the output data transfer to the subsequent sub-jobs is delayed as well, causing the start-time of those sub-jobs to be delayed. If those sub-jobs do not have sufficient computational resources allocated to compensate for the shorter time available for completing their calculations, the original error of the RMS might cause them to fail their calculations. Thus it causes a cascading effect of failing sub-jobs. Therefore, the whole workflow will fail because of one single error. A concrete example of such a error scenario for the workflow in Figure 1 is that if sub-job 0 is delayed 1 time slot, the data transfer tasks 0-1, 0-2, 0-3, 0-4 cannot be executed. Thus, sub-jobs 1, 2, 3, and 4 do not get the input data to start their calculation at the specified start-times. The consequence is that the start of those sub-jobs will also be delayed by one time slot. Those sub-jobs, however, might not have enough computational resources available to finish their calculation on time. To avoid the delay of the whole workflow, the resource allocation of the sub-jobs of the workflow must be re-scheduled so they compensate the delay. However, this re-scheduling may bring some negative side-effects. • •
The finished time of the workflow may exceed the pre-determined time period. The broker will be fined by the user according to the length of the delay. In the re-scheduling, if the remaining sub-jobs must be moved to other RMSs, the broker has to cancel the old reservation contracts. If this is not mentioned in the SLA, the broker will be fined by the service providers.
Thus, the system must have error recovery mechanisms in order to avoid or eliminate the negative effect caused by both the large-scale errors and the small-scale errors. In detail, it is desirable to have a mechanism to re-schedule the sub-jobs of the workflow in such a way that the workflow can be executed to produce the final result, while trying to keep the fines as low as possible. This chapter presents an error recovery framework for the SLA-based workflows that addresses these problems. The chapter is organized as follows. The second section describes the related work. The third section presents the error recovery mechanisms. The fourth section describes the re-mapping algorithms to re-map sub-jobs of the affected workflow to the healthy Grid resource. The experiment about the performance of the recovery mechanisms is discussed in section fifth. The sixth section presents the future research direction and the last section concludes the chapter with a short summary.
446
Error Recovery for SLA-Based Workflows
Figure 2. The error recovery framework
RELATED WORKS Little work exists on the issue of error recovery for workflows, although the importance of fault tolerance in Grid computing has already been acknowledged with the establishment of the Grid Checkpoint Recovery Working Group. Its purpose is to define user-level mechanisms and Grid services for achieving fault tolerance. (Stone, 2004) described some initial results of the group’s effort. The well-known Condor system has also implemented a mechanism to handle errors (Condor, 2006). When the mechanism detects the error, it continues to execute the other sub-jobs of the workflow as long as possible. This mechanism is reasonable if no SLAs have to be considered. Since it does not pay attention to meeting the deadline of a workflow, the cost incurred through fines and the need for extra resources can become very high. The literature recorded a considerable amount of work in related areas especially in finding recovery methods for single Grid job. (Garbacki et al., 2005) present a transparent fault tolerance for the Grid application based on Java RMI. They use globally consistent checkpoint to avoid having to restart longrunning computations from scratch after a system crash. (Hwang and Kesselman, 2003) present a framework for handling errors on the Grid. Central to the framework is flexibility in handling errors, which is achieved by using the workflow structure as a highlevel recovery policy specification. (Heine et al., 2005a) describe a SLA-aware job migration mechanism in Grid environments. Checkpoints of the running job can be migrated to the same or other clusters running HPC4U software (see Heine et al., 2005b). An architecture called VRM (Virtual Resource Management) manages the status of the process continuously.
447
Error Recovery for SLA-Based Workflows
ERROR RECOVERY FRAMEWORK The error recovery framework is presented in Figure 2. Error detection is done with a monitoring module which collects information about the RMS status, the RMS resources, the RMS reservations, the sub-jobs state and so on from all RMSs. The information is analyzed and stored in the central database to ensure that the broker module will have an overall picture of the system. When it detects an error occurring, it activates the error recovery module with an appropriate recovery strategy.
Recovery the Large-Scale Error In the SLA context, every sub-job of the workflow is planned to run on reserved resources within a specific time period to ensure the QoS while still preserving the integrity of the workflow. During the running process of the workflow, one or several RMSs can be detached from the system at any time. If this happens, we first determine all affected workflows. With each of the affected workflow, we check whether only independent sub-jobs are affected. If it is true, we try to re-map those sub-jobs to the healthy RMSs in a way that does not affect other sub-jobs of the workflow. If there are dependently affected sub-jobs or if the re-mapping of independently affected sub-jobs is failed, the affected workflow is added to a list. After that, we determine the re-mapping priority for those workflows. With each workflow in the priority order, we determine its sub-jobs which need to be re-mapped. Those sub-jobs form a new workflow from the old one. We use w-Tabu algorithm to map the workflow to the healthy RMSs and optimize the finished time. Following parts describe in detail each step.
Checking Workflows having only Independent Sub-Jobs Affected The independently affected sub-job is the sub-job that the error directly affects only it but not its previous or consequent dependent sub-jobs. To have a clear view, we can look at the running scenario in Figure 1. If the RMS 3 is failed while the sub-job 3 is running, the error directly affects only sub-job 3. Sub-job 0 or sub-job 6 is not directly affected. Thus, sub-job 3 is the independently affected sub-job. In this case, we try to re-map sub-job 3 to the other healthy RMS in a way that not affect the start time of sub-job 6. This problem is similar to the problem of recovering the directly affected sub-jobs which is described in the small scale error recovery section. This step is worth doing as it can be performed in a relatively short period of time and if it is success, the negative effect of the error is greatly reduced. If the RMS 2 is failed while sub-job 2 is running, sub-job 1, 2, 6, 7 are directly affected. Sub-job 6 depends on sub-job 2, sub-job 7 depends on sub-job 1, 6. Thus, sub-job 1, 2, 6, 7 are dependent affected. Re-mapping those sub-jobs affects seriously the integrity structure of the old workflow mapping solution and we consider it as serious affected. To recover from this error, we use procedures as follows.
Determining the Re-Mapping Priority When the error happens, many workflows can be affected simultaneously and we have to re-plan many new workflows formed from sets of determined affected sub-jobs. One problem is that which is the priority of re-mapping workflows. It is important because it affects the lateness of the workflows. Here, we use the policy Earliest Deadline First (EDF) which is used broadly in real time systems. Workflow having earlier deadline will be given higher priority as it occupies resources shorter and the other work-
448
Error Recovery for SLA-Based Workflows
flows need shorter time to wait for available resource. Thus, the total lateness is reduced and the fine amount is also reduced. To clarify the problem, suppose we have 2 workflows need to be re-mapped and the Grid system can execute only one workflow at a time. The workflow 1 was planned to be finished at t1, the workflow 2 was planned to be finished at t2, t2>t1. Suppose that the penalty for each hour late is P. If the workflow 2 is mapped first, workflow 1 have to wait until the workflow 2 is finished. Thus, the minimal fine will be P*(t2-fail_slot). If the workflow 1 is mapped first, the minimal fine will be P*(t1-fail_slot). Therefore, mapping workflow 1 first is better than workflow 2. In a real complex situation, mapping workflow 1 first gives more chance to finish workflow 1 earlier, to release resource earlier and thus, gives more chance for workflow 2 to be mapped with smaller lateness.
Determining Sub-Jobs which need to be Re-Planned Determining all sub-jobs to be re-mapped in a workflow is done with the following procedure. • • •
•
•
•
Step 1: Clearing the re-mapped set Step 2: Putting all sub-jobs which are running in the failed RMS to the re-mapped set Step 3: Re-mapping those determined sub-jobs will lead to the need for re-mapping all other consequent sub-jobs to ensure the integrity of the workflow. Thus, all those consequent sub-jobs are put to the re-mapped set. Step 4: With determined affected sub-jobs in the failed RMSs, they will not have input data to run if their directly finished previous sub-jobs are also in the failed RMSs. Thus, it is necessary to put those finished sub-jobs to the re-mapped set. Step 5: With determined affected sub-jobs in the healthy RMSs, they will not have input data to run if their directly finished previous sub-jobs are in the fail RMSs and the related data transfer task is not finished. Those finished sub-jobs must also be put to the re-mapped set. Step 6: All other sub-jobs of the workflow which did not receive the data from those determined sub-jobs must be re-mapped to ensure the integrity of the workflow.
Based on the determined priority, each workflow will be mapped in sequence to the healthy RMSs. To do the mapping, we refine the new workflow under Directed Acyclic Graph (DAG) format and then use the mapping module to map this new DAG workflow to RMSs. When forming DAG for a workflow, it is necessary to consider the dependency of affected sub-jobs with running sub-jobs in healthy RMS to ensure the integrity of the workflow. To present that dependency, in the new workflow, with each running sub-job in the healthy RMSs, we create a pseudo corresponding sub-job with: • • • •
Runtime= Deadline - fail slot - time overhead number of required CPU=0 number of required storage=0 number of required expert=0
With time overhead value is the period to do the recovery process. Moreover, we also need a new pseudo source sub-job for the workflow with runtime and resource requirement equal 0. Because of having to rerun even the already finished sub-jobs, the probability of having many solu-
449
Error Recovery for SLA-Based Workflows
tions that meet the original deadline is very low. Thus, we by pass the attempt to optimize the cost while ensuring the deadline. We use w-Tabu algorithm to minimize the finished time of the workflow. The w-Tabu algorithm is presented in the re-mapping algorithms section.
Recovery the Small-Scale Error When the small-scale error happens, we try to re-map the remaining sub-jobs in a way that the workflow can complete with little delay and little extra costs. The entire strategy includes 3 phases as described in Figure 2. Each phase represents a certain approach to find a re-mapping solution. The phases are sorted according to the simplicity and the cost that they incur.
Phase 1: Re-Mapping the Directly Affected Sub-Jobs In the first phase, we will try to re-map the directly affected sub-jobs in a way that does not affect the start time of the other remaining sub-jobs in the workflow. When we re-map the directly affected subjobs, we also have to re-map their related data transfers. For the example in Figure 1, if sub-job 0 is delayed, the affected sub-jobs are sub-job 1, 2, 3, 4 and their related data transfers. This task can be feasible for many reasons. • • •
The delay of the late sub-job could be very small. The Grid may have others solutions so that the data transfers will be shorter because the links have broader bandwidth. The Grid may have RMSs with higher CPU power which can execute the sub-jobs in shorter time.
In the first place, we try to adjust the execution time of the input data transfers, the affected subjobs and the output data transfers within the same RMS as pre-determined. Sub-jobs which cannot be adjusted will be re-mapped to other RMSs. If this phase is successful, the broker only has to pay the following costs: • •
The fee for canceling the reserved resources of directly affected sub-jobs. The extra resource cost if the new mapping solution is more expensive than the old one.
As the cost for this phase is the least in three phases, it should be tried first. The algorithm to re-map the directly affected sub-jobs called G-map is described more detail in re-mapping algorithms sections.
Phase 2: Re-Mapping the Workflow to Meet the Pre-Determined Deadline This phase is executed if the first phase was not successful. In this phase, we will try to re-map the remaining workflow in a way that the deadline of the workflow is met and the cost is minimized. The remaining workflow is formed in a way similar to the large scale error recovery section. If this phase is successful, the broker has to pay the following costs: •
450
The fee for canceling the reserved resources of all remaining sub-jobs.
Error Recovery for SLA-Based Workflows
•
The extra resource cost if the new mapping solution is more expensive than the old one.
To perform the mapping, we use the H-Map algorithm to find the solution. The detailed description about the H-Map algorithm can be seen in the re-mapping algorithms section.
Phase 3: Re-Mapping the Workflow to have Minimal Runtime This phase is the final attempt to recover from the error. It is initiated if the two previous phases were not successful. In this phase, we try to re-map the remaining workflow in a way that minimizes the delay of the entire workflow. If the solution has an acceptable lateness, the broker has to pay the following costs: • • •
The fee for canceling the reserved resources of all remaining sub-jobs. The extra resource cost if the new mapping solution has higher cost than the old one. The fine for finishing the entire workflow late. This cost increases proportionally with the length of the delay.
If the algorithm only finds a solution with a delay higher than accepted by the user, the whole workflow will be cancelled and the broker has to pay the following costs: • •
The fee for canceling the reserved resources of all remaining sub-jobs. The fine for not finishing the entire workflow.
The goal of this phase equals to minimizing the total runtime of the workflow. To do the re-mapping, we use w-Tabu algorithm, which is described in the re-mapping algorithms section.
Recovery Procedure When error recovery module is activated, it will perform following actions in a strict sequence: • •
• • • •
Access database to retrieve information about error RMSs and determine affected workflow as well as necessary sub-jobs of the workflow to be remapped. Based on determined information about affected workflows and sub-jobs, activate negotiation module to cancel all SLA sub-jobs with local RMSs related to specific sub-jobs. All negotiation activities are done with the help of SLA text as the means of communication. Activate monitoring module to update newest information about RMS, especially information about resource reservation. Call mapping modules to determine where and when sub-jobs in the affected workflow will be run. Based on mapping information, activate the negotiation module to sign new SLAs for each subjob with the specific local RSM. Update workflow control information, sub-jobs information in the central database.
451
Error Recovery for SLA-Based Workflows
Figure 3. w-Tabu algorithm overview
RE-MAPPING ALGORITHMS This section presents all algorithms used in the error recovery process. They include w-Tabu algorithm to optimize the finished time of a workflow, H-Map algorithm to optimize the cost of running a workflow while ensuring the deadline, and G-Map algorithm to map a group of sub-jobs satisfying the deadline while optimizing the cost.
Formal Mapping Problem Statement The formal specification of the described problem includes the following elements: •
Let R be the set of Grid RMSs. This set includes a finite number of RMSs, which provide static information about controlled resources and the current reservations/assignments. Let S be the set of sub-jobs in a given workflow including all sub-jobs with the current resource and deadline requirements. Let E be the set of edges in the workflow, which express the dependency between the sub-jobs and the necessity for data transfers between the sub-jobs. Let Ki be the set of resource candidates of sub-job si. This set includes all RMSs, which can run sub-job si, Ki ⊂ R.
• • •
Based on the given input, a feasible and possibly optimal solution is sought, allowing the most efficient mapping of the workflow in a Grid environment with respect to the given global deadline. The required solution is a set defined in Formula 1. M = {(si, rj, start_slot) | si ∈ S, rj ∈ Ki }
(1)
If the solution does not have start_slot for each si, it becomes a configuration as defined in Formula 2. a = {(si, rj) | si ∈ S, rj ∈ Ki }
452
(2)
Error Recovery for SLA-Based Workflows
A feasible solution must satisfy following conditions: •
Criterion 1: The finished time of the workflow must be smaller or equal to the expected deadline of the user. Criterion 2: All Ki ≠∅. There is at least one RMS in the candidate set of each sub-job. Criterion 3: The dependencies of the sub-jobs are resolved and the execution order remains unchanged. Criterion 4: The capacity of an RMS must equal or be greater than the requirement at any time slot. Each RMS provides a profile of currently available resources and can run many sub-jobs of a single flow both sequentially and in parallel. Those sub-jobs which run on the same RMS form a profile of resource requirement. With each RMS rj running sub-jobs of the Grid workflow, and with each time slot in the profile of available resources and profile of resource requirements, the number of available resources must be larger than the resource requirement. Criterion 5: The data transmission task eki from sub-job sk to sub-job si must take place in dedicated time slots on the link between the RMS running sub-job sk to the RMS running sub-job si. eki ∈ E.
• • •
•
In the next phase, the feasible solution with the lowest cost is sought. The cost C of running a Grid workflow is defined in Formula 3. It is the sum of four factors: the cost of using the CPU, the cost of using the storage, the cost of using the experts’ knowledge, and finally the expense for transferring data between the resources involved. n
C= å si.rt*(si.nc*rj.pc+si.ns*rj.ps+si.ne*rj.pe) + ∑ eki.nd*rj.pd
(3)
i =1
with si.rt, si.nc, si.ns, si.ne being the runtime, the number of CPUs, the number of storage, and the number of expert of sub-job si respectively. rj.pc, rj.ps, rj.pe, rj.pd are the price of using the CPU, the storage, the expert, and the data transmission of RMS rj respectively. eki.nd is the number of data to be transferred from sub-job sk to sub-job si. If two dependent sub-jobs run on the same RMS, the cost of transferring data from the previous subjob to the later sub-job is neglected. With the problem of optimizing the finished time of the workflow, it is not necessary to meet Criterion 1. With the problem of mapping a group of sub-jobs to resources, the Criterion 1 is expressed as the start time of each input data transfer must be later than the sub-job it depends on and the stop time of each output data transfer must be earlier than the next sub-job which depends on it. Supposing the Grid system has m RMSs, which can satisfy the requirement of n sub-jobs in a workflow. As an RMS can run several sub-jobs at a time, finding out the optimal solution needs mn loops. It can easily be shown that the optimal mapping of the workflow to the Grid RMS as described above is an NP hard problem.
453
Error Recovery for SLA-Based Workflows
w-Tabu Algorithm The main purpose of the w-Tabu algorithm is finding a solution with the minimal finished time. Although the problem has the same destination as most of existing algorithms mapping a DAG to resources (Deelman et al., 2004), the defined context is different from all other contexts appearing in the literature. In particular, our context is characterized with resource reservation, each sub-job is a parallel application and each RMS can run several sub-jobs simultaneously. Thus, a dedicated algorithm is necessary. We proposed a mapping strategy as depicted in Figure 3. Firstly, a set of referent configurations is created. Then we use a specific module to improve the quality of each configuration as far as possible. The best configuration will be selected. This strategy looks similar to an abstract of a long term local search such as Tabu search, Grasp, Simulated Annealing and so on. However, detailed description makes our algorithm distinguishable from them.
Generating Referent Solution Set Each configuration from the referent configurations set can be thought of as the starting point for a local search so it should be spread as widely as possible in the searching space. To satisfy the space spreading requirement, the number of the same map sub-job:RMS between two configurations must be as small as possible. The number of the member in the referent set depends on the number of available RMSs and the number of sub-jobs. During the process of generating a referent solution set, each candidate RMS of a sub-job has a co-relative assign_number to count the times that RMS is assigned to the sub-job. During the process of building a referent configuration, we use a similar set to store all defined configurations having at least a map sub-job:RMS similar to one in the creating configuration. The algorithm is defined in Algorithm 1. Algorithm 1. Generating reference set algorithm
assign_number of each candidate RMS =0 While m_size < max_size { Clear similar set For each sub-job in the workflow { For each RMS in the candidate list { For each solution in similar set { If solution contains sub-job:RMS num_sim++ Store tuple (sub-job, RMS, num_sim) in a list }} Sort the list Pick the best result assign_number++ If assign_number > 1 Find defined solution having the same sub-job:RMS and put to similar set }}
454
Error Recovery for SLA-Based Workflows
While building a configuration with each sub-job in the workflow, we select the RMS in the set of candidate RMSs, which create a minimal number of similar sub-job:RMS with other configurations in the similar set. After that, we increase the assign_number of the selected RMS. If this value is larger than 1, which means that the RMS were assigned to the sub-job more than one time, there must exist configurations that contain the same sub-job:RMS and thus satisfy the similar condition. We search these configurations in the reference set which have not been in the similar set, and then add them to the similar set. When finished, the configuration is put to the referent set. After all reference configurations are defined, we use a specific procedure to refine each of the configurations as far as possible.
Solution Improvement Algorithm To improve the quality of a configuration, we use a specific procedure based on short term Tabu Search for this problem. We use Tabu Search because it can also play the role of a local search but with a wider search area. Besides the standard components of Tabu Search, there are some components specific to the workflow problems. The Neighborhood Set Structure One of the most important concepts of Tabu Search as well as local search is the neighborhood set structure. A configuration can also be presented as a vector. The index of the vector represents the sub-job, and the value of the element represents the RMS. With a configuration a, a=a1a2. . .an | with all ai ⊂ Ki, we generate n*(m-1) configurations a’. We change the value of xi to each and every value in the candidate list which is different from the present value. Each change results in a new configuration. After that we have set A, |A|=n*(m-1). A is the set of neighborhoods of a configuration. The Assigning Sequence of the Workflow When the RMS to execute each sub-job, the bandwidth among sub-jobs was determined, the next task is determining a time slot to run sub-job in the specified RMS. At this point, the assigning sequence of the workflow becomes important. The sequence of determining runtime for sub-jobs of the workflow in RMS can also affect the final finished time of the workflow especially in the case of having many sub-jobs in the same RMS. In general, to ensure the integrity of the workflow, sub-jobs in the workflow are assigned based on the sequence of the data processing. However, that principal does not cover the case of a set of sub-jobs, which have the same priority in data sequence and do not depend on each other. To solve the problem, we determine the earliest and the latest start time of each sub-jobs of the workflow in an ideal condition. The time period to do data transfer among sub-jobs is computed by dividing the amount of data to a fixed bandwidth. The earliest and latest start and stop time for each sub-job and data transfer depends only on the workflow topology and the runtime of sub-jobs but not the resources context. These parameters can be determined using conventional graph algorithms. We see that mapping sub-job having smaller latest start time first will make the lateness smaller. Thus, the latest start time value determined as above is used to determine the assigning sequence. The sub-job having the smaller latest start time will be assigned earlier. This procedure will satisfy Criterion 3.
455
Error Recovery for SLA-Based Workflows
Computing the Timetable Procedure To determine the finished time of a solution we have to determine the timetable to execute sub-jobs and their related data transfer. In the error recovery phase, finding a solution that meets or nearly meets Criteria 1 is very important. Therefore, we do not simply use the provided runtime of each sub-job but modify it according to the performance of each RMS. Let pki, pkj is the performance of a CPU in RMS ri, rj respectively and pkj > pki. Suppose that a sub-job has the provided runtime rti with the resource requirement equals to ri. Thus, the runtime rti of the sub-job in rj is determined as in Formula 4. rt j =
rti pki + (pk j - pki ) * k pki
(4)
Parameter k presents the affection of the sub-job’s communication character and the RMS’s communication infrastructure. For example, if pkj equals to 2* pki and rti is 10 hours, rtj will be 5 hours if k equals 1. However, k=1 only when there is no communication among parallel tasks of the sub-job. Otherwise, k will be less than 1. The practical Grid workflow usually has a fixed input data pattern. For example, the weather forecasting workflow is executed day by day and finishes within a constant period of time since all data was collected (Lovas et al., 2004). This character is the basis for estimating the Grid workload’s runtime (Spooner et al., 2003). In our chapter, parameter ka is an average value which is determined by the user through many experiments and is provided as the input for the algorithm. In the real environment, k may fluctuate around the average value depending on the network infrastructure of the system. For example, suppose that ka equals 0.8. If the cluster has good network communications, the real value of k may increase to 0.9. If the cluster has not so good network communications, the real value of k may decrease to 0.7. Nowadays, with very good network technology in High Performance Computing Centers, the fluctuation of k is not so much. To overcome the fluctuation problem, we use the pessimistic value kp instead of k in the Formula 4 to determine the new runtime of the sub-job as follows. • • •
If ka >0.8, for example with the rare communication sub-job, kp =0.5. If 0.8> ka >0.5, for example with normal communication sub-job, kp =0.25. If ka <0.5, for example with heavy communication sub-job, kp =0.
The pessimistic policy will ensure that the sub-job can be finished within the new determined runtime period. With this assumption, the algorithm to compute timetable is presented in Algorithm 2. As the w-Tabu algorithm applies both for light workflow and heavy workflow, determining the parameter for each case cannot be the same. With light workflow, the end time of data transfer equals the time slot after the end of the correlative source sub-job. With a heavy workflow, the end time of data transfer is determined by searching the bandwidth reservation profile. This procedure will satisfy Criterion 4 and 5. Algorithm 2. Determining timetable algorithm for workflow in w-Tabu
456
Error Recovery for SLA-Based Workflows
Figure 4. H-Map algorithm overview
With each sub-job k following the assign sequence { Determine set of assigned sub-jobs Q, which having output data transfer to the sub-job k With each sub-job i in Q { min_st_tran=end_time of sub-job i +1 If heavy weight workflow { Search in reservation profile of link between RMS running sub-job k and RMS running sub-job i to determine start and end time of data transfer task with the start time > min_st_tran } else { end time data transfer = min_st_tran } } min_st_sj=max end time of all above data transfer +1 Search in reservation profile of RMS running sub-job k to determine its start and end time with the start time > min_st_sj }
The Modified Tabu Search Procedure In normal Tabu search, in each move iteration, we will try assigning each sub-job sji ⊂S with each RMS rj in the candidate set Ki and use the procedure in Algorithm 2 to compute the runtime and then check for overall improvement and pick the best one. This method is not efficient as it requires a lot of time for computing the runtime of the workflow which is not a simple procedure. We will improve the method by proposing a new neighborhood with two comments. Comment 1: The runtime of the workflow depends mainly on the execution time of the critical path. In one iteration, we can move only one sub-job to one RMS. If the sub-job does not belong to the
457
Error Recovery for SLA-Based Workflows
critical path, after the movement, the old critical path will have a very low probability of being shortened and the finished time of the workflow has a low probability of improvement. Thus, we concentrate only on sub-jobs in the critical path. With a defined solution and runtime table, the critical path of a workflow is defined with the algorithm in Algorithm 3. Algorithm 3. Determining critical path algorithm
Let C is the set of sub-jobs in the critical path Put last sub-job into C next_subjob=last sub-job do{ prev_subjob is determined as the sub-job having latest finished data output transfer to next_subjob Put prev_subjob into C next_sj=prev_subjob } until prev_sj= first sub-job We start with the last sub-job determined. The next sub-job of the critical path will have the latest finish data transfer to the previously determined sub-job. The process continues until next sub-job is equal to first sub-job. Comment 2: In one move iteration, with only one change of one sub-job to one RMS, if the finish time of the data transfer from this sub-job to the next sub-job in the critical path is not decreased, the critical path cannot be shortened. For this reason, we only consider the change which shortens the finish time of consequent data transfer. It is easy to see that checking if we can improve the data transfer time is much shorter than computing the runtime table for the whole workflow. With two comments and other remaining procedures similar to the standard Tabu search, we build the overall improvement procedure as presented in Algorithm 4. Algorithm 4. Configuration improvement algorithm in w-Tabu
while (num_loop<max_loop){ Determine critical path For each sub-job in the critical path { For each RMS in the candidate set { If can improve the finished time of the sequence data transfer { Compute timetable for new solution Store tuple (sub-job, RMS, makespan) to candidate list } } } Pick the solution having smaller makespan
458
Error Recovery for SLA-Based Workflows
Figure 5. G-Map algorithm overview
or not affect tabu rule Assign tabu_number for the selected RMS If smaller makespan then store the solution num_loop++ }
Performance of w-Tabu Algorithm To study the performance of the w-Tabu algorithm, we employed all the ideas in the recently appearing literature related to mapping workflow to Grid resource with the same destination to minimize the finished time and adapted them to our problem. Those algorithms include w-DCP, Grasp, minmin, maxmin, and suffer (Quan, 2008). After doing the extensive experiment with simulation data, the experimental data shows that all algorithms need few seconds to find out the solutions and w-Tabu algorithm outperforms all other algorithms. In particular, the quality of the solution found by w-Tabu algorithm is from 15% to 20% higher than the one found by the other algorithms. More detail about the experiment and result can be seen in (Quan, 2008).
H-Map Algorithm The goal of H-Map algorithm is finding out a solution, which ensures Criterion 1-5, and is as inexpensive as possible. The overall H-Map algorithm is presented in Figure 4. Firstly, a set of initial configurations C0 is created. The configurations in C0 should be distributed widely over the search space and must satisfy Criteria 1. If | C0 |=∅, we can deduce that there is little resource free on the Grid and the w-Tabu algorithm is invoked. If w-Tabu also cannot find out a feasible solution, the algorithm stops. If |C0| ≠∅, the set will gradually be refined to have better quality solutions. The refining process stops when the solutions in the set cannot be improved more and we have the final
459
Error Recovery for SLA-Based Workflows
set C*. The best solution in C* will be output as the result of the algorithm. The following sections will describe detail each procedure in the algorithm.
Constructing the Set of Initial Configurations The purpose of this algorithm is to create a set of initial configurations which will be distributed widely over the search space. Step 0: With each sub-job si, we sort the RMSs in the candidate set Ki according to the cost they need to run si. The cost is computed according to Formula 3. The sorted configuration space includes many layers. The configuration in outer layers has a greater cost than the inner layers. The cost of the configuration lying between two layers is greater than the cost of the inner layer and smaller than the cost of the outer layer. Step 1: We pick the first configuration as the first layer in the configuration space. The determined configuration can be presented as a vector. The index of the vector represents the sub-job, and the value of the element represents the RMS. Although the first configuration has minimal cost according to Formula 3, we cannot be sure that this is the optimal solution. The real cost of a configuration must consider the neglected cost of data transmission when two sequential sub-jobs are in the same RMS. Step 2: We construct the other configurations by doing a process as following. The second solution is the second layer of the configuration space. Then we create a solution having cost located between layer 1 and layer 2 by combining the first and the second configuration. To do this, we take the p first elements from the first vector configuration and then the p second elements from the second vector configuration and repeat until having n elements to form the third one. Thus, we get (n/2) elements from the first vector configuration and (n/2) other elements from the second one. Combining in this way will ensure the target configuration of having a greater difference in cost according to Formula 3 compared to the source configurations. The process continues until reaching the final layer. Thus, we have in total 2*m-1 configurations. With this method, we can ensure that the set of initial configurations is distributed over the search space according to cost criteria. Step 3: We check Criterion 4 and 5 of all 2*m-1 configurations. To verify Criterion 4 and 5, we have to determine the timetable for all sub-jobs of the workflow. The procedure to determine the timetable of the workflow is similar to the one described in Algorithm 2. If some of them do not satisfy the Criterion 4 and 5 requirement, we construct more to have enough 2*m-1 configurations. To do the construction, we change the value of p parameter in the range from 1 to (n/2) in step 2 to create the new configuration. After this phase we have set C0 including maximum (2*m-1) valid configurations.
Improving Solution Quality Algorithm To improve the quality of the solutions, we use the neighborhood structure as described in w-Tabu algorithm section. Call A the set of neighborhood of a configuration, the procedure to find the highest quality solution includes the following steps.
460
Error Recovery for SLA-Based Workflows
Step 1: for all a ⊂ A, calculate cost(a) and timetable(a), pick a* with the smallest cost(a*) and satisfy Criterion 1, put a* to set C1. The detailed technique of this step is described in Algorithm 5. Algorithm 5. Algorithm to improve the solution quality For each subjob in the workflow { For each RMS in the candidate list { If cheaper then put (sjid, RMS id, improve_value) to a list }} Sort the list according to improve_value From the begin of the list{ Compute time table to get the finished time If finished time < limit break } Store the result We consider only the configuration having a smaller cost than the present configuration. Therefore, instead of computing the cost and the timetable of all configurations in the neighborhood set, we compute only the cost of them. All the cheaper configurations are stored in a sorted list. And then we compute the timetable of cheaper configurations along the list to find the first feasible configuration. This technique helps to decrease a lot of the algorithm’s runtime. Step 2: Repeat step 1 with all a ⊂ C0 to form C1. Step 3: Repeat step 1 to 2 until Ct= Ct-1. Step 4:Ct ≡ C*. Pick the best configuration of C*.
Performance of H-Map Algorithm To study the performance of the H-Map algorithm, we applied the standard metaheuristics such as Tabu Search, Simulated Annealing, Iterated Local Search, Guided Local Search, Genetic Algorithm, Estimation of Distribution Algorithm to our problem. The experiment results show that the H-Map algorithm finds out equal or higher quality solutions within a much shorter runtime than other algorithms in most cases. With small-scale problems, some metaheuristics using local search such as ILS, GLS, and EDA find out equal results with the H-Map and better than the SA or GA. But with large-scale problems, they have an exponential runtime with unsatisfactory results. Runtime of the H-map algorithm is just few seconds. More detail about the experiment and results can be seen in (Quan, 2008).
G-Map Algorithm G-Map algorithm maps a group of sub-jobs onto the Grid resources where G stands for Group. In the G-Map algorithm, we try to compress the solution space in a way so that the ability for feasible solutions is higher. After that, a set of initial configurations is constructed. This set will be improved by a local search until it cannot be improved any more. Finally, we pick the best solution from the final set. The
461
Error Recovery for SLA-Based Workflows
architecture of the algorithm is presented in Figure 5.
Refining the Solution Space The set of candidate RMSs for each sub-job can be continuously refined by the following observation: An RMS will be valid with a sub-job only if the sub-job assigned to that RMS satisfies the start time of the next sequential sub-jobs. The algorithm to refine the solution space is presented in Algorithm 6. Algorithm 6. Refining the solution space procedure
for each sub-job k in the set { for each RMS r in the candidate list of k{ for each link to k in assigned sequence{ min_st_tran=end_time of source sub-job search reservation profile of link the start_tran > min_st_tran end_tran = start_tran+num_data/bandwidth update reservation profile } min_st_sj=max (end_tran) search in reservation profile of r the start_job > min_st_sj end_job= start_job + runtime for each link from k in assigned sequence{ min_st_tran=end_job search reservation profile of link the start_tran > min_st_tran end_tran = start_tran+num_data/bandwidth update reservation profile if end_tran>=end_time of destination sub-job out of the candidate list }}}
remove r
With each separate sub-job, we determine the schedule time of the input data transfers, the sub-job and output data transfer. From the algorithm 6, we can see that the resource reservation profile is not updated. We call this the ideal assignment. If the stop time of the output data transfer is not earlier than the start time of the next sequential sub-job, then we remove the RMS from the candidate set.
Constructing the Set of Initial Configurations The goal of the algorithm is finding out a feasible solution which satisfies all required criteria and is as inexpensive as possible. Therefore, the set of initial configurations should satisfy two criteria. •
462
The configurations in the set must differ from each other as far as possible. This criterion will
Error Recovery for SLA-Based Workflows
•
ensure that the set of initial configurations will be distributed widely over the search space. The RMS running sub-job in each configuration should differ from each other. This criterion will ensure that each sub-job will be assigned in the ideal condition; thus the ability to become a feasible solution will be increased. The procedure to create the set of initial configuration is as follows.
Step 1: Sorting the candidate set according to the cost factor. With each sub-job, we compute the cost of running the sub-job by each RMS in the candidate set and then sort the RMSs according to the cost. Step 2: Forming the first configuration. The procedure to form the first configuration in the set is presented in Algorithm 7. We form the first solution with as small a cost as possible. With each unassigned sub-job, we compute the m_delta = cost running in the first feasible RMS minus the cost running in the second feasible RMS in the sorted candidate list. The sub-job having the smallest m_delta will be assigned to the first feasible RMS. The purpose of this action is to ensure that the sub-job having the higher ability of increasing the cost will be assigned first. After that, we will update the reservation profile and check if the assigned RMS is still available for other sub-jobs. If not, we will mark it as unavailable. This process is repeated until all sub-jobs are assigned. The selection of which sub-job to be assigned is effective when there are as many sub-jobs having the same RMS as the first feasible solution. Algorithm 7. The algorithm to form the first configuration While the set of unassigned sub-jobs is not empty { Foreach sub-job s in the set of unassigned sub-jobs { m_delta=cost in first feasible RMS- cost in second feasible RMS put (s, RMS, m_delta) in a list } Sort the list to get the minimum m_delta Assign s to the RMS Drop s out of the set of unassigned sub-jobs Update the reservation profile of the RMS Check if the RMS is still feasible with other unassigned sub-jobs if not, mark the RMS is infeasible }
Step 3: Forming the other configurations. The procedure to form the other initial configurations is described in Algorithm 8. To satisfy the two criteria as described above, we use assign_number to keep track of the number of the assignment RMS to a sub-job and l_ass to keep track of the appearance frequency of RMS within a configuration. The RMS having the smaller assign_number and the smaller appearance frequency in l_ass will be selected.
463
Error Recovery for SLA-Based Workflows
Algorithm 8. Procedure to create the initial configuration set
assign_number of each candidate RMS =0 While number of configuration < max_sol { clear list of assigned RMS l_ass for each sub-job in the set { find in the candidate list RMS r having the smallest number of appearance in l_ass and the smallest assign_number Put r to l_ass assign_number++ }}
Determining the Assigning Order When the RMS executing each sub-job and the bandwidths among sub-jobs have been determined, the next task is determining the time slot to run a sub-job in the specific RMS. At this time, the order of determining the scheduled time for sub-jobs becomes important. The sequence of determining runtime for sub-jobs in RMS can also affect Criterion 1, especially in the case of having many sub-jobs in the same RMS. In this algorithm, we use the following policy. The input data transfer having the smaller earliest start time will be scheduled earlier. The output data transfer having the smaller latest stop time will be scheduled earlier. The sub-job having the earlier deadline should be scheduled earlier.
Checking the Feasibility of a Solution To check for the feasibility of a solution, we have to determine the timetable with a procedure as presented in Algorithm 9. Algorithm 9. Procedure to determine the timetable
for each sub-job k in the set { for each link to k in assigned sequence{ min_st_tran=end_time of source sub-job search reservation profile of link the start_tran > min_st_tran end_tran = start_tran+num_data/bandwidth update link reservation profile } min_st_sj=max (end_tran) search in reservation profile of RMS running k the start_job > min_st_sj end_job= start_job + runtime
464
Error Recovery for SLA-Based Workflows
update resource reservation profile for each link from k in assigned sequence{ min_st_tran=end_job search reservation profile of link the start_tran > min_st_tran end_tran = start_tran+num_data/bandwidth update link reservation profile }} After determining the timetable, the stop time of the output data transfer will be compared with the start time of the next sequential sub-jobs. If there is violation, this solution is determined as infeasible.
Improving Solution Quality Algorithm To improve the quality of the solution, we use the procedure similar to the one used in the H-Map algorithm. If the initial configuration set C0 ≠ ∅, the set will gradually be refined to have better quality solutions. The refining process stops when the solutions in the set cannot be improved any more and we have the final set C*. The best solution in C* will be output as the result of the algorithm.
Performance of the G-Map Algorithm To study the performance of the G-Map algorithm, we applied Deadline Budget Constraint (DBC), HMap, Search All Cases (SAC) algorithms to this problem. The experiment results show that only the SAC algorithm has exponent runtime when the size of the problem is large. Other algorithms has very small runtime, just few seconds. The H-Map algorithm has a limited chance to find a feasible solution. The reason is that the H-Map is designed for mapping the whole workflow and the step of refining solution space is not performed. Therefore, there are a lot of infeasible solutions in the initial configuration set. G-Map and DBC algorithm have the same ability to find a feasible solution. Thus, we only compare the quality of the solution between G-Map and DBC algorithms. In average, G-Map finds out solutions 5% better than DBC algorithm. More detail about the experiment and results can be seen in (Quan and Altmann, 2007b).
PERFORMANCE ExPERIMENT The experiment is done with simulation to study the performance of the error recovery mechanisms. We use simulation data because we want to cover a wide range of character of the workload which is impossible with a real workload. The hardware and software used in the experiments is rather standard and simple (Pentium D 2,8Ghz, 1GB RAM, Fedora Core 5, MySQL).
Large-Scale Error Recovery Experiment The goal of this experiment is to measuring the total reaction time of the error recovery mechanism in absolute value when the error happens. Determining total reaction time is important because it helps
465
Error Recovery for SLA-Based Workflows
defining the earliest start time of the re-map workflow, which is a necessary parameter for mapping algorithm. To do the experiment, we use 20 RMSs with different resource configuration and then we fill all the RMSs with randomly selected workflows having start time slot equal 20. We generated 20 different workflows which: • • • •
Have different topologies. Have a different number of sub-jobs from 7 to 32. Have different sub-job specifications. Without lost of generality, we assume that each sub-job has the same CPU performance requirement. Have different amounts of data transfer.
The number of failing RMS increases from 1 to 3. The failed RMS is selected randomly. With each number of failed RMS, fail slot is increased along the reservation axis. The reason for this activity is that the error can happen at any random time slot along the reservation axis. Thus, the broader range of experiment time is, the more correctly reaction time value is determined. At each time, we used the described recovery mechanism to re-map all affected workflow as well as all affected sub-jobs and measure runtime. The runtime is computed in second. When 1 RMS fails, the experiment data shows that the total reaction time of the mechanism increases following the increase of total number of affected sub-jobs. When the number of the failed RMSs increases, the total number of affected sub-jobs increases but the number of healthy RMSs decreases. For that reason, the total reaction time of the mechanism when the number of failed RMSs increasing does not have big difference with the case of having 1 failed RMS. Further more, the probability of having more than 2 failed RMSs simultaneously at a time is very rare. For those reasons, the simulation data can be dependable. With the total reaction time is only less than 2 minutes compared to hourly running workflow, the performance of the algorithm is well accepted in real situation. In the mapping algorithm, time is computed in slot, which can have resolution from 3 to 5 minutes. The reaction time of the mechanism will occupy 1 time slot, the time for the system to do the negotiation takes about 1 time slot. Thus the start time slot of the re-mapped workflow can be assigned to the value of the present time slot plus 2. From the experiment data, we also see that module recovering group of independent affected subjobs is rarely invoked. One main reason for this result is that the consequent sub-jobs of a workflow are mapped to the same RMS to save the data transfer cost. Thus, when the RMS is failed, a series of dependent sub-jobs of the workflow is affected.
Small-Scale Error Recovery Performance The goal of this experiment is studying the effectiveness of the multi phases error recovery and the effect of the late period to the recovery process. To do the experiment, we generated 8 different workflows which: • •
466
Have different topologies. Have a different maximum number of potentially directly affected sub-jobs. The number of subjobs is in the range from 1 to 10. The number of the potentially directly affected sub-jobs stops at 10 because as far as we know with the workload model as described in Part1, this number in the
Error Recovery for SLA-Based Workflows
• •
real workflow is just between 1 and 7. Have different sub-job specifications. Without lost of generality, we assume that each sub-job has the same CPU performance requirement. Have different amount of data transfer.
As the difference in the static factors of an RMS such as OS, CPU speed and so on can be easily filtered by an SQL query, we use 20 RMSs with the resource configuration equal to or even better than the requirement of sub-jobs. Those RMSs have already had some initial workload in their resource reservation profiles and bandwidth reservation profiles. Those 8 workflows are mapped to 20 RMSs. We select the late sub-job in each workflow in a way that the number of the directly affected sub-jobs equals the maximum number of the potentially directly affected sub-jobs of that workflow. The late period is 1 time slot. With each group of the affected sub-jobs, we change the power configuration of the RMS and the k value of affected sub-jobs. The RMS configuration spreads in a wide range from having many RMSs with more powerful CPU to having many RMSs having CPU equal to the requirement. The workload configuration changes widely from having many sub-jobs with big k to having many sub-job with small k. We have chosen this experiment schema because we want to study the character of the algorithm in all possible cases.
The Effectiveness of the Error Recovery Mechanism In this section, we study the effectiveness of the mechanisms appeared in three phases of the error recovery strategy for small-scale error. The performance of an error recovery mechanism is defined as the cost that the broker has to pay for the negative effect of the error as described in error recovery section. If the cost is smaller, the performance of the mechanism is better and vice versa. To do the experiment, we set the lateness period to 1. Each reserved resource cancellation costs 10% of the resource hiring value. For each affected sub-job group, for each power resource configuration scenario, for each workload configuration scenario, we execute both the recovery strategy including three phases and the recovery strategy including only phase 3. We record the cost and the phase in which the error recovery strategy including three phases is successful. For each phase, we compute the average relative cost of successful solutions found by both strategies. The experiment showed that if phase 2 or phase 1 is successful, the performance of the two strategies is different. If the error recovery mechanisms for 1 sub-job late succeeds at phase 1 or 2, the broker will pay less money than using the mechanism of phase 3. The probability of recovering successfully at phase 1 or 2 is large when the delay is small.
The Effect of the Late Period to the Recovery Process To evaluate the effect of the late period on the recovery process, we change the lateness period from 1 time slot to 5 time slots. For each affected sub-job group, for each power resource configuration scenario, for each workload configuration scenario, for each late period, we perform the whole recovery process with the G-Map, the H-Map, and the w-Tabu algorithm. If the G-Map algorithm in phase 1 is not successful, the H-Map algorithm in phase 2 is revoked. If H-Map is not successful, the w-Tabu algorithm in phase 3 is revoked. Thus, for each late period, we have a total of 8*12*12=1152 recovery instances. For each late period, we record the number of feasible solutions for each algorithm and also for each phase of the recovery process. From the experimental data, the error is effectively recovered when the
467
Error Recovery for SLA-Based Workflows
late period is between 1 and 3 time slots. If the late period is less than or equal to 3 time slot, the ability to successfully recover with a low cost by the first phase is very high, 830 times out of 1152. When the late period is greater than 3, the chance of the failing of phase 1 increases sharply and we have to invoke the second phase or third phase, whichever has the higher cost.
FUTURE RESEARCH DIRECTION The reaction time of the error recovery depends mainly on the re-mapping time and the negotiation. From the experiment result, we can see that the reaction time of the error recovery procedure takes in about 2 time slots. We want to reduce this value further to lessen the negative effect of the error. One potential way to realize this idea is by reducing the re-mapping time. In particular, we will focus on improving the speed of the re-mapping algorithms while at the same time not degrading the mapping quality.
CONCLUSION This chapter has presented the error recovery framework for the system handling the SLA-based workflow on the Grid environment. The framework deals with both small-scale errors and large-scale errors. When the large-scale error happens, many workflows can be simultaneously affected. After attempting to see if the directly affected sub-jobs of each affected workflow can be recovered, the system focuses on re-mapping those workflows in a way that minimize the lateness. When the small-scale error happens, only one workflow is affected and the system tries many recovery steps. In the first step, we try to re-map the directly affected sub-jobs in such a way that does not affect the start time of other remaining sub-jobs in the workflow. If the first step is not successful, we try to re-map the remaining workflow in a way that meet the deadline of the workflow and as inexpensively as possible. If the second step is not success, we try to re-map the remaining workflow in a way that minimizes the lateness of the workflow. The experiment studies many aspects of the error recovery mechanism and the results show the effectiveness of applying separate error recovery mechanisms. The total reaction time of the system is 2 time slots in the bad case when a large-scale error happens. In the case of a small-scale error, the error is effectively recovered when the late period is between 1 and 3 time slots. Thus, the error recovery framework could be employed as an important part of the system supporting Service Level Agreement for the Grid-based workflow.
REFERENCES Berriman, G. B., Good, J. C., & Laity, A. C. (2003). Montage: A grid enabled image mosaic service for the national virtual observatory. In F. Ochsenbein (Ed.), Astronomical Data Analysis Software and Systems XIII, (pp. 145-167). Livermore, CA: ASP press. Condor Team. (2006). CondorVersion 6.4.7 Manual. Retrieved October 18, 2006, from www.cs.wisc. edu/condor/manual/v6.4
468
Error Recovery for SLA-Based Workflows
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., et al. (2004). Pegasus: Mapping scientific workflows onto the grid. In M. Dikaiakos (Ed.), AxGrids 2004, (LNCS 3165, pp. 11-20). Berlin: Springer Verlag. Fischer, L. (Ed.). (2004). Workflow Handbook 2004. Lighthouse Point, FL: Future Strategies Inc. Garbacki, P., Biskupski, B., & Bal, H. (2005). Transparent fault tolerance for grid application. In P. M. Sloot (Ed.) Advances in Grid Computing - EGC 2005, (pp. 671-680). Berlin: Springer Verlag. Georgakopoulos, D., Hornick, M., & Sheth, A. (1995). An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2), 119–153. doi:10.1007/BF01277643 Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). Provision of fault tolerance with grid-enabled and SLA-aware resource management systems. In G. R. Joubert (Ed.) Parallel Computing: Current and Future Issues of High End Computing, (pp. 105-112), NIC-Directors. Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). SLA-aware job migration in grid environments. In L. Grandinetti (Ed.), Grid Computing: New Frontiers of High Performance Computing (345-367). Amsterdam, The Netherland: Elsevier Press. Hovestadt, M. (2003). Scheduling in HPC resource management systems: Queuing vs. planning. In D. Feitelson (Ed.), Job Scheduling Strategies for Parallel Processing, (pp.1-20). Berlin: Springer Verlag. Hwang, S., & Kesselman, C. (2003). GridWorkflow: A flexible failure handling framework for the Grid. In B. Lowekamp (Ed.), 12th IEEE International Symposium on High Performance Distributed Computing, (pp. 126—131). New York: IEEE press. Lovas, R., Dózsa, G., Kacsuk, P., Podhorszki, N., & Drótos, D. (2004). Workflow support for complex Grid applications: Integrated and portal solutions. In M. Dikaiakos (Ed.): AxGrids 2004, (LNCS 3165, pp. 129-138). Berlin: Springer Verlag. Ludtke, S., Baldwin, P., & Chiu, W. (1999). EMAN: Semiautomated software for high-resolution singleparticle reconstruction. Journal of Structural Biology, 128, 146–157. doi:10.1006/jsbi.1999.4174 Quan, D. M. (Ed.). (2008). A Framework for SLA-aware execution of Grid-based workflows. Saabbrücken, Germany: VDM Verlag. Quan, D. M., & Altmann, J. (2007). Business model and the policy of mapping light communication grid-based workflow within the SLA Context. In Proceedings of the International Conference of High Performance Computing and Communication (HPCC07), (pp. 285-295). Berlin: Springer Velag. Quan, D. M., & Altmann, J. (2007). Mapping a group of jobs in the error recovery of the Grid-based workflow within SLA context. In L. T. Yang (Ed.), Proceedings of the 21st International Conference on Advanced Information Networking and Applications (AINA 2007), (pp. 986-993). New York: IEEE press. Sahai, A., Graupner, S., Machiraju, V., & Moorsel, A. (2003). Specifying and monitoring guarantees in commercial grids through SLA. In F. Tisworth (Ed.), Proceeding of the 3rd IEEE/ACM CCGrid2003, (pp.292—300). New York: IEEE press.
469
Error Recovery for SLA-Based Workflows
Singh, M. P., & Vouk, M. A. (1997). Scientific workflows: Scientific computing meets transactional workflows. Retrieved January 13, 2006 from http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/ workflows /sciworkflows.html Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S., & Nudd, G. R. (2003). Local grid scheduling techniques using performance prediction. In S. Govan (Ed.), IEEE Proceedings - Computers and Digital Techniques Vol 150, (pp. 87-96). New York: IEEE Press. Stone, N. (2004). GWD-I: An architecture for grid checkpoint recovery services and a GridCPR API. Retrieved October 15, 2006 from http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-2.0.pdf Wolski, R. (2003). Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4), 41–49. doi:10.1145/773056.773064
KEY TERMS AND DEFINITIONS Business Grid: The business Grid is a Grid of resource providers that sell their computing resource. Error Recovery: Error recovery is a process to act against the error in order to reduce the negative effect of the error. Grid-Based Workflow: A Grid-based workflow usually includes many dependent sub-jobs. Subjobs in the Grid-based workflow are usually computationally intensive and require powerful computing facilities to run on. Grid Computing: Grid computing (or the use of a computational grid) is combining the computing resources of many organizations to a problem at the same time. Service Level Agreement: SLAs are defined as an explicit statement of expectations and obligations in a business relationship between service providers and customers. Workflow Mapping: Workflow mapping is a process that determines where and when (optional) each sub-job of the workflow will run on. Workflow Broker: The workflow broker coordinates the work of many service providers to execute successfully a workflow.
ENDNOTE 1
470
In this chapter, RMS is used to represent the cluster/super computer as well as the Grid service provided by the HPCC.
471
Chapter 21
A Fuzzy Real Option Model to Price Grid Compute Resources David Allenotor University of Manitoba, Canada Ruppa K. Thulasiram University of Manitoba, Canada Kenneth Chiu University at Binghamtom, State University of NY, USA Sameer Tilak University of California, San Diego, USA
ABSTRACT A computational grid is a geographically disperssed heterogeneous computing facility owned by dissimilar organizations with diverse usage policies. As a result, guaranteeing grid resources availability as well as pricing them raises a number of challenging issues varying from security to management of the grid resources. In this chapter we design and develop a grid resources pricing model using a fuzzy real option approach and show that finance models can be effectively used to price grid resources.
INTRODUCTION Ian Foster and Carl Kesselman (I. Foster & Kesselman, 1999) describe the grid as an infrastructure that provides a dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities that enable the sharing, exchange, selection, and aggregation of geographically distributed resources. A computational grid is analogous to an electrical power grid. In the electric power grid, electrical energy is generated form various sources such as coal, solar, hydro or nuclear. The user of electrical energy has no knowledge about the source of the energy but only concerned about availability and ubiquity of the energy. Likewise, the computational grid is characterized by heterogeneous resources (grid resources) which are owned by multiple organizations and individuals. The grid distributed resources include but not limited to CPU cycles, memory, network bandwidths, throughput, computing DOI: 10.4018/978-1-60566-661-7.ch021
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Fuzzy Real Option Model to Price Grid Compute Resources
power, disks, processor, software, various measurements and instrumentation tools, computers, software, catalogue data and databases, special devices and instruments, and people/collaborators. We describe the grid compute resources as grid compute commodities (gccs) that need to be priced. This chapter focuses on the design and development of a grid resource pricing model with an objective to provide optimal gain (profitability wise) for the grid operators and a satisfaction guarantee measured as Quality of Service1 (QoS) requirements for grid resource users and resources owners through a regulated Service Level Agreements2 (SLAs)-based resource pricing. We design our pricing model using a discrete time numerical approach to model grid resources spot price. We then model resources pricing problem as a real option pricing problem. We monitor and maintain the grid service quality by addressing uncertainty constraints using fuzzy logic. In recent times, research efforts in computational grid has focused on developing standard for grid middleware in order to provide solutions to grid security issues and infrastructure-based issues (I. T. Foster, Kesselman, Tsudik & Tuecke, 1998), and grid market economy, (Schiffmann, Sulistio, & Buyya, 2007). Since grid resources have been available for free there has been only little effort made to price them. However, a trend is developing due to large interest in grid for public computing and because several business operatives do not want to invest in computing infrastructures due to the dynamic nature of information technology, there is expected to be huge demand for grid computing infrastructures and resources. In the future, therefore, a sudden explosion of grid usage is expected. In anticipation to cope with the sudden increase in grid and grid resources usage, Amazon has introduced a Simple Storage Service (S3) (Palankar, Onibokun, Iamnitchi, & Ripeanu, 2007) for grid consumers. S3 offers a pay-asyou-go online storage, and as such, it provides an alternative to in-house mass storage. A major drawback of the S3 is data access performance. Although the S3 project is successful, its current architecture lack requirements for supporting scientific collaborations due to its reliance on a set of assumptions based on built-in trusts.
BACKGROUND A financial option is defined (see, for example (Hull, 2006)) as the right to buy or to sell an underlying asset that is traded in an exchange for an agreed-upon sum. The right to buy or sell an option may expire if the right is not exercised on or before a specific period and the option buyer forfeits the premium paid at the beginning of the contract. The exercise price (strike price) specified in an option contract is the stated price at which the asset can be bought or sold at a future date. A call option grants the holder the right to purchase the underlying asset at the specified strike price. On the other hand, a put option grants the holder the right to sell the underlying asset at the specified strike price. An American option can be exercised at any time during the life of the option contract; a European option can only be exercised at expiry. Options are derivative securities because their value is a derived function from the price of some underlying asset upon which the option is written. They are also risky securities because the price of their underlying asset at any future time may not be predicted with certainty. This means the option holder has no assurance that the option will be in-the-money (i.e., yield a non-negative reward), before expiry. A real option provides a choice from a set of alternatives. In the context of this study, these alternatives include the flexibilities of exercising, deferring, finding other alternatives, waiting or abandoning an option. We capture these alternatives using fuzzy logic (Bojadziew & Bojadziew, 1997) and express the choices as a fuzzy number. A Fuzzy number is expressed as a membership function that lies between
472
A Fuzzy Real Option Model to Price Grid Compute Resources
i = 1, 2,�, n and dS = (r − δ ) Sdt − σ Sdz. I.e., a membership function maps all elements in the universal set X to the interval dx = vdt + σ dz. We map all possible flexibilities using membership function. The majority of current research efforts ((Buyya, Abramson, & Venugopal, 2005) and (references thereof)) in grid computing focus on grid market economy. Current literature on real option approaches to valuing projects present real option framework in eight categories (Gray, Arabshahi, Lamassoure, Okino, & Andringa, 2004): option to defer, time-to-build option, option to alter, option to expand, option to abandon, option to switch, growth options, and multiple options. Efforts towards improving the selection and decision methods used in the prediction of the capital than an investment may consume. Carlsson and Fuller in ((Carlsson & Fullér, 2003) apply a hybrid approach to valuing real options. Their method incorporates real option, fuzzy logic, and probability to account for the uncertainty involved in the valuation of future cash flow estimates. The results of the research given in (Gray et al., 2004) and (Carlsson & Fullér, 2003) have no formal reference to the QoS that characterize a decision system. Carlsson and Fuller (Carlsson & Fullér, 2003) apply fuzzy methods to measure the level of decision uncertainties and did not price grid resources. We propose a finance concept for pricing grid resources. In our model, we design and develop a pricing function similar in concept to Mutz et. al. (Mutz,Wolski, & Brevik, 2007) where they model resource allocation in a batched-queue of jobs ji for v = r − δ − σ 2 / 2 waiting to be to be granted resources. Job ji receives service before ji+ 1. The resources granted is based on the owners parameters. Their basis for modeling the payment function depends on the users behavior which impose some undesirable externality constraints (resource usage policies across multiple organizations) on the jobs on queue. With specific reference to the job value vi (currency based), and the delay in total turnaround time d expressed as a tolerance factor. Mutz et. al. obtained a job priority model using efficient design mechanism in (Krishna & Perry, 2007). They also proposed a compensation function based on how the propensity with which a job scheduled for time tn– 1 wishes to be done at time earlier. The compensation that is determined by d is paid by the job owner whose job is to be done earlier and disbursed in the form of incentives (say more gccs) to the jobs (or owners of job) before. Our pricing model will incorporate a price variant factor (pvf) – penalty function. The pvf is a fuzzy number and based on the fuzziness (or uncertainty in availability or changes in technology), the pvf trends influences the price of a grid resource. In this chapter we draw our inferences by comparing simulated results to the results obtained from a research grid (SHARCNET (SHARCNET, 2008)). The choice of our selection is to achieve a real-life situation in these different grid types. We evaluate our proposed grid resources pricing model and provide a justification by comparing real grid behavior to simulation results obtained using some base spot prices for the gccs. In particular, we emphasize the provision of service guarantees measured as Quality of Service (QoS) and profitability from the perspectives of the users and grid operatives respectively. We strike at maintaining a balance between user required service from the grid, profitability for resources utilization, and satisfaction for using grid resources.
RESEARCH METHODOLOGY Black-Scholes (Black & Scholes, 1973) developed one of the important models for pricing financial options which was enhanced by Merton (Merton, 1973). Cox, Ross, and Rubinstein (Cox, Ross, & Rubinstein, 1979) developed a discretized version of this model. The Black-Scholes and other models form the fundamental concepts of real option. In an increasingly uncertain and dynamic global market (such
473
A Fuzzy Real Option Model to Price Grid Compute Resources
as the grid market) place, managerial flexibility has become a major concern. A real options framework captures the set of assumptions, concepts, and methodologies for accessing decision flexibility in a known future. Flexibilities which are characterized by uncertainties in investment decisions are critical because not all of them have values in the future. This challenge in real options concept has propelled several research efforts in recent times. The real option theory becomes more functional when the business in question could be expressed as a process that includes; (1) an option, (2) irreversible investment, and (3) when there is a measure of uncertainty about the value of investment and possibility for losses. The uncertainty referred to here is the observed price volatility of the underlying asset, σ. The value of this volatility is in direct proportion to the time value of the option. That is, if the volatility is small, the time value of the option becomes very negligible and hence the real option approach does not add value to its valuation. Several schemes exist in the literature to price financial options; (1) application of the Black-Scholes model (Black & Scholes, 1973) that requires solution of a partial differential equation which captures the price movements continuously; (2) application of a discrete time and state binomial model of the underlying asset price that captures the price movement discretely (Cox, Ross, & Rubinstein, 1979). In our simulation, we use the trinomial model (see for example, Hull, 2006) to solve the real option pricing problem. This is a discrete time approach to calculate the discounted expectations in a trinomial-tree structure. A good description of the binomial lattice model can be found in (Thulasiram, Litov, Nojumi, Downing, & Gao., 2001). We start by grid utilization trace gathering and analysis to determine the extent and effect a particular grid resources usage has on the overall behavior of the grid.
Model Assumptions and Formulation We formulate grid resources pricing model based on the following set of assumptions. First, we assume that it is more cost effective to use the resources from a grid than other resources elsewhere. We also assume some base prices for gccs such that they are as close to the current real sale prices but discounted almost as close to δ x = σ 3 δt . For instance if a 1GB of Random Access Memory (RAM) cost E[δ x] = pu (δ x) + pm (0) + pd ( −δ x) = vδ t , we can set a price of E[δ x 2 ] = pu (δ x 2 ) + pm (0) + pd (δ x 2 ) =σ 2δ t + v 2δ t 2 per week for pu = 0.5 * ((σ 2 ∆t + v 2 ∆t 2 ) / ∆x 2 + (v∆t ) / ∆x) MB memory. The option holder has a sole right to exercise the option any time before the expiration (American style option). Secondly, since the resources exists in non-storable (non-stable) states, we can value them as real assets value them as real assets. This assumption qualifies them to fit into the general investment valuation model in the real option valuation approach. This assumption also justifies resources availability. Since the gccs are non-stable, availability could be affected by a high volatility (σ). This implies that the grid resources utilization times are in effect shorter relative to life of option in financial valuation methods. Hence a holder of the option to use the grid resources has an obligation-free chance of exercising the right. The obligation-free status enables us to apply existing finance option valuation theory to model our pricing scheme. As an example, consider an asset whose price is initially S0 and an option on the asset whose current price is f. Suppose the option last for a time T and that during the life of the option the asset price can either move up from S0 to a new level S0u with a payoff value of fu or move down from S0 to a new level, S0d and with a payoff value of fd where μ > 1 and d < 1. This leads to a one-step binomial. We define a grid-job a service request that require utilizes one or more of the gcc-s between start and finish.
474
A Fuzzy Real Option Model to Price Grid Compute Resources
Price Variant Factor Our model objective is to keep the grid busy (i.e., without idle compute cycles). To achieve this objective, we setup a control function define as price variant factor (pf). The pf is a fuzzy number, a multiplier and based on the fuzziness (or uncertainty in changes in technology). The (pf) is a multiplier and a real number given as 0 ≤ pf ≤ 1. The value depends on changes in technological trends. These changes (new and faster algorithms, faster and cheaper processors, or changes in usage rights and policies) are nondeterminable prior to exercising any of the options to hold the use of grid resource. The certainty of these changes cannot be predicted exactly. Therefore, we treat pvf as a fuzzy number and apply fuzzy techniques to capture uncertainties in pf.
Real Option Discretization of Trinomial Process The trinomial-tree model was introduced in (Boyle, 1996) to price primarily American-style and European-style options on a single underlying asset. Options pricing under the Black-Scholes model (Balck and Scholes, 1973) requires the solution of the partial differential equation and satisfied by the option price. Option prices are obtained by building a discrete time and state binomial model of the asset price and then apply discounted expectations. A generalization of such a binomial valuation model (Hull, 2006) to a trinomial model and option valuations on the trinomial model are useful since solving the partial differential equation of the option price by the explicit finite difference method is equivalent to performing discounted expectations in a trinomial-tree (Hull, 2006). The asset price in a trinomialtree moves in three directions compared with only two for a binomial tree the time horizon (number of steps) can be reduced in a trinomial-tree to attain the same accuracy obtained by a binomial-tree. Consider an asset whose current price is S, and r is the riskless and continuously compounded interest rate, the stochastic differential equation for the risk-neutral geometric Brownian motion (GBM) model 2 2 2 2 of an asset price paying a continuous dividend yield of pm = 1 − ((σ ∆t + v ∆t ) / ∆x ) per annum (Hull, 2006) is given by the expression: pd = 0.5 * ((σ 2 ∆t + v 2 ∆t 2 ) / ∆x 2 − (v∆t ) / ∆x)
(1)
For convenience in terms of x = lnS, we take the derivative of x i.e., Ci , j = max(e − r ∆t ( pu Ci +1, j + pm Ci +1, j +1 + pd Ci +1, j + 2 ), K − Si − j ).
(2)
where Ci , j = e − r ∆t ( pu Ci +1, j + pm Ci +1, j +1 + pd Ci +1, j + 2 ). Consider a trinomial model of asset price in a small time interval δt, we set the asset price changes by δx. Suppose this change remain the same or changes by δx, with probabilities of an up movement pu, probability of steady move (without a change) pm, and probability of a downward movement pd. Figure 1 shows a one-step trinomial lattice expressed in terms of δx and δt. The drift (due to known factors) and volatility (σ, due to unknown factors) parameters of the asset price can be captured in the simplified discrete process using δx, pu, pm, and pd. The space step can be computed (with a choice) using ( Eˆ ). A relationship between the parameters of the continuous time process and trinomial process (a discretization of the geometric Brownian motion (GBM)) is obtained by equating
475
A Fuzzy Real Option Model to Price Grid Compute Resources
Figure 1. One-step trinomial lattice
the mean and variance over the time interval δt and imposing the unitary sum of probabilities, i.e.,
∫ F (t ) = Eˆ [ S (t )] = S (0)e
t
0
µ (τ ) dτ
(3)
Where E[δx] is the expectation. From Equation (3),
Gi = g1 , g 2 ,�, g n
(4)
where the unitary sum of probabilities can be presented as pu + pm + pd = 1
(5)
pu, pm, and pd are probabilities of the price going up, down or remaining same respectively. Solving Equations (3), (4), and (5) yields the transitional probabilities;
CCi = cc1 , cc2 ,�, ccm
(6)
Pi = p1 , p2 ,�, pn
(7)
p1ccg1 1 g p 1 1cc2 � g p1cc1 m
476
g
p2 cc2
1
g
�
p2 cc2
�
�
�
2
g2 2 ccm
p
�
g pnccn 1 gn pncc 2 � g pnccn m
(8)
A Fuzzy Real Option Model to Price Grid Compute Resources
Figure 2. SHARCNET: CPU time vs. number of jobs
The trinomial process of Figure 1 could be repeated a number of times to form an n-step trinomial tree. Figure 2 shows a four-step trinomial. For number of time steps (horizontal level) n = 4, the number of leaves (height) in such a tree is given by 2n + 1. We index a node by referencing a pair (i, j) where i points at the level (row index) and j indicates the distance from the top (column index). Time t is referenced from the level index by i:t = iΔt. From Figure 2(b), node (i, j) is thus connected to node (i + 1, j) (upward move), to node (i + 1, j + 1) (steady move), and to node (i + 1, j + 2) (downward move). The option price and the asset price at node (i, j) are given by C[i, j] = Ci, j and S[i, j] = Si, j respectively. The g asset price could be computed from the number of up and down moves required to reach (i, j) from pnccn m and is given byS[i, j] = S[0,0](uid j). (9) The options at maturity (i.e., when T = nΔt for European style options; T ≤ nΔt for American style options) are determined by the pay off. So for a call option (the intent to buy an asset at a previously determined strike price), the pay off Cn,j = Max(0, Sn,j – K) and for a put option (the intent to sell) is given by Cn,j = Max(0, K – Sn,j). The value K represents the strike price at maturity T = nΔt for a European-style option, and the strike price at any time before or on maturity for an American-style option. To compute option prices, we apply the discounted expectations under the risk neutral assumption. For an American put option (for example), for i < n: g
pnccn
m
(10)
For a European call option (exercised on maturity only), for i < n, g
pnccn
m
(11)
While option price starts at C0,0, we apply the expression for Cn, j along with Equations (9), and (10) or (11) to obtain the option price at every time step and node of the trinomial-tree. We now model grid resources based on the transient availability3 of the grid compute cycles, the availability of compute cycles, and the value of volatility of prices associated with the compute cycles. Given maturity date t, expectation of the risk-neutral value dg cci / g cci = g cc µ dt + g ccσ dzi; the future price F(t) of a contract on grid resources could be expressed as (see for example (Hull, 2006)):
( g cc1 , g cc 2 ,� , g ccn )
(12)
477
A Fuzzy Real Option Model to Price Grid Compute Resources
Consider a trinomial model (see e.g., (Hull, 2006), (Cox et al.1979)) of asset price in a small time interval Δt, the asset price increases by Δx, remain the same or decreases by Δx, with probabilities; probability of up movement pu, probability of steady move (staying at the middle) pm, and probability of a downward movement pd. Figure 1 shows a one-step trinomial tree and Figure 2(b) shows a multi-step trinomial tree.
GRID COMPUTE RESOURCES PRICING Consider some grids p1 , p2 ,� , pn and compute commodities that exist in the grids gccsuch as d ln S = dpi / pi = µi dt +σ i dz. Suppose we have set base prices (some assumed base values) such as d ln S = [ g cc (t ) − p f ln S ]dt + [ stochastic term], then we can setup a Grid Resources Utilization and Pricing (GRUP) matrix. For the grid resources utilization of several grids and several resources, we have: d ln Si = [ g cci (t ) − p f ln Si ]dt + σ i dzi |i =1,2,�, n
(13)
where occurrence F (t ) = Eˆ [ S (t )] in Equation (13) is a trinomial tree that means the price of a grid compute commodity. At each l = 0,1,� , n − 1, a solution for best exercise is required. Therefore each occurrence of j = 1, 2,� ,(2l + 1) requires large computational resources of the grid because of its large size. In other words, the problem of finding prices of grid resources itself is large and would require a large amount of grid computing power. To price the multi-resources system, we suppose a real option depends on some other variables such as the expected growth rate gccμ and the volatility respectively gccσ. Then if we let
T = (t , µ (t )) | t ∈ T ,µ T (t ) ∈ [0,1].
(14)
for any number of derivatives of gcc such as 1 x −a b − a µT (tn ) = c − x c − b 0
for x = b for a ≤ x ≤ b for b ≤ x ≤ c otherwise i.e., if x ∉ [a, c]
with prices p (g cc : tut = tn |QoS ≈ SLA) respectively, we have: d ln S = dpi / pi = midt + sidz
(15)
where the variables gcci = {the set of resources}. Applying the price variant factor pf for pricing options, we have:
478
A Fuzzy Real Option Model to Price Grid Compute Resources
d ln S = [gcc (t ) - p f ln S ]dt +[stochastic term ]
(16)
Where σdz is called the stochastic term. The strength of the pf is determined by the value of its membership function (high for pf > 0). For a multi-asset problem, we have: d ln Si = [gcci (t ) - p f ln Si ]dt + sidz i |i =1,2,,n
(17)
The value of gcc(t) is determined such that F (t ) = Eˆ[S (t )] i.e., the expected value of S is equal to the future price. A scenario similar to what we may get is a user who suspects that he might need more compute cycles (bandwidth) in 3 , 6 , and 9 months from today and therefore decides to pay some amount, $s upfront to hold a position for the expected increase. We illustrate this process using a 3 –step trinomial process. If the spot price for bandwidth is $sT bit per second (bps) and the projected 3, 6, and 9 months future prices are $s1, $s2, and $s3 respectively. In this scenario, the two uncertainties are the amount of bandwidth that will be available and the price per bit. However, we can obtain an estimate for the stochastic process for bandwidth prices by substituting some reasonably assumed values of pf and σ (e.g., pf = 10%, σ = 20%) in Equation (16) and obtain the value of S from Equation (17). Suppose Vl, j represents the option values at l for l = 0,1, , n - 1 level and j node for j = 1, 2, ,(2l + 1) (for a trinomial lattice only); i.e., V1,1 represents the option value at level 1 and at pu. Similarly, in our simulation, using the base price values that we assume, we obtain option value for the trinomial tree at various time step of 2, 4, 8, and 16.
Fuzzy Logic Framework We express the value of the gcc flexibility opportunities as: Gcc: tn = tut
(18)
where tn denotes the time-dimensional space and given as 0 ≤ tn ≤ 1 and tut describes the corresponding utilization time. If tn = 0, gcc usage is “now” or “today”, if tn = 1, gcc has a usage flexibility opportunity for “the future” where future is not to exceed 6 months (say). Users often request and utilize gcc at extremely high computing power but only for a short time for tut = tn ≈ 0. Therefore, disbursing the gcc on-demand and satisfying users’ Quality of Service (QoS), requires that the distributed resources be over-committed or under-committed for tn = 1 or 0) respectively in order to satisfy the conditions specified in the Service Level Agreements (SLAs) document. Such extreme conditions (for example, holding gcc over a long time) requires some cost in the form of storage. Therefore, we express utilization time tn as a membership function of a fuzzy set T. A fuzzy set is defined (see for example (Bojadziew & Bojadziew, 1997)) as: T = (t, m(t )) | t T , mT (t ) [0,1].
(19)
Thus, given that T is a fuzzy set in a time domain (the time-dimensional space), then μT(tn) is called the membership function of the fuzzy set T which specifies the degree of membership (between 0 and 1) to which tn belongs to the fuzzy set T. We express the triangular fuzzy membership function as follows:
479
A Fuzzy Real Option Model to Price Grid Compute Resources
Figure 3. SHARCNET: Used memory vs. number of jobs
ìï1 ïï ïï x - a ïï mT (tn ) = ïí b - a ïïc - x ïï ïï c - b ïï0 î
for x = b for a £ x £ b for b £ x £ c otherwise i.e., if x [a, c ]
(20)
Where [a, c] is called the universe of discourse or the entire life of the option. Therefore, for every gcc at utilization time tn, availability of the gcc expressed as membership function is the value compared to stated QoS conditions given in the SLA document. An SLA document (Pantry & Griffiths, 1997) describes the agreed upon services provided by an application system to insure that it is reliable, secure, and available to meet the needs of the business it supports. The SLA document consists of the technicalities and requirements specific to service provisioning e.g., the expected processor cycles, QoS parameters, some legal and financial aspects such as violation penalties, and utilization charges for resources use. The implication of a service constraint that guarantees QoS and meets the specified SLA conditions within a set of intermittently available gcc is a system that compromises the basic underlying design objective of the grid as a commercial computing service resource (Yeo & Buyya, 2007). Therefore, Equation (18) becomes: gcc : tut = tn |QoS »SLA
(21)
To satisfy QoS-SLA requirements, we evaluate existing grid utilization behavior from utilization traces. Based on the observed values of resources demands from the utilization traces. We obtain results from the SHARCNET traces and observe utilizations for memory, and CPU time. To price the gcc-s, we run the trinomial lattice using the the following model parameters: For example, for a one-step trinomial tree we use K = $0.70, S = $0.80, T = 0.5, r = 0.06, σ = 0.2, and Nj = 2N + 1. We extend our study by varying the volatility σ in time steps of N = 4,8,16,24. For a 6 month contract, for example, N = 3 would mean a 2 month step size and N = 12 would mean a 2 week step size. Unlike stock prices, we need not go for very small step sizes. We examine the relationship between used CPU time and memory in the grid to the number of jobs requesting its use are depicted in Figures 2 and 3 respectively for SHARCNET. The trace analysis shows
480
A Fuzzy Real Option Model to Price Grid Compute Resources
Figure 4. Option value for RAM—in-the-money
that SHARCNET has a symmetrically skewed effect of CPU time on the number of jobs served by the grid. Although SHARCNET delivers a larger proportion of jobs but it experience a sharp drop in the number of jobs it serve. The times/dates of low CPU availability is due to either waste, wait or priority jobs served by the grid or any combination of them. If we compare the CPU usage characteristics displayed by SHARCNET in Figure 2 and the memory utilization in Figure 3 we observe that a particular application (such as those involve signal processing/image rendering) which require a high CPU as well as high memory from the grid will not necessarily run optimally. If this application is run on the SHARCNET grid (for example), it would run using sufficient CPU but under a depleted memory condition. In our experiments, we simulate the grid compute commodities (gcc) and monitor users’ request for utilization. For a call option, we simulate the effects of time on exercising the option to use one of the gcc-s such as memory (RAM), hard disk (HD), and CPU. We start with memory (one of the gcc-s) using the following parameters: S = $6.849.00 × 10–7, T = 0.5, r = 0.06, N = 4,8,16,24, σ = 0.2, and Nj = 2N + 1; we vary K such that we can have in-the-money and out-of-the-money conditions. These values reflect the market value of these raw infrastructure, in general. We are not certain about the type of RAM available in the example grids. However, one can easily map the above parametric values to correspond to the infrastructure available in the grids. This is true for other gcc-s such as CPU and hard disk discussed later. We obtain option values and study the variation in several step sizes to determine the effects of fluctuations (uncertainty) that exists between total period of the option contract and time of exercise on option value. Figure 4 shows an in-the-money option value for RAM. They show an increasing option value which increases with the number of time steps. Over the number of step sizes the option value reaches a steady state. Actual value determines the fact that entering the contract is beneficial for the user while still generating a reasonable revenue to the grid provider. This is an indication that at any given time, a users’ actual cost of utilizing the grid resources is the sum of the base cost and the additional cost which depends on the time of utilization of the gcc. However for an equilibrium service-profit system, we impose a price modulation factor -- price variant factor called pf (see Section 3.2). The value of the pf depends on changes/variations in the technology
481
A Fuzzy Real Option Model to Price Grid Compute Resources
Figure 5. Execution time for various commodities
or architecture of the grid infrastructure. These variations are unknown prior to exercising the options to hold the use of grid resource and hence determining the exact price of gcc in real life is uncertain and hard to predict. Therefore, to maximize gcc utilization (ut) with more computing facilities and with same technology, we set the value of pf(ut) is set to 0.1 and with new technology, the pf = 1.0. Fuzzified boundary value of pf is constructed as pf(ut) = [0.1,1.0] to facilitate fuzzification. Our model, therefore, adjusts the price in the use of grid resources by (pf(ut))–1 (for the grid operator) while providing quality service. For example, applying pf reverses an unprofitable late exercise of an out-of-the-money option value to an early exercise of in-the-money option value in a 10% adjustments. Figure 11 shows a corresponding out-of-the-money option value for CPU. Similarly, we obtain from our simulation the option values for both in-the-money and out-of-themoney for CPU using the parameters S = $68.49 and K = $68.47 and $80.47 (all values scaled at (×10–6)) and simulated for a varying time step of 4, 8, 16 and 24, results of which are not shown. We repeat this for the various grids and for the various gcc -s first individually and then using a combination of the individual gcc-s. Figure 5 shows execution time for HD, CPU, and RAM at various time steps. From figures of option values we observed that in 24 steps the option value is reaching a steady state and hence we did not experiment beyond 24 steps. Since number of nodes to be computed increases, the time required to achieve steady state in option value also increases as shown in Figure 5. Our interest in the design and development of an equilibrium service-profit grid resources pricing model is in particular centered in levels were resources utilization in the grid show depleted values and do not sufficiently provide a service quality necessary to guarantee a user high QoS. The depleted resources (from the traces) are memory in utilization in levels SHARCNET Figure 5. In these circumstances, a users’ QoS must be guaranteed. We use our price varying factor pf discussed earlier to modulate the
482
A Fuzzy Real Option Model to Price Grid Compute Resources
effective gcc prices by awarding incentive in the form of dividends to the user who require composite resources.
CONCLUSION AND FUTURE WORK We use the behavior of the grid resources utilization patterns observed from the traces to develop a novel pricing model as a real option problem. Our two important contributions are: (1) option value determination for grid resources utilization and the determination of the best point of exercise of the option to utilize any of the grid resources. This helps the user as well as the grid operator to optimize resources for profitability; (2) our study also incorporate a price variant factor, which controls the price of the resources and ensure that at any time the grid users gets the maximum at best prices and that the operators also generate reasonable revenue at the current base spot price settings. Our future work will focus on the larger problem of pricing grid resources for applications that utilize heterogeneous resources across heterogeneous grids and cloud computing. For example, if an application requires memory in one grid and the CPU time from another grid simultaneously, then we will have to deal with a more complex, computationally intensive, and a multi-dimensional option pricing problem. This would require a more complex optimization of the solution space of the grid resources utilization matrix as well as determining the best node (time) to exercise the option.
REFERENCES Black, F., & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. The Journal of Political Economy, 81(3). doi:10.1086/260062 Bojadziew, G., & Bojadziew, M. (1997). Fuzzy Logic for Business, Finance, and Management Modeling, (2nd Ed.). Singapore: World Scientific Press. Boyle, P. P. (1986). Option Valuing Using a Three Jump Process. International Options Journal, 3(2). Buyya, R., Abramson, D., & Venugopal, S. (2005). The Grid Economy. IEEE Journal. Buyya, R., Giddy, J., & Abramson, D. (2000). An evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications. Proceedings of the 2nd Workshop on Active Middleware Services, Pittsburgh, PA. Carlsson C. & Fullér, R. (2003). A Fuzzy Approach to Real Option Valuation. Journal of Fuzzy Sets and Systems, 39. Cox, J. C., Ross, S., & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics, 3(7). Foster, I., & Kesselman, C. (1999). The Grid: Blueprint for a New Computing Infrastructure. San Francisco: Morgan Kaufmann Publishers, Inc. Foster, I., Kesselman, C., Tsudik, G., & Tuecke, S. (1998). A security Architecture for Computational Grids. ACM Conference on Computer and Communications Security.
483
A Fuzzy Real Option Model to Price Grid Compute Resources
Gray, A. A., Arabshahi, P., Lamassoure, E., Okino, C., & Andringa, J.. (2004). A Real Option Framework for Space Mission Design. Technical report, VNational Aeronautics and Space Administration NASA. Hull, J. C. (2006). Options, Futures, and Other Derivatives (6th Edition). Upper Saddle River, NJ: Prentice Hall. Krishna, V. & Perry, M. (2007). Efficient mechanism Design. Merton, R. C. (1973). Theory of Real Option Pricing. The Bell Journal of Economics and Management Science, 4(1). doi:10.2307/3003143 Mutz, A., Wolski, R., & Brevik, J. (2007). Eliciting honest value information in a batch-queue environment. In The 8th IEEE/ACM Int’ Conference on Grid Computing (Grid 2007) Austin, Texas, USA. Palankar, M., Onibokun, A., Iamnitchi, A., & Ripeanu, M. (2007). Amazon S3 for Science Grids: a Viable Solution? Poster: 4th USENIX mposium on Networked Systems Design and Implementation (NSDI’07). Pantry, S., & Griffiths, P. (1997). The Complete Guide to Preparing and Implementary Service Level Agreements (1st Ed.). London: Library Association Publishing. Schiffmann, W., Sulistio, A., & Buyya, R. (2007). Using Revenue management to Determine Pricing of Revervations. Proc. 3rd International Conference on e-Science and Grid Computing (eScience 2007) Bangalore, India, December 10-13. SHARCNET. (2008). Shared Hierarchical Academic Research Computing Network (SHARCNET). Thulasiram, R. K., Litov, L., & Nojumi, H. Downing, C. T. & Gao, G. R. (2001). Multithreaded Algorithms for Pricing a Class of Complex Options. Proceedings (CD-ROM) of the International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA. Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis for a Commercial Computing Service. Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007, IEEE CS Press, Los Alamitos, CA, USA).
KEY TERMS AND DEFINITIONS Distributed Computing: Grid resource as they relates to the geographical regions which is a factor in terms of availability and computability. Fuzzy Support for QoS: A decision support systems that is based on managing uncertainties associated with grid resources availability. Grid Computing: A computing grid is a system that delivers processing power of a massively parallel computation and facilitates the deployment of resources-intensive applications Price Adjustments: A control/ feed back structure that modulate grid resources price with a specific objective to benefits users and grid operatives; value depends of current tecnology or maket trend. Real Option Model: A mathematical framework similar to financial options but characterized by uncertainty in decision flexibility in a known future for determining project viabilites.
484
A Fuzzy Real Option Model to Price Grid Compute Resources
Resource Management: This refers to provision of the grid resources to users at the time of requested utilization. Resource Pricing: A fair share of the grid resources that depends highly on availability (monitored by price variant factor) rather than market forces of demand and supply.
ENDNOTES 1
2
3
QoS describes a user’s perception of a service to a set of predefined service conditions contained in a Service Level Agreements (SLAs) that is necessary to achieve a user-desired service quality. An SLA (Pantry & Griffiths, 1997) is a legal contract in which a resource provider (say a grid operator) agrees to deliver an acceptable minimum level of QoS to the users. A reserved quantity at a certain time (tn– 1) may be unavailable at tn.
485
486
Chapter 22
The State of the Art and Open Problems in Data Replication in Grid Environments Mohammad Shorfuzzaman University of Manitoba, Canada Rasit Eskicioglu University of Manitoba, Canada Peter Graham University of Manitoba, Canada
ABSTRACT Data Grids provide services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored at distributed locations around the world. For example, the next-generation of scientific applications such as many in high-energy physics, molecular modeling, and earth sciences will involve large collections of data created from simulations or experiments. The size of these data collections is expected to be of multi-terabyte or even petabyte scale in many applications. Ensuring efficient, reliable, secure and fast access to such large data is hindered by the high latencies of the Internet. The need to manage and access multiple petabytes of data in Grid environments, as well as to ensure data availability and access optimization are challenges that must be addressed. To improve data access efficiency, data can be replicated at multiple locations so that a user can access the data from a site near where it will be processed. In addition to the reduction of data access time, replication in Data Grids also uses network and storage resources more efficiently. In this chapter, the state of current research on data replication and arising challenges for the new generation of data-intensive grid environments are reviewed and open problems are identified. First, fundamental data replication strategies are reviewed which offer high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability of the overall system. Then, specific algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also analyzed. A set of appropriate metrics including access latency, bandwidth savings, server load, and storage overhead for use in making critical
DOI: 10.4018/978-1-60566-661-7.ch022
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
The State of the Art and Open Problems in Data Replication in Grid Environments
comparisons of various data replication techniques is also discussed. Overall, this chapter provides a comprehensive study of replication techniques in Data Grids that not only serves as a tool to understanding this evolving research area but also provides a reference to which future e orts may be mapped.
INTRODUCTION The popularity of the Internet as well as the availability of powerful computers and high-speed network technologies is changing the way we use computers today. These technology opportunities have also led to the possibility of using distributed computers as a single, unified computing resource, leading to what is popularly known as Grid Computing (Kesselman & Foster, 1998). Grids enable the sharing, selection, and aggregation of a wide variety of resources including supercomputers, storage systems, data sources, and specialized devices that are geographically distributed and owned by different organizations for solving large-scale computational and data intensive problems in science, engineering, and commerce (Venugopal, Buyya, & Ramamohanarao, 2006). Data Grids deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored across distributed storage resources. For example, scientists working in areas as diverse as high energy physics, bioinformatics, and earth observations need to access large amounts of data. The size of these data is expected to be terabyte or even petabyte scale for some applications. Maintaining a local copy of data on each site that needs the data is extremely expensive. Also, storing such huge amounts of data in a centralized manner is almost impossible due to extensively increased data access time. Given the high latency of wide-area networks that underlie many Grid systems, and the need to access or manage several petabytes of data in Grid environments, data availability and access optimization are key challenges to be addressed. An important technique to speed up data access for Data Grid systems is to replicate the data in multiple locations, so that a user can access the data from a site in his vicinity (Venugopal et al., 2006). Data replication not only reduces access costs, but also increases data availability for most applications. Experience from parallel and distributed systems design shows that replication promotes high data availability, lower bandwidth consumption, increased fault tolerance, and improved scalability. However, the replication algorithms used in such systems cannot always be directly applied to Data Grid systems due to the wide-area (mostly hierarchical) network structures and special data access patterns in Data Grid systems that differ from traditional parallel systems. In this chapter, the state of the current research on data replication and its challenges for the new generation of data-intensive grid environments are reviewed and open problems are discussed. First, different data replication strategies are introduced that offer efficient replica1 placement in Data Grid systems. Then, various algorithms for selecting appropriate replicas and maintaining replica consistency are discussed. The impact of data replication on job scheduling performance in Data Grids is also investigated. The main objective of this chapter, therefore, is to provide a basis for categorizing present and future developments in the area of replication in Data Grid systems. This chapter also aims to provide an understanding of the essential concepts of this evolving research area and to identify important and outstanding issues for further investigation.
487
The State of the Art and Open Problems in Data Replication in Grid Environments
The remainder of this chapter is organized as follows. First, an overview of the data replication problem is presented, describing the key issues involved in data replication. In the following section, progress made to date in the area of replication in Data Grid systems is reviewed. Following this, a critical comparison of data placement strategies, probably the core issue affecting replication efficiency in Data Grids, is provided. A summary is then given and some open research issues are identified.
OVERVIEW OF REPLICATION IN DATA GRID SYSTEMS The efficient management of huge distributed and shared data resources across Wide Area Networks (WANs) is a significant challenge for both scientific research and commercial applications. The Data Grid as a specialization and extension of the Grid (Baker, Buyya, & Laforenza, 2006) provides a solution to this problem. Essentially, Data Grids (Chervenak, Foster, Kesselman, Salisbury, & Tuecke, 2000) deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored in distributed storage resources. At the minimum, a Data Grid provides two basic functions: a high-performance, reliable data transfer mechanism, and a scalable replica discovery and management mechanism. Depending on application requirements, other services may also be needed (e.g. security, accounting, etc.). Grid systems typically involve loosely coupled jobs that require access to a large number of datasets. Such a large volume of datasets has posed a challenging problem in how to make the data more easily and efficiently available to the users of the systems. In most situations, the datasets requested by a user’s job cannot be found at the local nodes in the Data Grid. In this case, data must be fetched from other nodes in the grid which causes high access latency due to the size of the datasets and the wide-area nature of the network that underlies most grid systems. As a result, job execution time can become very high due to the delay of fetching data (often over the Internet). Replication (Ranganathan & Foster, 2001b) of data is the most common solution used to address access latency in Data Grid systems. Replication results in the creation of copies of data files at many different sites in the Data Grid. Replication of data has been demonstrated to be a practical and efficient method to achieve high network performance in distributed environments, and it has been applied widely in the areas of distributed databases and some Internet applications (Ranganathan & Foster, 2001b; Chervenak et al., 2000). Creating replicas can effectively reroute client requests to different replica sites and offer remarkably higher access speed than a single server. At the same time, the workload of the original server is distributed across the replica servers and, therefore, decreases significantly. Additionally, the network load is also distributed across multiple network paths thereby decreasing the probability of congestion related performance degradation. In these ways, replication plays a key role in improving the performance of data-intensive computing in Data Grids.
The Replication Process and Components The use of replication in Data Grid systems speeds up data access by replicating the data at multiple locations so that a user can access data from a site in his vicinity (Venugopal et al., 2006). Replication of data, therefore, aims to reduce both access latency and bandwidth consumption. Replication can also help in server load balancing and can enhance reliability by creating multiple copies of the same data. Replication is, of course, limited by the amount of storage available at each site in the Data Grid and
488
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 1. A replica management architecture
by the bandwidth available between those sites. A replica management system, therefore, must ensure access to the required data while managing the underlying storage and network resources. A replica management system, shown in Figure 1, consists of storage nodes that are linked to each other via high-performance data transport protocols. The replica manager directs the creation and management of replicas according to the demands of the users and the availability of storage, and a catalog (or directory) keeps track of the replicas and their locations. The catalog can be queried by applications to discover the number and locations of available replicas of a given file.
Issues and Challenges in Data Replication Although the necessity of replication in Data Grid systems is evident, its implementation entails several issues and challenges such as selecting suitable replicas, maintaining replica consistency, and so on. The following fundamental issues are identified: a. b. c. d. e.
Strategic placement of replicas is needed to obtain maximum gains from replication according to the objectives of applications. The degree of replication must be selected to require the minimum number of replicas without reducing the performance of applications. Replica selection identifies the replica that best matches the user‘s quality of service (QoS) requirements and, perhaps, achieves one or more system-wide management objectives. Replica consistency management ensures that the multiple copies (i.e., replicas) of a given file are kept consistent in the presence of multiple concurrent updates. The impact of replication on the performance of job scheduling must also be considered.
489
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 2. Taxonomy of the issues in data replication
Figure 2 presents a visual taxonomy of these issues which will be used in the next subsections.
Replica Placement Although, data replication is one of the major optimization techniques for promoting high data availability, low bandwidth consumption, increased fault tolerance, and improved scalability, the problem of replica placement has not been well studied for large-scale Grid environments. To obtain the maximum possible gains from file replication, strategic placement of the file replicas in the system is critical. The replica placement service is the component of a Data Grid architecture that decides where in the system a file replica should be placed. The overall file replication problem consists of making the following decisions (Ranganathan & Foster, 2001b): (1) which files should be replicated; (2) when and how many should replicas be created; (3) where should these replicas be placed in the system. Replication methods can be classified as static or dynamic (M. Tang, Lee, Yeo, & Tang, 2005). For static replication, after a replica is created, it will exist in the same place until it is deleted manually by users or its “replica duration” expires. The drawback of static replication is evident – when client access patterns change greatly in the Data Grid, the benefits brought by replicas will decrease sharply. On the contrary, dynamic replication takes into consideration changes in the Data Grid environment and automatically creates new replicas for popular data files or moves the replicas to other sites when necessary to improve performance.
Replica Selection A system that includes replicas also requires a mechanism for selecting and locating them at file access time. Choosing and accessing appropriate replicas are very important to optimize the use of grid resources. A replica selection service discovers the available replicas and selects the “best” replica given the user’s location and quality of service (QoS) requirements. Typical QoS requirements when doing replica selection might include access time as well as location, security, cost and other constraints, The replica selection problem can be divided into two sub-problems (Rahman, Barker, & Alhajj, 2005): 1) discovering the physical location(s) of a file given a logical file name, and 2) selecting the best replica from a set based on some selection criteria. Network performance can play a major role when selecting a replica. Slow network access limits the efficiency of data transfer regardless of client and server implementation. One optimization technique to select the best replica from different physical locations is by examining the available (or predicted)
490
The State of the Art and Open Problems in Data Replication in Grid Environments
bandwidth between the requesting computing element and various storage elements that hold replicas. The best site, in this case, is the one that has the minimum transfer time required to transport the replica to the requested site. Although, network bandwidth plays a major role in selecting the best replica, other factors including additional characteristics of data transfer (most notably, latency), replica host load, and disk I/O performance are important as well.
Replica Consistency Consistency and synchronization problems associated with replication in Data Grid systems are not well addressed in the existing research with files often being regarded as being read-only. However, as grid solutions are increasing used by a number of applications, requirements will arise for mechanisms that maintain the consistency of replicated data that can change over time. The replica consistency problem deals with concurrent updates made to multiple replicas of a file. When one file is updated, all other replicas then have to have the same contents and thus provide a consistent view. Consistency therefore requires some form of concurrency control. Replica consistency is a traditional issue in distributed systems, but it introduces new problems in Data Grid systems. The traditional consistency implementations such as invalidation protocols, distributed locking mechanisms, atomic operations and two-phase commit protocols are not necessarily suitable for Data Grid environments because of the long delays introduced by the use of a wide-area network and the high degree of autonomy of Data Grid resources (Domenici, Donno, Pucciani, Stockinger, & Stockinger, 2004). For example, in a Data Grid, the replicas for a file may be distributed over different countries. So, if one node which holds a replica is not available when the update operation is working, the whole updating process could fail.
The Impact of Data Replication on Job Scheduling Dealing with the large number of data files that are geographically distributed causes many challenges in a Data Grid. One that is not commonly considered is scheduling jobs to take data location into account when determining job placement. The locations of data required by a job clearly impact grid scheduling decisions and performance (M. Tang, Lee, Yeo, & Tang, 2006). Traditional job schedulers for grid systems are responsible for assigning incoming jobs to compute nodes in such a way that some evaluative conditions are met, such as the minimization of the overall execution time of the jobs or the maximisation of throughput or utilisation. Such systems generally take into consideration the availability of compute cycles, job queue lengths, and expected job execution times, but they typically do not consider the location of data required by the jobs. Indeed, the impact of data and replication management on job scheduling behaviour has largely remained unstudied. Data intensive applications such as High Energy Physics and Bioinformatics require both Computational Grid and Data Grid features. Performance improvements for these applications can be achieved by using a Computational Grid that provides a large number of processors and a Data Grid that provides efficient data transport and data replication mechanisms. In such environments, effective resource scheduling is a challenge. One must consider not only the abundance of computational resources but also data locations. A site that has enough available processors may not be the optimal choice for computation if it doesn‘t have the required data nearby. (Allocated processors might wait a long time to access the remote data.) Similarly, a site with local copies of required data is not a good place to compute if it doesn’t have
491
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 3. Taxonomy of the replica placement algorithms
adequate computational resources. An effective scheduling mechanism is required that will allow shortest access to the required data, thereby reducing the data access time. Since creating data replicas can significantly reduce the data access cost, a tighter integration of job scheduling and automated data replication can bring substantial improvement in job execution performance.
DATA REPLICATION: STATE OF THE ART As mentioned earlier, data replication becomes more challenging because of some unique characteristics of Data Grid systems. This section surveys existing replication strategies in Data Grids and the issues involved in replication that will form a basis for future discussion of open issues in the next section.
Replica Placement Strategies With the high latency of wide-area networks that underlies most Grid systems, and the need to access and manage multiple petabytes of data, data availability and access optimization become key challenges to be addressed. Hence, most of the existing replica placement algorithms focus on at least two types of objective functions for placing replicas in Data Grid systems. The first type of replica placement strategy looks towards decreasing the data access latency and the network bandwidth consumption. The other type of replica placement strategy focuses on how to improve system reliability and availability. Figure 3 shows a taxonomy of the replica placement algorithms based on the realized objective functions together with references to papers in each category.
492
The State of the Art and Open Problems in Data Replication in Grid Environments
Algorithms Focusing on Access Latency and Bandwidth Consumption Ranganathan and Foster (Ranganathan & Foster, 2001b, 2001a) present and evaluate different replication strategies for a hierarchical Data Grid architecture. These strategies are defined depending on when, where, and how replicas are created and destroyed in a hierarchically structured grid environment. They test six different replication strategies: 1) No Replication: only root node holds replicas; 2) Best Client: replica is created for the client who accesses the file the most; 3) Cascading: a replica is created on the path from the root node to the best client; 4) Plain Caching: a local copy is stored upon initial request; 5) Caching plus Cascading: combines plain caching and cascading; 6) Fast Spread: file copies are stored at each node on the path from the root to the best client. They show that the cascading strategy reduces response time by 30% over plain caching when data access patterns contain both temporal and geographical locality. When access patterns contain some locality, Fast Spread saves significant bandwidth over other strategies. These replication algorithms assume that popular files at one site are also popular at others. The client site counts hops for each site that holds replicas, and the model selects the site that is the least number of hops from the requesting client; but it does not consider current network bandwidth and also limits the model to a hierarchical grid. The proposed replication algorithms can be refined so that time interval and threshold of replication change automatically based on user behaviour. Lamehamedi et. al (Lamehamedi, Szymanski, shentu, & Deelman, 2002; Lamehamedi, Szymanski, Shentu, & Deelman, 2003) study replication strategies where the replica sites can be arranged in different topologies such as a ring, tree or hybrid. Each site or node maintains an index of the replicas it hosts and the other locations that it knows about that host replicas of the same files. Replication decisions are made based on a cost model that evaluates both the data access costs and performance gains of creating each replica. The estimation of costs and gains is based on factors such as run-time accumulated read/write statistics, response time, bandwidth, and replica size. The replication strategy places a replica at a site that minimises the total access costs including both read and write costs for the datasets. The write cost considers the cost of updating all the replicas after a write at one of the replicas. They show via simulation that the best results are achieved when the replication process is carried out closest to the users. Bell et al. (W. H. Bell et al., 2003) present a file replication strategy based on an economic model that optimises the selection of sites for creating replicas. Replication is triggered based on the number of requests received for a dataset. Access mediators receive these requests and start auctions to determine the cheapest replicas. A Storage Broker (SB) participates in these auctions by offering a “price” at which it will sell access to a replica if it is available. If the replica is not available at the local storage site, then the broker starts an auction to replicate the requested file onto its storage if it determines that having the dataset is economically feasible. Other SBs then bid with the lowest prices that they can offer for the file. The lowest bidder wins the auction but is paid the amount bid by the second-lowest bidder. In subsequent research, Bell et al. (W. Bell et al., 2003) describe the design and implementation of a Grid simulator, OptorSim. In particular, OptorSim allows the analysis of various replication algorithms. The goal is to evaluate the impact of the choice of an algorithm on the throughput of typical grid jobs. The authors implemented a simple remote access heuristic and two traditional cache replacement algorithms (oldest file deletion and least accessed file deletion). Their simulation was constructed assuming that the grid consists of several sites, each of which may provide computational and data-storage resources for submitted jobs. Each site consists of zero or more Computing Elements and zero or more Storage Elements. Computing Elements run jobs, which use the data in files stored on Storage Elements. A Resource Broker controls the scheduling of jobs to Comput-
493
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 4. An example of the history and node relations
ing Elements. Sites without Storage or Computing Elements act as network routing nodes. Various algorithms were compared to a novel algorithm (W. H. Bell et al., 2003) based on an economic model. The comparison was based on several grid scenarios with various workloads. The results obtained from OptorSim suggest that the economic model performs at least as well as traditional methods. However, the economic model shows marked performance improvements over other algorithms when data access patterns are sequential. Sang-Min Park et al. (Park, Kim, Ko, & Yoon, 2003) propose a dynamic replication strategy, called BHR (Bandwidth Hierarchy based Replication), to reduce data access time by avoiding network congestion in a Data-Grid network. The BHR algorithm benefits from ‘network-level locality’, which indicates that the required file is located at the site that has the broadest bandwidth to the site of the job‘s execution. In Data Grids, some sites may be located within a region where sites are linked closely. For instance, a country or province/state might constitute a network region. Network bandwidth between sites within a region will be broader than bandwidth between sites across regions. That is, a hierarchy of network bandwidth may appear in the Internet. If the required file is located in the same region, less time will be consumed to fetch the file. Thus, the benefit of network-level locality can be exploited. The BHR strategy reduces data access time by maximizing this network-level locality. Rahman et al. (Rahman, Barker, & Alhajj, 2005b) present a replica placement algorithm that considers both the current state of the network and file requests. Replication is started by placing the master files at one site. Then the expected “utility” or “risk index” is calculated for each site that does not currently hold a replica and then one replica is placed on the site that optimizes the expected utility or risk. The proposed algorithm based on utility selects a candidate site to host a replica by assuming that future requests and current load will follow current loads and user requests. Conversely, the algorithm using a risk index exposes sites far from all other sites and assumes a worst case scenario whereby future requests will primarily originate from that distant site thereby attempting to provide good access throughout the network. One major drawback of these strategies is that the algorithms select only one site per iteration and place a replica there. Grid environments can be highly dynamic and thus there might be a sudden burst of requests such that a replica needs to be placed at multiple sites simultaneously to quickly satisfy the large spike of requests. Two dynamic replication mechanisms (M. Tang, Lee, Yeo, & Tang, 2005) are proposed for a multitier architecture for Data Grids: Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU). The SBU
494
The State of the Art and Open Problems in Data Replication in Grid Environments
algorithm replicates any data file that exceeds a pre-defined threshold of access rate as close as possible to the clients. The main shortcoming of SBU is the lack of consideration of the relationship with historical access records. To address this problem, ABU was designed which takes into account access histories of files used by sibling nodes and aggregates the access record of similar files so that these frequently accessed files are replicated first. This process is repeated until the root is reached. An example of a data file access history and the network topology of the related nodes is shown in Figure 4. The history indicates that node N1 has accessed file A five times, while N2 and N3 have accessed B four and three times, respectively. Nodes N1, N2 and N3 are siblings and their parent node is P1. If we assume that the SBU algorithm is adopted and the given threshold is five, the last two records in the history will be skipped and only the first record will be processed. The result is that the file A will be created in node P1 if it has enough space, and file B will not be replicated. Considering this example it is clear that the decision of SBU is not optimal, because from the perspective of the whole system, file B, which is accessed seven times by node N2 and N3, is more popular than A, which is only accessed five times by node N1. Hence, the better solution is to replicate file B to P1 first, then replicate file A to P1 if it still has enough space available. The Aggregate Bottom-Up (ABU) algorithm works in a similar fashion. With a hierarchical topology, the client searches for files from a client to the root. In addition, the root replicates the needed data at every node. Therefore, access latency can be improved significantly. On the other hand, significant storage space may be used. Storage space utilization and access latency must be traded off against each other. Rahman et al. (Rahman, Barker, & Alhajj, 2005a) propose a multi-objective approach to address the replica placement problem in Data Grid systems. A grid environment is highly dynamic, so predicting user requests and network load, a-priori, is difficult. Therefore, only considering a single objective, variations in user requests and network load will have larger impacts on system performance. Rahman et al. use two models: the p-median and p-center models (Hakami, 1999), for selecting the candidate sites at which to host replicas. The p-median model places replicas at sites that optimize the request-weighted average response time (which is the time required to transfer a file from the nearest replication site). The response time is zero if a local copy exists. The request-weighted response time is calculated by multiplying the number of requests at a particular site by the response time for that site. The average is calculated by averaging the request weighted response times for all sites. The p-center model selects candidate sites to host replicas by minimizing the maximum response time. Rahman et al consider a multi-objective approach that combines the p-center and p-median objectives to decide where to place replica.
Algorithms Focusing on System Reliability and Availability Once bandwidth and computing capacity become relatively cheap, data access time can decrease dramatically. How to improve the system reliability and availability then becomes the focal point for replication algorithms. Lei and Vrbsky (Lei & Vrbsky, 2006) propose a replica strategy to improve availability when storage resources are limited without increasing access time. To better express system data availability, Lei and Vrbsky introduce two new measures: the file missing rate and the bytes missing rate. The File Missing Rate (FMR) represents the number of files potentially unavailable out of all the files requested by all the jobs. The Bytes Missing Rate (BMR) represents the number of bytes potentially unavailable out of the total number of bytes requested by all jobs. Their replication strategy is aimed at minimizing the data missing rate. To minimize the FMR and BMR, their proposed strategy makes the replica and placement decisions based on the benefits received
495
The State of the Art and Open Problems in Data Replication in Grid Environments
from replicating the file in the long term. If the requested file is not at a site, it is replicated at the site if there is enough storage space. If there is not enough free space to store the replica, an existing file must be replaced. Their replication algorithm can be enhanced by differentiating between the file missing rate and bytes missing rate in the grid when the file size is not unique. Ranganathan et al. (Ranganathan, Iamnitchi, & Foster, 2002) present a dynamic replication strategy that creates copies based on trade-o s between the cost and the future benefits of creating a replica. Their strategy is designed for peer-to-peer environments where there is a high-degree of unreliability and hence, considers the minimum number of replicas that might be required given the probability of a node being up and the accuracy of information possessed by a site in a peer-to-peer network. In their approach, peers create replicas automatically in a decentralized fashion, as required to meet availability goals. The aim of the framework is to maintain a threshold level of availability at all times. Each peer in the system possesses a model of the peer-to-peer storage system that it can use to determine how many replicas of any file are needed to maintain the desired availability. Each peer applies this model to the (necessarily incomplete and/or inaccurate) information it has about the system state and replication status of its files to determine if, when, and where new replicas should be created. The result is a completely decentralized system that can maintain performance guarantees. These advantages come at the price of accuracy since nodes make decisions based on partial information, which sometimes leads to unnecessary replication. Simulation results show that the redundancy in action associated with distributed authority is more evident when nodes are highly unreliable. An analytical model for determining the optimal number of replica servers is presented by Schintke and Reinefeld (Schintke & Reinefeld, 2003) to guarantee a given overall reliability given unreliable system components. Two views are identified: the requester who requires a guaranteed availability of the data (local view), and the administrator who wants to know how many replicas are needed and how much disk space they would occupy in the overall system (global view). Their model captures the characteristics of peer-to-peer-like environments as well as that of grid systems. Empirical simulations confirm the accuracy of this analytical model. Abawajy (Abawajy, 2004) addresses the file replication problem while focusing on the issue of strategic placement of the replicas with the objectives of increased availability of the data and improved response time while distributing load equally. Abawajy proposes a replica placement service called Proportional Share Replication (PSR). The main idea underlying the PSR policy is that each file replica should serve an approximately equal number of requests in the system. The objective is to place the replicas on a set of sites systematically in such a way that file access parallelism is increased while the access costs are decreased. Abawajy argues that no replication approach balances the load of data requests within the system both at the network and host levels. Simulation results show that file replication improves the performance of data access but the gains depend on several factors including where the file replicas are located, burstiness of the request arrivals, packet losses and file sizes. To use distributed replicas efficiently and to improve the reliability of data transfer, Wang et al. (C. Wang, Hsu, Chen, & Wu, 2006) propose an efficient multi-source data transfer algorithm for data replication, whereby a data replica can be assembled in parallel from multiple distributed data sources in a fashion that adapts to various network bandwidths. The goal is to minimize the data transfer time by scheduling sub-transfers among all replica sites. All replica sites must deliver their source data continuously to maximize their aggregated bandwidth, and all sub-transfers of source data should, ideally, be fully overlapped throughout the replication. Experimental results show that their algorithm can obtain more aggregated bandwidth, reduce connection overheads, and achieve superior load balance.
496
The State of the Art and Open Problems in Data Replication in Grid Environments
Algorithms Focusing on Overall Grid Performance Although a substantial amount of work has been done on data replication in Grid systems, most of it has focused on infrastructure for replication and mechanisms for creating and deleting replicas. However, to obtain the maximum benefit from replication, a strategic placement of replicas considering many factors is essential. Notably, different sites may have different service quality requirements. Therefore, quality of service is an important additional factor in overall system performance. Lin et al. (Lin, Liu, & Wu, 2006) address the problem of data replica placement in Data Grids given traffic patterns and locality requirements. They consider several important issues. First, the replicas should be placed in proper server locations so that the workload on each server is balanced. Another important issue is choosing the optimal number of replicas when the maximum workload capacity for each replica server is known. The denser the distribution of replicas is, the shorter the distance a client site needs to travel to access a data copy. However, maintaining multiple copies of data in Grid systems is expensive, and therefore, the number of replicas must be bounded. Clearly, optimizing the access cost of data requests and reducing the cost of replication are two conflicting goals. Finding a balance between them is a challenging task. Lin et al. also consider the issue of service locality. Each user may specify the minimum distance he will accept to the nearest data server. This serves as a locality assurance that users may specify, and the system must make sure that within the specified range there is a server to answer any file request. Lin et al. assume a hierarchical Data Grid model. In such a hierarchical Data Grid model, all the request traffic may reach the root, if not satisfied by a replica. This introduces additional complexity for the design of an efficient algorithm for replica placement in Grid systems when network congestion is one of the objective functions to be optimized. Tang and Xu (X. Tang & Xu, 2005) suggest a QoS-aware replica placement approach to cope with quality-of-service issues. They provide two heuristic algorithms for general graphs, and a dynamic programming solution for a tree topology. Every edge uses the distance between the two end-points as a cost function. The distance between two nodes is used as a metric for quality (i.e. access time) assurance. A request must be answered by a server that is within the distance specified by the request. Every request knows the nearest server that has the replica and the request takes the shortest path to reach the server. Their goal has been to find a replica placement that satisfies all requests without violating any range constraint, and that minimizes the update and storage costs at the same time. They show that this QoS-aware replica placement problem is NP-complete for general graphs. Wang et al. (H. Wang, Liu, & Wu, 2006) study the QoS-aware replica placement problem and provide a new heuristic algorithm to determine the positions of the replicas to improve system performance and satisfy the quality requirements specified by the users simultaneously. Their model is based on general graphs and their algorithm starts by finding the cover set (Revees, 1993) of every server in the network. In the second phase, the algorithm identifies and deletes super cover sets in the network. Finally, it inserts replicas into the network iteratively until all servers are satisfied. Experiment results indicate that the algorithm efficiently finds near-optimal solutions so that it can be deployed in various realistic environments. However, the study does not consider the workload capacity of the servers.
497
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 5. A taxonomy of replica selection algorithms and selected papers
Replica Selection Algorithms To improve replica retrieval we must determine the best replica location using a replica selection technique. Such techniques attempt to select the single best server to provide optimum transfer rates. This can be challenging because bandwidth quality can vary unpredictably due to the shared nature of the Internet. Another approach is to use co-allocation technology (Vazhkudai, 2003) to download data. Co-allocation of data transfers enables the clients to download data from multiple locations by establishing multiple connections in parallel. This can improve the performance compared to single-server approaches and helps to mitigate Internet congestion problems. Figure 5 shows a taxonomy of replica selection algorithms based on the method used for retrieving the replicas distributed in the system.
Algorithms Based on Selecting the Best Replica Vazhkudai et al. (Vazhkudai, Tuecke, & Foster, 2001) discuss the design and implementation of a highlevel replica selection service that uses information regarding replica location and user preferences to guide selection from among storage replica alternatives. An application that requires access to replicated data begins by querying an application specific metadata repository, specifying the characteristics of the desired data. The metadata repository maintains associations between representative characteristics and logical files, thus enabling the application to identify logical files based on application requirements rather than by a possibly unknown file name. Once the logical file has been identified, the application uses the replica catalog to locate all replica locations containing physical file instances of this logical file, from which it can choose a suitable instance for retrieval. Vazhkudai et al. use Globus (Foster, 2006) information service capabilities concerning storage system properties to collect dynamic information to improve and optimize the selection process. Chervenak et al. (Chervenak, 2002) characterize the requirements for a Replica Location Service (RLS) and describe a Data Grid architectural framework, Giggle (GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. An RLS is composed of a Local Replica Catalog (LRC) and a Replica Location Index (RLI). The LRC maps logical identifiers to physical locations and vice versa. It periodically sends out information to other RLSs about its contents (mappings) via a
498
The State of the Art and Open Problems in Data Replication in Grid Environments
soft-state propagation method. Collectively, the LRCs provide a complete and locally consistent record of global replicas. The RLI contains a set of pointers from logical identifier to LRC. The RLS uses the RLIs to find LRCs that contain the requested replicas. RLIs may cover a subset of LRCs or cover the entire set of LRCs. To select the best replica Rahman et al. (Rahman, K.Barker, & Alhajj, 2005) design an optimization technique that considers both network latency and disk state. They present a model that uses a simple data mining approach to select the best replica from a number of sites that hold replicas. Previous history of data transfers can help in predicting the best site to hold a replica. Rahman et al’s approach is one such predictive technique. In their technique, when a new request arrives for the best replica, all previous data are examined to find a subset of previous file requests that are similar and then they are used to predict the best site to hold the replica. The proposed model shows significant performance improvement for sequential and unitary random file access patterns. The client node always contacts the site found by the classifier and requests the file, regardless of the accuracy of the classification result. Switching from a classification method to a traditional one is not considered even when the classification result is far from ideal. Hence, the system performance will decrease for inaccurate file accesses. Future work could be done on designing an adaptive algorithm so that the algorithm can switch to a traditional approach for consecutive file transfers when it encounters misclassification. Sun et al. (M. Sun, Sun, Lu, & Yu, 2005) propose an ant optimization algorithm for file replica selection in Data Grids. The general idea of the ant-based approach is to use an ant colony optimization algorithm to decide which data file replicas should be accessed when a job requires data resources. The ant algorithm (Dorigo, 1992) is a meta-heuristic method which mimics the behavior of how real ants find the shortest path from their nest to a food source. The main idea is to mimic the pheromone trail used by real ants as a medium for communication and feedback among ants. The goal of using the ant-based approach is to exploit the ant algorithm to decide which data file replicas should be accessed when a job requires data resources. For the selection of a data replica the ant uses pheromone information which reflects the efficiency of previous accesses. The algorithm has been implemented and the advantages of the new ant algorithm have been investigated using the grid simulator OptorSim (W. Bell et al., 2003). Their evaluation demonstrates that their ant algorithm can reduce data access latency, decrease bandwidth consumption and distribute storage site load.
Algorithms Using Co-Allocation Mechanism Vazhkudai (Vazhkudai, 2003) developed several co-allocation mechanisms to enable parallel downloading of files. The most interesting one is called Dynamic Co-Allocation. The dataset that the client wants is divided into “k” disjoint blocks of equal size. Each available server is assigned to deliver one block in parallel. When a server finishes delivering a block, another block is requested, and so on, until the entire file is downloaded. Faster servers can deliver the data quickly, thus serving larger portions of the file requested when compared to slower servers. This approach exploits the partial copy feature of GridFTP (Allcock, 2003) provided by the Globus Toolkit (Foster, 2006) to reduce the total transfer time. One drawback of this approach is that faster servers must wait for the slowest server to deliver the final block. This ‘idle-time drawback’ is common to existing co-allocation strategies. It is important to reduce the differences in completion time among replica servers to achieve the best possible performance. Chang et al. (Chang, Wang, & Chen, 2005) suggest an improvement to dynamic co-allocation to address the problem of faster servers waiting for slower ones. Their work is based on a co-allocation
499
The State of the Art and Open Problems in Data Replication in Grid Environments
architecture coupled with prediction techniques. They propose two techniques: (1) abort and retransfer, and (2) one by one co-allocation. These techniques can increase the volume of data requested from faster servers and reduce the volume of data fetched from slower servers thereby balancing the load and individual completion times. The Abort and Retransfer scheme allows the aborting of the slowest server transfer so the work can be moved to faster servers. This can dynamically change the allocation condition based on the dynamic conditions of the transfer. When all data blocks are assigned, the procedure will check the remaining transfer time of the slowest server. If the remaining time is longer than the time of transferring the last data block from the fastest server, the final data block will be re-assigned to the fastest server. One by one co-allocation focuses on preventing the problematic allocation to the slowest server a-priori. One by one co-allocation is a pre-scheduling method used to allocate the data blocks to be transferred to the available servers. By using a prediction technique, the transfer time of each server is estimated. The data blocks are then assigned to the fastest server with the lowest transfer time in each round. Further, if one server is assigned to transfer more than one data block in an earlier round, its total transfer time is accumulated. Yang et al. (Yang, Yang, Chen, & Wang, 2006) propose a dynamic co-allocation scheme based on a co-allocation grid data transfer architecture called the Recursive-Adjustment Co-Allocation scheme that reduces the idle time spent waiting for the slowest server and improves data transfer performance. Their co-allocation scheme works by continuously adjusting each replica server’s workload to correspond to its real-time bandwidth during file transfers. Yang et al. also provide a function that enables users to define a “final block threshold”, according to the characteristics of their Data Grid environment to avoid continuous over adjustment. Usually, a complete file is replicated to many Grid sites for local access (including when co-allocation is used). However, a site may only need certain parts of a given replica. Therefore, to use the storage system efficiently, it may be desirable for a grid site to store only part(s) of a replica. Chang and Chen (Chang & Chen, 2007) propose a concept called fragmented replicas where, when doing replication, a site can store only those partial contents that are needed locally. This can greatly reduce the storage space wasted in storing unused data. Chang and Chen also propose a block mapping procedure to determine the distribution of blocks in every available server for later replica retrieval. Using this procedure, a server can provide its available partial replica contents to other members in the grid system since clients can retrieve a fragmented replica directly by using the block mapping procedure. Given the block mapping procedure, co-allocation schemes (Vazhkudai, 2003; Chang et al., 2005) can be used to retrieve data sets from the available servers given the added constraint that only specific servers will hold a particular fragment. Simulation results show that download performance is improved in their fragmented replication system. Chang and Chen (Chang & Chen, 2007) assume that the blocks in a fragmented replica are contiguous. If they were not, then the data structure to represent the fragmented replica and the algorithm for retrieval would be more complicated. Also, the proposed algorithms do not always find an optimal solution as explained by the authors. It would be interesting to determine if a worst case performance bound exists for these algorithms. When multiple replicas exist, a client uses a replica selection mechanism to find the “best” source from which to download. However, this simple approach may not yield the best performance and reliability because data is received from only one replica server. To avoid this problem, Zhou et al. (Zhou, Kim, Kim, & Yeom, 2006) developed ReCon, a fast and reliable replica retrieval system for Data Grids that
500
The State of the Art and Open Problems in Data Replication in Grid Environments
acquires data not only from the best source but from other sources as well. Through concurrent transfer, they are able to achieve significant performance improvement when retrieving a replica. ReCon also provides fault-tolerant replica retrieval since multiple replication sites are employed. For fast replica retrieval, Zhou et al considered various fast retrieval algorithms, among which probebased retrieval appears to be the best approach, providing twice the transfer rate of the best replica server chosen by the replica selection service. Probe-based retrieval predicts the future network throughput of the replica servers by sending probing messages to each server. This allows them to select replicas which will provide fast access. For reliable replica retrieval, they introduce a recursive scheduling mechanism, which provides fault-tolerant retrieval by rescheduling failed sub-transfers.
Replica Consistency Algorithms As mentioned earlier, the replica consistency problem in Data Grid systems deals with the update synchronization of multiple copies (replicas) of a file. When one file is updated, all other replicas have to be synchronized to have the same contents and thus provide a consistent view. Different algorithms for maintaining such consistency have been proposed in the literature. Replication consistency algorithms have traditionally been classified into strong and weak consistency. A strong consistency algorithm (Duvvuri, Shenoy, & Tewari, 2000) ensures that all the replicas have exactly the same content (synchronized systems) before any transaction is carried out. In an unreliable network like the Internet, with a large number of replicas, latency can become high so it becomes impractical to use such systems. Strong consistency algorithms are suitable for systems with few replicas, and on a reliable, low latency network where a large amount of bandwidth is available. In contrast, weak consistency algorithms (Golding, 1992) maintain approximate consistency of the replicas sacrificing data freshness in a controlled way to improve availability and performance. They are very useful in systems where it is not necessary for all the replicas to be totally consistent for carrying out transactions (systems that can withstand a certain degree of inconsistency). In the context of weak consistency, the fast consistency algorithm (Elias & Moldes, 2002b) prioritizes replicas with high demand in such a way that a large number of clients receive fresh content. As described by Elias and Moldes (Elias & Moldes, 2002a), this algorithm gives high performance in one zone of high demand, but in multiple zones the performance may become poor. To improve this poor performance, Elias and Moldes (Elias & Moldes, 2003) propose an election algorithm based on demand, whereby the replicas in each zone of high demand select leader replicas that subsequently construct a logical topology, linking all the replicas together. In this way, changes are able to reach all the high demand replicas without the low demand zones forming a barrier to prevent this from happening. Two coherence protocols for Data Grids were introduced by Sun and Xu (Y. Sun & Xu., 2004) called lazy-copy and aggressive-copy. In the lazy-copy based protocol, replicas are only updated as needed if someone accesses them. This can save network bandwidth by avoiding transferring up-to-date replicas every time some modifications are made. However, the lazy-copy protocol has to pay penalties in terms of access delay when inter-site updating is required. For the aggressive-copy protocol, replicas are always updated immediately when the original file is modified. In other words, full consistency for replicas is guaranteed in aggressive-copy, whereas partial consistency is applied to lazy-copy. Compared with lazy-copy, access delay time can be reduced by an aggressive-copy based mechanism without suffering from long update time during each replica access. Nevertheless, full consistency with frequent replica updates could consume a considerable amount of network bandwidth. Furthermore, some updates may
501
The State of the Art and Open Problems in Data Replication in Grid Environments
be unnecessary because it is probable that they will never be used. Chang and Chang (Chang & Chang, 2006) propose an innovative and effective architecture called the Adaptable Replica Consistency Service (ARCS) which has the capability of dealing with the replica consistency problem to provide better performance and load balance for file replication in Data Grids. The ARCS architecture works by modifying the two previously described distinct coherence protocols. Chang and Chang make use of the concept of network regions in (Park et al., 2003) to develop the scheme. Several grid sites located closely together are organized into a grid group called a grid region. A ‘Region Server’ is responsible for the consistency service within a region. Each region is connected via the Internet. Each grid region has at most zero or one master replica and multiple master replicas are distributed over grid regions. A region server must be aware of the location of other master replicas to maintain full consistency among all master replicas. Update modifications are propagated to other connected region servers for their master replicas with the aid of a file locking mechanism whenever a master replica is modified in a certain grid region. Thus, a master replica within a grid region has the latest information all the time. Each secondary replica can update its contents more efficiently in accordance with the master replica if the region has a master replica. Simulation results show that ARCS is superior to the coherence protocols described in (Y. Sun & Xu., 2004). Belalem and Slimani (Belalem & Slimani, 2006, 2007) proposed a hybrid model to manage the consistency of replicas in large scale systems. The model combines two existing approaches (optimistic and pessimistic (Saito & Levy, 2000; Saito & Shapiro, 2005)) to consistency management. The pessimistic approach prohibits any access to a replica unless it is provably up to date. The main advantage of this approach is that all replicas converge at the same time, a fact that guarantees high consistency of data. Hence, any problem of divergence is avoided. On the contrary, the optimistic approach allows access to any replica at any time regardless of the state of the replica sets, which might be incoherent. This also means that the approach can cause replica contents to diverge. Optimistic techniques require a follow-up phase to detect and correct divergences among replicas by converging them toward a coherent state. Pessimistic replication and optimistic replication are two contrasting replication models. The work of Belalem and Slimani tries to benefit from the advantages of both approaches. Optimistic principals are used to ensure replica consistency within each site in the grid individually. Global consistency, i.e., consistency between sites, is covered by the application of algorithms inspired by the pessimistic approach. Their model aims to substantially reduce the communication time between sites to achieve replica consistency, increase the effectiveness of consistency management, and more importantly, be adaptive to changes in large systems. Domenici et al. (Domenici, Donno, Pucciani, & Stockinger, 2006) propose a Replica Consistency Service, CONStanza, that is general enough to be suitable for most types of applications in a grid environment and which meets the general requirements for grid middleware, such as performance, scalability, reliability, and security. Their proposed replica consistency service allows for replica updates in a single-master scenario with lazy update synchronization. Two types of replicas are maintained which have different semantics and access permissions for end-users. The first is a master replica that can be updated by end-users of the system. The master replica is the one that is, by definition, always up-to-date. The other is a secondary replica (also referred to as secondary copy) that is updated/synchronized by CONStanza with a certain delay to eventually have the same contents as the master replica. Obviously, the longer the update propagation delay is, the more unsynchronized the master and the secondary replica are, and the higher is the probability of experiencing stale reads on secondary replicas. This service
502
The State of the Art and Open Problems in Data Replication in Grid Environments
Figure 6. A taxonomy of algorithms considering data scheduling and associated papers
provides users with the ability to update data using a certain consistency delay parameter (hence relaxed consistency) to adapt to specific application requirements and tolerances.
Impact of Data Replication on Job Scheduling Effective scheduling can reduce the amount of data transferred across the Internet by dispatching a job to where the needed data files are available. Assume a job is scheduled to be executed at a particular compute node. When job scheduling is coupled to replication and the data has to be fetched from remote storage, the scheduler can create a copy of the data at the point of computation so that future requests for the same file that come from the neighborhood of the compute node can be satisfied more quickly. Further, in the future, any job dealing with that particular file can be preferentially scheduled at that compute node if it is available. In a decoupled scheduler, the job is scheduled to a suitable computational resource and a suitable replica location is then identified to request the required data from. In this case, the storage requirement is transient, that is, disk space is required only for the duration of execution. Figure 6 shows a taxonomy of replication algorithms considering data scheduling. A comparison of decoupled against coupled strategies by Ranganathan and Foster (Ranganathan & Foster, 2002) has shown that decoupled strategies promise increased performance and reduce the complexity of designing algorithms for Data Grid environments. He et al. (He, Sun, & Laszewski, 2003) deal with the problem of Integrated Replication and Scheduling (IRS). They couple job scheduling and data scheduling. At the end of periodic intervals when jobs are scheduled, the popularity of required files is calculated and then used by the data scheduler to replicate data for the next set of jobs. While these may or may not share the same data requirements as the previous set there is often a high probability that they will. The importance of data locality in job scheduling was also realized by Ranganathan and Foster (Ranganathan & Foster, 2004). They proposed a Data Grid architecture based on three main components: External Scheduler (ES), Local Scheduler (LS) and Dataset Scheduler (DS). An ES receives job submissions from the user, then it decides the remote site to which the job should be sent depending on its scheduling strategy. The LS of each site decides how to schedule all the jobs assigned to it, using
503
The State of the Art and Open Problems in Data Replication in Grid Environments
its local resources. The DS keeps track of the popularity of each dataset currently available and makes data replication decisions. Using this architecture, Ranganathan and Foster developed and evaluated various replication and scheduling strategies. Their results confirmed the importance of data locality in scheduling jobs. Dheepak et al. (Dheepak, Ali, Sengupta, & Chakrabarti, 2005) have created several scheduling techniques based on a developed replication strategy. The scheduling strategies are Matching based Job Scheduling (MJS), Cost base Job Scheduling (CJS) and Latency based Job Scheduling (LJS). In MJS, the jobs are scheduled to those sites that have the maximum match in terms of data. For example, if a job requests ‘n’ files, and in a site, all those files are already present, then the amount of data in bytes corresponding to those ‘n’ files represents the match corresponding to the job request. In CJS, the cost of scheduling a job onto a site is defined as the combined cost of moving the data to the site, the time to compute the job at the site, and the wait time in the queue at the site. The job is scheduled onto the site which has the minimum cost. Finally, LJS takes the latency experienced into account before taking the scheduling decision. The cost of scheduling in this case includes the latency involved in scheduling the current job based on the current data locations, and also the latency involved due to the current queue. Simulation results show that among the strategies, LJS and CJS perform similarly and MJS performs less well. Venugopal et al. (Venugopal & Buyya, 2005) propose a scheduling algorithm that considers two cost metrics: an economic budget and time. The algorithm tries to optimize one of them given a bound on the other, e.g., spend a budget as small as possible, while not missing any deadline. The incoming applications consist of a set of independent tasks each of which requires a computational resource and accesses a number of data sets located on different storage sites. The algorithm assumes every data set has only one copy in the Grid, so that the resource selection is only for computational resources, taking into account the communication costs from data storage sites to different computation sites as well as the actual computation costs. Instead of doing a thorough search in a space whose size is exponential in the number of data sets requested by a task, the resource selection procedure simply performs a local search which only guarantees that the current mapping is better than the previous one. In this way, the cost of the search procedure is linear. The drawback of this strategy is that it is not guaranteed to find a feasible schedule even if there is one. As we have seen, data can be decomposed into multiple independent sub datasets and distributed for parallel execution and access. Most of the existing studies on scheduling in Data Grids do not consider this possibility which is typical in many data intensive applications. Kim and Weissman (Kim & Weissman, 2004), however, exploit such parallelism to achieve desired performance levels when scheduling large Data Grid applications. When parallel applications require multiple data files from multiple data sources, the scheduling problem is challenging in several dimensions; how should data be decomposed, should data be moved to computation or vice-versa, and which computing resources should be used. The problem can be optimally solved by adding some constraints (e.g., decomposing data into sub datasets of only the same size). Another approach is to use heuristics such as those based on optimization techniques (e.g. genetic algorithms, simulated annealing, and tabu search). Kim and Weissman propose a novel genetic algorithm (GA) based approach to address the scheduling of decomposable Data Grid applications, where communication and computation are considered at the same time. Their proposed algorithm is novel in two ways. First, it automatically balances load, that is, data in this case, onto communication/ computation resources while generating a near optimal schedule. Second, it does not require a job to be pre-decomposed. This algorithm is a competitive choice for scheduling large Data Grid applications
504
The State of the Art and Open Problems in Data Replication in Grid Environments
in terms of both scheduling overhead and the quality of solutions when compared to other algorithms. However, this work does not consider the case of multiple jobs competing for shared resources. Tang et al. (M. Tang, Lee, Tang, & Yeo, 2005; M. Tang et al., 2006) propose a Data Grid architecture supporting efficient data replication and job scheduling. The computing sites are organized into individual domains according to the network structure, and a replica server is placed in each domain. Two centralized dynamic replication algorithms with different replica placement methods and a distributed dynamic replication algorithm are proposed. At regular intervals, the dynamic replication algorithms exploit the data access history for popular data files and compute the replication destinations to improve data access performance for grid jobs. Coupled with these replication algorithms, the grid scheduling heuristics of Shortest Turnaround Time (STT), Least Relative Load and Data Present are proposed. For each incoming job, the STT heuristic estimates the turnaround time on every computing site and assigns a new job to the site that provides the shortest turnaround time. The Least Relative Load heuristic assigns a new job to the computing site that has the least relative load. This scheduling heuristic attempts to balance the workloads for all computing sites in the Data Grid. Finally, the Data Present heuristic considers the data location as the major factor when assigning the job. Simulation results demonstrate that the proposed algorithms can shorten the job turnaround time greatly. Analyzing earlier work, Dang and Lim (Dang & Lim, 2007) identified two shortcomings in earlier work. The first is not considering the relationships among data file and between the data files and jobs. By replicating a set of files that has high probability to be used together on nearby resources, they expect that the jobs using these files will be scheduled to that small area. The second is a limitation in the use of the Dataset Scheduler (DS) (Ranganathan & Foster, 2004). Instead of just tracking data popularity, the DS plays the role of an independent scheduler. They propose a tree of data types in which the relationship between data in the same category and relationship between nearby categories are defined. By means of this, a correlation between data is extracted. The idea is then to gather data that is ‘related’ to a small region so that any job requiring such data will be executed inside that region. This reduces the cost to transfer data to the job execution site, therefore, improves the job execution performance. Desprez et al. (Desprez & Vernois, 2006) describe an algorithm that combines data management and scheduling via a steady state approach. Using a model of the grid platform, the number of requests as well as their distribution, and the number and size of data files, they define a linear programming problem to satisfy the constraints at every level of the platform in steady-state. The solution of this linear program provides a placement for the data files on the servers as well as, for each kind of job, the server on which they should be executed. However, this heuristic approach for approximating an integer solution to the linear program does not always give the best mapping of data and can potentially give results that are far from the optimal value of the objective function. Chang et al. (Chang, Chang, & Lin, 2007) developed a job scheduling policy, called Hierarchical Cluster Scheduling (HCS), and a dynamic data replication strategy, called Hierarchical Replication Strategy (HRS), to improve data access efficiency in a cluster structured grid. Their HCS scheduling policy considers the locations of required data, the access cost and the job queue length of a computing node. HCS uses hierarchical scheduling that takes cluster information into account to reduce the search time for an appropriate computing node. HRS integrates the previous replication strategy with the job scheduling policy to increase the chance of accessing data at a nearby node. The probability of scheduling the same type of job to the same cluster will be rather high in their proposed scheduling algorithm, leading to possible load balancing problems. The consideration of system load balancing with other
505
The State of the Art and Open Problems in Data Replication in Grid Environments
scheduling factors will be an important direction for future research. In addition, balancing between data access time, job execution time, and network capabilities also needs to be studied further. Some recent work has addressed data movement in task scheduling. The current research has developed along two directions: allocating the task to where the data is, and moving the data to where the task is. He and Sun (He & Sun, 2005) incorporate data movement into task scheduling using a newly introduced data structure called the Data Distance Table (DDT) to measure the dynamic data movement cost, and integrate this cost into an extended Min-Min (He et al., 2003) scheduling heuristic. A data replica based algorithm is dynamically adjusted to place data on under-utilized sites before any possible load imbalance occurs. Based on DDT, a data-conscious task scheduling heuristic is introduced to minimize data access delay. Experimental results show that their data-conscious dynamic adjusting scheduling heuristics outperforms the general Min-Min technique significantly for data intensive applications, especially when the critical data sets are unevenly distributed. Khanna et al. (Khanna et al., 2006) address the problem of efficient execution of a batch of dataintensive tasks with batch-shared I/O behavior, on coupled storage and compute clusters. They approach the problem in three stages. The first stage, called sub-batch selection, partitions a batch of tasks into sub-batches such that the total size of the files required for a sub-batch does not exceed the available aggregate disk space on the compute cluster. The second stage accepts a sub- batch as input and yields an allocation of the tasks in the sub-batch onto the nodes of the compute cluster to minimize the subbatch execution time. The third stage orders the tasks allocated to each node at runtime and dynamically determines what file transfers need to be done and how they should be scheduled to minimize end-point contention on the storage cluster. Two scheduling schemes are proposed to solve this three stage problem. The first approach formulates the sub-batch selection problem using a 0-1 Integer Programming (IP) formulation. The second stage is also modeled as a 0- 1 IP formulation to determine the mapping of tasks to nodes, source and destination nodes for all replications, and the destination nodes for all remote transfers. The second approach, called BiPartition, employs a bi-level hypergraph partitioning based scheduling heuristic that formulates the sharing of files among tasks as a hypergraph. The BiPartition approach results in slightly longer batch execution times, but is much faster than the IP based approach. Thus, the IP based approach is attractive for small workloads, while the BiPartition approach is preferable for large scale workloads. Lee and Zomaya (Lee & Zomaya, 2006) propose a novel scheduling algorithm called the Shared Input data based Listing (SIL) algorithm for Data intensive Bag-of-Tasks (DBoT) applications on grids. The algorithm uses a set of task lists that are constructed taking data sharing patterns into account and that are reorganized dynamically based on the performance of resources during the execution of the application. The primary goal of this dynamic listing is to minimize data transfers thus leading to shortening the overall completion time of DBoT applications. The SIL algorithm also attempts to reduce serious schedule increases (that occur because of inefficient task/host assignments) by using task duplication. The SIL algorithm consists of two major phases. The task grouping phase groups tasks into a set of lists based on their data sharing patterns, associates these task lists with sites, and further breaks and/or associates them with hosts. Then the scheduling phase assigns tasks to hosts dynamically reorganizing task lists and duplicates tasks once all tasks are scheduled but while some tasks are still running. Additionally, Santos-Neto et al. (Santos-Neto, Cirne, Brasileiro, & Lima, 2004) have developed a Storage Affinity (SA) algorithm which tries to minimize data transfers by making scheduling decisions incorporating the location of data previously transferred. In addition, they consider task replication as
506
The State of the Art and Open Problems in Data Replication in Grid Environments
soon as a host becomes available –between the time the last unscheduled task gets assigned and the time the last running task completes its execution. The SA algorithm determines task/host assignments based on a “storage affinity metric”. The storage affinity of a task to a host is the amount of the task’s input data already stored at the site to which the host belongs. Although the scheduling decision SA makes is between task and host, storage affinity is calculated between task and site. This is because in the grid model used for SA, each site in the grid uses a single data repository that is accessed by all the hosts in the site. For each scheduling decision, the SA algorithm calculates storage affinity values for all unscheduled tasks and dispatches the task with the largest storage affinity value. If none of the tasks has a positive storage affinity value one of them is scheduled at random. By the time this initial scheduling is completed, all the hosts will be busy in running the same number of tasks. On the completion of any of the running tasks, the SA algorithm starts task replication. Then each of the remaining running tasks is considered for replication and the best one is selected. The selection decision is based on the storage affinity value and the number of replicas available.
COMPARISON OF REPLICA PLACEMENT STRATEGIES In this section, we summarize current and past research on different replica placement techniques for Data Grid environments. Several important factors such as grid infrastructure, data access patterns, network traffic conditions, and so on are taken into account when choosing a replica placement strategy. In the presence of diverse and varying grid characteristics it is difficult to create a common ground for comparison of different strategies. To gain insight into the effectiveness of different replication strategies, we compare them by considering metrics including access latency, bandwidth savings, and server work load. Response Time: This is the time that elapses from when a node sends a request for a file until it receives the complete file. If a local copy of the file exists, the response time is assumed to be zero. Bandwidth Consumption: This includes the bandwidth consumed for data transfers occurred when a node requests a file and when a server creates a replica at another node. Server Work Load: This is the amount of work done by the servers. Ideally, the replicas should be placed so that the workload on each server is balanced.
Comparison of Replica Placement Algorithms We start with the initial work (Ranganathan & Foster, 2001b) on replication strategies proposed for hierarchical Data Grids. Among these strategies, Fast Spread shows relatively consistent performance and is best both in terms of access latency and bandwidth consumption given random access patterns. The disadvantage is that it has high storage requirements. The entire storage space at each tier is fully utilized by Fast-Spread. If, however, there is sufficient locality in the access patterns, Cascading would work better than the others in terms of both access latency and bandwidth consumption. The Best Client algorithm is naive and illustrates the worst case performance among those presented in (Ranganathan & Foster, 2001b). An improvement to the Cascading technique is the Proportional Share Replica policy (Abawajy, 2004). The method is a heuristic one that places replicas at the “optimal” locations by assuming that the
507
The State of the Art and Open Problems in Data Replication in Grid Environments
number of sites and the total number of replicas to be distributed are already known. Firstly, an ideal load distribution is calculated and then replicas are placed on candidate sites that can service replica requests slightly greater than or equal to that ideal load. This technique was evaluated based on mean response time (mean access latency). Simulation results show that it performs better than the cascading technique with increased availability of data and considers load sharing among replica servers. Unfortunately, the approach is unrealistic for most scenarios and is inflexible once placement decisions have been made. With the aim of improving the performance of data access given varying workloads, dynamic replication algorithms were presented by Tang et al. (M. Tang, Lee, Yeo, & Tang, 2005). In their paper, two dynamic replication algorithms, Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU), were proposed for a multi-tier Data Grid. Their simulation results show that both algorithms can reduce the average response time of data access significantly compared to static replication methods. ABU can achieve great performance improvements for all access patterns even if the available storage size of the replication server is relatively small. Comparing the two algorithms to Fast Spread, the dynamic replication strategy ABU proves to be superior. As for SBU, although the average response time of Fast Spread (Ranganathan & Foster, 2001b) is better in most cases, Fast Spread’s replication frequency may be too high to be useful in the real world. A multi-objective approach to dynamic replication placement exploiting operations research techniques was proposed in (Rahman, Barker, & Alhajj, 2005a). In this method, replica placement decisions are made considering both the current network status and data request patterns. Dynamic maintainability is achieved by considering replica relocation cost. Decisions to relocate are made when a performance metric degrades significantly in the last specific number of time periods. Their technique was evaluated in terms of request-weighted average response time, but the performance results were not compared to any of the other existing replication techniques. The BHR (Park et al., 2003) dynamic replication strategy focuses on ‘network-level locality’ by trying to place the targeted file at a site that has broad bandwidth to the site of job execution. The BHR strategy was evaluated in term of job execution time (which includes access latency) with varying bandwidths and storage spaces using the OptorSim (W. Bell et al., 2003) simulator. The simulation results show that it can even outperform aggressive replication strategies like LRU Delete and Delete Oldest (W. Bell et al., 2003) in terms of data access time especially when grid sites have relatively small storage capacity and a clear hierarchy of bandwidths. Lin et al. (Lin et al., 2006) is one of the relatively few replication efforts that focus on overall grid performance. Their proposed placement algorithm, targeted for a tree-based network model, finds optimal locations for replicas so that the workload among the replicas is balanced. They also propose a new algorithm to determine the minimum number of replicas required when the maximum workload capacity of each replica server is known. All these algorithms ensure that QoS requirements from the users are satisfied. Subsequent work by Wang et al. (H. Wang et al., 2006) addresses the replica placement problem when the underlying network is a general graph, instead of a tree. Their experimental results indicate that their proposed algorithm efficiently finds near-optimal solutions.
SUMMARY AND OPEN PROBLEMS This survey has reviewed data replication in grid systems, considering the issues and challenges involved in replication with a primary focus on replica placement which is at the core of all replication strategies.
508
The State of the Art and Open Problems in Data Replication in Grid Environments
Although replication in parallel and distributed systems has been intensively studied, new challenges in grid environments make replication an interesting ongoing topic and many research e orts are underway in this area. We have identified heterogeneity, dynamism, system reliability and availability, and the impact of data replication on scheduling as the primary challenges addressed by current research in grid replication. We also find that the evolution of Data Grid architectures (e.g., support for a variety of grid structure models, fragmented replicas, co-allocation mechanisms and data sharing) provide an opportunity to implement sophisticated data replication algorithms providing specific benefits. In addition to enhancements to classic replication algorithms, new methodologies have been applied, such as grid economic models and nature inspired heuristics (e.g., genetic and ant algorithms). Due to the characteristics of grid systems and the challenges involved in replication, there are still many open issues related to data replication on the grid. Without any specific assumptions, we find the following general issues deserving of additional/future exploration.
Fragmented Replication Focusing on the concept of using fragmented replicas in replica placement and selection is a recent research trend. As mentioned when discussing algorithms that use co-allocation methods, the problem with current strategies that deal with fragmented replicas is increased complexity. Usually, the blocks in a fragmented replica are considered to be contiguous. If they were not, then the data structure to represent the fragmented replica and the algorithm for retrieval would be more complicated. Also, the proposed algorithms (Chang & Chen, 2007) do not always find an optimal solution. It would be interesting to find whether a worst case performance bound exists for the algorithms. Finding efficient ways to handle fragmented replica updates would also be an interesting area for future research.
Algorithms that are Adaptive to Performance Variation It will likely be important to come up with a suite of adaptive job placement and data movement algorithms that can dynamically selecting strategies depending on current and predicted grid conditions. The limitations of current rescheduling algorithms for Data Grids are high cost and lack of consideration of dependent tasks. For jobs whose turn-around times are large, rescheduling can improve performance dramatically. However, rescheduling is itself costly, especially when there are extra data dependencies among tasks compared to independent applications. In addition, many related problems also must be considered. For example, when the rescheduling mechanisms should be invoked, what measurable parameters should be used to decide whether rescheduling will be profitable, and where tasks should be migrated to. Research on rescheduling for Data Grids is largely an open field for future work.
Enhanced Algorithms Combining Computation and Data Scheduling Only a handful of current research e orts consider the simultaneous optimization of computation and data transfer scheduling, which suggests possible opportunities for future work. Consideration of data staging in grid scheduling has an impact on the choice of a computational node for a task. The situation could be far more complex if there were multiple copies of data, and data dependencies among the tasks were considered. As discussed, in the work in (Kim & Weissman, 2004), scheduling decomposable Data Grid applications does not consider the case of multiple jobs competing for shared resources which would be
509
The State of the Art and Open Problems in Data Replication in Grid Environments
an interesting topic for future research. Also, combined computation and data scheduling may lead to possible load balancing problems (e.g., the probability of scheduling the same type of job to the same cluster is high in the scheduling algorithm proposed in (Chang et al., 2007)). Thus, consideration of system load balancing with different scheduling factors will be an important future research direction.
New Models of Grid Architecture Grid-like complex distributed environments cannot always be organized and controlled in a hierarchical manner. Any central directory service would inevitably become a performance bottleneck and a single point of failure. Rather, in the future, many of these systems will likely be operated in a self-organizing way, using replicated catalogs and a mechanism for the autonomous generation and placement of replicas at different sites. As discussed earlier, one open question for replica placement in such environments is how to determine replica locations when the network is general graph, instead of a tree. It is important to consider the properties of such other graphs and derive efficient algorithms for use with them. The design of efficient algorithms for replica placement in grid systems when network congestion is one of the objective functions to be optimized also needs to receive further consideration.
Increased Collaboration Using VO-based Data Grids Foster et al. (Foster, Kesselman, & Tukcke, 2001) have proposed a grid architecture for resource sharing among different entities based around the concept of Virtual Organizations (VOs). A VO is formed when different organizations pool resources and collaborate to achieve a common goal. A VO defines the resources available for the participants and the rules for accessing and using the resources and the conditions under which the resources may be used. A VO also provides protocols and mechanisms for applications to determine the suitability and accessibility of available resources. The existence of VOs impacts the design of Data Grid architectures in many ways. For example, a VO may be stand alone or may be composed of a hierarchy of regional, national and international VOs. In the latter case, the underlying Data Grid may have a corresponding hierarchy of repositories and the replica discovery and management system might be structured accordingly. More importantly, sharing of data collections is guided by the relationships that exist between the VOs that own each of the collections. While Data Grids may be built around VOs, current technologies do not provide many of the capabilities required for enabling collaboration between participants. For example, the tree structure of many replication mechanisms inhibits direct copying of data between participants that reside on different branches. Replication systems, therefore, will likely need to follow hybrid topologies that involve peer-to-peer links between different branches for enhanced collaboration. With the use of VOs, efforts have moved towards community-based scheduling in which schedulers follow policies that are set at the VO level and enforced at the resource level through service level agreements and allocation quotas (Dumitrescu & Foster, 2004). Since communities are formed by pooling of resources by participants, resource allocation must ensure fair shares to everyone. This requires community-based schedulers that assign quotas to each of the users based on priorities and resource availability. Individual user schedulers should then submit jobs taking into account the assigned quotas and could negotiate with the central scheduler for a quota increase or change in priorities. It could also be possible to swap or reduce quotas to gain resource share in the future. Users are able to plan ahead for future resource requirements by advance reservation of resources. This community-based schedul-
510
The State of the Art and Open Problems in Data Replication in Grid Environments
ing combined with enhanced Data Grid capabilities for collaboration will introduce new challenges to efficient replica placement in Data Grids and also the need to reduce replication cost.
CONCLUSION Data Grids are being adopted widely for sharing data and collaboratively managing and executing largescale scientific applications that process large data sets, some that are distributed around the world. However, ensuring efficient and fast access to such huge and widely distributed data is hindered by the high latencies of the Internet upon which many Data Grids are built. Replication of data is the most common solution to this problem. In this chapter, we have studied, characterized, and categorized the issues and challenges involved in such data replication systems. In doing so, we have tried to provide insight into the architectures, strategies and practices that are currently used in Data Grids for data replication. Also, through our characterization, we have been attempted to highlight some of the shortcomings in the work done and identify gaps in the current architectures and strategies. These represent some of the directions for future research in this area. This paper provides a comprehensive study of replication in Data Grid that should not only serve as a tool for understanding this area but also present a reference by which future efforts can be classified.
REFERENCES Abawajy, J. (2004). Placement of file replicas in data grid environments. In Proceedings of international conference on computational science (Vol. 3038, pp. 66-73). Allcock, W. (2003, Mar). GridFTP protocol specification. Global Grid Forum Recommendation GFD.20. Baker, M., Buyya, R., & Laforenza, D. (2002). Grids and grid technologies for wide-area distributed computing. Software [SPE]. Practice and Experience, 32, 1437–1466. doi:10.1002/spe.488 Belalem, G., & Slimani, Y. (2006). A hybrid approach for consistency management in large scale systems. In Proceedings of the international conference on networking and services (pp. 71–76). Belalem, G., & Slimani, Y. (2007). Consistency management for data grid in optorsim simulator. In Proceedings of the international conference on multimedia and ubiquitous engineering (pp. 554–560). Bell, W., Cameron, D., Capozza, L., Millar, P., Stockinger, K., & Zini, F. (2003). Optorsim - A grid simulator for studying dynamic data replication strategies. International Journal of High Performance Computing Applications, 17, 403–416. doi:10.1177/10943420030174005 Bell, W. H., Cameron, D. G., Carvajal-Schiaffino, R., Millar, A. P., Stockinger, K., & Zini, F. (2003). Evaluation of an economy-based file replication strategy for a data grid. In Proceedings of the 3rdIEEE/ ACM international symposium on cluster computing and the grid.
511
The State of the Art and Open Problems in Data Replication in Grid Environments
Chang, R., & Chang, J. (2006). Adaptable replica consistency service for data grids. In Proceedings of the third international conference on information technology: New generations (ITNG’06) (pp. 646–651). Chang, R., Chang, J., & Lin, S. (2007). Job scheduling and data replication on data grids. Future Generation Computer Systems, 23(7), 846–860. doi:10.1016/j.future.2007.02.008 Chang, R., & Chen, P. (2007). Complete and fragmented replica selection and retrieval in data grids. Future Generation Computer Systems, 23(4), 536–546. doi:10.1016/j.future.2006.09.006 Chang, R., Wang, C., & Chen, P. (2005). Replica selection on co-allocation data grids. In Proceedings of the second international symposium on parallel and distributed processing and applications (Vol. 3358, pp. 584–593). Chervenak, A. (2002). Giggle: A framework for constructing scalable replica location services. In Proceedings of the IEEE supercomputing (pp. 1–17). Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2000). The Data Grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23, 187–200. doi:10.1006/jnca.2000.0110 Dang, N. N., & Lim, S. B. (2007). Combination of replication and scheduling in data grids. International Journal of Computer Science and Network Security, 7(3). Desprez, F., & Vernois, A. (2006). Simultaneous scheduling of replication and computation for dataintensive applications on the grid. Journal of Grid Computing, 4(1), 66–74. doi:10.1007/s10723-0059016-2 Dheepak, R., Ali, S., Sengupta, S., & Chakrabarti, A. (2005). Study of scheduling strategies in a dynamic data grid environment. In Distributed Computing - IWDC 2004 (Vol. 3326). Berlin: Springer. Domenici, A., Donno, F., Pucciani, G., & Stockinger, H. (2006). Relaxed data consistency with CONStanza. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid (pp. 425–429). Domenici, A., Donno, F., Pucciani, G., Stockinger, H., & Stockinger, K. (2004, Nov). Replica consistency in a Data Grid. Nuclear Instruments and Methods in Physics Research, 534, 24–28. doi:10.1016/j. nima.2004.07.052 Dorigo, M. (1992). Optimization, learning and natural algorithms (Tech. Rep.). Ph.D. Thesis, Politecnico di Milano, Milan, Italy. Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In Proceedings of the fifth IEEE/ACM international workshop on grid computing (pp. 53–60). Duvvuri, V., Shenoy, P., & Tewari, R. (2000). Adaptive leases: A strong consistency mechanism for the World Wide Web. In Proceedings of IEEE INFOCOM (pp. 834–843).
512
The State of the Art and Open Problems in Data Replication in Grid Environments
Elias, J. A., & Moldes, L. N. (2002a). Behaviour of the fast consistency algorithm in the set of replicas with multiple zones with high demand. In Proceedings of symposium in informatics and telecommunications. Elias, J. A., & Moldes, L. N. (2002b). A demand based algorithm for rapid updating of replicas. In Proceedings of IEEE workshop on resource sharing in massively distributed systems (pp. 686– 691). Elias, J. A., & Moldes, L. N. (2003). Generalization of the fast consistency algorithm to a grid with multiple high demand zones. In Proceedings of international conference on computational science (ICCS 2003) (pp. 275–284). Foster, I. (2006). Globus toolkit version 4: Software for service-oriented systems. In Proceedings of the international conference on network and parallel computing (pp. 2–13). Foster, I., Kesselman, C., & Tukcke, S. (2001). The anatomy of the grid: Enabling scalable virtual organizations. International Journal of High Performance Computing Applications, 15(3), 200–222. doi:10.1177/109434200101500302 Golding, R. A. (1992, Dec). Weak-consistency group communication and membership (Tech. Rep.). Computer and Information Sciences, University of California, Ph.D. Thesis. Hakami, S. (1999). Optimum location of switching centers and the absolute centers and medians of a graph. Operations Research, 12, 450–459. doi:10.1287/opre.12.3.450 He, X., & Sun, X. (2005). Incorporating data movement into grid task scheduling. In Proceedings of grid and cooperative computing (pp. 394–405). He, X., Sun, X., & Laszewski, G. (2003). QoS guided Min-Min heuristic for grid task scheduling. Journal of Computer Science and Technology, Special Issue on Grid Computing, 18 (4). Kesselman, C., & Foster, I. (1998). The Grid: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers. Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T., Krishnamoorthy, S., Sadayappan, P., et al. (2006). Task scheduling and file replication for data-intensive jobs with batch-shared I/O. In Proceedings of high-performance distributed computing (HPDC) (pp. 241–252). Kim, S., & Weissman, J. B. (2004). A genetic algorithm based approach for scheduling decomposable data grid applications. In Proceedings of international conference on parallel processing (Vol. 1, pp. 405–413). Lamehamedi, H., & Szymanski, B. shentu, Z., & Deelman, E. (2002). Data replication strategies in grid environments. In Proceedings of the fifth international conference on algorithms and architectures for parallel processing (pp. 378–383). Lamehamedi, H., Szymanski, B., Shentu, Z., & Deelman, E. (2003). Simulation of dynamic data replication strategies in data grids. In Proceedings of the international parallel and distributed processing symposium (pp. 10–20).
513
The State of the Art and Open Problems in Data Replication in Grid Environments
Lee, Y. C., & Zomaya, A. Y. (2006). Data sharing pattern aware scheduling on grids. In Proceedings of International Conference on Parallel Processing, (pp. 365–372). Lei, M., & Vrbsky, S. V. (2006). A data replication strategy to increase data availability in data grids. In Proceedings of the international conference on grid computing and applications (pp. 221–227). Lin, Y., Liu, P., & Wu, J. (2006). Optimal placement of replicas in data grid environments with locality assurance. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS’06), 01, 465–474. Park, S., Kim, J., Ko, Y., & Yoon, W. (2003). Dynamic data grid replication strategy based on Internet hierarchy. In Proceedings of the second international workshop on grid and cooperative computing (GCC’2003). Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica selection in grid environment: A data-mining approach. In Proceedings of the ACM symposium on applied computing (pp. 695–700). Rahman, R. M., Barker, K., & Alhajj, R. (2005a). Replica placement in data grid: A multi-objective approach. In Proceedings of the international conference on grid and cooperative computing (pp. 645–656). Rahman, R. M., Barker, K., & Alhajj, R. (2005b). Replica placement in data grid: Considering utility and risk. In Proceedings of the international conference on information technology: Coding and computing (ITCC’05) (Vol. 1, pp. 354–359). Ranganathan, K., & Foster, I. (2001a). Design and evaluation of dynamic replication strategies for a high performance data grid. In Proceedings of the international conference on computing in high energy and nuclear physics (pp. 260-263). Ranganathan, K., & Foster, I. (2002). Decoupling computation and data scheduling in distributed data intensive applications. In Proceedings of the 11th international symposium for high performance distributed computing (HPDC) (pp. 352–358). Ranganathan, K., & Foster, I. (2003). Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing, 1(1), 53–62. doi:10.1023/A:1024035627870 Ranganathan, K., & Foster, I. T. (2001b). Identifying dynamic replication strategies for a high-performance data grid. In Proceedings of the International Workshop on Grid Computing (GRID’2001) (pp. 75–86). Ranganathan, K., Iamnitchi, A., & Foster, I. (2002). Improving data availability through dynamic modeldriven replication in large peer-to-peer communities. In Proceedings of the 2nd IEEE/ACM international symposium on cluster computing and the grid (CCGRID’02) (pp. 376–381). Revees, C. (1993). Moderm heuristic techniques for combinatorial problems. Oxford, UK: Oxford Blackwell Scientific Publication. Saito, Y., & Levy, H. M. (2000). Optimistic replication for internet data services. In Proceedings of international symposium on distributed computing (pp. 297–314).
514
The State of the Art and Open Problems in Data Replication in Grid Environments
Saito, Y., & Shapiro, M. (2005). Optimistic replication. ACM Computing Surveys, 37(1), 42–81. doi:10.1145/1057977.1057980 Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids. In Proceedings of 10th workshop on job scheduling strategies for parallel processing (Vol. 3277, pp. 210–232). Schintke, F., & Reinefeld, A. (2003). Modeling replica availability in large data grids. Journal of Grid Computing, 1(2), 219–227. doi:10.1023/B:GRID.0000024086.50333.0d Sun, M., Sun, J., Lu, E., & Yu, C. (2005). Ant algorithm for file replica selection in data grid. In Proceedings of the first international conference on semantics, knowledge, and grid (SKG 2005) (pp. 64–66). Sun, Y., & Xu, Z. (2004). Grid replication coherence protocol. In Proceedings of the 18th international parallel and distributed processing symposium (pp. 232–239). Tang, M., Lee, B., Tang, X., & Yeo, C. K. (2005). Combining data replication algorithms and job scheduling heuristics in the data grid. In Proceedings of European conference on parallel computing (pp. 381–390). Tang, M., Lee, B., Yeo, C., & Tang, X. (2005). Dynamic replication algorithms for the multi-tier data grid. Future Generation Computer Systems, 21(5), 775–790. doi:10.1016/j.future.2004.08.001 Tang, M., Lee, B., Yeo, C., & Tang, X. (2006). The impact of data replication on job scheduling performance in the data grid. Future Generation Computer Systems, 22(3), 254–268. doi:10.1016/j. future.2005.08.004 Tang, X., & Xu, J. (2005). QoS-aware replica placement for content distribution. IEEE Transactions on Parallel and Distributed Systems, 16(10), 921–932. doi:10.1109/TPDS.2005.126 Vazhkudai, S. (2003, Nov). Enabling the co-allocation of grid data transfers. In Proceedings of the fourth international workshop on grid computing (pp. 41–51). Vazhkudai, S., Tuecke, S., & Foster, I. (2001). Replica selection in the globus data grid. In Proceedings of the first IEEE/ACM international conference on cluster computing and the grid (CCGRID 2001) (pp. 106–113). Venugopal, S., & Buyya, R. (2005, Oct). A deadline and budget constrained scheduling algorithm for escience applications on data grids. In Proceedings of the 6th international conference on algorithms and architectures for parallel processing (ICA3PP-2005) (pp. 60–72). Venugopal, S., Buyya, R., & Ramamohanarao, K. (2006). A taxonomy of data grids for distributed data sharing, management, and processing. ACM Computing Surveys, 1, 1–53. Wang, C., Hsu, C., Chen, H., & Wu, J. (2006). Efficient multi-source data transfer in data grids. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID’06) (pp. 421–424). Wang, H., Liu, P., & Wu, J. (2006). A QoS-aware heuristic algorithm for replica placement. Journal of Grid Computing, 96–103.
515
The State of the Art and Open Problems in Data Replication in Grid Environments
Yang, C., Yang, I., Chen, C., & Wang, S. (2006). Implementation of a dynamic adjustment mechanism with efficient replica selection in data grid environments. In Proceedings of the ACM symposium on applied computing (pp. 797–804). Zhou, X., Kim, E., Kim, J. W., & Yeom, H. Y. (2006). ReCon: A fast and reliable replica retrieval service for the data grid. In Proceedings of IEEE international symposium on cluster computing and the grid (pp. 446–453).
KEY TERMS AND THEIR DEFINITIONS Access Latency: Access latency is the time that elapses from when a node sends a request for a file until it receives the complete file. Data Grids: Data Grids primarily deal with providing services and infrastructure for distributed data-intensive applications that need to access, transfer and modify massive datasets stored in distributed storage resources. Job Scheduling: Job scheduling assigns incoming jobs to compute nodes in such a way that some evaluative conditions are met, such as the minimization of the overall execution time of the jobs. Replica Consistency: The replica consistency problem deals with the update synchronization of multiple copies (replicas) of a file. Replica Placement: The replica placement service is the component of a Data Grid architecture that decides where in the system a file replica should be placed. Replica Selection: A replica selection service discovers the available replicas and selects the best replica that matches the user’s location and quality of service (QoS) requirements. Replication: Replication is an important technique to speed up data access for Data Grid systems by replicating the data in multiple locations, so that a user can access the data from a site in his vicinity.
ENDNOTE 1
516
A replica may be a complete or a partial copy of the original dataset.
517
Chapter 23
Architectural Elements of Resource Sharing Networks Marcos Dias de Assunção The University of Melbourne, Australia Rajkumar Buyya The University of Melbourne, Australia
ABSTRACT This chapter first presents taxonomies on approaches for resource allocation across resource sharing networks such as Grids. It then examines existing systems and classifies them under their architectures, operational models, support for the life-cycle of virtual organisations, and resource control techniques. Resource sharing networks have been established and used for various scientific applications over the last decade. The early ideas of Grid computing have foreseen a global and scalable network that would provide users with resources on demand. In spite of the extensive literature on resource allocation and scheduling across organisational boundaries, these resource sharing networks mostly work in isolation, thus contrasting with the original idea of Grid computing. Several efforts have been made towards providing architectures, mechanisms, policies and standards that may enable resource allocation across Grids. A survey and classification of these systems are relevant for the understanding of different approaches utilised for connecting resources across organisations and virtualisation techniques. In addition, a classification also sets the ground for future work on inter-operation of Grids.
INTRODUCTION Since the formulation of the early ideas on meta-computing (Smarr & Catlett, 1992), several research activities have focused on mechanisms to connect worldwide distributed resources. Advances in distributed computing have enabled the creation of Grid-based resource sharing networks such as TeraGrid (Catlett, Beckman, Skow, & Foster, 2006) and Open Science Grid (2005). These networks, composed of multiple resource providers, enable collaborative work and sharing of resources such as computers, DOI: 10.4018/978-1-60566-661-7.ch023
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Architectural Elements of Resource Sharing Networks
storage devices and network links among groups of individuals and organisations. These collaborations, widely known as Virtual Organisations (VOs) (Foster, Kesselman, & Tuecke, 2001), require resources from multiple computing sites. In this chapter we focus on networks established by organisations to share computing resources. Despite the extensive literature on resource allocation and scheduling across organisational boundaries (Butt, Zhang, & Hu, 2003: Grimme, Lepping, & Papaspyrou, 2008; Iosup, Epema, Tannenbaum, Farrellee, & Livny, 2007; Ranjan, Rahman, & Buyya, 2008; Fu, Chase, Chun, Schwab, & Vahdat, 2003; Irwin et al., 2006; Peterson, Muir, Roscoe, & Klingaman, 2006; Ramakrishnan et al., 2006; Huang, Casanova, & Chien, 2006), existing resource sharing networks mostly work in isolation and with different utilisation levels (Assunção, Buyya, & Venugopal, 2008; Iosup et al., 2007), thus contrasting with the original idea of Grid computing (Foster et al., 2001). The early ideas of Grid computing have foreseen a global and scalable network that would provide users with resources on demand. We have previously demonstrated that there can exist benefits for Grids to share resources with one another such as reducing the costs incurred by over-provisioning (Assunção & Buyya, in press). Hence, it is relevant to survey and classify existing work on mechanisms that can be used to interconnect resources from multiple Grids. A survey and classification of these systems are important in order to understand the different approaches utilised for connecting resources across organisations and to set the ground for future work on inter-operation of resource sharing networks, such as Grids. Taxonomies on resource management systems for resource sharing networks have been proposed (Iosup et al., 2007; Grit, 2005). Buyya et al. (2000) and Iosup et al. (2007) have described the architectures used by meta-scheduler systems and how jobs are directed to the resources where they execute. Grit (2005) has classified the roles of intermediate parties, such as brokers, in resource allocation for virtual computing environments. This chapter extends existing taxonomies, thus making the following contributions: •
•
It examines additional systems and classifies them under a larger property spectrum namely resource control techniques, scheduling considering virtual organisations and arrangements for resource sharing. It provides classifications and a survey of work on resource allocation and scheduling across organisations, such as centralised scheduling, meta-scheduling and resource brokering in Grid computing. This survey aims to show different approaches to federate organisations in a resource sharing network and to allocate resources to its users. We also present a mapping of the surveyed systems against the proposed classifications.
BACKGROUND Several of the organisational models followed by existing Grids are based on the idea of VOs. The VO scenario is characterised by resource providers offering different shares of resources to different VOs via some kind of agreement or contract; these shares are further aggregated and allocated to users and groups within each VO. The life-cycle of a VO can be divided into four distinct phases namely creation, operation, maintenance, and dissolution. During the creation phase, an organisation looks for collaborators and then selects a list of potential partners to start the VO. The operation phase is concerned with resource management, task distribution, and usage policy enforcement (Wasson & Humphrey, 2003; Dumitrescu & Foster, 2004). The maintenance phase deals with the adaptation of the VO, such as al-
518
Architectural Elements of Resource Sharing Networks
location of additional resources according to its users’ demands. The VO dissolution involves legal and economic issues such as determining the success or failure of the VO, intellectual property and revocation of access and usage privileges. The problem of managing resources within VOs in Grid computing is further complicated by the fact that resource control is generally performed at the job level. Grid-based resource sharing networks have users with units of work to execute, also called jobs; some entities decide when and where these jobs will execute. The task of deciding where and when to run the users’ work units is termed as scheduling. The resources contributed by providers are generally clusters of computers and the scheduling in these resources is commonly performed by Local Resource Management Systems (LRMSs) such as PBS (2005) and SGE (Bulhões, Byun, Castrapel, & Hassaine, 2004). Scheduling of Grid users’ applications and allocation of resources contributed by providers is carried out by Grid Resource Management Systems (GRMSs). A GRMS may comprise components such as: • • •
Meta-schedulers, which communicate with LRMSs to place jobs at the provider sites; Schedulers that allocate resources considering how providers and users are organised in virtual organisations (Dumitrescu & Foster, 2005); and Resource brokers, which represent users or organisations by scheduling and managing job execution on their behalf.
These components interact with providers’ LRMSs either directly or via interfaces provided by the Grid middleware. The Grid schedulers can communicate with one another in various ways, which include via sharing agreements, hierarchical scheduling, Peer-to-Peer (P2P) networks, among others. Recently, utility data centres have deployed resource managers that allow the partitioning of physical resources and the allocation of raw resources that can be customised with the operating system and software of the user’s preference. This partitioning is made possible by virtualisation technologies such as Xen (Barham et al., 2003; Padala et al., 2007) and VMWare1. The use of virtualisation technologies for resource allocation enables the creation of customised virtual clusters (Foster et al., 2006; Chase, Irwin, Grit, Moore, & Sprenkle, 2003; Keahey, Foster, Freeman, & Zhang, 2006). The use of virtualisation technology allows for another form of resource control termed containment (Ramakrishnan et al., 2006), in which remote resources are bound to the users’ local computing site on demand. The resource shares can be exchanged across sites by intermediate parties. Thereby, a VO can allocate resources on demand from multiple resource providers and bind them to a customised environment, while maintaining it isolated from other VOs (Ramakrishnan et al., 2006). In the following sections, we classify existing systems according to their support to the life-cycle of VOs, their resource control techniques and the mechanisms for inter-operation with other systems. We also survey representative work and map them according to the proposed taxonomies.
CLASSIFICATIONS FOR GRID RESOURCE MANAGEMENT SYSTEMS Buyya et al. (2000) and Iosup et al. (2007) have classified systems according to their architectures and operational models. We present their taxonomy in this section because it classifies the way that schedulers can be organised in a resource sharing network. We have included a new operational model to the taxonomy (i.e. hybrid of job routing and job pulling). Moreover, systems with similar architecture may
519
Architectural Elements of Resource Sharing Networks
Figure 1. Taxonomy on Grid resource management systems
still differ in terms of the mechanisms employed for resource sharing, the self-interest of the system’s participants, and the communication model. A Grid system can use decentralised scheduling wherein schedulers communicate their decisions with one another in a co-operative manner, thus guaranteeing the maximisation of the global utility of the system. On the other hand, a broker may represent a particular user community within the Grid, can have contracts with other brokers in order to use the resources they control and allocate resources that maximise its own utility (generally given by the achieved profit). We classify the arrangements between brokers in this section. Furthermore, systems can also differ according to their resource control techniques and support to different stages of the VO life-cycle. This section classifies resource control techniques and the systems’ support for virtual organisations. The attributes of GRMSs and the taxonomy are summarised in Figure 1.
Architecture and Operational Models of GRMSs This section describes several manners in which schedulers and brokers can be organised in Grid systems. Iosup et al. (2007) considered a multiple cluster scenario and classified the architectures possibly used
520
Architectural Elements of Resource Sharing Networks
as Grid resource management systems. They classified the architectures in the following categories: •
•
• •
•
Independent clusters - each cluster has its LRMS and there is no meta-scheduler component. Users submit their jobs to the clusters of the organisations to which they belong or on which they have accounts. We extend this category by including single-user Grid resource brokers. In this case, the user sends her jobs to a broker, which on behalf of the user submits jobs to clusters the user can access. Centralised meta-scheduler - there is a centralised entity to which jobs are forwarded. Jobs are then sent by the centralised entity to the clusters where they are executed. The centralised component is responsible for determining which resources are allocated to the job and, in some cases, for migrating jobs if the load conditions change. Hierarchical meta-scheduler - schedulers are organised in a hierarchy. Jobs arrive either at the root of the hierarchy or at the LRMSs. Distributed meta-scheduler - cluster schedulers can share jobs that arrive at their LRMSs with one another. Links can be defined either in a static manner (i.e. by the system administrator at the system’s startup phase) or in a dynamic fashion (i.e. peers are selected dynamically at runtime). Grit (2007) discusses the types of contracts that schedulers (or brokers) can establish with one another. Hybrid distributed/hierarchical meta-scheduler - each Grid site is managed by a hierarchical meta-scheduler. Additionally, the root meta-schedulers can share the load with one another.
This classification is comprehensive since it captures the main forms through which schedulers and brokers can be organised in resource sharing networks. However, some categories can be further extended. For example, the site schedulers can be organised in several decentralised ways and use varying mechanisms for resource sharing, such as a mesh network in which contracts are established between brokers (Irwin et al., 2006; Fu et al., 2003) or via a P2P network with a bartering-inspired economic mechanism for resource sharing (Andrade, Brasileiro, Cirne, & Mowbray, 2007). Iosup et al. also classified a group of systems according to their operational model; the operational model corresponds to the mechanism that ensures jobs entering the system arrive at the resource in which they run. They have identified three operational models: • • •
Job routing, whereby jobs are routed by the schedulers from the arrival point to the resources where they run through a push operation (scheduler-initiated routing); Job pulling, through which jobs are pulled from a higher-level scheduler by resources (resourceinitiated routing); and Matchmaking, wherein jobs and resources are connected to one another by the resource manager, which acts as a broker matching requests from both sides.
We add a fourth category to the classification above in which the operational model can be a hybrid of job routing and job pulling. Examples of such cases include those that use a job pool to (from) which jobs are pushed (pulled) by busy (unoccupied) site schedulers (Grimme et al., 2008). (See Figure 2).
521
Architectural Elements of Resource Sharing Networks
Figure 2. Architecture models of GRMSs
Arrangements Between Brokers in Resource Sharing Networks This section describes the types of arrangements that can be established between clusters in resource sharing networks when decentralised or semi-decentralised architectures are in place. It is important to distinguish between the way links between sites are established and their communication pattern; from the mechanism used for negotiating the resource shares. We classify the work according to the communication model in the following categories: •
•
522
P2P network - the sites of the resource sharing network are peers in a P2P network. They use the network to locate sites where the jobs can run (Butt et al., 2003; Andrade, Cirne, Brasileiro, & Roisenberg, 2003). Bilateral sharing agreements - sites establish bilateral agreements through which a site can locate another suitable site to run a given job. The redirection or acceptance of jobs occurs only
Architectural Elements of Resource Sharing Networks
• •
between sites that have a sharing agreement (Epema, Livny, Dantzig, Evers, & Pruyne, 1996). Shared spaces - sites co-ordinate resource sharing via shared spaces such as federation directories and tuple spaces (Grimme et al., 2008; Ranjan et al., 2008). Transitive agreements - this is similar to bilateral agreements. However, a site can utilise resources from another site with which it has no direct agreement (Fu et al., 2003; Irwin et al., 2006).
Although existing work can present similar communication models or similar organisational forms for brokers or schedulers, the resource sharing mechanisms can differ. The schedulers or brokers can use mechanisms for resource sharing from the following categories: •
•
•
System centric - the mechanism is designed with the goal of maximising the overall utility of the participants. Such mechanisms aim to, for example, balance the load between sites (Iosup et al., 2007) and prevent free-riding (Andrade et al., 2007). Site centric - brokers and schedulers are driven by the interest of maximising the utility of the participants within the site they represent without the explicit goal of maximising the overall utility across the system (Butt et al., 2003; Ranjan, Harwood, & Buyya, 2006). Self-interested - brokers act with the goal of maximising their own utility, generally given by profit, yet satisfying the requirements of their users. They do not take into account the utility of the whole system (Irwin et al., 2006).
Resource Control Techniques The emergence of virtualisation technologies has resulted in the creation of testbeds wherein multiplesite slices (i.e. multiple-site containers) are allocated to different communities (Peterson et al., 2006). In this way, slices run concurrently and are isolated from each other. This approach, wherein resources are bound to a virtual execution environment or workspace where a service or application can run, is termed here as a container model. Most of the existing Grid middleware employ a job model in which jobs are routed until they reach the sites’ local batch schedulers for execution. It is clear that both models can co-exist, thus an existing Grid technology can be deployed in a workspace enabled by container-based resource management (Ramakrishnan et al., 2006; Montero, Huedo, & Llorente, 2008). We classify systems in the following categories: • •
Job model - this is the model currently utilised by most of the Grid systems. The jobs are directed or pulled across the network until they arrive at the nodes where they are finally executed. Container-based - resource managers in this category can manage a cluster of computers within a site by means of virtualisation technologies (Keahey et al., 2006; Chase et al., 2003). They bind resources to virtual clusters or workspaces according to a customer’s demand. They commonly provide an interface through which one can allocate a set of nodes (generally virtual machines) and configure them with the operating system and software of choice. ◦ Single-site - these container-based resource managers allow the user to create a customised virtual cluster using shares of the physical machines available at the site. These resource managers are termed here as single-site because they usually manage the resources of one administrative site (Fontán, Vázquez, Gonzalez, Montero, & Llorente, 2008; Chase et al.,
523
Architectural Elements of Resource Sharing Networks
◦
2003), although they can be extended to enable container-based resource control at multiple sites (Montero et al., 2008). Multiple-site - existing systems utilise the features of single-site container-based resource managers to create networks of virtual machines on which an application or existing Grid middleware can be deployed (Ramakrishnan et al., 2006). These networks of virtual machines are termed here as multiple-site containers because they can comprise resources bound to workspaces at multiple administrative sites. These systems allow a user to allocate resources from multiple computing sites thus forming a network of virtual machines or a multiple-site container (Irwin et al., 2006; Shoykhet, Lange, & Dinda, 2004; Ruth, Jiang, Xu, & Goasguen, 2005; Ramakrishnan et al., 2006). This network of virtual machines is also referred to as virtual Grid (Huang et al., 2006) or slice (Peterson et al., 2006).
Some systems such as Shirako (Irwin et al., 2006) and VioCluster (Ruth, McGachey, & Xu, 2005) provide container-based resource control. Shirako also offers resource control at the job level (Ramakrishnan et al., 2006) by providing a component that is aware of the resources leased. This component gives recommendations on which site can execute a given job.
Taxonomy on Virtual Organisations The idea of user communities or virtual organisations underlies several of the organisational models adopted by Grid systems and guides many of the efforts on providing fair resource allocation for Grids. Consequently, the systems can be classified according to the VO awareness of their scheduling and resource allocation mechanisms. One may easily advocate that several systems, that were not explicitly designed to support VOs, can be used for resource management within a VO. We restrict ourselves to provide a taxonomy that classifies systems according to (i) the VO awareness of their resource allocation and scheduling mechanisms; and (ii) the provision of tools for handling different issues related to the VO life-cycle. For the VO awareness of scheduling mechanisms we can classify the systems in: • •
Multiple VOs - those scheduling mechanisms that perform scheduling and allocation taking into consideration the various VOs existing within a Grid; and Single VO - those mechanisms that can be used for scheduling within a VO.
Furthermore, the idea of VO has been used in slightly different ways in the Grid computing context. For example, in the Open Science Grid (OSG), VOs are recursive and may overlap. We use several criteria to classify VOs as presented in Figure 3. With regard to dynamism, we classify VOs as static and dynamic (Figure 3). Although Grid computing is mentioned as the enabler for dynamic VOs, it has been used to create more static and long-term collaborations such as APAC (2005), EGEE (2005), the UK National e-Science Centre (2005), and TeraGrid (Catlett et al., 2006). A static VO has a pre-defined number of participants and its structure does not change over time. A dynamic VO presents a number of participants that changes constantly as the VO evolves (Wesner, Dimitrakos, & Jeffrey, 2004). New participants can join, whereas existing participants may leave. A dynamic VO can be stationary or mobile. A stationary VO is generally composed of highly specialised resources including supercomputers, clusters of computers, personal computers and data resources. The
524
Architectural Elements of Resource Sharing Networks
Figure 3. Taxonomy on Grid facilitated VOs
components of the VO are not mobile. In contrast, a mobile VO is composed of mobile resources such as Personal Digital Assistants (PDAs), mobile phones. The VO is highly responsive and adapts to different contexts (Wesner et al., 2004). Mobile VOs can be found in disaster handling and crisis management situations. Moreover, a VO can be hybrid, having both stationary and mobile components. Considering goal-orientation, we divide VOs into two categories: targeted and non-targeted (Figure 3). A targeted VO can be an alliance or collaboration created to explore a market opportunity or achieve a common research goal. A VO for e-Science collaboration is an example of a targeted VO as the participants have a common goal (Hey & Trefethen, 2002). A non-targeted VO is characterised by the absence of a common goal; it generally comprises participants who pursue different goals, yet benefit from the VO by pooling resources. This VO is highly dynamic because participants can leave when they achieve their goals. VOs can be short-, medium- or long-lived (Figure 3). A short-lived VO lasts for minutes or hours. A medium-lived VO lasts for weeks and is formed, for example, when a scientist needs to carry out experiments that take several days to finish. Data may be required to carry out such experiments. This scenario may be simplified if the VO model is used; the VO may not be needed as soon as the experiments have been carried out. A long-lived VO is formed to explore a market opportunity (goal-oriented) or to pool resources to achieve disparate objectives (non-targeted). Such endeavours normally last from months to years; hence, we consider a long-lived VO to last for several months or years. As discussed in the previous section, the formation and maintenance of a VO present several challenges. These challenges have been tackled in different ways, which in turn have created different formation and
525
Architectural Elements of Resource Sharing Networks
maintenance approaches. We thus classify the formation and membership, or maintenance, as centralised and decentralised (Figure 3). The formation and membership of a centralised VO is controlled by a trusted third party, such as Open Science Grid (2005) or the Enabling Grids for E-SciencE (2005). OSG provides an open market where providers and users can advertise their needs and intentions; a provider or user may form a VO for a given purpose. EGEE provides a hierarchical infrastructure to enable the formation of VOs. On the other hand, in a decentralised controlled VO, no third party is responsible for enabling or controlling the formation and maintenance. This kind of VO can be complex as it can require multiple Service Level Agreements (SLAs) to be negotiated among multiple participants. In addition, the monitoring of SLAs and commitment of the members are difficult to control. The VO also needs to self-adapt when participants leave or new participants join. Regarding the enforcement of policies, VOs can follow different approaches, such as hub or democratic. This is also referred to as topology. Katzy et al. (2005) classify VOs in terms of topology, identifying the following types: chain, star or hub, and peer-to-peer. Sairamesh et al. (2005) identify business models for VOs; the business models are analogous to topologies. However, by discussing the business models for VOs, the authors are concerned with a larger set of problems, including enforcement of policies, management, trust and security, and financial aspects. In our taxonomy, we classify the enforcement and monitoring of policies as star or hub, democratic or peer-to-peer, hierarchical, and chain (Figure 3). Some projects such as Open Science Grid (2005) and EGEE (2005) aim to establish consortiums or clusters of organisations, which in turn allow the creation of dynamic VOs. Although not very related to the core issues of VOs, they aim to address an important problem: the establishment of trust between organisations and the means for them to look for and find potential partners. These consortiums can be classified as hierarchical and market-like (Figure 3). A market-like structure is any infrastructure that offers a market place, which organisations can join and present interests in starting a new collaboration or accepting to participate in an ongoing collaboration. These infrastructures may make use of economic models such as auctions, bartering, and bilateral negotiation.
A SURVEY OF EXISTING WORK This section describes relevant work of the proposed taxonomy in more detail. First, it describes work on a range of systems that have a decentralised architecture. Some systems present a hierarchy of scheduling whereby jobs are submitted to the root of the hierarchy or to their leaves; in either case, the jobs execute at the leaves of the hierarchical structure. Second, this section presents systems of hierarchical structure, resource brokers and meta-scheduling frameworks. During the past few years, several Gridbased resource sharing networks and other testbeds have been created. Third, we discuss the work on inter-operation between resource sharing networks. Finally, this section discusses relevant work focusing on VO issues.
Distributed Architecture Based Systems Condor Flocking: The flocking mechanism used by Condor (Epema et al., 1996) provides a software approach to interconnect pools of Condor resources. The mechanism requires manual configuration of sharing agreements between Condor pools. Each pool owner and each workstation owner maintains full control of when their resources can be used by external jobs.
526
Architectural Elements of Resource Sharing Networks
The developers of Condor flocking opted for a layered design for the flocking mechanism, which enables the Condor’s Central Manager (CM) (Litzkow, Livny, & Mutka, 1988) and other Condor machines to remain unmodified and operate transparently from the flock. The basis of the flocking mechanism is formed by Gateway Machines (GW). There is at least one GW in each Condor pool. GWs act as resource brokers between pools. Each GW has a configuration file describing the subset of connections it maintains with other GWs. Periodically, a GW queries the status of its pool from the CM. From the list of resources obtained, the GW makes a list of those resources that are idle. The GW then sends this list to the other GWs to which it is connected. Periodically, the GW that received this list chooses a machine from the list, and advertises itself to the CM with the characteristics of this machine. The flocking protocol (which is a modified version of the normal Condor protocol) allows the GWs to create shadow processes that so that a submission machine is under the impression of contacting the execution machine directly. Self-Organizing Flock of Condors: The original flocking scheme of Condor has the drawback that knowledge about all pools with which resources can be shared need to be known a priori before starting Condor (Epema et al., 1996). This static information poses limitations regarding the number of resources available and resource discovery. Butt et al. (2003) introduced a self-organising resource discovery mechanism for Condor, which allows pools to discover one another and resources available dynamically. The P2P network used by the flocking mechanism is based on Pastry and takes into account the network proximity. This may result in saved bandwidth in data transfer and faster communications. Experiments with this implementation considering four pools with four machines each were provided. Additionally, simulation results demonstrated the performance of the flocking mechanism when interconnecting 1,000 pools. Shirako: Shirako (Irwin et al., 2006) is a system for on-demand leasing of shared networked resources across clusters. Shirako’s design goals include: autonomous providers, who may offer resources to the system on a temporary basis and retain the ultimate control over them; adaptive guest applications that lease resources from the providers according to changing demand; pluggable resource types, allowing participants to include various types of resources, such as network links, storage and computing; brokers that provide guest applications with an interface to acquire resources from resource providers; and allocation policies at guest applications, brokers and providers, which define the manner resources are allocated in the system. Shirako utilises a leasing abstraction in which authorities representing provider sites offer their resources to be provisioned by brokers to guest applications. Shirako brokers are responsible for coordinating resource allocation across provider sites. The provisioning of resources determines how much of each resource each guest application receives, when and where. The site authorities define how much resource is given to which brokers. The authorities also define which resources are assigned to serve requests approved by a broker. When a broker approves a request, it issues a ticket that can be redeemed for a lease at a site authority. The ticket specifies the type of resource, the number of resource units granted and the interval over which the ticket is valid. Sites issue tickets for their resources to brokers; the brokers’ polices may decide to subdivide or aggregate tickets. A service manager is a component that represents the guest application and uses the lease API provided by Shirako to request resources from the broker. The service manager determines when and how to redeem existing tickets, extend existing leases, or acquire new leases to meet changing demand. The system allows guest applications to renew or extend their leases. The broker and site authorities match accumulated pending requests with resources under the authorities’ control. The broker prioritises requests
527
Architectural Elements of Resource Sharing Networks
and selects resource types and quantities to serve them. The site authority assigns specific resource units from its inventory to fulfill lease requests that are backed by a valid ticket. Site authorities use Cluster on Demand (Chase et al., 2003) to configure the resources allocated at the remote sites. The leasing abstraction provided by Shirako is a useful basis to co-ordinate resource sharing for systems that create distributed virtual execution environments of networked virtual machines (Keahey et al., 2006; Ruth, Rhee, Xu, Kennell, & Goasguen, 2006; Adabala et al., 2005; Shoykhet et al., 2004). Ramakrishnan et al. (2006) used Shirako to provide a hosting model wherein Grid deployments run in multiple-site containers isolated from one another. An Application Manager (AM), which is the entry point of jobs from a VO or Grid, interacts with a Grid Resource Oversight Coordinator (GROC) to obtain a recommendation of a site to which jobs can be submitted. The hosting model uses Shirako’s leasing core. A GROC performs the functions of leasing resources from computing sites and recommending sites for task submission. At the computing site, Cluster on Demand is utilised to provide a virtual cluster used to run Globus 4 along with Torque/MAUI. VioCluster: VioCluster is a system that enables dynamic machine trading across clusters of computers (Ruth, McGachey, & Xu, 2005). VioCluster introduces the idea of virtual domain. A virtual domain, originally comprising its physical domain of origin (i.e. a cluster of computers), can grow in the number of computing resources, thus dynamically allocating resources from other physical domains according to the demands of its user applications. VioCluster presents two important system components: the creation of dynamic virtual domains and the mechanism through which resource sharing is negotiated. VioCluster uses machine and network virtualisation technology to move machines between domains. Each virtual domain has a broker that interacts with other domains. A broker has a borrowing policy and a lending policy. The borrowing policy determines under which circumstances the broker will attempt to obtain more machines. The lending policy governs when it is willing to let another virtual domain make use of machines within its physical domain. The broker represents a virtual domain when negotiating trade agreements with other virtual domains. It is the broker’s responsibility to determine whether trades should occur. The policies for negotiating the resources specify: the reclamation, that is, when the resources will be returned to their home domain; machine properties, which represent the machines to be borrowed; and the machines’ location as some applications require communication. The borrowing policy must be aware of the communication requirements of user applications. Machine virtualisation simplifies the transfer of machines between domains. When a machine belonging to a physical domain B is borrowed by a virtual domain A, it is utilised to run a virtual machine. This virtual machine matches the configuration of the machines in physical domain A. Network virtualisation enables the establishment of virtual network links connecting the new virtual machine to the nodes of domain A. For the presented prototype, PBS is used to manage the nodes of the virtual domain. PBS is aware of the computers’ heterogeneity and never schedules jobs on a mixture of virtual and physical machines. The size of the work queue in PBS was used as a measure of the demand within a domain. OurGrid: OurGrid (Andrade et al., 2003) is a resource sharing system organised as a P2P network of sites that share resources equitably in order to form a Grid to which they all have access. OurGrid was designed with the goal of easing the assembly of Grids, thus it provides connected sites with access to the Grid resources with a minimum of guarantees needed. OurGrid is used to execute Bag-of-Tasks (BoT) applications. BoT are parallel applications composed of a set of independent tasks that do not communicate with one another during their execution. In contrast to other Grid infrastructures, the system
528
Architectural Elements of Resource Sharing Networks
does not require offline negotiations if a resource owner wants to offer her resources to the Grid. OurGrid uses a resource exchange mechanism termed network of favours. A participant A is doing a favour to participant B when A allows B to use her resources. According to the network of favours, every participant does favours to other participants expecting the favours to be reciprocated. In conflicting situations, a participant prioritises those who have done favours to it in the past. The more favours a participant does, the more it expects to be rewarded. The participants locally account their favours and cannot profit from them in another way than expecting other participants to do them some favours. Detailed experiments have demonstrated the scalability of the network of favours (Andrade et al., 2007), showing that the larger the network becomes, the more fair the mechanism performs. The three participants in the OurGrid’s resource sharing protocol are clients, consumers, and providers. A client requires access to the Grid resources to run her applications. The consumer receives requests from the client to find resources. When the client sends a request to the consumer, the consumer first finds the resources able to serve the request and then executes the tasks on the resources. The provider manages the resources shared in the community and provides them to consumers. Delegated Matchmaking:Iosup et al. (2007) introduced a matchmaking protocol in which a computing site binds resources from remote sites to its local environment. A network of sites, created on top of the local cluster schedulers, manages the resources of the interconnected Grids. Sites are organised according to administrative and political agreements so that parent-child links can be established. Then, a hierarchy of sites is formed with the Grid clusters at the leaves of the hierarchy. After that, supplementary to the hierarchical links, sibling links are established between sites that are at the same hierarchical level and operate under the same parent site. The proposed delegated matchmaking mechanism enables requests for resources to be delegated up and down the hierarchy thus achieving a decentralised network. The architecture is different from work wherein a scheduler forwards jobs to be executed on a remote site. The main idea of the matchmaking mechanism is to delegate ownership of resources to the user who requested them through this network of sites, and add the resources transparently to the user’s local site. When a request cannot be satisfied locally, the matchmaking mechanism adds remote resources to the user’s site. This simplifies security issues since the mechanism adds the resources to the trusted local resource pool. Simulation results show that the mechanism leads to an increase in the number of requests served by the interconnected sites. Grid Federation:Ranjan et al. (2005) proposed a system that federates clusters of computers via a shared directory. Grid Federation Agents (GFAs), representing the federated clusters, post quotes about idle resources (i.e. a claim stating that a given resource is available) and, upon the arrival of a job, query the directory to find a resource suitable to execute the job. The directory is a shared-space implemented as a Distributed Hash Table (DHT) P2P network that can match quotes and user requests (Ranjan et al., 2008). An SLA driven co-ordination mechanism for Grid superscheduling has also been proposed (Ranjan et al., 2006). GFAs negotiate SLAs and redirect requests through a Contract-Net protocol. GFAs use a greedy policy to evaluate resource requests. A GFA is a cluster resource manager and has control over the cluster’s resources. GFAs engage into bilateral negotiations for each request they receive, without considering network locality. Askalon:Siddiqui et al. (2006) introduced a capacity planning architecture with a three-layer negotiation protocol for advance reservation on Grid resources. The architecture is composed of allocators that make reservations of individual nodes and co-allocators that reserve multiple nodes for a single Grid application. A co-allocator receives requests from users and generates alternative offers that the user
529
Architectural Elements of Resource Sharing Networks
can utilise to run her application. A co-allocation request can comprise a set of allocation requests, each allocation request corresponding to an activity of the Grid application. A workflow with a list of activities is an example of Grid application requiring co-allocation of resources. Co-allocators aim to agree on Grid resource sharing. The proposed co-ordination mechanism produces contention-free schedules either by eliminating conflicting offers or by lowering the objective level of some of the allocators. GRUBER/DI-GRUBER:Dumitrescu et al. (2005) highlighted that challenging usage policies can arise in VOs that comprise participants and resources from different physical organisations. Participants want to delegate access to their resources to a VO, while maintaining such resources under the control of local usage policies. They seek to address the following issues: • • • •
How usage policies are enforced at the resource and VO levels. What mechanisms are used by a VO to ensure policy enforcement. How the distribution of policies to the enforcement points is carried out. How policies are made available to VO job and data planners.
They have proposed a policy management model in which participants can specify the maximum percentage of resources delegated to a VO. A VO in turn can specify the maximum percentage of resource usage it wishes to delegate to a given VO’s group. Based on this model above, they have proposed a Grid resource broker termed GRUBER (Dumitrescu & Foster, 2005). GRUBER architecture is composed of four components, namely: • • • •
Engine: which implements several algorithms to detect available resources. Site monitoring: is one of the data providers for the GRUBER engine. It is responsible for collecting data on the status of Grid elements. Site selectors: consist of tools that communicate with the engine and provide information about which sites can execute the jobs. Queue manager: resides on the submitting host and decides how many jobs should be executed and when.
Users who want to execute jobs, do so by sending them to submitting hosts. The integration of existing external schedulers with GRUBER is made in the submitting hosts. The external scheduler utilises GRUBER either as the queue manager that controls the start time of jobs and enforces VO policies, or as a site recommender. The second case is applicable if the queue manager is not available. DI-GRUBER, a distributed version of GRUBER, has also been presented (Dumitrescu, Raicu, & Foster, 2005). DI-GRUBER works with multiple decision points, which gather information to steer resource allocations defined by Usage Service Level Agreements (USLAs). These points make decisions on a per-job basis to comply with resource allocations to VO groups. Authors advocated that 4 to 5 decision points are enough to handle the job scheduling of a Grid 10 times larger than Grid3 at the time the work was carried out (Dumitrescu, Raicu, & Foster, 2005). Other important work:Balazinska et al. (2004) have proposed a load balancing mechanism for Medusa. Medusa is a stream processing system that allows the migration of stream processing operators from overloaded to under-utilised resources. The request offloading is performed based on the marginal cost of the request. The marginal cost for one participant is given by the increase (decrease) in the cost curve given by the acceptance (removal) of the request from the requests served by the participant.
530
Architectural Elements of Resource Sharing Networks
NWIRE (Schwiegelshohn & Yahyapour, 1999) links various resources to a metacomputing system, also termed meta-system. It also enables the scheduling in these environments. A meta-system comprises interconnected MetaDomains. Each MetaDomain is managed by a MetaManager that manages a set of ResourceManagers. A ResourceManager interfaces the scheduler at the cluster level. The MetaManager permanently collects information about all of its resources. It handles all requests inside its MetaDomain and works as a resource broker to other MetaDomains. In this way, requests received by a MetaManager can be submitted either by users within its MetaDomain or by other MetaManagers. Each MetaManager contains a scheduler that maps requests for resources to a specific resource in its MetaDomain. Grimme et al. (2008) have presented a mechanism for collaboration between resource providers by means of job interchange though a central job pool. According to this mechanism, a cluster scheduler adds to the central pool jobs that cannot be started immediately. After scheduling local jobs, a local scheduler can schedule jobs from the central pool if resources are available. Dixon et al. (2006) have provided a tit-for-tat or bartering mechanism based on local, non-transferable currency for resource allocation in large-scale distributed infrastructures such as PlanetLab. The currency is maintained locally within each domain in the form of credit given to other domains for providing resources in the past. This creates pair-wise relationships between administrative domains. The mechanism resembles OurGrid’s network of favours (Andrade et al., 2003). The information about exchanged resources decays with time, so that recent behaviour is more important. Simulation results showed that, for an infrastructure like PlanetLab, the proposed mechanism is more fair than the free-for-all approach currently adopted by PlanetLab. Graupner et al. (2002) have introduced a resource control architecture for federated utility data centres. In this architecture, physical resources are grouped in virtual servers and services are mapped to virtual servers. The meta-system is the upper layer implemented as an overlay network whose nodes contain descriptive data about the two layers below. Allocations change according to service demand, which requires to the control algorithms to be reactive and deliver quality solutions. The control layer performs allocation of services to virtual server environments and its use has been demonstrated by a capacity control example for a homogeneous Grid cluster.
Hierarchical Systems, Brokers and Meta-Scheduling This section describes some systems that are organised in a hierarchical manner. We also describe work on Grid resource brokering and frameworks that can be used to build meta-schedulers. Computing Center Software (CCS): CCS (Brune, Gehring, Keller, & Reinefeld, 1999) is a system for managing geographically distributed high-performance computers. It consists of three components, namely: the CCS, which is a vendor-independent resource management software for local HPC systems; the Resource and Service Description (RSD), used by the CCS to specify and map hardware and software components of computing environments; and the Service Coordination Layer (SCL), which co-ordinates the use of resources across computing sites. The CCS controls the mapping and scheduling of interactive and parallel jobs on massively parallel systems. It uses the concept of island, wherein each island has components for user interface, authorisation and accounting, scheduling of user requests, access to the physical parallel system, system control, and management of the island. At the meta-computing level, the Center Resource Manager (CRM) exposes scheduling and brokering features of the islands. The CRM is a management tool atop the CCS islands. When a user submits an application, the CRM maps the user request to the static and dynamic informa-
531
Architectural Elements of Resource Sharing Networks
tion on resources available. Once the resources are found, CRM requests the allocation of all required resources at all the islands involved. If not all resources are available, the CRM either re-schedules the request or rejects it. Center Information Server (CIS) is a passive component that contains information about resources and their statuses, and is analogous to Globus Metacomputing Directory Service (MDS) (Foster & Kesselman, 1997). It is used by the CRM to obtain information about resources available. The Service Co-ordination Layer (SCL) is located one level above the local resource management systems. The SCL co-ordinates the use of resources across the network of islands. It is organised as a network of co-operating servers, wherein each server represents one computing centre. The centres determine which resources are made available to others and retain full autonomy over them. EGEE Workload Management System (WMS): EGEE WMS (Vázquez-Poletti, Huedo, Montero, & Llorente, 2007) has a semi-centralised architecture. One or more schedulerscan be installed in the Grid infrastructure, each providing scheduling functionality for a group of VOs. The EGEE WMS components are: The User Interface (UI) from where the user dispatches the jobs; the Resource Broker (RB), which uses Condor-G (Frey, Tannenbaum, Livny, Foster, & Tuecke, 2001); the Computing Element (CE), which is the cluster front-end; the Worker Nodes (WNs), which are the cluster nodes; the Storage Element (SE), used for job files storage; and the Logging and Bookkeeping service (LB) that registers job events. Condor-G: Condor-G (Frey et al., 2001) leverages software from Globus and Condor (Frey et al., 2001) and allows users to utilise resources spanning multiple domains as if they all belong to one personal domain. Although Condor-G can be viewed as a resource broker itself (Venugopal, Nadiminti, Gibbins, & Buyya, 2008), it can also provide a framework to build meta-schedulers. The GlideIn mechanism of Condor-G is used to start a daemon process on a remote resource. The process uses standard Condor mechanisms to advertise the resource availability to a Condor collector process, which is then queried by the Scheduler to learn about available resources. Condor-G uses Condor mechanisms to match locally queued jobs to the resources advertised by these daemons and to execute them on those resources. Condor-G submits an initial GlideIn executable (a portable shell script), which in turn uses GSI-authenticated GridFTP to retrieve the Condor executables from a central repository. By submitting GlideIns to all remote resources capable of serving a job, Condor-G can guarantee optimal queuing times to user applications. Gridbus Broker: Gridbus Grid resource broker (Venugopal et al., 2008) is user-centric broker that provides scheduling algorithms for both computing- and data-intensive applications. In Gridbus, each user has her own broker, which represents the user by (i) selecting resources that minimise the user’s quality of service constraints such as execution deadline and budget spent; (ii) submitting jobs to remote resources; and (iii) copying input and output files. Gridbus interacts with various Grid middleware’s (Venugopal et al., 2008). Gridway: GridWay (Huedo, Montero, & Llorente, 2004) is a Globus based resource broker that provides a framework for execution of jobs in a ‘submit and forget’ fashion. The framework performs job submission and execution monitoring. Job execution adapts itself to dynamic resource conditions and application demands in order to improve performance. The adaptation is performed through application migration following performance degradation, sophisticated resource discovery, requirements change, or remote resource failure. The framework is modular wherein the following modules can be set on a per-job basis: resource selector, performance degradation evaluator, prolog, wrapper and epilog. The name of the first two modules or steps are intuitive, so we describe here only the last three. During prolog, the component
532
Architectural Elements of Resource Sharing Networks
responsible for job submission (i.e. submission manager) submits the prolog executable, which configures the remote system and transfers executable and input files. In the case of restart of an execution, the prolog also transfers restart files. The wrapper executable is submitted after prolog and wraps the actual job in order to obtain its exit code. The epilog is a script that transfers the output files and cleans the remote resource. GridWay also enables the deployment of virtual machines in a Globus Grid (Rubio-Montero, Huedo, Montero, & Llorente, 2007). The scheduling and selection of suitable resources is performed by GridWay whereas a virtual workspace is provided for each Grid job. A pre-wrapper phase is responsible for performing advanced job configuration routines, whereas the wrapper script starts a virtual machine and triggers the application job on it. KOALA:Mohamed and Epema (in press) have presented the design and implementation of KOALA, a Grid scheduler that supports resource co-allocation. KOALA Grid scheduler interacts with cluster batch schedulers for the execution of jobs. The work proposes an alternative to advance reservation at local resource managers, when reservation features are not available. This alternative allows processors to be allocated from multiple sites at the same time. SNAP-Based Community Resource Broker: The Service Negotiation and Acquisition Protocol (SNAP)-based community resource broker uses an interesting three-phase commit protocol. SNAP is proposed because traditional advance reservation facilities cannot cope with the fact that information availability may change between the moment at which resource availability is queried and the time when the reservation of resources is actually performed (Haji, Gourlay, Djemame, & Dew, 2005). The three phases of SNAP protocol consist of (i) a step in which resource availability is queried and probers are deployed, which inform the broker in case the resource status changes; (ii) then, the resources are selected and reserved; and (iii) after that, the job is deployed on the reserved resources. Platform Community Scheduler Framework (CSF): CSF (2003) provides a set of tools that can be utilised to create a Grid meta-scheduler or a community scheduler. The meta-scheduler enables users to define the protocols to interact with resource managers in a system independent manner. The interface with a resource manager is performed via a component termed Resource Manager (RM) Adapter. A RM Adapter interfaces a cluster resource manager. CSF supports the GRAM protocol to access the services of the resource managers that do not support the RM Adapter interface. Platform’s LSF and MultiCluster products leverage the CSF to provide a framework for implementing meta-scheduling. Grid Gateway is an interface that integrates Platform LSF and CSF. A scheduling plug-in for Platform LSF scheduler decides which LSF jobs are forwarded to the meta-scheduler. This decision is based on information obtained from an information service provided by the Grid Gateway. When a job is forwarded to the meta-scheduler, the job submission and monitoring tools dispatch the job and query its status information through the Grid Gateway. The Grid Gateway uses the job submission, monitoring and reservation services from the CSF. Platform MultiCluster also allows multiple clusters using LSF to forward jobs to one another transparently to the end-user. Other important work: Kertész et al. (2008) introduced a meta-brokering system in which the metabroker, invoked through a Web portal, submits jobs, monitors job status and copies output files using brokers from different Grid middleware, such as NorduGrid Broker and EGEE WMS. Kim and Buyya (2007) tackle the problem of fair-share resource allocation in hierarchical VOs. They provide a model for hierarchical VO environments based on a resource sharing policy; and provide a heuristic solution for fair-share resource allocation in hierarchical VOs.
533
Architectural Elements of Resource Sharing Networks
Inter-Operation of Resource Sharing Networks Relevant work on the attempts to enable inter-operation between resource sharing networks is discussed in this section. PlanetLab: PlanetLab (Peterson et al., 2006) is a large-scale testbed that enables the creation of slices, that is, distributed environments based on virtualisation technology. A slice is a set of virtual machines, each running on a unique node. The individual virtual machines that make up a slice contain no information about the other virtual machines in the set and are managed by the service running in the slice. Each service deployed on PlanetLab runs on a slice of PlanetLab’s global pool of resources. Multiple slices can run concurrently and each slice is like a network container that isolates services from other containers. The principals in PlanetLab are: • • •
Owner: organisation that hosts (owns) one or more PlanetLab nodes. User: researcher who deploys a service on a set of PlanetLab nodes. PlanetLab Consortium (PLC): centralised trusted intermediary that manages nodes on behalf of a group of owners and creates slices on those nodes on behalf of a group of users.
When PLC acts as a Slice Authority (SA), it maintains the state of the set of system-wide slices for which the PLC is responsible. The SA provides an interface through which users register themselves, create slices, bind users to slices, and request the slice to be instantiated on a set of nodes. PLC, acting as a Management Authority (MA), maintains a server that installs and updates the software running on the nodes it manages and monitors these nodes for correct behavior, taking appropriate action when anomalies and failures are detected. The MA maintains a database of registered nodes. Each node is affiliated with an organization (owner) and is located at a site belonging to the organization. MA provides an interface used by node owners to register their nodes with the PLC and allows users and slices authorities to obtain information about the set of nodes managed by the MA. PlanetLab’s architecture has evolved to enable decentralised control or federations of PlanetLabs (Peterson et al., 2006). The PLC has been split into two components namely the MA and SA, which allow PLC-like entities to evolve these two components independently. Therefore, autonomous organisations can federate and define peering relationships with each other. For example, peering relationships with other infrastructure is one of the goals of PlanetLab Europe (2008). A resource owner may choose a MA to which it wants to provide resources. MAs, in turn, may blacklist particular SAs. A SA may trust only certain MAs to provide it with the virtual machines it needs for its users. This enables various types of agreements between SAs and MAs. It is also important to mention that Ricci et al. (2006) have discussed issues related to the design of a general resource allocation interface that is sufficiently wide for allocators in a large variety of current and future testbeds. An allocator is a component that receives as input the users’ abstract description for the required resources and the resource status from a resource discoverer and produces allocations performed by a deployment service. The goal of an allocator is to allow users to specify characteristics of their slice in high-level terms and find resources to match these requirements. Authors have described their experience in designing PlanetLab and Emulab and among several important issues, they have advocated that:
534
Architectural Elements of Resource Sharing Networks
• • •
In future infrastructures, several allocators may co-exist and it might be difficult for them to coexist without interfering into one another; With the current proportional-share philosophy of PlanetLab, where multiple management services can co-exist, allocators do not have guarantees over any resources; Thus, co-ordination between the allocators may be required.
Grid Interoperability Now - Community Group (GIN-CG): GIN-CG (2006) has been working on providing interoperability between Grids by developing components and adapters that enable secure and standard job submissions, data transfers, and information queries. These efforts provide the basis for load management across Grids by facilitating standard job submission and request redirection. They also enable secure access to resources and data across Grids. Although GIN-CG’s efforts are relevant, its members also highlight the need for common allocation and brokering of resources across Grids.2 InterGrid: Assunção et al. (2008) have proposed an architecture and policies to enable the interoperation of Grids. This set of architecture and policies is termed as the InterGrid. InterGrid is inspired by the peering agreements between Internet Service Providers (ISPs). The Internet is composed of competing ISPs that agree to allow traffic into one another’s networks. These agreements between ISPs are commonly termed as peering and transit arrangements (Metz, 2001). In the InterGrid, a Resource Provider (RP) contributes a share of computational resources, storage resources, networks, application services or other type of resource to a Grid in return for regular payments. An RP has local users whose resource demands need to be satisfied, yet it delegates provisioning rights over spare resources to an InterGrid Gateway (IGG) by providing information about the resources available in the form of free time slots (Assunção & Buyya, 2008). A free time slot includes information about the number of resources available, their configuration and time frame over which they will be available. The control over resource shares offered by providers is performed via a container model, in which the resources are used to run virtual machines. Internally, each Grid may have a resource management system organised in a hierarchical manner. However, for the sake of simplicity, experimental results consider that RPs delegate provisioning rights directly to an IGG (Assunção & Buyya, in press). A Grid has pre-defined peering arrangements with other Grids, managed by IGGs and, through which they co-ordinate the use of resources of the InterGrid. An IGG is aware of the terms of the peering with other Grids; provides Grid selection capabilities by selecting a suitable Grid able to provide the required resources; and replies to requests from other IGGs. The peering arrangement between two Grids is represented as a contract. Request redirection policies determine which peering Grid is selected to process a request and at what price the processing is performed (Assunção & Buyya, in press). Other important work: Boghosian et al. (2006) have performed experiments using resources from more than one Grid for three projects, namely Nektar, SPICE and Vortonics. The applications in these three projects require massive numbers of computing resources only achievable through Grids of Grids. Although resources from multiple Grids were used during the experiments, they emphasised that several human interactions and negotiations are required in order to use federated resources. The authors highlighted that even if interoperability at the middleware level existed, it would not guarantee that the federated Grids can be utilised for large-scale distributed applications because there are important additional requirements such as compatible and consistent usage policies, automated advanced reservations and co-scheduling. Caromel et al. (2007) have proposed the use of a P2P network to acquire resources dynamically from a Grid infrastructure (i.e. Grid’5000) and desktop machines in order to run compute intensive applica-
535
Architectural Elements of Resource Sharing Networks
tions. The communication between the P2P network and Grid’5000 is performed through SSH tunnels. Moreover, the allocation of nodes for the P2P network uses the deployment framework of ProActive by deploying Java Virtual Machines on the allocated nodes. In addition to GIN-CG’s efforts, other Grid middleware interoperability approaches have been presented. Wang et al. (2007) have described a gateway approach to achieve interoperability between gLite (2005) (the middleware used in EGEE) and CNGrid GOS (2007) (the middleware of the Chinese National Grid (2007)). The work focuses on job management interoperability, but also describes interoperability between the different protocols used for data management as well as resource information. In the proposed interoperability approach, gLite is viewed as a type of site job manager by GOS, whereas the submission to GOS resources by gLite is implemented in a different manner; an extended job manager is instantiated for each job submitted to a GOS resource. The extended job manager sends the whole batch job to be executed in the CNGrid.
Virtual Organisations We have also carried out a survey on how projects address different challenges in the VO life-cycle. Two main categories of projects have been identified: the facilitators for VOs, which provide means for building clusters of organisations hence enabling collaboration and formation of VOs; and enablers for VOs, which provide middleware and tools to help in the formation, management, maintenance and dissolution of VOs. The classification is not strict because a project can fall into two categories, providing software for enabling VOs and working as a consortium, which organisations can join and start collaborations that are more dynamic. We divide our survey into three parts: middleware and software infrastructure for enabling VOs; consortiums and charters that facilitate the formation of VOs; and other relevant work that addresses issues related to a VO’s life-cycle.
Enabling Technology Enabling a VO means to provide the required software tools to help in the different phases of the life-cycle of a VO. As we present in this section, due to the complex challenges in the life-cycle, many projects do not address all the phases. We discuss relevant work in this section. The CONOISE Project: CONOISE (Patel et al., 2005) uses a marketplace (auctions) for the formation of VOs (Norman et al., 2004). The auctions are combinatorial; combinatorial auctions allow a good degree of flexibility so that VO initiators can specify a broad range of requirements. A combinatorial auction allows multiple units of a single item or multiple items to be sold simultaneously. However, combinatorial auctions lack on means for bid representation and efficient clearing algorithms to determine prices, quantities and winners. As demonstrated by Dang (2004), clearing combinatorial auctions is an NP-Complete problem. Thus, polynomial and sub-optimal auction clearing algorithms for combinatorial auctions have been proposed. Stakeholders in VOs enabled by CONOISE are called agents. As example of VO formation, a user may request a service to an agent, who in turn verifies if it is able to provide the service requested at the time specified. If the agent cannot provide the service, it looks for the Service Providers (SPs) offering the service required. The Requesting Agent (RA) then starts a combinatorial auction and sends call for bids to SPs. Once RA receives the bids, it determines the best set of partners and then, starts the formation of the VO. Once the VO is formed, RA becomes the VO manager.
536
Architectural Elements of Resource Sharing Networks
An agent that receives a call for bids has the following options: (a) she can decide not to bid for the auction; (b) she can bid considering its resources; (c) she may bid using resources from an existing collaboration; (d) she may identify the need to start a new VO to provide the extra resources required. Note that call for bids are recursive. CONOISE uses a cumulative scheduling based on a Constraint Satisfaction Program (CSP) to model the decision process of an agent. CONOISE also focuses on the operation and maintenance phases of VOs. Once a VO is formed, it uses principles of coalition formation for distributing tasks amongst the member agents (Patel et al., 2005). An algorithm for coalition structure generation, which is bound from the optimal, is presented and evaluated (Dang, 2004). Although not very focused on authorisation issues, the CONOISE project also deals with issues regarding trust and reputation in VOs by providing reputation and policing mechanisms to ensure minimum quality of service. The TrustCoM Project: TrustCoM (2005) addresses issues related to the establishment of trust throughout the life-cycle of VOs. Its members envision that the establishment of Service Oriented Architectures (SOAs) and the dynamic open electronic marketplaces will allow dynamic alliances and VOs among enterprises to respond quickly to market opportunities. The establishment of trust, not only at a resource level but also at a business process level, is hence of importance. In this light, TrustCoM aims to provide a framework for trust, security and contract management to enable on-demand and self-managed dynamic VOs (Dimitrakos, Golby, & Kearley, 2004; Svirskas, Arevas, Wilson, & Matthews, 2005). The framework extends current VO membership services (Svirskas et al., 2005) by providing means to: (i) identify potential VO partners through reputation management; (ii) manage users according to the roles defined in the business process models that VO partners perform; (iii) define and manage the SLA obligations on security and privacy; (iv) enable the enforcement of policies based on the SLAs and contracts. From a corporate perspective, Sairamesh et al (2005) provide examples of business models on the enforcement of security policies and the VO management. While the goal is to enable dynamic VOs, TrustCoM focuses on the security requirements for the establishment of VOs composed of enterprises. Studies and market analysis to identify the main issues and requirements to build a secure environment in which VOs form and operate have been performed.
Facilitators or Breeding Environments In order to address the problem of trust between organisations, projects have created federations and consortiums which physical organisations or Grids can join to start VOs based on common interests. We describe the main projects in this field and explain some of the technologies they use. Open Science Grid (OSG): OSG (2005) can be considered as a facilitator for VOs. The reason is that the project aims at forming a cluster or consortium of organisations and suggests them to follow a policy that states how collaboration takes place and how a VO is formed. To join the consortium and consequently form a VO, it is necessary to have a minimum infrastructure and preferably use the middleware suggested by OSG. In addition, OSG provides tools to check the status and monitor existing VOs. OSG facilitates the formation of VOs by providing an open-market-like infrastructure that allows the consortium members to advertise their resources and goals and establish VOs to explore their objectives. The VO concept is used in a recursive manner; VOs may be composed of sub-VOs. For more information we refer to the Blueprint for the OSG (2004). A basic infrastructure must be provided to form a VO, including a VO Membership Service (VOMS) and operation support. The operation support’s main goal is to provide technical support services at
537
Architectural Elements of Resource Sharing Networks
the request of a member site. As OSG intends to federate across heterogeneous Grid environments, the resources of the member sites and users are organised in VOs under the contracts that result from negotiations among the sites, which in turn have to follow the consortium’s policies. Such contracts are defined at the middleware layer and can be negotiated in an automated fashion; however, thus far there is no easily responsive means to form a VO and the formation requires complex multilateral agreements among the involved sites. OSG middleware uses VOMS to support authorisation services for VO members hence helping in the maintenance and operation phases. Additionally, for the sake of scalability and easiness of administration, Grid User Management System (GUMS) facilitates the mapping of Grid credentials to sitespecific credentials. GUMS and VOMS provide means to facilitate the authorisation in the operation and maintenance phases. GridCat provides maps and statistics on jobs running and storage capacity of the member sites. This information can guide schedulers and brokers on job submission and in turn facilitate the operation phase. Additionally, MonALISA (MONitoring Agents using a Large Integrated Services Architecture) (Legrand et al., 2004) has been utilised to monitor computational nodes, applications and network performance of the VOs within the consortium. Enabling Grids for E-sciencE (EGEE): Similarly to OSG, EGEE (2005) federates resource centres to enable a global infrastructure for researchers. EGEE’s resource centres are hierarchically organised: an Operations Manager Centre (OMC) located at CERN, Regional Operations Centres (ROC) located in different countries, Core Infrastructures Centres (CIC) and Resource Centres (RC) responsible for providing resources to the Grid. A ROC carries out activities as supporting deployment and operations; negotiating SLAs within its region and organising certification authorities. CICs are in charge of providing VO-services, such as maintaining VO-Servers and registration; VO-specific services such as databases, resource brokers and user interfaces; and other activities such as accounting and resource usage. The OMC interfaces with international Grid efforts. It is also responsible for activities such as approving connection with new RCs, promoting cross-trust among CAs, and enabling cooperation and agreements with user communities, VOs and existing national and regional infrastructures. To join EGEE, in addition to the installation of the Grid middleware, there is a need for a formal request and further assessment from special committees. Once the application is considered suitable to EGEE, a VO will be formed. Accounting is based on the use of resources by members of the VO. EGEE currently utilises LCG-2/gLite (2005).
Other Important Work Resource allocation in a VO depends on, and is driven by, many conditions and rules: the VO can be formed by physical organisations under different, sometimes conflicting, resource usage policies. Participating organisations provide their resources to the VO, which can be defined in terms of SLAs, and agree to enforce VO level policies defining who has access to the resources in the VO. Different models can be adopted for negotiation and enforcement of SLAs. One model is by relying on a trusted VO manager. Resource providers supply resources to the VO according to SLAs established with the VO manager. The VO manager in turn assigns resource quotas to VO groups and users based on a commonly agreed VO-level policy. In contrast, a VO can follow a democratic or P2P sharing approach, in which “you give what you can and get what others can offer” or “you get what give” (Wasson & Humphrey, 2003). Elmroth and Gardfj¨all (2005) presented an approach for enabling Grid-wide fair-share scheduling. The work introduces a scheduling framework that enforces fair-share policies in a Grid-wide scale. The
538
Architectural Elements of Resource Sharing Networks
policies are hierarchical in the sense that they can be subdivided recursively to form a tree of shares. Although the policies are hierarchical, they are enforced in a flat and decentralised manner. In the proposed framework, resources have local policies and split the available resources to given VOs. These local policies have references to the VO-level policies. Although the proposed framework and algorithm do not require a centralised scheduler, it may impose certain overhead in locally caching global usage information.
MAPPING OF SURVEYED WORK AGAINST THE TAXONOMIES This section presents the mapping of the surveyed projects against the proposed taxonomies. For simplicity, the mapping only considers selected work from those surveyed to be included in the tables presented in this section. Table 1 classifies existing work according to their architectures and operational models. Gridbus Broker, GridWay, and SNAP-based community resource broker are resource brokers that act on behalf of users to submit jobs to Grid resources to which they have access. They follow the operational model based on job routing. Although GridWay provides means for the deployment of virtual machines, this deployment takes place on a job basis (Rubio-Montero et al., 2007). DI-GRUBER, VioCluster, Condor flocking and CSF have a distributed-scheduler architecture in which brokers or meta-schedulers have bilateral sharing agreements between them (Table 2). OurGrid and Self-organising flock of Condors utilise P2P networks of brokers or schedulers, whereas Grid federation uses a P2P network to build a shared space utilised by providers and users to post resource claims and requests respectively (Table 2). VioCluster and Shirako enable the creation of virtualised environments in which job routing or job pulling based systems can be deployed. However, in these last two systems, resources are controlled at the level of containment or virtual machines. Table 2 summarises the communication models and sharing mechanisms utilised by distributedscheduler based systems. Shirako uses transitive agreements in which brokers can exchange claims of resources issued by site authorities who represent the resource providers. It allows brokers to delegate access to resources multiple times. The resource control techniques employed by the surveyed systems are summarised in Table 3. As described beforehand, VioCluster and Shirako use containment based resource control, whereas the remaining systems utilise the job model. EGEE WMS and DI-GRUBER take into account the scheduling of jobs according to the VOs to which users belong and the shares contributed by resource providers. The other systems can be utilised to form a single VO wherein jobs can be controlled on a user basis. The support of various works to the VO life-cycle phases is depicted in Table 4. We select a subset of the surveyed work, particularly the work that focuses on VO related issues such as their formation and operation. DI-GRUBER and gLite schedule jobs by considering the resource shares of multiple VOs. EGEE and OSG also work as facilitators of VOs by providing consortiums to which organisations can join and start VOs (Table 5). However, the process is not automated and requires the establishment of contracts between the consortium and the physical resource providers. Shirako enables the creation of virtualised environments spanning multiple providers, which can be used for hosting multiple VOs (Ramakrishnan et al., 2006). The systems characteristics and the VOs they enable are summarised in Table 5. Conoise and Akogrimo allow the formation of dynamic VOs in which the VO can be started by a user utilising a mobile
539
Architectural Elements of Resource Sharing Networks
Table 1. GRMSs according to architectures and operational models System
Architecture
Operational Model
SGE and PBS
Independent clusters
Job routing
Condor-G
Independent clusters*
Job routing
Gridbus Broker
Resource Broker
Job routing
GridWay
Resource Broker
Job routing**
SNAP-Based Community Resource Broker
Resource Broker
Job routing
EGEE WMS
Centralised
Job routing
KOALA
Centralised
Job routing
PlanetLab
Centralised
N/A***
Computing Center Software (CCS)
Hierarchical
Job routing
GRUBER/DI-GRUBER
Distributed/static
Job routing
VioCluster
Distributed/static
N/A***
Condor flocking
Distributed/static
Matchmaking
Community Scheduler Framework
Distributed/static
Job routing
OurGrid
Distributed/dynamic
Job routing
Self-organising flock of Condors
Distributed/dynamic
Matchmaking
Grid federation
Distributed/dynamic
Job routing
Askalon
Distributed/dynamic
Job routing
SHARP/Shirako
Distributed/dynamic
N/A***
Hybrid
Matchmaking
Delegated Matchmaking
* ** ***
Condor-G provides software that can be used to build meta-schedulers. GridWay also manages the deployment of virtual machines. PlanetLab, VioCluster and Shirako use resource control at the containment level, even though they enable the creation of virtual execution environments on which systems based on job routing can be deployed.
Table 2. Classification of GRMSs according to their sharing arrangements Communication Pattern
Sharing Mechanism
GRUBER/DI-GRUBER
System
Bilateral agreements
System centric
VioCluster
Bilateral agreements
Site centric
Condor flocking
Bilateral agreements
Site centric
OurGrid
P2P network
System centric
Self-organising flock of Condors
P2P network
Site centric
Shared space
Site centric
Askalon
Grid federation
Bilateral agreements
Site centric
SHARP/Shirako
Transitive agreements
Self-interest
Delegated MatchMaking
Bilateral agreements
Site centric
540
Architectural Elements of Resource Sharing Networks
Table 3. Classification of GRMSs according to their support for VOs and resource control System
Support for VOs
Resource Control
Multiple VO
Job model
EGEE WMS KOALA
Single VO
Job model
Multiple VO
Job model
VioCluster
Single VO
Container model/multiple site*
Condor flocking
Single VO
Job model
OurGrid
Single VO
Job model
Self-organising flock of Condors
Single VO
Job model
Grid federation
Single VO
Job model
Askalon
Single VO
Job model
Multiple VO**
Container model/multiple site***
Single VO
Job model
GRUBER/DI-GRUBER
SHARP/Shirako Delegated MatchMaking
*
VioCluster supports containment at both single site and multiple site levels.
**
Shirako enables the creation of multiple containers that can in turn be used by multiple VOs, even though it does not handle issues on job scheduling amongst multiple VOs.
***
Shirako supports containment at both (i) single site level through Cluster on Demand and (ii) multiple-site level. Shirako also explores resource control at job level by providing recommendations on the site in which jobs should be executed.
Table 4. Support to the phases of the VO’s lifecycle by the projects analysed Support to the phases of the VO life-cycle
Project Name
Creation
Operation
Maintenance
Dissolution
Support for short term collaborations
OSG*
Partial
Partial
Not available
Not available
Not available
EGEE/gLite*
Partial
Available
Not available
Not available
Not available
CONOISE
Available
Available
Available
Not available
Available
TrustCoM
Mainly related to security issues
Mainly related to security issues
Not available
Not available
Not available
DI-GRUBER
Not available
Available
Partial**
Not available
Not available
Akogrimo***
Partial
Partial
Partial
Partial
Partial
Shirako
Not available
Available
Available
Not available
Not available
*
OSG and EGEE work as consortiums enabling trust among organisations and facilitating the formation of VOs. They also provide tools for monitoring status of resources and job submissions. EGEE’s WMS performs the scheduling taking into account multiple VOs.
**
DI-GRUPER’s policy decision points allow for the re-adjustment of the VOs according to the current resource shares offered by providers and the status of the Grid.
***
Akogrimo aims at enabling collaboration between doctors upon the patient’s request or in case of a health emergency.
541
Architectural Elements of Resource Sharing Networks
Table 5. Mapping of the systems against the propsed VO taxonomies Dynamism
Goal Orientation
Duration
Control
Policy Enforcement
Facilitators
Dynamic/Hybrid
Targeted
Mediumlived
Decentralised
Democratic
N/A
TrustCoM**
Static
Targeted
Long-lived
N/A
N/A
N/A
GRUBER/DIGRUBER
Static
Targeted
Long-lived
Decentralised
Decentralised***
N/A
gLite/EGEE
Static
Targeted
Long-lived
Centralised
Centralised
Centralised+
Open Science Grid
Static
Targeted
Long-lived
Hierarchical
Centralised
Market-like
Dynamic/Hybrid
Targeted
Short or Mediumlived
Decentralised
Democratic
N/A
Dynamic
Non-targeted
Mediumlived
Decentralised
Democratic
N/A
System Conoise*
Akogrimo
Shirako
* ** *** +
Conoise and Akogrimo allow a client using a mobile device to start a VO, thus the VO can comprise fixed and mobile resources. TrustCoM deals with security issues and does not provide tools for the management and policy enforcement in VOs. DI-GRUBER uses a network of decision points to guide submitting hosts and schedulers about which resources can execute the jobs. EGEE Workload Management System is aware of the VOs and schedules jobs accordingly to the VOs in the system.
device. The virtual environments enabled by Shirako can be adapted by leasing additional resources or terminating leases according to the demands of the virtual organisation it is hosting (Ramakrishnan et al., 2006). Resource providers in Shirako may offer their resources in return for economic compensation meaning that the resource providers may not have a common target in solving a particular resource challenge. This makes the VOs non-targeted.
FUTURE TRENDS Over the last decade, the distributed computing realm has been characterised by the deployment of large-scale Grids such as EGEE and TeraGrid. Such Grids have provided the research community with an unprecedented number of resources, which have been used for various scientific research. However, the hardware and software heterogeneity of the resources provided by the organisations within a Grid have increased the complexity of deploying applications in these environments. Recently, application deployment has been facilitated by the intensifying use of virtualisation technologies. The increasing ubiquity of virtual machine technologies has enabled the creation of customised environments atop a physical infrastructure and the emergence of new business models such as virtualised data centres and cloud computing. The use of virtual machines brings several benefits such as: server consolidation, the ability to create VMs to run legacy code without interfering in other applications’ APIs, improved security through the creation of application sandboxes, dynamic provisioning of virtual machines to services, and performance isolation.
542
Architectural Elements of Resource Sharing Networks
Existing virtual-machine based resource management systems can manage a cluster of computers within a site allowing the creation of virtual workspaces (Keahey et al., 2006) or virtual clusters (Foster et al., 2006; Montero et al., 2008; Chase et al., 2003). They can bind resources to virtual clusters or workspaces according to a customer’s demand. These resource managers allow the user to create customised virtual clusters using shares of the physical machines available at the site. In addition, current data centres are using virtualisation technology to provide users with the look and feel of taping into a dedicated computing and storage infrastructure for which they are charged a fee based on usage (e.g. Amazon Elastic Computing Cloud3 and 3Tera4). These factors are resulting in the creation of virtual execution environments or slices that span both commercial and academic computing sites. Virtualisation technologies minimise many of the concerns that previously prevented the peering of resource sharing networks, such as the execution of unknown applications and the lack of guarantees over resource control. For the resource provider, substantial work is being carried out on the provisioning of resources to services and user applications. Techniques such as workload forecasts along with resource overbooking can reduce the need for over-provisioning a computing infrastructure. Users can benefit from the improved reliability, the performance isolation, and the environment isolation offered by virtualisation technologies. We are likely to see an increase in the number of virtual organisations enabled by virtual machines, thus allocating resources from both commercial data centres and research testbeds. We suggest that emerging applications will require the prompt formation of VOs, which are also quickly responsive and automated. VOs can have dynamic resource demands, which are quickly responded by data centres relying on virtualisation technologies. There can also be an increase in business workflows relying on globally available messaging based systems for process synchronisation5. Our current research focuses on connecting computing sites managed by virtualisation technologies for creating distributed virtual environments which are used by the user applications.
CONCLUSION This chapter presents classifications and a survey of systems that can provide means for inter-operating resource sharing networks. It also provides taxonomies on Virtual Organisations (VOs) with a focus on Grid computing practices. Hence, we initially discussed the challenges in VOs and presented a background on the life-cycle of VOs and on resource sharing networks. This chapter suggests that future applications will require the prompt formation of VOs, which are also quickly responsive and automated. This may be enabled by virtualisation technology and corroborates the current trends on multiple site containers or virtual workspaces. Relevant work and technology in the area were presented and discussed.
ACKNOWLEDGMENT We thank Marco Netto, Alexandre di Costanzo and Chee Shin Yeo for sharing their thoughts on the topic and helping in improving the structure of this chapter. We are grateful to Mukaddim Pathan for proof reading a preliminary version of this chapter. This work is supported by research grants from the Australian Research Council (ARC) and Australian Department of Innovation, Industry, Science and Research (DIISR). Marcos’ PhD research is partially supported by NICTA.
543
Architectural Elements of Resource Sharing Networks
REFERENCES A Blueprint for the Open Science Grids. (2004, December). Snapshot v0.9. Adabala, S., Chadha, V., Chawla, P., Figueiredo, R., Fortes, J., & Krsul, I. (2005, June). From virtualized resources to virtual computing Grids: the In-VIGO system. Future Generation Computer Systems, 21(6), 896–909. doi:10.1016/j.future.2003.12.021 Andrade, N., Brasileiro, F., Cirne, W., & Mowbray, M. (2007). Automatic Grid assembly by promoting collaboration in peer-to-peer Grids. Journal of Parallel and Distributed Computing, 67(8), 957–966. doi:10.1016/j.jpdc.2007.04.011 Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, P. (2003). OurGrid: An approach to easily assemble Grids with equitable resource sharing. In 9th Workshop on Job Scheduling Strategies for Parallel Processing (Vol. 2862, pp. 61–86). Berlin/Heidelberg: Springer. Australian Partnership for Advanced Computing (APAC) Grid. (2005). Retrieved from http://www.apac. edu.au/programs/GRID/index.html. Balazinska, M., Balakrishnan, H., & Stonebraker, M. (2004, March). Contract-based load management in federated distributed systems. In 1st Symposium on Networked Systems Design and Implementation (NSDI) (pp. 197-210). San Francisco: USENIX Association. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al. (2003). Xen and the art of virtualization. In 19th ACM Symposium on Operating Systems Principles (SOSP ’03) (pp. 164–177). New York: ACM Press. Boghosian, B., Coveney, P., Dong, S., Finn, L., Jha, S., Karniadakis, G. E., et al. (2006, June). Nektar, SPICE and vortonics: Using federated Grids for large scale scientific applications. In IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE). Paris: IEEE Computing Society. Brune, M., Gehring, J., Keller, A., & Reinefeld, A. (1999). Managing clusters of geographically distributed high-performance computers. Concurrency (Chichester, England), 11(15), 887–911. doi:10.1002/ (SICI)1096-9128(19991225)11:15<887::AID-CPE459>3.0.CO;2-J Bulhões, P. T., Byun, C., Castrapel, R., & Hassaine, O. (2004, May). N1 Grid Engine 6 Features and Capabilities [White Paper]. Phoenix, AZ: Sun Microsystems. Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A self-organizing flock of condors. In 2003 ACM/IEEE Conference on Supercomputing (SC 2003) (p. 42). Washington, DC: IEEE Computer Society. Buyya, R., Abramson, D., & Giddy, J. (2000, June). An economy driven resource management architecture for global computational power grids. In 7th International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2000). Las Vegas, AZ: CSREA Press. Caromel, D., di Costanzo, A., & Mathieu, C. (2007). Peer-to-peer for computational Grids: Mixing clusters and desktop machines. Parallel Computing, 33(4–5), 275–288. doi:10.1016/j.parco.2007.02.011
544
Architectural Elements of Resource Sharing Networks
Catlett, C., Beckman, P., Skow, D., & Foster, I. (2006, May). Creating and operating national-scale cyberinfrastructure services. Cyberinfrastructure Technology Watch Quarterly, 2(2), 2–10. Chase, J. S., Irwin, D. E., Grit, L. E., Moore, J. D., & Sprenkle, S. E. (2003). Dynamic virtual clusters in a Grid site manager. In 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 2003) (p. 90). Washington, DC: IEEE Computer Society. Chinese National Grid (CNGrid) Project Web Site. (2007). Retrieved from http://www.cngrid.org/ CNGrid GOS Project Web site. (2007). Retrieved from http://vega.ict.ac.cn Dang, V. D. (2004). Coalition Formation and Operation in Virtual Organisations. PhD thesis, Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton, Southampton, UK. de Assunção, M. D., & Buyya, R. (2008, December). Performance analysis of multiple site resource provisioning: Effects of the precision of availability information [Technical Report]. In International Conference on High Performance Computing (HiPC 2008) (Vol. 5374, pp. 157–168). Berlin/Heidelberg: Springer. de Assunção, M. D., & Buyya, R. (in press). Performance analysis of allocation policies for interGrid resource provisioning. Information and Software Technology. de Assunção, M. D., Buyya, R., & Venugopal, S. (2008, June). InterGrid: A case for internetworking islands of Grids. [CCPE]. Concurrency and Computation, 20(8), 997–1024. doi:10.1002/cpe.1249 Dimitrakos, T., Golby, D., & Kearley, P. (2004, October). Towards a trust and contract management framework for dynamic virtual organisations. In eChallenges. Vienna, Austria. Dixon, C., Bragin, T., Krishnamurthy, A., & Anderson, T. (2006, September). Tit-for-Tat Distributed Resource Allocation [Poster]. The ACM SIGCOMM 2006 Conference. Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In 5th IEEE/ACM International Workshop on Grid Computing (Grid 2004) (pp. 53–60). Washington, DC: IEEE Computer Society. Dumitrescu, C., & Foster, I. (2005, August). GRUBER: A Grid resource usage SLA broker. In J. C. Cunha & P. D. Medeiros (Eds.), Euro-Par 2005 (Vol. 3648, pp. 465–474). Berlin/Heidelberg: Springer. Dumitrescu, C., Raicu, I., & Foster, I. (2005). DI-GRUBER: A distributed approach to Grid resource brokering. In 2005 ACM/IEEE Conference on Supercomputing (SC 2005) (p. 38). Washington, DC: IEEE Computer Society. Dumitrescu, C., Wilde, M., & Foster, I. (2005, June). A model for usage policy-based resource allocation in Grids. In 6th IEEE International Workshop on Policies for Distributed Systems and Networks (pp. 191–200). Washington, DC: IEEE Computer Society. Elmroth, E., & Gardfjäll, P. (2005, December). Design and evaluation of a decentralized system for Grid-wide fairshare scheduling. In 1st IEEE International Conference on e-Science and Grid Computing (pp. 221–229). Melbourne, Australia: IEEE Computer Society Press.
545
Architectural Elements of Resource Sharing Networks
Enabling Grids for E-sciencE (EGEE) project. (2005). Retrieved from http://public.eu-egee.org. Epema, D. H. J., Livny, M., van Dantzig, R., Evers, X., & Pruyne, J. (1996). A worldwide flock of condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12(1), 53–65. doi:10.1016/0167-739X(95)00035-Q Fontán, J., Vázquez, T., Gonzalez, L., Montero, R. S., & Llorente, I. M. (2008, May). OpenNEbula: The open source virtual machine manager for cluster computing. In Open Source Grid and Cluster Software Conference – Book of Abstracts. San Francisco. Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayor, B., & Zhang, X. (2006, May). Virtual clusters for Grid communities. In 6th IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2006) (pp. 513–520). Washington, DC: IEEE Computer Society. Foster, I., & Kesselman, C. (1997, Summer). Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputer Applications, 11(2), 115–128. Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the Grid: Enabling scalable virtual organizations. The International Journal of Supercomputer Applications, 15(3), 200–222. Frey, J., Tannenbaum, T., Livny, M., Foster, I. T., & Tuecke, S. (2001, August). Condor-G: A computation management agent for multi-institutional Grids. In 10th IEEE International Symposium on High Performance Distributed Computing (HPDC 2001) (pp. 55–63). San Francisco: IEEE Computer Society. Fu, Y., Chase, J., Chun, B., Schwab, S., & Vahdat, A. (2003). SHARP: An architecture for secure resource peering. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003) (pp. 133–148). New York: ACM Press. gLite - Lightweight Middleware for Grid Computing. (2005). Retrieved from http://glite.web.cern.ch/ glite. Graupner, S., Kotov, V., Andrzejak, A., & Trinks, H. (2002, August). Control Architecture for Service Grids in a Federation of Utility Data Centers (Technical Report No. HPL-2002-235). Palo Alto, CA: HP Laboratories Palo Alto. Grid Interoperability Now Community Group (GIN-CG). (2006). Retrieved from http://forge.ogf.org/ sf/projects/gin. Grimme, C., Lepping, J., & Papaspyrou, A. (2008, April). Prospects of collaboration between compute providers by means of job interchange. In Job Scheduling Strategies for Parallel Processing (Vol. 4942, p. 132-151). Berlin / Heidelberg: Springer. Grit, L. E. (2005, October). Broker Architectures for Service-Oriented Systems [Technical Report]. Durham, NC: Department of Computer Science, Duke University. Grit, L. E. (2007). Extensible Resource Management for Networked Virtual Computing. PhD thesis, Department of Computer Science, Duke University, Durham, NC. (Adviser: Jeffrey S. Chase)
546
Architectural Elements of Resource Sharing Networks
Haji, M. H., Gourlay, I., Djemame, K., & Dew, P. M. (2005). A SNAP-based community resource broker using a three-phase commit protocol: A performance study. The Computer Journal, 48(3), 333–346. doi:10.1093/comjnl/bxh088 Hey, T., & Trefethen, A. E. (2002). The UK e-science core programme and the Grid. Future Generation Computer Systems, 18(8), 1017–1031. doi:10.1016/S0167-739X(02)00082-1 Huang, R., Casanova, H., & Chien, A. A. (2006, April). Using virtual Grids to simplify application scheduling. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Rhodes Island, Greece: IEEE. Huedo, E., Montero, R. S., & Llorente, I. M. (2004). A framework for adaptive execution in Grids. Software, Practice & Experience, 34(7), 631–651. doi:10.1002/spe.584 Iosup, A., Epema, D. H. J., Tannenbaum, T., Farrellee, M., & Livny, M. (2007, November). Inter-operating Grids through delegated matchmaking. In 2007 ACM/IEEE Conference on Supercomputing (SC 2007) (pp. 1–12). New York: ACM Press. Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., & Yocum, K. G. (2006, June). Sharing networked resources with brokered leases. In USENIX Annual Technical Conference (pp. 199–212). Berkeley, CA: USENIX Association. Katzy, B., Zhang, C., & Löh, H. (2005). Virtual organizations: Systems and practices. In L. M. Camarinha-Matos, H. Afsarmanesh, & M. Ollus (Eds.), (p. 45-58). New York: Springer Science+Business Media, Inc. Keahey, K., Foster, I., Freeman, T., & Zhang, X. (2006). Virtual workspaces: Achieving quality of service and quality of life in the Grids. Science Progress, 13(4), 265–275. Kertész, A., Farkas, Z., Kacsuk, P., & Kiss, T. (2008, April). Grid enabled remote instrumentation. In F. Davoli, N. Meyer, R. Pugliese, & S. Zappatore (Eds.), 2nd International Workshop on Distributed Cooperative Laboratories: Instrumenting the Grid (INGRID 2007) (pp. 303–312). New York: Springer US. Kim, K. H., & Buyya, R. (2007, September). Fair resource sharing in hierarchical virtual organizations for global Grids. In 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) (pp. 50–57). Austin, TX: IEEE. Legrand, I., Newman, H., Voicu, R., Cirstoiu, C., Grigoras, C., Toarta, M., et al. (2004, SeptemberOctober). Monalisa: An agent based, dynamic service system to monitor, control and optimize Grid based applications. In Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland. Litzkow, M. J., Livny, M., & Mutka, M. W. (1988, June). Condor – a hunter of idle workstations. In 8th International Conference of Distributed Computing Systems (pp. 104–111). San Jose, CA: Computer Society. Metz, C. (2001). Interconnecting ISP networks. IEEE Internet Computing, 5(2), 74–80. doi:10.1109/4236.914650 Mohamed, H., & Epema, D. (in press). KOALA: A co-allocating Grid scheduler. Concurrency and Computation.
547
Architectural Elements of Resource Sharing Networks
Montero, R. S., Huedo, E., & Llorente, I. M. (2008, September/October). Dynamic deployment of custom execution environments in Grids. In 2nd International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP ’08) (pp. 33–38). Valencia, Spain: IEEE Computer Society. National e-Science Centre. (2005). Retrieved from http://www.nesc.ac.uk. Norman, T. J., Preece, A., Chalmers, S., Jennings, N. R., Luck, M., & Dang, V. D. (2004). Agentbased formation of virtual organisations. Knowledge-Based Systems, 17, 103–111. doi:10.1016/j. knosys.2004.03.005 Open Science Grid. (2005). Retrieved from http://www.opensciencegrid.org Open Source Metascheduling for Virtual Organizations with the Community Scheduler Framework (CSF) (Tech. Rep.) (2003, August). Ontario, Canada: Platform Computing. OpenPBS. The portable batch system software. (2005). Veridian Systems, Inc., Mountain View, CA. Retrieved from http://www.openpbs.org/scheduler.html Padala, P., Shin, K. G., Zhu, X., Uysal, M., Wang, Z., Singhal, S., et al. (2007, March). Adaptive control of virtualized resources in utility computing environments. In 2007 Conference on EuroSys (EuroSys 2007) (pp. 289-302). Lisbon, Portugal: ACM Press. Patel, J., Teacy, L. W. T., Jennings, N. R., Luck, M., Chalmers, S., & Oren, N. (2005). Agent-based virtual organisations for the Grids. International Journal of Multi-Agent and Grid Systems, 1(4), 237–249. Peterson, L., Muir, S., Roscoe, T., & Klingaman, A. (2006, May). PlanetLab Architecture: An Overview (Tech. Rep. No. PDN-06-031). Princeton, NJ: PlanetLab Consortium. PlanetLab Europe. (2008). Retrieved from http://www.planet-lab.eu/. Ramakrishnan, L., Irwin, D., Grit, L., Yumerefendi, A., Iamnitchi, A., & Chase, J. (2006). Toward a doctrine of containment: Grid hosting with adaptive resource control. In 2006 ACM/IEEE Conference on Supercomputing (SC 2006) (p. 101). New York: ACM Press. Ranjan, R., Buyya, R., & Harwood, A. (2005, September). A case for cooperative and incentive-based coupling of distributed clusters. In 7th IEEE International Conference on Cluster Computing. Boston, MA: IEEE CS Press. Ranjan, R., Harwood, A., & Buyya, R. (2006, September). SLA-based coordinated superscheduling scheme for computational Grids. In IEEE International Conference on Cluster Computing (Cluster 2006) (pp. 1–8). Barcelona, Spain: IEEE. Ranjan, R., Rahman, M., & Buyya, R. (2008, May). A decentralized and cooperative workflow scheduling algorithm. In 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2008). Lyon, France: IEEE Computer Society. Ricci, R., Oppenheimer, D., Lepreau, J., & Vahdat, A. (2006, January). Lessons from resource allocators for large-scale multiuser testbeds. SIGOPS Operating Systems Review, 40(1), 25–32. doi:10.1145/1113361.1113369
548
Architectural Elements of Resource Sharing Networks
Rubio-Montero, A., Huedo, E., Montero, R., & Llorente, I. (2007, March). Management of virtual machines on globus Grids using GridWay. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007) (pp. 1–7). Long Beach, USA: IEEE Computer Society. Ruth, P., Jiang, X., Xu, D., & Goasguen, S. (2005, May). Virtual distributed environments in a shared infrastructure. IEEE Computer, 38(5), 63–69. Ruth, P., McGachey, P., & Xu, D. (2005, September). VioCluster: Virtualization for dynamic computational domain. In IEEE International on Cluster Computing (Cluster 2005) (pp. 1–10). Burlington, MA: IEEE. Ruth, P., Rhee, J., Xu, D., Kennell, R., & Goasguen, S. (2006, June). Autonomic live adaptation of virtual computational environments in a multi-domain infrastructure. In 3rd IEEE International Conference on Autonomic Computing (ICAC 2006) (pp. 5-14). Dublin, Ireland: IEEE. Sairamesh, J., Stanbridge, P., Ausio, J., Keser, C., & Karabulut, Y. (2005, March). Business Models for Virtual Organization Management and Interoperability (Deliverable A - WP8&15 WP - Business & Economic Models No. V.1.5). Deliverable document 01945 prepared for TrustCom and the European Commission. Schwiegelshohn, U., & Yahyapour, R. (1999). Resource allocation and scheduling in metasystems. In 7th International Conference on High-Performance Computing and Networking (HPCN Europe ’99) (pp. 851–860). London, UK: Springer-Verlag. Shoykhet, A., Lange, J., & Dinda, P. (2004, July). Virtuoso: A System For Virtual Machine Marketplaces [Technical Report No. NWU-CS-04-39]. Evanston/Chicago: Electrical Engineering and Computer Science Department, Northwestern University. Siddiqui, M., Villazón, A., & Fahringer, T. (2006). Grid capacity planning with negotiation-based advance reservation for optimized QoS. In 2006 ACM/IEEE Conference on Supercomputing (SC 2006) (pp. 21–21). New York: ACM. Smarr, L., & Catlett, C. E. (1992, June). Metacomputing. Communications of the ACM, 35(6), 44–52. doi:10.1145/129888.129890 Svirskas, A., Arevas, A., Wilson, M., & Matthews, B. (2005, October). Secure and trusted virtual organization management. ERCIM News (63). The TrustCoM Project. (2005). Retrieved from http://www.eu-trustcom.com. Vázquez-Poletti, J. L., Huedo, E., Montero, R. S., & Llorente, I. M. (2007). A comparison between two grid scheduling philosophies: EGEE WMS and Grid Way. Multiagent and Grid Systems, 3(4), 429–439. Venugopal, S., Nadiminti, K., Gibbins, H., & Buyya, R. (2008). Designing a resource broker for heterogeneous Grids. Software, Practice & Experience, 38(8), 793–825. doi:10.1002/spe.849 Wang, Y., Scardaci, D., Yan, B., & Huang, Y. (2007). Interconnect EGEE and CNGRID e-infrastructures through interoperability between gLite and GOS middlewares. In International Grid Interoperability and Interoperation Workshop (IGIIW 2007) with e-Science 2007 (pp. 553–560). Bangalore, India: IEEE Computer Society.
549
Architectural Elements of Resource Sharing Networks
Wasson, G., & Humphrey, M. (2003). Policy and enforcement in virtual organizations. In 4th International Workshop on Grid Computing (pp. 125–132). Washington, DC: IEEE Computer Society. Wesner, S., Dimitrakos, T., & Jeffrey, K. (2004, October). Akogrimo - the Grid goes mobile. ERCIM News, (59), 32-33.
ENDNOTES 1 2
3 4 5
550
http://www.vmware.com/ The personal communication amongst GIN-CG members is online at: http://www.ogf.org/pipermail/ gin-ops/2007-July/000142.html http://aws.amazon.com/ec2/ http://www.3tera.com/ http://aws.amazon.com/sqs/
Section 6
Optimization Techniques
552
Chapter 24
Simultaneous MultiThreading Microarchitecture Chen Liu Florida International University, USA Xiaobin Li Intel® Corporation, USA Shaoshan Liu University of California, Irvine, USA Jean-Luc Gaudiot University of California, Irvine, USA
ABSTRACT Due to the conventional sequential programming model, the Instruction-Level Parallelism (ILP) that modern superscalar processors can explore is inherently limited. Hence, multithreading architectures have been proposed to exploit Thread-Level Parallelism (TLP) in addition to conventional ILP. By issuing and executing instructions from multiple threads at each clock cycle, Simultaneous MultiThreading (SMT) achieves some of the best possible system resource utilization and accordingly higher instruction throughput. In this chapter, the authors describe the origin of SMT microarchitecture, comparing it with other multithreading microarchitectures. They identify several key aspects for high-performance SMT design: fetch policy, handling long-latency instructions, resource sharing control, synchronization and communication. They also describe some potential benefits of SMT microarchitecture: SMT for faulttolerance and SMT for secure communications. Given the need to support sequential legacy code and emerge of new parallel programming model, we believe SMT microarchitecture will play a vital role as we enter the multi-thread multi/many-core processor design era.
INTRODUCTION Ever since the first integrated circuits (IC) were independently invented by Jack Kilby (Nobel Prize Laureate in Physics in 2000) from Texas Instruments and Robert Noyce (co-founder of Intel®) around 50 years ago, we have witnessed an exponential growth of the whole semiconductor industry. DOI: 10.4018/978-1-60566-661-7.ch024
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Simultaneous MultiThreading Microarchitecture
Figure 1. Moore’s Law: Transistor count increase
Moore’s Law and Memory Wall The semiconductor industry has been driven by Moore’s law (Moore, 1965) for about 40 years with the continuing advancements in VLSI technology. Moore’s law states that the number of transistors on a single chip doubles every TWO years, as shown in Figure 11, which is based on data from both Intel® and AMD. A corollary of Moore’s law states that the feature size of chip manufacturing technology keeps decreasing at the rate of one half approximately every FIVE years (a quarter every two years), based on our observation shown in Figure 2. As the number of transistors on a chip grows exponentially, we have reached the point where we could have more than one billion transistors on a single chip. For example, the Dual-Core Itanium® 2 from Intel® integrates more than 1.7 billion transistors (Intel, 2006). How to efficiently utilize this huge amount of transistor estate is a challenging task which has recently preoccupied many researchers and system architects from both academia and industry. Processor and memory integration technologies both follow Moore’s law. Memory latency, however, is drastically increasing relatively to the processor speed. This is often referred to as the “Memory Wall” problem (Hennessy, 2006). Indeed, Figure 3 shows that CPU performance increases at an average rate of 55% per year, while the memory performance increases at a much lower 7% per year average rate. There is no sign this gap will be remedied in the near future. Even though the processor speed is continuously increasing, and processors can handle increasing numbers of instructions in one clock cycle, however, we will continue experiencing considerable performance degradation each time we need to access the memory. Pipeline stalls will occur when the data does not arrive soon enough after it has been requested from the memory.
553
Simultaneous MultiThreading Microarchitecture
Figure 2. Moore’s Law: Feature size decrease
Figure 3. Memory wall
554
Simultaneous MultiThreading Microarchitecture
Overcoming the Limits of Instruction-Level Parallelism Modern superscalar processors are capable of fetching multiple instructions at the same time and execute as many instructions as there are functional units, exploiting the Instruction-Level Parallelism (ILP) that inherently exists even in otherwise sequential programs. Furthermore, in order to extract more instructions that can be executed in parallel, these processors have employed dynamic instruction scheduling and have been equipped with larger instruction window than ever. Even though increasing the size of the instruction window would increase to some extent the amount of ILP that a superscalar processor can deliver, control and data dependencies among the instructions, branch mispredictions, and long-latency operations such as memory accesses limit the effective size of the instruction window. For SPEC benchmark programs (http://www.spec.org/), a basic instruction block typically consists of up to 25 instructions (Huang, 1999). However, the average block size for integer programs has remained small (Mahadevan, 1994), around 4-5 instructions (Marcuello, 1999). Wall (1991) also pointed out that most representative application programs do not have an intrinsic ILP higher than 7 instructions per cycle even with unbounded resources and optimistic assumptions. Hence, if many slots in the instruction window are occupied by those depending on a preceding instruction suffering from a cache miss for the input operands, the effective size of the instruction window is quite small: only a few instructions can be issued due to the lack of Instruction-Level Parallelism. Therefore, the performance achieved by such processors is far below the theoretical peak as a result of poor resource utilization. For example, even a superscalar processor with a fetch width of eight instructions, derived from the MIPS R10000 processor (Yeager, 1996), equipped with out-of-order execution and speculation, provided an Instruction Per Cycle (IPC) reading of only 2.7 for a multi-programming workload (multiple independent programs), and 3.3 for a parallel-programming workload (one parallelized program), despite a potential of eight (Eggers, 1997).
BACKGROUND As the design of modern microprocessors, either of superscalar or Very Long Instruction Word (VLIW) architectures, has been pushed to its limit, the performance gain that could be achieved is diminishing due to limited Instruction-Level Parallelism, even with deeper (in terms of pipeline stages), wider (in terms of the fetch/execute/retire bandwidth) pipeline design (Culler, 1998; Eggers, 1997; Hennessy, 2006). Needless to say, the performance of a superscalar processor depends on how many independent instructions are delivered to both the front-end (all the stages before execution) and the back-end stages of the pipeline. Due to the sequential programming model, most of the software programs are written without giving consideration to parallelizing the code. This introduces practical problems when it comes to executing those programs because of many control and data dependencies. This has compelled hardware architects to focus on breaking the barriers introduced by limited ILP: •
One approach entails performing speculative execution in order to deliver more Instruction-Level Parallelism. Many techniques for speculative execution have been studied to alleviate the impact of control dependencies among instructions. As the pipeline of microprocessors becomes wider and deeper, however, the penalty of incorrect speculation increases significantly.
555
Simultaneous MultiThreading Microarchitecture
•
The other approach entails exploiting Thread-Level Parallelism (TLP) as well as ILP. If we can break the boundary among threads and execute instructions from multiple threads, there is a better chance to find instructions ready to execute.
Multithreading Microarchitectures Multithreading microarchitectures can be classified by their method of thread switching: coarse-grain multithreading, fine-grain multithreading, Chip Multi-Processing (CMP) and Simultaneous MultiThreading (SMT). Different implementation methods can significantly affect the behavior of the application. In coarse-grain multithreading and fine-grain multithreading, at each cycle, we still execute instructions from a single thread only. In Chip Multi-Processing and Simultaneous MultiThreading, at each cycle we execute instructions from multiple threads concurrently.
Hardware-Supported Multithreading The original idea of hardware-supported multithreading was to increase the performance through overlapping communication operations with computation operations in parallel architectures, without any intervention from the software (Culler 1998). Based on the frequency of thread swapping operations, hardware-supported multithreading can be divided into two categories: •
•
Coarse-grain multithreading (or blocked multithreading), a new thread is selected for execution only when a long-latency event occurs for the current thread, such as an L2 cache miss or a remote communication request. The advantage of coarse-grain multithreading is that it masks the otherwise wasted slots with the execution of another thread. The disadvantage is that when there are multiple short-latency events, the context switch overhead is high. Due to the limited ILP, the issue slot is not fully utilized when executing one thread. The MIT Alewife is implemented using this technique (Agarwal, 1995). Fine-grain multithreading (or interleaved multithreading), a new thread is selected for execution at every clock cycle, compared with coarse-grain multithreading which only switches context on long-latency events. The advantage is that it does not require extra logic to detect the long-latency events and it handles both long-latency and short-latency events because the context switch will happen anyway. The disadvantage is also the context switch overhead. Due to the singlethread execution at every clock cycle, the issue slot is not fully utilized either. HEP (Smith, 1981), HORIZON (Thistle, 1988) and TERA (Alverson, 1990) all belong to this category.
Chip Multi-Processing A CMP processor normally consists of multiple single-thread processing cores. As each core executes a separate thread, concurrent execution of multiple threads, hence TLP, is realized (Hammond, 1997). However, the resources on each core are not shared with others, even though a shared L2 cache (or L3 cache if there is one) is common in CMP designs. Each core is relatively simpler than a heavy-weight superscalar processor. The width of the pipeline of each core is smaller so that the pressure to explore Instruction-Level Parallelism is reduced. Because of the simpler pipeline design, each core does not need to run at a high
556
Simultaneous MultiThreading Microarchitecture
frequency, directly leading to a reduction in power consumption. Actually this is one of the reasons why the industry shifted from single-core processor to multi-core processor design: because of the “Power Wall” problem. We cannot include more transistors into the processor and rely solely on raw frequency increase to claim it as the next-generation processor, simply because the power density is just prohibitive. On the other hand, with multi-core design, we avoid the “Power Wall” because we can now operate at a lower frequency while the power consumption increases linearly with the number of cores. For a CMP processor, however, if the application program cannot be effectively parallelized, the cores will be under-utilized because we can not find enough threads to keep all the cores busy at one time. In the worst case, only one core is working and we cannot execute across the cores to utilize the idle functional units of other cores. There are actually two categories of CMP design: homogeneous multi-core and heterogeneous multicore. In homogeneous multi-core, we have identical cores on the same die. For example, the Stanford Hydra (Hammond, 2000) integrates four MIPS-based processor on a single die. The IBM POWER4 (Tendler, 2002) is a 2-way CMP design. Intel® Core™2 duo processor series, Core™2 Extreme QuadCore processors and AMD Opteron™ Quad-Core processors are the new generation CMP design. Heterogeneous multi-core is an asymmetric design: there is normally one or more general-purpose processing core(s) and multiple specialized, application-specific processing units. The IBM Cell/B.E.™ belongs to this category.
Simultaneous MultiThreading Superscalar or VLIW architectures are often equipped with more functional units than the width of the pipeline, because of a more aggressive execution type. Often, not all functional units are active at the same time because of an insufficient number of instructions to execute due to limited ILP. Simultaneous MultiThreading has been proposed as an architectural technique whose goal is to efficiently utilize the resources of a superscalar machine without introducing excessive additional control overhead. An SMT processor is still one physical processor, but it is made to appear like multiple logical processors. In an effort to reduce hardware implementation overhead, most of the pipeline resources are shared, including instruction queues and functional units. Only hardware parts necessary to retain the thread context are duplicated, e.g., program counter (PC), register files and branch predictors, as shown in Figure 4 (Kang, 2004). By allowing one processor to execute two or more threads concurrently, a Simultaneous MultiThreading microarchitecture can exploit both Instruction-Level Parallelism and Thread-Level Parallelism, accordingly achieving improved instruction throughput (Burns, 2002; Lee, 2003; Nemirovsky, 1991; Shin, 2003; Tullsen, 1995; Tullsen, 1996; Yamamoto, 1995). The multiple threads can come either from a parallelized program (parallel-programming workload) or from multiple independent programs (multi-programming workload). With the help of multiple thread contexts that keep track of the dynamic status of each thread, SMT processors have the ability to fetch, issue and execute instructions from multiple threads at every clock cycle, taking advantage of the vast number of functional units that neither superscalar nor VLIW processors can absorb. Also because of TLP, the pressure on exploring ILP within a single thread is reduced and we do not need aggressive speculative execution any longer. This reduces the chances of wrong-path execution. Hence, Simultaneous MultiThreading is one of the most efficient architecture to utilize the vast computing power that such a microprocessor would have, achieving optimal system resource utilization and higher performance.
557
Simultaneous MultiThreading Microarchitecture
Figure 4. SMT vs. CMP
The difference in scheduling among Superscalar, CMP and SMT is shown in Figure 5: CMP exploits TLP by executing different threads in parallel on different processing cores while SMT exploits TLP by simultaneously issuing instructions from different threads with a large issue width on a single processor. Figure 5. Resource utilization comparison of different microarchitectures
558
Simultaneous MultiThreading Microarchitecture
From the graph we can see that SMT processors inherently decrease the horizontal and vertical waste by executing instructions fetched from different threads (Eggers, 1997). They can provide enhanced performance in terms of instruction throughput as a result of taking better usage of the resources.
Commercial Implementation of SMT SMT has been an active research area for more than one decade and has also met with some commercial success. Among others, embryonic implementations can be found in the design of the CDC 6600 (Thornton, 1970), the HEP (Smith, 1981), the TERA (Alverson, 1990), the HORIZON (Thistle, 1988), and the APRIL (Agarwal, 1990) architectures, in which there exists some concept of multithreading or Simultaneous MultiThreading. The first major commercial development of SMT was embodied in the DEC 21464 (EV-8) (Preston, 2002). However, it never made it into production line after DEC was acquired by Compaq. The Intel® Pentium® 4 processor 3.06GHz or higher (Hinton, 2001) and Intel® Xeon® processor families (Marr, 2002) are the first modern desktop/server processor implemented SMT, with a basic 2-thread SMT engine (named Hyper-Threading (HT) technology by Intel®). When multiple threads are available, two threads can be executed simultaneously; if there is only one thread to execute, the resources can be combined together as if it were one single processor. Intel® claims its Hyper-Threading technology implementation only requires 5% hardware overhead, while provides up to a 65% performance improvement (Marr, 2002). This matches exactly the stated implementation goal of Hyper-Threading: smallest hardware overhead and high enough performance gain (Marr, 2002). Recently we see a trend to blur the boundary between CMP and SMT, which is multi-core multi-thread processor. For example, IBM POWER5 (Sinharoy, 2005) is such an implementation, with multi-core on a single chip and each core is a 2-thread SMT engine. MIPS Technology designed an SMT system called “MIPS MT”. One implementation of this architecture has 8 cores and each core is a 4-thread SMT engine. All these examples demonstrate the power and popularity of SMT.
SMT DESIGN ASPECTS With the concept of SMT in mind, this section will dive into the unique design aspects of this microarchitecture. The techniques used to boost the performance of SMT processors can be roughly divided into the following categories: fetch policy, handling long-latency instructions, active resource allocation and cache coherence communication.
Thread Selection Policy Just like superscalar machines, the performance of an SMT processor is affected by the “quality” of the instructions injected into the pipeline. There are two critical aspects to this observation: •
•
First, if the instructions fetched and/or executed have dependencies among each other or if they have long latencies, the ILP and TLP which can be exploited will be limited. This will result in a clogging of the instruction window and a stalling of the front-end stages. Second, if the instructions fetched and/or executed belong to the wrong path, these instructions would compete with the instructions from the correct path for system resources in both the frontend and the back-end, which would degrade the overall performance and power efficiency. 559
Simultaneous MultiThreading Microarchitecture
Therefore, how to fill the front-end stages of an SMT processor with “high-quality” instructions from multiple threads is a critical decision which must be made at each cycle. Tullsen et al. (1996) suggested the following priority-based thread-scheduling policies for SMT microarchitectures that surpass the simple Round-Robin policy: • • • •
BRCOUNT policy, which prioritizes the threads according to the number of unresolved branches in the front-end of the pipeline. MISSCOUNT policy, which prioritizes the threads according to the number of outstanding D-Cache misses. ICOUNT policy, which prioritizes the threads according to the number of instructions in the frontend stages. IQPOSN policy, which prioritizes the threads according to which one has the oldest instruction in the instruction queue.
Among those, ICOUNT policy was found to provide the best performance in terms of overall instruction throughput. The reason is that the ICOUNT variable can indicate the current performance of the thread to some extent. However, the ICOUNT policy does not take speculative execution into account because it does not consider that after an instruction has been injected into the pipeline, it may be discarded whenever a conditional branch preceding the instruction has been determined to have been incorrectly predicted. ICOUNT fails to distinguish between the instructions discarded in the intermediate stages due to incorrect speculation and the ones normally retired from the pipeline. Furthermore, the ICOUNT policy does not handle long-latency instructions well. If one thread has a temporarily low ICOUNT, it does not necessarily mean that a cache miss will not happen to the current instructions from that thread. As a result, the ICOUNT variable may incorrectly reflect the respective activities of the threads. This is one of the reasons why the sustained instruction throughput obtained under the ICOUNT-based policy still remains significantly lower than the possible peak. Sometimes, a priority-based fetch policy could cause uneven execution of the threads, considering the case that one thread has very few cache misses while the other one has frequent misses. In an effort to avoid biased execution so that all the threads can progress equally, Raasch et al. (1999) proposed a priority-rotating scheme in an attempt to increase the execution of instructions from less efficient threads when threads are of equal priority. However, the performance of this scheme is not as good as anticipated: the throughput falls short of ICOUNT policy, sometimes even Round-Robin policy. The authors suggested enforcing the scheme by including branch confidence estimator into the process of fetch decision making.
Handing Long-Latency Instructions Due to the “Memory Wall” problem, there is a major factor that affects resource distribution in SMT microarchitectures: long-latency instructions such as load misses. These instructions will clog the pipeline, unless the data can be pre-fetched from the memory. When one thread has injected many instructions into the pipeline and a load miss happens, the miss instruction and the instructions depending on it would not be able to move forward at all. Thus, the residency of those instructions in the pipeline does not necessarily translate into an increased overall instruction throughput. On the contrary, they pollute the instruction window and waste system resources which could otherwise be utilized by instructions
560
Simultaneous MultiThreading Microarchitecture
from other threads. Taking into consideration the severe “damage” these instructions could cause, this means that an SMT processor must be aware of the execution of long-latency instructions. Since ICOUNT does not handle long-latency instructions well, Tullsen et al. (2001) proposed two fetch policies that can better deal with those instructions. One is STALL, which immediately stops fetching from a thread once a cache miss has been detected. The other is FLUSH, which flushes the instructions from those threads with long-latency loads out of the pipeline, rather than occupying system resources while waiting for the completion of the long-latency operations. In both schemes, however, the detection of long-latency operations comes too late (after an L2 miss), and flushing out all the instructions already fetched into the pipeline is not a power-efficient solution. There are several other techniques that attempt to advance the handling of those long-latency instructions, hence improving SMT performance. In DG (El-Moursy, 2003), when the number of outstanding L1 data cache misses from a thread is beyond a preset threshold, fetching from that thread is prohibited. However, L1 cache misses do not necessarily lead to L2 cache misses. Therefore, stalling a thread in such a case may be too severe and would cause an unnecessary stall and resource under-use. It has thus been proposed in DWarn (Cazorla, IPDPS, 2004) to use L1 cache misses as an indicator of L2 cache misses and give those threads with cache misses a lower fetch priority instead of stalling them. This allows DWarn to act in a controlled manner on L1 misses before L2 misses even happen so as to reduce resource under-use and avoid harming a thread when L1 misses do not lead to L2 misses.
Resource Partitioning among Multiple Threads If we want to exploit more TLP, we need multiple threads to co-exist in the pipeline. At the same time, competition for system resources among these threads is also introduced. The overall performance of an SMT processor depends on many factors. How to distribute the resources among multiple threads is certainly one of the key issues in order to achieve better performance. Nevertheless, there are different opinions when it comes to this specific problem. Sometimes a dynamic sharing method can be applied to the system resources at every pipeline stage in SMT microarchitectures (Eggers, 1997; Tullsen, 1995; Tullsen, 1996), which means threads can compete for the resources and there is no quota on the resources that one single thread could utilize. Some other times, all the major queues can be statically partitioned (Koufaty, 2003; Marr, 2002), so that each thread has its own portion of the resources and there is no overlap. In most of the fetch policy studies, dynamic sharing was normally used and assumed to be capable of maximizing the resource utilization and corresponding performance. Fetch policy alone achieves the resource distribution function in an extremely indirect and limited way. Upon a load miss, the pipeline of a superscalar processor will simply stall after running out of instructions before the operand from memory returns. For SMT processors on a load miss, other thread(s) can still proceed because of the TLP, but in a “handicapped” way. This is due to the fact that the instructions from the thread with a cache miss will occupy system resources in the pipeline. It directly translates into a reduction in the amount of system resources that other thread(s) can utilize. This is what we call “mutual-hindrance” execution. Hence, we do need direct control over system resources in order to achieve what we call “mutual-benefit” execution. This would allow us to avoid the resource being unevenly distributed among threads, which could cause pipeline clogging. An investigation of the impact of different system resource partitioning mechanisms on SMT processors was performed by Raasch et al. (2003). Various system resources, like instruction queue, ReOrder Buffer (ROB), issue bandwidth, and commit bandwidth are studied under different partitioning mechanisms.
561
Simultaneous MultiThreading Microarchitecture
Figure 6. Fetch prioritizing and throttling scheme
The authors concluded that the true power of SMT lies in its ability to issue and execute instructions from different threads at every clock cycle. If those resources are partitioning among threads, it would severely impair the ability of SMT to exploit TLP. Hence, the issue bandwidth has to be shared all the time. They also observed that partitioning the storage queues, like ROB, has little impact on the overall system performance. DCRA (Cazorla, MICRO, 2004) is proposed in an attempt to dynamically allocate the resources among threads by dividing the execution of each thread into different phases, using instruction and cache miss count as indicators. The study shows that DCRA achieves around 18% performance gain over ICOUNT in terms of harmonic mean. Hill-Climbing (Choi, 2006) dynamically allocates the resources based on the current performance of each thread and feedback into the resource-allocation engine. It uses its hill-climbing algorithm to sample some different resource distributions first to find out the local optimum and then adopt that distribution. It achieves a slightly higher performance (2.4%) than DCRA but is certainly the most expensive one in terms of execution overhead when it comes to finding the local optimum. There is also concern with how to justify the local optimum is the global optimum. Liu C. et al. (2008) extended their work by proposing several different resource sharing control schemes and combining them with the front-end fetch policy to enforce the resource distribution. They also studied the impact on the overall performance caused by enforcing resource sharing control on both the front-end and the back-end of the pipeline. They introduced a two-level decision making process. The widely accepted ICOUNT policy is still used for thread prioritizing in order to select the candidate thread to fetch instructions from in the next clock cycle. On top of the ICOUNT policy, another variable, the Occupancy Counter, is adopted. Each thread occupying a resource currently monitored is associated with a designated Occupancy Counter. At every clock cycle, more instructions from a given thread are fed into the queue. Also some instructions from the thread leave the queue and are passed onto the next stage of the pipeline or retire. The value of the Occupancy Counter is updated after comprehensively
562
Simultaneous MultiThreading Microarchitecture
evaluating the number of instructions from the certain thread in the specific queue every cycle. If after updating, the value of the Occupancy Counter of a running thread is greater than its assigned resource cap, the fetching of instructions from that thread will be stalled next clock cycle, even if it is of highest priority from ICOUNT policy. This allows the throttling of selected thread(s) after prioritizing, which enforces the resource sharing control schemes among multiple threads, as shown in Figure 6.Four different resource sharing control mechanisms have been proposed: • •
•
•
D-Share: Both Instruction Fetch Queue (IFQ) and ROB are in the dynamic sharing mode, just like other system resources. No throttling. IFQ-Fen2: Enforcing the sharing control on IFQ. Cap is set to be half of the IFQ entries, and other system resources are in the dynamic sharing mode. Throttling based on Occupancy Counter of IFQ. ROB-Fen: Enforcing the sharing control on ROB, Cap is set to be half of the ROB entries, while other system resources are in the dynamic sharing mode. Throttling based on Occupancy Counter of ROB. Dual-Fen: Enforcing the sharing control on both IFQ and ROB, Cap is set to be half of the IFQ or ROB entries, and other system resources are in the dynamic sharing mode. Throttling based on Occupancy Counters of either IFQ or ROB.
It is found that controlling the resource sharing of either IFQ or ROB is not sufficient if implemented alone. However, when controlling the resource sharing of both IFQ and ROB, the Dual-Fen scheme can yield an average performance gain of 38% when compared with the dynamic sharing case. The average L1 D-Cache miss rate has been reduced by 33%. The average time during which an instruction resides in the pipeline has been reduced by 34%. This demonstrates the power of the resource sharing control mechanism for SMT microarchitectures.
SMT Synchronization and Communication When multiple processes share data, their accesses to the shared data must be serialized according to the program semantics so as to avoid errors caused by non-deterministic data access behavior. Conventional synchronization mechanisms in Symmetric MultiProcessing (SMP) designs are constrained by long synchronization latency, resource contention, as well as synchronization granularity. Synchronization latency is determined by where synchronization operations take place. For conventional SMP machines that perform synchronization operations in memory, it can take hundreds of cycles to complete one synchronization operation. Resource contention exists in many of the existing synchronization operations, e.g., test-and-set and compare-and-swap. These operations utilize polling mechanism which introduces serious contention problems. When multiple processes are attempting to lock a shared variable in memory, only one process will succeed, while all other attempts are strictly overheads. In addition, contention may lead to deadlock situations that require extra mechanisms for deadlock prevention, which further degrade system performance. Furthermore, due to the long-latency associated with each synchronization operation, most synchronization operations in SMP designs are coarse-grained. Thus, a data structure such as an array needs to be locked for synchronization although only one array element is under synchronization at any instance of parallel execution. This results in unnecessary serialization of the access to data structures, and restricts the parallelization of programs (Liu, 2007).
563
Simultaneous MultiThreading Microarchitecture
Figure 7. Microarchitecture of the Godson-2 SMT processor
The granularity and performance of synchronization operations determine the degree of parallelism that can be extracted from a program. Hence the conventional coarse-grained synchronization operations cannot exploit the fine-grained parallelism which is required for SMT designs. As demonstrated by Tullsen et al. (1999), an SMT processor differs from a conventional multiprocessor in several crucial ways which influence the design of SMT synchronization: • •
•
Threads share data in L1 cache, instead of in memory as in SMP designs, implying a much lower synchronization latency. Hardware thread contexts on an SMT processor share functional units, thus synchronization and communication of data can be much more effective than through memory. Based on this characteristic, one possible way of synchronization is through direct register access between two threads. Threads on an SMT processor compete for all fetch and execution resources each cycle, thus synchronization mechanisms that consume any shared resources without making progress can impede other threads. In the extreme case, when one thread demands blocking synchronization while holding all the resources such as all instruction window entries, a deadlock would occur.
Based on these differences between SMT and conventional multiprocessor designs, the synchronization operations for SMT designs should possess the following properties: •
•
564
Low Latency: this can be easily achieved because threads in SMT share data in the L1 cache. As mentioned before, one possibility of synchronization is through direct register access, but this may complicate the hardware design to avoid deadlock situations. Fine-Grained: the degree of parallelism that can be exploited in a parallel computing system is limited by the granularity of synchronization. To achieve high performance, the SMT design must
Simultaneous MultiThreading Microarchitecture
•
•
be capable to handle fine-grained synchronization. Minimum Contention: conventional synchronization mechanism such as spin locks requires either spinning or retrying, thus consuming system resources. This effect is highly undesirable. To achieve high performance, stalled threads must use zero processor resources. Deadlock Free: blocked threads must release processor resources to allow execution progress.
One interesting SMT synchronization mechanism is implemented in the Godson-2 SMT processor. As shown in Figure 7 (Li, 2006), Godson-2 SMT processor supports the simultaneous execution of two threads, and each thread owns its individual program counter, logical registers, and control registers. Other system resources, including various queues, pipeline path, functional units, and caches are shared between the two threads. The Godson-2 SMT processor implements full/empty synchronization to pass messages between threads at the register level. Each register has an associated full/empty bit and each register can be read and written by synchronized read and write instructions. Communication and synchronization through registers meets the goal of low latency; also, the granularity of synchronization in this case is at the single register level, which meets the goal of fine granularity. On the other hand, full/empty scheme may result in deadlock. After being decoded, a synchronized read instruction is in the register renaming stage and the register it reads is empty (not ready or not produced). If this instruction waits for the register it reads to be set to full in the register renaming stage, it will block the pipeline and result in a deadlock. One solution to this problem is to block synchronized read/write instructions in the instruction buffer in the decode stage and rename the register to get the correct physical register number only after the register is full (ready or produced). This approach avoids blocking the whole pipeline and thus prevents deadlocks. Furthermore, this synchronization mechanism is contention-free because once a synchronized read operation is issued, the thread is blocked and not consuming any processor resources until the operation is retired. Another interesting SMT synchronization approach has been proposed by Tullsen et al. (1999). This approach uses hardware-based blocking locks such that a thread which fails to acquire a lock, blocks and frees all resources it is using except for the hardware context itself: Further, a thread that releases a lock causes the blocked thread to be restarted. The implementation of this scheme consists of two hardware primitives, Acquire and Release, and one hardware data structure, a lock box. The Acquire operation acquires a memory-based lock and does not complete until the lock has been acquired. The Release operation releases the lock if no other thread is waiting; otherwise, the next waiting thread is unblocked. The lock box contains one entry per context and each entry contains the address of the lock, a pointer to the lock instruction that blocked and a valid bit. The scheme works as follows: when a thread fails to acquire a lock, the lock address and instruction pointer are stored in the lock box entry, and the thread is flushed from the processor after the lock instruction. When another thread releases the lock, the blocked thread is found in the lock box and its execution is resumed. In the meantime, this thread’s lock box entry is invalidated. This approach has low-latency and is fine-grained because synchronization takes place at the level of the L1 cache and the size of the data can be adjusted. Also, when a thread is blocked, all its instructions are flushed from the instruction queue, thus guaranteeing execution progress and freedom from deadlock. In addition, this approach imposes minimal contention because once Acquire fails, the thread is blocked and consumes no processor resources. As indicated by Liu S. et al. (2008), modern applications may not contain enough ILP due to data dependencies among instructions. Nevertheless, value prediction techniques are able to exploit the
565
Simultaneous MultiThreading Microarchitecture
inherent data redundancy in application programs. Specifically, value prediction techniques are able to predict the value to be produced before the instruction executes, therefore the execution can move on with the correctly predicted value. Value prediction techniques require extra hardware resources and it also requires a recovery mechanism when the value is not correctly predicted. SMT is a perfect platform for value prediction because the system is underutilized only when there is not enough ILP. When this happens, a speculative thread can be triggered to perform value prediction on the underutilized resources, which allows the execution to proceed if the value is correctly predicted. Value prediction techniques in the context of SMT architecture have been studied by Gontmakher et al. (2006) and Tuck et al. (2005). In (Tuck, 2005), it shows that by allowing the value-speculative execution to proceed in a separate thread, value prediction is able to overcome data dependencies presented in traditional computing paradigms. With value prediction techniques, 40% performance gain has been reported. In (Gontmakher, 2006), it examines the interaction of speculative execution with the thread-related operations and develops techniques to allow thread-related operations to be speculative executed. The results demonstrate 25% performance improvement.
POTENTIAL BENEFITS OF SMT We have discussed a number of design issues. We will now address some potential incidental benefits of SMT microarchitectures beyond strict performance improvement.
SMT for Fault-Tolerance One possible SMT application is to design microprocessors resistant to transient faults (Li X., 2006). The multi-thread execution paradigm inherently provides the spatial and temporal redundancy necessary for fault-tolerance. We can run two copies of the same threads on an SMT processor and compare the results in order to detect any transient fault which would have occurred in the meantime. This allows, upon detection of an error, to roll back the processor state to a known safe point, and then retry the instructions, thereby resulting in an error-free execution. This means that temporal redundancy is inherently implemented by SMT: for instance, assume a soft error occurred in a functional unit (FU) when executing an instruction from thread #1. Even though the FUs are typically shared between active threads, since the soft error is assumed to be transient, as long as the same instruction from thread #2 is executed at a different moment, the results of the redundant execution from the two copied threads would not match. Furthermore, if any fault in the pipeline is detected, the checkpoint information can then be used to return the processor to a state corresponding to a fault-free point. After that, the processor can retry the instructions from the point of recovery. Nevertheless, this basic idea comes at a cost. Generally speaking, it requires the redundant execution from the two copied threads to have appropriate fault detection coverage for a given processor component. Hence, the higher the desired fault detection coverage, the more redundant the required execution. However, redundant execution inevitably comes at the cost of performance overhead, added hardware, increased design complexity, etc. Consequently, how to trade the fault detection coverage off the added costs is essential for the practicality of the basic idea. Specifically, consider the need to generate redundant executing threads: given a general five-stage pipeline which is comprised of instruction fetch, decode, issue, execute and retire stages, all stages can be exploited for that requirement. Take the fetch stage as
566
Simultaneous MultiThreading Microarchitecture
Figure 8. Functional diagram of the fault-tolerant SMT data path
an example; we can generate the redundant threads by fetching instructions twice. Since the instruction fetch stage is the first pipeline stage, the redundant execution would then cover all the pipeline stages, thus, the largest possible fault detection coverage could be achieved. However, allowing two redundant threads to fetch instructions would possibly end up halving the effective fetch bandwidth. Consequently, that halved fetch bandwidth would be an upper bound for the maximum pipeline throughput. Additionally, the redundant thread generated in the fetch stage would then compete not only for the decode bandwidth, the issue bandwidth, and the retire bandwidth, but also for Issue Queue (IssueQ) and ROB capacity, which are all identified as key factors that affect the performance of the redundant execution. Conversely, we can re-issue the retired instructions from ROB back to the functional units for redundant execution. In doing so, the bandwidth and spatial occupancy contention at IssueQ and ROB can be relieved, thus the performance overhead can be lowered. However, this retire-stage-based design comes at the price of smaller fault detection coverage: only the execution stage would be covered. Given these trade-off considerations, we can simply fetch the instructions once and then immediately copy the instructions fetched to generate the redundant thread. In doing this, there is no need for partitioning the fetch bandwidth between the redundant threads. Moreover, we can rely on the dispatch thread scheduling and redundant thread reduction to relieve the contention in the IssueQ and ROB. Both techniques lower the performance overhead. Other than the design trade-off, another issue associated with the basic idea is the need to prevent deadlocks. In a fault-tolerant SMT design, two copies of the same thread are now cooperating with each other. Such cooperation could cause deadlocks. We present a systematic deadlock analysis and conclude that as long as ROB, Load Queue (LQ) and Store Queue (SQ) (the instruction issue queues for load and store instructions, separately) have allocated some dedicated entries to the trailing thread, the deadlock situations identified can be prevented. Based on this conclusion, we propose two ways for the prevention of any deadlock situation: one is to statically allocate entries in ROB, LQ, and SQ for the redundant thread copy; the other is to dynamically monitor for deadlocks.
Lowering the Performance Overhead As discussed, to lower the performance overhead, we can simply fetch the instructions once and then immediately copy the instructions fetched in order to generate the redundant thread. However, in doing so, faults in three major components in the fetch stage: I-Cache, Program Counter, and Branch Prediction Units (BPU) might not be covered. In particular, any transient faults which would happen inside the I-Cache might not be detected. However, to protect the I-Cache, we can implement Error Correcting Code (ECC)-like mechanisms that are very effective at handling transient faults in memory structures.
567
Simultaneous MultiThreading Microarchitecture
Further, the fault occurring in the BPUs will have no effect on the functional correctness of program execution; however, the critical component PCs must also be protected by ECC-like mechanisms. As shown in Figure 8, the instruction copy operation is simple: just buffer the instructions fetched into two instruction queues, hence, the copy operation would not enlarge the pipeline cycle frequency, nor would another pipeline stage be added. To be specific, each instruction fetched can be bound to a sequential number and a unique thread ID. For instructions that are stored in IFQ, the “leading thread” (LT) is used as their thread ID, whereas for those stored in another IFQ, called trace queue (traceQ), the “trailing thread” (TT) is used. It should be noted that traceQ also serves in the two performance overhead lowering techniques which will be described in detail in the following subsections. Focusing on our redundant execution mode, the key factors that affect the performance of redundant execution can be identified as: contention for bandwidth as far as issue, execution, and retire operations are concerned, as well as the capacity contention in IssueQ and ROB. We now address these types of resource contention by introducing four schemes to reduce TT to be as lightweight as possible (remember that executing TT is merely for fault detection purposes). In doing so, the competition between IssueQ, ROB, and FUs could then be reduced. The first scheme we propose is to prevent the mispredicted TT instructions from being dispatched for execution. This is based on the observation that the number of dynamic mispredicted instructions might be a significant portion of the total fetched instructions. For example, Kang et al. (2008) observed that nearly 16.2% to 28.8% of the instructions fetched would be discarded from the pipeline even with high branch prediction accuracy. Hence, if we could prevent mispredicted instructions in TT from being dispatched, the effective utilization of IssueQ, ROB, and FUs would be accordingly improved. Based on this observation, we leverage LT branch resolution results to completely prevent the mispredicted instructions in TT from being dispatched. It should also be noted that in this design neither a branch outcome queue nor a branch prediction queue is needed. Specifically, when encountering a branch instruction in traceQ, the dispatch operation will check its prediction status: if its prediction outcome has been resolved by its counterpart from LT, we continue its dispatch operation; otherwise, the TT dispatch operation will be paused. In order not to pause the TT dispatch operation, LT must be executed ahead of TT. The LT ahead of TT execution mode is called staggered execution. To set up the TT branch instruction status (the initial status is set as “unresolved”), every completed branch instruction from LT will search traceQ to match its TT counterpart. We should note here that the sequential numbers provide the mean for matching two redundant threads instructions. As we have seen, each instruction fetched is associated with a sequential number at first, and then the fetched instruction is replicated to generate the redundant thread. In doing so, two copied instructions will have the same sequential numbers in different threads. It should also be noted that such a sequential number feature has been implemented, for example, in the Alpha and PowerPC processors. If the branch has been correctly predicted, the status of the matched counterpart TT branch instruction will be assured as “resolved”. Conversely, if the branch has been mispredicted, LT will perform its usual branch misprediction recovery, while at the same time it will flush all those instructions inside traceQ that are located behind the matched counterpart branch instruction. In other words, LT performs the branch misprediction recovery for both LT and TT. Thus, TT does not recover any branch misprediction by itself. After recovery, the status of the TT branch instruction will be set up as “resolved”. In the second scheme, we adopt the Load Value Queue (LVQ) design (Reinhardt, 2000) and include it in our design as shown in Figure 8. Basically, when an LT load fetches data from the cache (or the main memory), the data fetched and the matching tag associated are also buffered into the LVQ. Instead
568
Simultaneous MultiThreading Microarchitecture
of accessing the memory hierarchy, the TT loads simply check and match the LVQ for the data fetched. In doing so, TT might reduce the D-Cache miss penalties and in turn improve its performance. It should be noted here that in order to fully benefit from the LT data prefetching, we must guarantee that LT is always ahead of TT, which requires a staggered execution mode. The third scheme consists in applying the dispatch thread scheduling. A well-known fact is that there are many idle slots in the execution pipeline. Hence, we must make sure that the redundant execution will exploit those idle slots as much as possible in order to circumvent the identified performance affecting contentions. To exploit the idle slots, we must ensure that whenever one thread is idle for any reason, the execution resources can be promptly allocated to another thread that can utilize them more efficiently. Based on this observation, the ICOUNT policy (Tullsen, 1996) was proposed to schedule threads in order to fill IssueQ with issuable instructions, i.e., restrict threads from clogging IssueQ. However, we argue that it is the dispatch stage that directly feeds IssueQ with useful instructions. Hence, scheduling threads in the dispatch stage level would react more promptly to the thread idleness in IssueQ. Therefore, we modify the ICOUNT policy as follows (also see in Figure 8): at each clock cycle, we count the number of instructions that are still waiting in the IssueQ from LT and TT. A higher dispatch priority is assigned to the thread with the lower instruction count. More specifically, when the dispatch rate is eight instructions per cycle, the selected thread is allowed to dispatch as many instructions as possible (up to eight). If any dispatch slot is left from the selected thread, the alternate thread would consume the remaining slots. The above policy is denoted as “ICOUNT.2.8.dispatch”. While developing techniques to make TT as simple as possible, we found that a staggered execution mode is beneficial for those techniques. To that end, the fourth scheme, the “slack dispatch” scheme is proposed: in the instruction dispatch stage, if the selected thread is TT, we check the instruction distance between LT and TT. If the distance is less than a predefined threshold, we skip the TT dispatch operation and continue buffering TT in traceQ. This means that the size of traceQ (the entry number of traceQ) must meet the following requirement: sizeof (traceQ) > sizeof (IFQ) + predefined distance. Moreover, for the purposes of fault-detection, all retired LT instructions and their execution results are buffered into the checking queue (chkQ), as shown in Figure 8. Hence, TT is responsible for triggering the result comparison. We further assume the register file of TT is protected by ECC-like mechanisms. This means that, if any fault is detected, the register file state of TT could be used to recover that of LT.
Deadlock Analysis and Prevention As pointed out before, the two copies of a thread cooperate with each other for fault checking and recovering. However, if not carefully synchronized, such cooperation could result in deadlock situations where neither copy could make any progress. To prevent this, a detailed analysis and appropriate synchronization mechanisms are necessary. Resource sharing is one of the underlying conditions of deadlocks. Indeed, it should be noted that there is much resource sharing between the two thread copies. For example, IssueQ is a shared hardware resource and both thread copies contend for it. The availability of instructions being issued is another type of resource sharing: the issue bandwidth is dynamically partitioned between the two thread copies. Take chkQ as an example, only if there is a free entry in chkQ could LT retire its instruction and back up the retiring instructions and execution results there. On the other hand, the entry in chkQ can only be freed by TT: only after an instruction has been retired and compared, can the corresponding entry in chkQ be released. Further, due to the similarity between dispatch and issue operations, we combine
569
Simultaneous MultiThreading Microarchitecture
Figure 9. Resource allocation graph for the fault-tolerant SMT deadlock analysis
them under the term “issue resource” in the discussion which follows. Based on Figure 9, we can list all possible circular wait conditions. However, some conditions obviously do not end up in a deadlock (e.g., “LT → traceQ → TT → SQ → LT”). After exhausting the list, we describe all the possible deadlock scenarios as follows: 1.
LT → chkQ → TT → issue resource → LT
Scenario: When chkQ is full, LT cannot retire its instructions. Then, those instructions ready to retire from LT are simply stalled in ROB. If that stalling ends with an ROB full of instructions from LT (the case in which ROB is full of instructions from LT could be exacerbated by the fact that LT is favored by the dispatch thread scheduling policy for the stagger execution mode), the instruction dispatch operation will be blocked, thus, TT will be stalled in traceQ. Consequently, no corresponding instructions from TT can catch up to release the chkQ entries and then a deadlock can happen. In summary, the condition for this deadlock situation is derived from the following: Observation 1: When chkQ is full and ROB is full of instructions from LT, a deadlock happens. 2.
LT → LVQ → TT → issue resource → LT
Observation 2: When LVQ is full and LQ is full of instructions from LT, a deadlock happens. Similarly, the stalled load instructions could end up in a full ROB, thus, the instruction dispatch operation will be blocked. Hence, the deadlock observation follows: Observation 3: When LVQ is full and ROB is full and no load instructions from TT in ROB, a deadlock happens. 3.
570
LT → SQ → TT → issue resource → LT
Simultaneous MultiThreading Microarchitecture
Observation 4: When SQ is full of instructions from LT, a deadlock happens. Based on the above systematic deadlock analysis, we propose two mechanisms to handle the possible deadlock situations: static hardware resource partitioning and dynamic deadlock monitoring. In static hardware resource partitioning, i.e., each thread has its own allocated resources, the deadlock conditions identified can be broken such that we can prevent the deadlock. For example, we can partition the ROB in order to prevent the possible deadlock situation identified in Observation 1: if some entries of the ROB are reserved for TT, TT dispatch operations could continue since, when chkQ is full, the partitioned ROB cannot be full of instructions from LT. Subsequently, those dispatched TT instructions will be issued and their execution completed afterwards. After completion, they will trigger the result comparison and then free the corresponding chkQ entries if the operation was found to be fault-free. After some chkQ entries have been freed, LT is allowed to make progress. Moreover, we find that only three hardware resources (ROB, LQ, and SQ) need to be partitioned in order to prevent all the deadlock situations that we identified: Partitioning the ROB to break the deadlock situation identified in Observation 1: ROB will never be full of instructions from LT such that TT will be dispatched and then chkQ will be released. Similarly, partitioning LQ to break the deadlock situation identified in Observation 2; Partitioning SQ to break the deadlock situation identified in Observation 4. Now considering Observation 3, when LVQ is full, an LT load instruction LD_k in LQ cannot be issued. However, since ROB is now partitioned between LT and TT, the stalled load instruction LD_k in ROB will only block LT from being dispatched. In other words, the TT dispatch operation will not be blocked by the stalled load instruction LD_k, thus, for example, another load instruction LD_i from TT will be dispatched which will then release the LVQ entry occupied by the counterpart load instruction LD_i from LT. Once free LVQ entries are made available, the stalled LT load instruction LD_k can be issued. In summary, we have the following observation: Observation 5: For each ROB, LQ, and SQ, allocating some dedicated entries for TT will prevent the deadlock situations identified. It should be noted, however, that static hardware resource partitioning has some performance impact on the SMT, particularly when partitioning ROB, LQ, and SQ. To mitigate this performance impact, we allocate some minimum number entries for TT to prevent deadlocks and the remainder of the queue is shared between LT and TT. Hence, the maximum available entry number for LT is the total queue entry number minus the reserved entry number whereas the maximum available entry number for TT is the total queue entry number. From the deadlock analysis, we can also conclude that if we could dynamically regulate the progress of LT such that neither ROB nor LQ and SQ can be filled with instructions only from LT, the deadlock situations identified can be prevented. As illustrated in Figure 10, we can dynamically count the number of instructions from LT in ROB, LQ, and SQ, respectively, and then a caution signal is generated if at least one of the numbers of counted instructions exceeds the corresponding predefined occupancy threshold. Furthermore, as long as the caution signal is generated, the dispatch thread scheduling policy will hold on LT from being dispatched. Algorithm 1. Dispatch threads scheduling policy (with dynamic deadlock monitoring) Apply ICOUNT.2.8.dispatch policy; If ((selected thread is LT) AND (IFQ not empty) AND
571
Simultaneous MultiThreading Microarchitecture
Figure 10. Dynamic monitoring for the deadlock prevention
(no precaution signal)) Dispatch from IFQ; else if ((distance between LT and TT meets predefined stagger execution mode) AND (traceQ is not empty) AND (not an unresolved branch instruction)) Dispatch from traceQ; else Nothing to be dispatched. To be specific, the comprehensive dispatch thread scheduling policy we developed is listed in Algorithm1: first, we apply the ICOUNT.2.8. Dispatch policy. If the selected thread is LT, we must then check whether IFQ is empty since no instruction can be dispatched in the case of an empty IFQ. Furthermore, we need to make sure no caution signal has been generated. If there is one such signal, we must stop dispatching from LT. On the other hand, if the selected thread is TT, we check the following conditions before dispatch TT: (1) the staggered execution mode requirement; (2) traceQ not empty; (3) not encountering an unresolved branch instruction. It should be noted that the dynamic deadlock monitoring approach offers higher design flexibility than the static resource partitioning. By adjusting the predefined occupancy thresholds, we can manipulate the resource allocation between the cooperating threads. However, this flexibility comes at the cost of additional hardware as well as a more complicated thread scheduling policy.
572
Simultaneous MultiThreading Microarchitecture
SMT for Secure Communication Another possibility is to exploit SMT microarchitectures for secure communication. Traditionally, computer security focuses on the prevention and avoidance of software attacks. Nevertheless, the PC architecture is too trusting of its code environment, which makes PCs vulnerable to hardware attacks (Huang, 2003). For instance, to take control of a processor, a hacker can initiate a man-in-the-middle attack that injects malicious code into a system bus that connects the processor and the memory module. One approach to counter these attacks is memory authentication, which guarantees data integrity. Memory authentication normally needs to be performed in three steps (Yan, 2006). First, all memory lines have to be brought to the processor for authentication tag computation. Then, these lines are sent back to memory. At last, each time a line is brought to the processor at run-time, the authentication tag is recomputed and compared. This approach takes extra CPU cycles for authentication, generates extra bus traffic, and is vulnerable at system start time. To compensate the performance overhead, many propose extra pipeline stages or hardware units for authentication (Shi, 2004). However, the extra hardware overhead involved makes trusted systems only affordable to high-end users. Is it worth the hardware overhead? Financial institutions can afford spending hundreds of thousands dollars on trusted systems but this is too much for ordinary PC users. Is it worth the performance overhead? Large trusted systems can afford spending 60% of its cycle time scrutinizing every instruction received but this is certainly not acceptable for ordinary PC users either. To address these issues, we propose Patrolling Thread (PT) for instruction memory authentication in an SMT microarchitecture. We choose to only authenticate instruction memory because the most common attack is malicious code injection and thus instruction memory is the “Achilles’ heel” in computer (hardware) security. Also, instruction memory is one-way traffic (read-only), which makes security schemes easier to implement. In our proposed scheme, little performance overhead is incurred because it utilizes idle resources for authentication computation through employing SMT technique. What’s more, since PT uses only existing pipeline stages/resources, little hardware overhead is necessary. In addition, by dedicating a hardware thread for system security, our approach provides tunable security levels for the system to operate under different requirements and environments. Even though SMT exploits TLP in addition to ILP, the pipeline utilization still can not reach 100%. Thus the patrolling thread can take advantage of unused pipeline resources to execute instruction memory authentication algorithm, by which means to minimize the impact on regular program execution. If one incoming instruction does not pass the authentication test, then a warning would be issued and the system stops taking in any more instructions from memory until recovery. To accommodate different security requirements and performance overhead, we have proposed three different schemes to implement patrolling thread: i.
Regular-checking scheme: served as the baseline scheme, e.g., check one in every ten incoming instruction lines. This approach introduces some performance overhead and can secure the system when utilization is high, but a small number of malicious instructions could still sneak in. In most situations, this approach is secure enough because one malicious instruction is usually not enough to cause disastrous effects to the system. For instance, if n instructions are required to hack the system, then as long as the patrolling thread can catch one line of malicious instructions before all n instructions enter the processor, the system is safe.
573
Simultaneous MultiThreading Microarchitecture
ii.
iii.
Self-checking scheme: the patrolling thread examines the incoming instruction lines only if there are free pipeline slots available. This scheme incurs no performance overhead but it becomes vulnerable when the system utilization is kept high. Secured-checking scheme: schedule the patrolling thread to authenticate every incoming instruction line regardless of the system utilization.
For authentication algorithm, we choose the One-Key CBC MAC (OMAC), which is a block-cipherbased message authentication algorithm for messages of any bit length (Iwata, 2003). Using this algorithm, a block cipher and a tag length need to be specified, and then both parties (Memory Management Unit (MMU) and Processor in our case) share the same secret key for tag generation and tag verification. In the proposed PT approach, when a memory line is requested, MMU generates a tag for the line, and the processor can check the line authenticity by verifying its tag. We assume that MMU is able to generate authentication tag on-the-fly since it has been demonstrated that MMU can be modified to carry out more sophisticated security operations, such as encryption and decryption (Gilmont, 1999). Here is a brief analysis of the probability of detecting malicious code using patrolling thread scheme. The question can be summarized as follow: if we have m lines of instructions coming in, n of which are malicious codes, and we perform memory authentication on k lines of instructions, what would be the probability of catching one line of malicious code P(detection)3? The probability that the first line we authenticate is malicious code can be written as: P (Detection1) =
n m
If the first memory line passes the authentication, then we choose the second memory line to check, P (Detection 2) =
m -n n × m m -1
The first term represents the first memory line is from the m-n genuine lines, and then the second time we pick one from the remaining m-1 memory lines to authenticate.The probability that we catch the malicious code at the third time is: P (Detection 3) =
m -n m -n -1 n × × m m -1 m -2
Until the kth time, P (Detectionk ) =
574
m -n m -n -1 n × ×××××× m m -1 m - (k - 1)
Simultaneous MultiThreading Microarchitecture
Figure 11. P(Detection) with m = 10
Figure 12. Pipeline with patrolling thread
575
Simultaneous MultiThreading Microarchitecture
P(Detection) is the summation of all above terms, which we could write as:
k
P (Detection ) = å i =1
æm - i ö÷ çç ÷ ç n ÷÷÷ × n çè ø æm ö÷ çç ÷ ç n ÷÷÷ × (m - i - n + 1)) çè ø
æm ö÷ ç Here çç ÷÷÷ represents the binomial coefficient. çè n ÷ø Based on this equation, we plotted P(Detection) for the scenario where we perform memory authentication on 10 memory lines with the malicious code ratio and detection ratio varies, as shown in Figure 11. As we can see, if detection rate (DT) is 0.1 (corresponding to the regular-checking scheme), then the probability of detecting the malicious code is directly corresponding to the Malicious code ratio. On the other hand, if DT is 1 (corresponding to the secured-checking scheme), then we will always be able to detect the malicious code, hence for the P(Detection) we have a horizontal line here. At last, for the selfchecking scheme we proposed, its P(Detection) would be in between of the previous two schemes. PT is designed in a similar fashion to the detector thread proposed by Shin et al. (2006). As shown in Figure 12, PT’s initial program image is loaded via DMA to PT PRAM by the OS during the boot. Once loaded, PT can start running from its own reset address depending on the patrolling scheme we choose. Whenever we have a cache miss for instructions, MMU would send in the instruction line with authentication tag. Then the instructions are sent to the instruction caches as normal, while its corresponding security tag is sent to the PT RAM. The tag and the instructions share the same memory address while PT RAM and Instruction Cache use the same cache indexing algorithm. This is to make sure that the tag and its corresponding instructions are related with each other. The PT-enabled fetch unit decides which thread to fetch from, the patrolling thread or the regular thread(s). Whenever patrolling thread finishes the authentication process and it is a pass, nothing will happen and pipeline continues to flow. If, however, the authentication fails, then an alert would be raised and the whole pipeline would be flushed. Program counter(s) would be rolled back to the last known good position and restart execution.
TRENDS AND CONLUSIONS With major general-purpose processor manufacturers transiting from single-core to multi-core processor production, one basic obstacle is lying before us: are we truly ready for the multi-core era? With most programs still developed under sequential programming model, the extent of the InstructionLevel Parallelism we can exploit is very limited. Many resources on chip just remain idle and we are considerably short of fully utilizing the vast computing power of those chips. To solve this problem, we need to revolutionize the whole computing paradigm, from incorporating parallel programming into the application program development, to hardware design that could better facilitate the program for parallel execution. On the other hand, the Simultaneous MultiThreading microarchitecture model has proven to be capable of maximizing the utilization of on-chip resources and hence achieve improved performance.
576
Simultaneous MultiThreading Microarchitecture
Here we see a perfect match through utilizing the multi-thread microarchitecture to harness the computing power of multi-core, exploiting Thread-Level Parallelism (TLP) in addition to ILP. We expect near-future processors to be of the multi-thread, multi/many-core kind. SMT would fit into the both homogeneous and heterogeneous multi/many-core system case, with one or many cores running multiple threads. Because of limited ILP, the main thread normally cannot use all the system resources. On the other hand, we have the demand from all those “house-keeping” functions such as on-chip resource usage analysis, data synchronization and routing decision making to assist the execution of the main thread. With SMT microarchitecture, we could achieve better system utilization when running those multiple threads together, hence achieve overall performance improvement. According to the “Refrigerator Theory” of Professor Yale Patt from the University of Texas at Austin, another trend for future heterogeneous multi/many-core processor design is to include some application specific cores, in additional to the general-purpose processing cores. For example, AMD’s vision on “fusion” (AMD, 2008) and the next generation Intel® “Larrabee” processor (Seiler, 2008), are both targeting a GPU/CPU combined design. These specific cores can be used as performance boosters for specific applications to achieve an overall performance improvement. In order to utilize these specific cores effectively, we need: • • •
An instruction set enhancement, to add dedicated instructions to best exploit these special cores. Improvement from the compiler, to extract more that can be run in these special cores. Operating system assistance, to be aware of these special cores for a better job scheduling.
For some power-constrained applications, we may need to put those specific cores into sleep mode in order to reduce the power consumption during normal execution and then power them back on when the need arises. If that is the case, however, the sleep state enter/exit latency would be a factor that should not be overlooked. Unless the core will be idle for considerably extended period of time, the gain you could get from running the specific core(s) may not justify the latency you need for the core mode change (from active to sleep/deep sleep or vice versa). What’s more, put a core into deep sleep is not of trivial job in terms of hardware overhead. Due to these limiting factors, this approach needs to be considered cautiously by system architect. SMT technology has been one of the de facto features of modern microprocessors. In this chapter, we examined this important technology from its motivation, design aspects, and applications. We strongly believe that, if we can utilize effectively, SMT will continue playing a critical role in future multi/manycore processor design.
REFERENCES Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K. L., Kranz, D., Kubiatowicz, J., et al. (1995). The MIT Alewife machine: architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), S. Margherita Ligure, Italy, (pp. 2-13). New York: ACM Press.
577
Simultaneous MultiThreading Microarchitecture
Agarwal, A., Lim, B.-H., Kranz, D., & Kubiatowicz, J. (1990). April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90), (pp. 104-114), Seattle, WA: ACM Press. Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., & Smith, B. (1990) The Tera computer system. In Proceedings of the 4th International Conference on Supercomputing (ICS’90), (pp. 1-6). Amsterdam: ACM Press. Burns, J., & Gaudiot, J.-L. (2002). SMT layout overhead and scalability. IEEE Transactions on Parallel and Distributed Systems, 13(2), 142–155. doi:10.1109/71.983942 Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dcache Warn: an I-fetch policy to increase SMT efficiency. In Proceedings of the 18th International Parallel & Distributed Processing Symposium (IPDPS’04), (pp. 74-83). Santa Fe, NM: IEEE Computer Society Press. Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’04), (pp. 171-182). Portland, OR: IEEE Computer Society Press. Choi, S., & Yeung, D. (2006). Learning-based SMT processor resource distribution via hill-climbing. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA’06), (pp. 239-251), Boston: IEEE Computer Society Press. Culler, D. E., Singh, J. P., & Gupta, A. (1998) Parallel computer architecture: a hardware/software approach, (1st edition). San Francisco: Morgan Kaufmann. Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R. L., & Tullsen, D. M. (1997). Simultaneous multithreading: a platform for next-generation processors. IEEE Micro, 17(5), 12–19. doi:10.1109/40.621209 El-Moursy, A., & Albonesi, D. H. (2003). Front-end policies for improved issue efficiency in SMT processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03), (pp. 31-40). Anaheim, CA: IEEE Computer Society Press. Gilmont, T., Legat, J.-D., & Quisquater, J.-J. (1999). Enhancing the security in the memory management unit. In Proceedings of the 25th EuroMicro Conference (EUROMICRO’99). 1, 449-456. Milan, Italy: IEEE Computer Society Press. Gontmakher, A., Mendelson, A., Schuster, A., & Shklover, G. (2006) Speculative synchronization and thread management for fine granularity threads. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA’06), (pp. 278-287). Austin, TX: IEEE Computer Society Press. Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., & Olukotun, K. (2000). The Stanford Hydra CMP. IEEE Micro, 20(2), 71–84. doi:10.1109/40.848474 Hammond, L., Nayfeh, B. A., & Olukotun, K. (1997). A single-chip multiprocessor. IEEE Computer, 30(9), 79–85.
578
Simultaneous MultiThreading Microarchitecture
Hennessy, J., & Patterson, D. (2006). Computer architecture: a quantitative approach (4th Ed.). San Francisco: Morgan Kaufmann. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., & Roussel, P. (2001). The microarchitecture of the Pentium 4 processor. Intel® Technology Journal, 5(1), 1-13. Huang, A. (2003) Hacking the Xbox: an introduction to reverse engineering, (1st Ed.). San Francisco: No Starch Press. Huang, J., & Lilja, D. J. (1999). Exploiting basic block value locality with block reuse. Proceedings of 5th International Symposium on High-Performance Computer Architecture (HPCA’99), (pp. 106-114). Orlando, FL: IEEE Computer Society Press. Intel News Release. (2006). New dual-core Intel® Itanium® 2 processor doubles performance, reduces power consumption. Santa Clara, C: Author. Iwata, T., & Kurosawa, K. (2003). OMAC: One-Key CBC MAC. In 10th International Workshop on Fast Software Encryption (FSE’03), (LNCS Vol. 2887/2003, pp. 129-153), Lund, Sweden. Berlin/ Heidelberg: Springer. Kang, D.-S. (2004) Speculation-aware thread scheduling for simultaneous multithreading. Doctoral Dissertation, University of Southern California, Los Angeles, CA. Kang, D.-S., Liu, C., & Gaudiot, J.-L. (2008). The impact of speculative execution on SMT processors. [IJPP]. International Journal of Parallel Programming, 36(4), 361–385. doi:10.1007/s10766-007-00523 Koufaty, D., & Marr, D. (2003). Hyperthreading technology in the Netburst microarchitecture. IEEE Micro, 23(2), 56–65. doi:10.1109/MM.2003.1196115 Lee, S.-W., & Gaudiot, J.-L. (2003). Clustered microarchitecture simultaneous multithreading. In 9th International Euro-Par Conference on Parallel Processing (Euro-Par’03), (LNCS Vol. 2790/2004, pp. 576-585), Klagenfurt, Austria. Berlin/Heidelberg: Springer. Li, X., & Gaudiot, J.-L. (2006). Design trade-offs and deadlock prevention in transient fault-tolerant SMT processors. In Proceedings of 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06), (pp. 315-322). Riverside, CA: IEEE Computer Society Press. Li, Z., Xu, X., Hu, W., & Tang, Z. (2006). Microarchitecture and performance analysis of Godson-2 SMT processor. In Proceedings of the 24th International Conference on Computer Design (ICCD’06), (pp. 485-490). San Jose, CA: IEEE Computer Society Press. Liu, C., & Gaudiot, J.-L. (2008). Resource sharing control in simultaneous multithreading microarchitectures. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems Conference (ACSAC’08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer Society Press. Liu, S., & Gaudiot, J.-L. (2007). Synchronization mechanisms on modern multi-core architectures. In Proceedings of the 12th Asia-Pacific Computer Systems Architecture Conference (ACSAC’07), (LNCS Vol. 4697/2007), (pp. 290-303), Seoul, Korea. Berlin/Heidelberg: Springer.
579
Simultaneous MultiThreading Microarchitecture
Liu, S., & Gaudiot, J.-L. (2008). The potential of fine-grained value prediction in enhancing the performance of modern parallel machines. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems Conference (ACSAC’08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer Society Press. Mahadevan, U., & Ramakrishnan, S. (1994) Instruction scheduling over regions: A framework for scheduling across basic blocks. In Proceedings of the 5th International Conference on Compiler Construction (CC’94), Edinburgh, (LNCS Vol. 786/1994, pp. 419-434). Berlin/Heidelberg: Springer. Marcuello, P., & Gonzalez, A. (1999) Exploiting speculative thread-level parallelism on a SMT processor. In Proceedings of the 7th International Conference on High-Performance Computing and Networking (HPCN Europe’99), Amsterdam, the Netherlands, (LNCS Vol. 1593/1999, pp. 754-763) Berlin/ Heidelberg: Springer. Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty, D.A, Miller, J.A., & Upton, M. (2002). Hyperthreading technology architecture and microarchitecture. Intel® Technology Journal, 6(1), 4-15. Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics Magazine, 38(8). Nemirovsky, M. D., Brewer, F., & Wood, R. C. (1991). DISC: dynamic instruction stream computer. In Proceedings of the 24th Annual International Symposium on Microarchitecture (MICRO’91), Albuquerque, NM (pp. 163-171). New York: ACM Press. Preston, R. P., Badeau, R. W., Bailey, D. W., Bell, S. L., Biro, L. L., Bowhill, W. J., et al. (2002). Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In Digest of Technical Papers of the 2002 IEEE International Solid-State Circuits Conference (ISSCC’02), San Francisco, CA (Vol. 1, pp. 334-472). New York: IEEE Press. Raasch, S. E., & Reinhardt, S. K. (1999). Applications of thread prioritization in SMT processors. In Proceedings of the 3rd Workshop on Multithreaded Execution and Compilation (MTEAC’99), Orlando, FL. Raasch, S. E., & Reinhardt, S. K. (2003). The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT’03), (pp. 15–25). New Orleans, LA: IEEE Computer Society. Reinhardt, S., & Mukherjee, S. (2000). Transient fault detection via simultaneous multithreading. In ACM SIGARCH Computer Architecture News: Special Issue: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00), (pp. 25-36). Vancouver,Canada: ACM Press Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., & Dubey, P. (2008). Larabee: a many-core x86 architecture for visual computing. [TOG]. ACM Transactions on Graphics, 27(3). doi:10.1145/1360612.1360617 Shi, W., Lee, H.-H., Ghosh, M., & Lu, C. (2004). Architectual support for high speed protection of memory integrity and confidentiality in multiprocessor systems. In Proceedings of the 13th International Conference on Parallel Architectures and Computation Techniques (PACT’04), Antibes Juan-les-Pins, France (pp.123-134). New York: IEEE Computer Society.
580
Simultaneous MultiThreading Microarchitecture
Shin, C.-H., & Gaudiot, J.-L. (2006). Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread. Journal of Parallel and Distributed Computing, 66(10), 1304–1321. doi:10.1016/j.jpdc.2006.06.003 Shin, C.-H., Lee, S.-W., & Gaudiot, J.-L. (2003). Dynamic scheduling issues in SMT architectures. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS’03), Nice, France, (p. 77b). New York: IEEE Computer Society. Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer, R. J., & Joyner, J. B. (2005). Power5 system microarchitecture. IBM Journal of Research and Development, 49(4/5), 505–521. Smith, B. J. (1981). Architecture and applications of the HEP multiprocessor computer system. In SPIE Proceedings of Real Time Signal Processing IV, 298, 241-248. Tendler, J. M., Dodson, J. S. Jr, Fields, J. S., Le, H., & Sinharoy, B. (2002). Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 5–25. Thistle, M. R., & Smith, B. J. (1988). A processor architecture for Horizon. In Proceedings of the 1988 ACM/IEEE conference on Supercomputing (SC’88), Orlando, FL, (pp. 35-41). New York: IEEE Computer Society Press. Thornton, J. E. (1970). Design of a computer - the Control Data 6600. Upper Saddle River, NJ: Scott Foresman & Co. Tuck, N., & Tullsen, D. M. (2005). Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05), (pp. 5-15), San Francisco: IEEE Computer Society. Tullsen, D. M., & Brown, J. A. (2001). Handling long-latency loads in a simultaneous multithreading processor. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’01), (pp. 318–327). Austin, TX: IEEE Computer Society. Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., & Stamm, R. L. (1996). Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96), Philadelphia, (pp. 191–202). New York: ACM Press. Tullsen, D. M., Eggers, S. J., & Levy, H. M. (1995). Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), Santa Margherita Ligure, Italy (pp. 392-403). New York: ACM Press. Tullsen, D. M., Lo, J. L., Eggers, S. J., & Levy, H. M. (1999). Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA’99), Orlando, FL (pp. 54-58). New York: IEEE Computer Society. Wall, D. W. (1991). Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA (ASPLOS-IV), (pp. 176-188). New York: ACM Press.
581
Simultaneous MultiThreading Microarchitecture
White Paper, A. M. D. (2008). The industry-changing impact of accelerated computing. Yamamoto, W., & Nemirovsky, M. (1995). Increasing superscalar performance through multistreaming. In Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT’95), (pp. 49-58). Limassol, Cyprus: IFIP Working Group on Algol. Yan, C., Rogers, B., Englender, D., Solihin, Y., & Prvulovic, M. (2006). Improving cost, performance, and security of memory encryption and authentication. In Proceedings of 33rd Annual International Symposium on Computer Architecture (ISCA’06), (pp. 179-190). Boston: IEEE Computer Society Press. Yeager, K. C. (1996). The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2), 28–40. doi:10.1109/40.491460
KEY TERMS AND DEFINITIONS Cache Coherence: The integrity of the data stored in local caches of a shared resource. Fault Tolerance: The property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. Fetch Policy: A mechanism which allows the determination of which thread(s) to fetch instructions from, when executing multiple threads. Instruction-Level Parallelism: A measure of how many of the operations in a computer program can be performed simultaneously. Simultaneous Multithreading: A technique to improve the overall efficiency by executing instructions from multiple threads simultaneously to better utilize the resources provided by modern processor architecture. Microarchitecture: A description of the electrical circuits of a processor that is sufficient to completely describe the operation of the hardware. Resource Sharing Control: A mechanism which allows the distribution of various resources in the pipeline among multiple threads. Secure Communication: Means by which information is shared with varying degrees of certainty so that third parties cannot know what the content is. Synchronization: Timekeeping which requires the coordination of events to operate a system in unison. Thread-Level Parallelism: A measure of how many of the operation across multiple threads can be performed simultaneously.
ENDNOTES 1
2 3
582
Some literature refers to 18 months. However, the official Moore’s law website of Intel®, and even an interview with Dr. Gordon Moore, confirms the two year figure. In Chinese, Fen means “Divide” or “Partition”. As long as we detect one line of malicious code, we will trigger an alert.
583
Chapter 25
Runtime Adaption Techniques for HPC Applications Edgar Gabriel University of Houston, USA
ABSTRACT This chapter discusses runtime adaption techniques targeting high-performance computing applications. In order to exploit the capabilities of modern high-end computing systems, applications and system software have to be able to adapt their behavior to hardware and application characteristics. Using the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the advantage of using adaptive techniques to exploit characteristics of the network and of the application. This allows to reduce the execution time of applications significantly and to avoid having to maintain different architecture dependent versions of the source code.
INTRODUCTION High Performance Computing (HPC) has reshaped science and industry in many areas. Recent groundbreaking achievements in biology, drug design and medical computing would not have been possible without the usage of massive computational resources. However, software development for HPC systems is currently facing significant challenges, since many of the software technologies applied in the last ten years have reached their limits. The number of applications being capable of efficiently using several thousands of processors or achieving a sustained performance of multiple teraflops is very limited and is usually the result of many person-years of optimizations for a particular platform. These optimizations are however often not portable. As an example, an application optimized for a commodity PC cluster performs (often) poorly on an IBM Blue Gene or the NEC Earth Simulator. Among the problems application developers face are the wide variety of available hardware and software components, such as DOI: 10.4018/978-1-60566-661-7.ch025
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Runtime Adaption Techniques for HPC Applications
• • • •
Processor type and frequency, number of processor per node and number of cores per processor, Size and performance of the main memory, cache hierarchy, Characteristics and performance of the network interconnect, Operating system, device drivers and communication libraries,
and the influence of each of these components on the performance of their application. Hence, an enduser faces a unique execution environment on each parallel machine he uses. Even experts struggle to fully understand correlations between hardware and software parameters of the execution environment and their effect on the performance of a parallel application.
Motivating Example In the following, we would like to clarify the dilemma of an application developer using a realistic and common example. Consider a regular 3-dimensional finite difference code using an iterative algorithm to solve the resulting system of linear equations. The parallel equation solver consists of three different operations requiring communication: scalar products, vector norms and matrix-vector products. Although the first two operations do have an impact on the scalability of the algorithm, the dominating operation from the communication perspective is the matrix-vector product. The occurring communication pattern for this operation is neighborhood communication, i.e. each process has to exchange data with its six neighboring processes multiple times per iteration of the solver. Depending on the execution environment and some parameters of the application (e.g. problem size), different implementations for the very same communication pattern can lead to optimal performance. We analyze the execution times for 200 iterations of the equation solver applied for a steady problem using 32 processes on the same number of processors on a state-of-the-art PC cluster for two different problem sizes (32×32×32 and 64×32×32 mesh points per process) and two different network interconnects (4xInfiniBand and Gigabit Ethernet). The neighborhood communication has been implemented in four different ways, named here fcfs, fcfs-pack, ordered, overlap. While the nodes/processors have been allocated exclusively for these measurements using a batch-scheduler, the network interconnect was shared with other applications using the same PC cluster. The results indicate, that already for this simple test-case on a single platform three different implementations of the neighborhood communication lead to the best performance of this application: although the differences between the different implementations are not dramatic over this network interconnect, fcfs shows the best performance for both problem sizes when using the InfiniBand interconnect. This implementation is initiating all required communications simultaneously using asynchronous communication followed by a Waitall operation on all pending messages. However, for the Gigabit Ethernet interconnect the fcfs approach seems to congest the network. Instead, the implementation which is overlapping communication and computation (overlap), is showing the best performance for the small problem size (6.2 seconds for overlap vs. 6.6 seconds for fcfs, 7.5 seconds for fcfs-pack, 8.1 seconds for ordered) while the ordered algorithm, which limits the number of messages concurrently on the fly, is the fastest implementation for the large problem size for this network interconnect (14.7 seconds for ordered vs. 26.9 seconds for fcfs, 19.9 for fcfs-pack and 23.4 seconds for overlap). The implementation considered to be the fastest one over the InfiniBand network leads thus to a performance penalty of nearly
584
Runtime Adaption Techniques for HPC Applications
80% over Gigabit Ethernet. An application developer implementing the neighborhood communication using a particular, fixed algorithm will inevitably give up performance on certain platforms.
Problem Description As demonstrated above, the wide variety in hardware and software leads to the inherent limitation that any code sequence or communication operation which contributes (significantly) towards the overall execution time of an application will inevitably give up performance, if the operation is hard-coded in the source code, i.e. the code sequence or communication operation does not have the ability to adapt its behavior at runtime due to changing conditions. Traditional tuning approaches have their fundamental limitations, and are not capable to solve the problem in a satisfactory manner.
Specific Goals The goal of this chapter therefore is to present dynamic runtime optimization techniques applied in high performance computing. Runtime adaption in HPC serves two purposes: first it allows tweaking the performance of a code in order to exploit the capabilities of the hardware. At the same time it simplifies software maintenance, since an application developer does not have to maintain multiple different versions of his code for different platforms. The chapter focuses on one specific project, the Abstract Data and Communication Library (ADCL). ADCL enables the creation of self-optimizing applications by allowing an application to register alternative versions of a particular function. Furthermore, ADCL offers several pre-defined operations allowing for seamless optimization of often occurring communication patterns in MPI parallel application. Although not fundamentally restricted to collective communication operations, most operation optimized through ADCL are collective in nature. In the following we discuss the related work in that area, present the concept of ADCL and give performance results obtained in three different scenarios using ADCL.
BACKGROUND In the last couple of years many projects have been dealing with optimizing collective operations in High Performance Computing. For the subsequent discussion, projects are categorized as approaches which are applying either static tuning, i.e. projects which typically lead to software components that cannot alter their behavior during execution, or dynamic tuning, in which the software/application does adapt at runtime its behavior as a reaction to varying conditions.
Static Tuning of Applications Most projects applying static tuning focus on one of two approaches to determine the best performing implementation for a particular operation: they either apply a pre-execution tuning step by testing the performance of different versions for the same operation for various message length and process counts; alternatively, some projects rely on performance prediction using sophisticated communication models to compare different algorithms. We will discuss representatives, advantages and disadvantages for both approaches in the next paragraphs.
585
Runtime Adaption Techniques for HPC Applications
Among the best known projects representing the first approach are the Automatically Tuned Linear Algebra Software (ATLAS) (Whaley, 2005) and the Automatically Tuned Collective Communications (ATCC) (Pjesivac-Grbovic, 2007) framework. ATLAS is a library providing optimized implementations of the Basic Linear Algebra Software (BLAS) library routines. As one of the very first projects acknowledging the wide variety of hardware and software components, ATLAS uses an extensive configuration step to determine the best performing implementation from a given pool of available algorithms on a specific platform and a given compiler for each operation. Furthermore, based on additional information such as cache sizes, ATLAS determines optimal, internal parameters such as the blocking factor for blocked algorithms. As a result of the configuration step, the ATLAS library will only contain the routines known to deliver the best performance on that platform. Similarly to ATLAS, ATCC determines the optimal algorithms for MPI’s collective operations on a given platform by using a parallel configuration step. During this configure step, several implementations for each collective operations are being tested and the fastest algorithm for each message length is stored in a configuration file. The resulting set of algorithms and parameters for this platform are then used during the execution of the application. In order to minimize the size of the configuration file, ATCC uses quad-tree encoding to represent the overall decision tree. This is also used to conclude which algorithm to use for message sizes/process counts which have not been tested in the parallel configure step. Projects such as ATLAS and ATCC face a number of fundamental drawbacks. First, the tuning procedure itself often takes more time than running an individual application. Thus, in case the system administrators of a cluster do not reserve the according time slots to tune these libraries in advance – and typically they will only reserve limited time and not multiple days to tune e.g. the MPI collective operations exhaustingly on a multi-thousand node cluster - end-users themselves will very probably not use their valuable compute time to perform these time consuming operations. Additionally, several factors influencing the performance of the application can only be determined while executing the application. These factors include process placement by the batch scheduler due to non-uniform network behavior (Evans, 2003), resource utilization due to the fact that some resources such as the network switch or file systems are shared by multiple applications, operating system jitter leading to slow-down of a subset of processes utilized by a parallel job (Petrini, 2003) and application characteristics such as communication volume and frequencies. Furthermore, some projects have also highlighted the influence of process arrivals patterns to the performance of collective communication operations: depending on the work that each process has to perform, the order in which processes start to execute a collective operation varies strongly depending on the application. Thus, the algorithm determined to lead to the best performance using a synthetic benchmark might in fact be suboptimal in a real application (Faraj, 2007). The second common approach used e.g. by the MagPIe project (Kielmann, 1999) compares the predicted execution time of various algorithms for a given operation using performance models. Although some of the communication models used such as LogP (Culler, 1993) and LogGP (Alexandrov, 1995) are highly sophisticated, these projects ultimately suffer from three limitations. Firstly, it is often hard to determine some parameters of (sophisticated) communication models. As an example, no approach is published as of today which derives a reasonable estimate of the receive-overhead in the LogGP model (Hoefler, 2007). Second, while it is possible to develop a performance model for a simple MPI-level communication operation, more complex functions involving alternating and irregular sequences of computation and communication have hardly been modeled as of today. Lastly, all models have their fundamental limitations and break-down scenarios, since they represent simplifications of the real world behavior of the machines. Thus, while modeling collective communication operations can improve the
586
Runtime Adaption Techniques for HPC Applications
understanding of performance characteristics for various algorithms, tuning complex operations based on these models is fundamentally limited.
Dynamic Tuning of Applications The dynamic optimization and tuning problem can be related to multiple research areas in various domains. Starting from the lower level of the software hierarchy, most runtime optimization problems are represented as empirical search procedure, having the boundary condition that any evaluation of the results has to be computationally inexpensive to minimize the overhead introduced by the runtime optimization itself. Depending on the type of the parameters tuned during the runtime optimization, various approaches from optimizations theory (Gill 1993) can be applied as well, e.g. the method of steepest decent in case of a contiguous, real-valued parameter. Vuduc (Vuduc, 2004) provides an excellent overview of various algorithms. On top of the search algorithms are often statistical methods used to remove outliers and analyze the performance results of the various alternatives tested during the search space. These algorithms can vary in their complexity and range from simple inter-quartile range methods to sophisticated algorithms from cluster analysis and robust statistics. Benkert (Benkert, 2008) gives a good overview and a performance comparison of different approaches. Finally, since most approaches used in runtime optimization are separated into an evaluation phase, where the runtime library uses certain events or instances in order to learn the best approach for those operations, and a phase applying the knowledge determined earlier, theories from machine learning (Witten, 2005) can often be applied as well. Once again, the main constraint for applying machine learning algorithms is due to the fact that the overhead introduced by the learning algorithm itself has to be inevitable cheap for a runtime library. A vast body of research for code optimizations is furthermore available in the field of compilers. As an example ADAPT (Voss, 2000) introduces runtime adaption and optimization by having different variants of a code sequence. During a sampling phase, the runtime environment explores the performance of the different versions and decides which one performs best. The runtime environment can furthermore invoke a separate dynamic code generator, which delivers new alternative code versions which can be loaded dynamically. Despite of the significant progress in these areas, the number of projects applying automated (runtime) tuning techniques in HPC is still very limited. Among those projects are FFTW (Fringo, 2005), PhiPAC (Vuduc, 2004), STAR-MPI (Faraj, 2006), and SALSA (Dongarra, 2003). In the following, we detail three of these projects which utilize advanced adaptation techniques, and compare them to various aspects of ADCL (Gabriel, 2007).
FFTW The FFTW (Fastest Fourier Transform in the West) library optimizes sequential and parallel Fast Fourier Transform (FFT) operations. To compute an FFT, the application has to invoke first a ’planner’ step specifying a problem which has to be solved. Depending on an argument passed by the application to the planner routine, the library measures the actual runtime of many different implementations and selects the fastest one (FFTW_MEASURE). In case many transforms of the same size are executed in an application, this ’plan’ delivers the optimal performance for all subsequent FFTs. Since creating a plan
587
Runtime Adaption Techniques for HPC Applications
can be time consuming, FFTW also provides a mode of operation where the planner comes up quickly with a good estimate, which might however not necessarily be the optimal one (FFTW_ESTIMATE). The decision procedure is initiated just once by the user. Thus, FFTW makes the runtime optimization upfront in the planner step without performing any useful work. In contrary to the approach taken by FFTW, ADCL integrated the runtime selection logic into the regular execution of the applications. Thus, the ADCL approach enables the library to restart the runtime selection logic in case the performance deviates significantly from the performance measured during the tuning step is observed, e.g. due to changing network conditions. FFTW has also the notion of historic learning, namely using a feature called Wisdom. The user can export experiences gathered in previous runs into a file, and reload it at subsequent executions. However, the wisdom concept in FFTW lacks any notion of related problems, i.e. wisdom can only be reused for exactly the same problem size that was used to generate it. Furthermore, the wisdom functionality also does not include any mechanism which helps to recognize outdated or invalid wisdom, e.g. if the platform used for collecting the wisdom is significantly different than the platform used while reloading the wisdom.
STAR-MPI STAR-MPI incorporates runtime optimization of collective communication operations providing a similar API as defined in the MPI specifications. Using an Automatic Empirical Optimization Software (AEOS), the library performs dynamic tuning of each collective operation by determining the performance of all available algorithms in a repository. Once performance data for all available algorithms have been gathered, STAR-MPI determines the most efficient algorithm. STAR-MPI tunes different instances/call-sites for each operation separately. In order to achieve this goal, the prototypes of the STAR-MPI collective operations have been extended by an additional argument, namely an integer value uniquely identifying each call-site. This is however hidden from applications by using pre-processor directives to redirect the MPI calls to their STAR-MPI. Similarly to all projects focusing on runtime adaption techniques, the largest overhead in STAR-MPI comes from the initial evaluation of the underperforming algorithms and the distributed decision logic, which is necessary to ensure that all processes agree on the final ‘winner‘. While STAR-MPI does a good job of minimizing the latter one by only introducing a single collective global reduction, the first item, i.e. testing of underperforming implementations is highly evident in STAR-MPI, due to the independent optimization of all operations per call-site. In contrary to that, ADCL allows for both, per-call site optimization and concatenating performance data of multiple call-sites for the same operation and message length. One approach used in STAR-MPI to minimize the problem outlined in the previous paragraph is to introduce grouping of algorithms. STAR-MPI initially compares a single algorithm from all available groups. After the winner group has been determined, the library does a fine-tuning of the performance by evaluating all other available algorithms within the winner group. As described later, ADCL further extends the notion of grouping implementations using an attribute concept, which allows characterizing algorithms and alternative implementations without enforcing the participation of an algorithm in a single group.
588
Runtime Adaption Techniques for HPC Applications
SALSA The Self-Adapting Large-scale Solver Architecture (SALSA) aims at providing the best suitable linear and non-linear system solver to an application. Using the characteristics of the application matrix, the solver contacts a knowledge database and provides an estimate on the best solver to use. Among the characteristics used for choosing the right solver are structural properties of the matrix (e.g. maximum and minimum number of non-zeros per row), matrix norms such as the 1- or the Frobenius-norm, and spectral properties. Recently, the authors have applied algorithms from machine learning such as boosting algorithms and alternating decision trees to improve the prediction quality of the system (Bhowmick, in press). The decision algorithm has been trained by using a large set of matrices from various application domains. Among the interesting features of this approach is, that the algorithm is capable of handling missing features for the prediction, e.g. in case some norms are considered too expensive to be calculated at runtime. The main drawback of the approach within the context of this chapter lies in the fact that the training steps have to be executed before running the application, due to the computation complexity of the according operations. The problem is however softened by the fact that the knowledge data base is per design reusable across multiple runs/executions.
THE ABSTRACT DATA AND COMMUNICATION LIBRARY The Abstract Data and Communication Library (ADCL) enables the creation of self-optimizing applications by either registering alternative versions of a particular function or use predefined operations capable of self-optimization. ADCL uses the initial iterations of the application to determine the fastest available code version. Once performance data on a sufficient number of versions is available, the library makes a decision on which alternative to use throughout the rest of the execution. From the conceptual perspective, ADCL takes advantage of two characteristics of most scientific applications: 1.
2.
Iterative execution: most parallel, scientific applications are centered around a large loop, and execute therefore the same code sequence over and over again. Consider for example an application which solves a time dependent partial differential equation (PDE). These problems are often solved by discretizing the PDE in space and time, and by solving the resulting system of linear equations for each time step. Depending on the application, iteration counts can reach six digit numbers. Collective execution: most large scale parallel applications are based on data decomposition, i.e. all processes execute the same code sequence on different data items. Processes are typically also synchronized, i.e. all processes are in the same loop iteration. This synchronization is often required for numerical reasons and is enforced by communication operations.
Description of the ADCL API The ADCL API offers high level interfaces of application level collective operations. These are required in order to be able to switch the implementation of the according collective operation within the library without modifying the application itself. The main objects within the ADCL API are:
589
Runtime Adaption Techniques for HPC Applications
• •
• •
• •
•
ADCL_Topology: provides a description of the process topology and neighborhood relations within the application. ADCL_Vector: specifies the data structures to be used during the communication. The user can for example register a data structure such as a matrix with the ADCL library, detailing how many dimensions the object has, the extent of each dimension, the number of halo-cells, and the basic datatype of the object. ADCL_Function: each ADCL function is the equivalent to an actual implementation of a particular operation. ADCL_Fnctset: a collection of ADCL functions providing the same functionality. ADCL provides pre-defined function-sets, such as for neighborhood communication (ADCL_FNCTSET _NEIGHBORHOOD). The user can however also register its own functions in order to utilize the ADCL runtime selection logic. ADCL_Attribute: abstraction for a particular characteristic of a function/implementation. Each attribute is represented by the set of possible values for this characteristic. ADCL_Attrset: an ADCL Attribute-set is a collection of ADCL attributes. An ADCL Functionset can have an ADCL Attribute-set attached to it, in which case all functions in the function-set have to provide valid values for each attribute in the attribute-set. ADCL_Request: combines a process topology, a function-set and a vector object. The application can initiate a communication by starting a particular ADCL request.
The following code sequence gives a simple example for an ADCL code, using a 2-D neighborhood communication on a 2-D process topology. This application first generates a 2-D process topology using an MPI Cartesian communicator. By registering a multi-dimensional matrix with ADCL, the library generates a vector-object. Combining the process topology, the vector object and the predefined function set ADCL_FNCTSET_NEIGHBORHOOD allows the library to determine automatically which portions of the vector have to be transferred to which process. Afterwards, each call to ADCL_Request_start initiates a neighborhood communication.
double vector[...][...]; ADCL_Vector vec; ADCL_Topology topo; ADCL_Request request; /* Generate a 2-D process topology */ MPI_Cart_create (MPI_COMM_WORLD, 2, cart_dims, periods, 0, &cart_comm); ADCL_Topology_create (cart_comm, &topo); /* Register a 2D vector with ADCL */ ADCL_Vector_register (ndims, vec_dims, NUM_HALO_CELLS, MPI_DOUBLE, vector, &vec); /* Combine description of data structure and process topology */ ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD, &request); /* Main application loop */
590
Runtime Adaption Techniques for HPC Applications
for (i=0; i
Technical Concept Two key components of ADCL are the algorithm used to determine, which versions of a particular operation shall be tested, and how to decide efficiently across multiple process on the best performing version. In the following, we give some details on both components.
Distributed Decision Logic A fundamental assumption within ADCL is, that the library has multiple alternative versions for a particular functionality available to choose from. These alternatives will be stored as different functions in the same function-set. The number of alternatives can reach from a few, (e.g. the user providing three different version of a parallel matrix-multiply operation) to many millions, in case the user is exploring different values for internal or external parameters, such as various buffer sizes, loop unroll depth etc. As of today, ADCL incorporates two different strategies for version selection at runtime. The first version incorporates a simple brute-force search, which evaluates all available alternatives. An alternative version selection algorithm is used, if the user annotates the implementations by a set of attributes/ attribute values. These attributes are used to reduce the time taken by the runtime selection procedure, by tuning each attribute separately. Independently of the version selection approach used by the library, the collective decision logic of ADCL will have to compare performance data of multiple functions gathered on different processes. The challenge lies in the fact, that in the most general case, processes only have access to their own performance data and performance data for the same code version might in fact differ significantly across multiple processes. Distributing the performance data of all processes for all versions to all other processes is however not feasible, since the costs for communicating these large volumes of data would often offset the performance benefits achieved by runtime tuning. The approach taken by the library relies therefore on data reduction, i.e. each process provides only a single value for each alternative version of the code section being optimized. In order to detail the algorithm let’s assume ADCL gather n measurements/data points for each version i on each process j. Let us denote the execution time of the kth measurement by t(I, j, k). In an initial step, the library removes outliers, i.e. measurements not fulfilling a condition C = t(i, j, k) | t(i, j, k) < b · minkt(i, j, k) with b being a well defined constant, from the data set. This leads to a filtered subsetM f (i, j) = {t(i, j, k) | t(i, j, k) fulfills C}of measurements with cardinality nf (i, j). Then, the performance measurements for each version are analyzed locally on each process and characterized by the local average execution time
591
Runtime Adaption Techniques for HPC Applications
m(i, j ) =
1 n
å t(i, j, k ) k
and its filtered counterpart 1 nf
m f (i, j ) =
å
t(i, j, k )
k ÎM f (i , j )
as estimates of the mean value. In a global reduction operation, the library determines for each version the maximum average execution time across all processes m(i ) = max m(i, j ) j
m f (i ) = max m f (i, j ) j
considering all respectively only filtered data, and the maximum number of outliers no(i) over all processes no (i ) = max no (i, j ) j
This reduction is motivated by a fundamental law in parallel computing, which states, that the performance of a (synchronous) application is determined by the slowest process/ processor. Finally, the library selects the maximum execution time including or excluding outliers by ïìm (i ) if r (i ) = ïí f ïï m(i ) else î
no (i ) £ nmo
depending on whether the maximum number of outliers allowed is exceeded or not. The algorithm i’ fulfilling r (i ¢) = mini r (i ) is chosen as the best one. Assuming that the runtime environment produces reproducible performance data over the lifetime of an application, this algorithm is guaranteed to find the fastest of available implementations for the current tuple of {problem size, runtime environment, versions tested}.
Attribute Based Tuning ADCL extends the algorithm grouping concept used in STAR-MPI by introducing the formal notion of attributes. The main idea behind that concept is, that any implementation of a collective communica-
592
Runtime Adaption Techniques for HPC Applications
tion operation has certain implicit requirements to the hardware and software environment in order to achieve the expected performance. As an example, ADCL uses as of today three attributes in order to characterize an implementation for the neighborhood communication function-set: 1.
2.
3.
Number of simultaneous communication partners: this attribute characterizes how many communication operations are initiated at once. For neighborhood communication, the currently supported values by ADCL are all (ADCL attribute value aao) and one (pair). This parameter is typically bound by the network/switch. Handling of non-contiguous messages: supported values are MPI derived data types (ddt) and pack/unpack (pack). The optimal value for this parameter will depend on the MPI library and some hardware characteristics. Data transfer primitive: a total of eight different data transfer primitives are available in ADCL as of today, which can be categorized as either blocking communication (e.g. MPI_Send, MPI_Recv), non-blocking/asynchronous communication (e.g. MPI_Isend, MPI_Irecv), or one-sided operations (e.g. MPI_Put, MPI_Get). Which data transfer primitive will deliver the best performance depends on the implementation of the according function in the MPI library and potentially some hardware support (e.g. for one-sided communication).
Please note that not all combinations of attributes might really lead to feasible implementations. As an example, implementations using a blocking data transfer primitives such as MPI_Send/Recv cannot be applied for implementations having more than one simultaneous communication partner, since this would potentially result in a deadlock. Therefore, a total of 20 implementations are currently available within ADCL for the n-dimensional neighborhood communication. Further attributes such as the capability of the library/environment to overlap communication and computation will be added in the near future. In order to speed up the selection logic, an alternative runtime heuristic based on the attributes characterizing an implementation has been developed. The heuristic is based on the assumption, that the fastest implementation for a given problem size on a given execution environment is also the implementation having ‘optimal’ values for the attributes in the given scenario. Therefore, the algorithm tries to determine the optimal value for each attribute used to characterize an implementation. Once the optimal value for an attribute has been found, the library removes all implementations not having the required value for the according attribute and thus shrinks the list of available implementations. In order to explain the approach in more details, let’s assume that an implementation is characterized by N attributes. Each attribute has nv(i), i = 1,N possible values v(i, j), j = 1, nv(i). The library assumes that the optimal value kopt(i) for an attribute i has been found, if rc(i) measurements confirm this hypothesis. In order to be able to deduct from a set of measurements towards the optimal value of a single attribute, the library only compares the execution times of implementations whose attributes differ only in the according attribute. To clarify this approach, please assume that we have to deal with four different attributes (N = 4), and want to determine the best value for the second attribute. We assume that this attribute has three distinct values (nv(2) = 3)), e.g. v(2, 1) = 1, v(2, 2) = 2, and v(2, 3) = 3. Since the values of all attributes except for the second one are being constant we assume that any performance differences between the three implementations can be accredited to the second attribute. The library determines collectively across all processes which of the three implementations has the lowest average execution time, using the same approach as outlined in the section 0. If we assume as an example, that the implementation with the attribute values [v(1, j’), 3, v(3, j’’), v(4, j’’’)] has the lowest average execu-
593
Runtime Adaption Techniques for HPC Applications
tion time, the library would develop the hypothesis that 3 is the optimal value for the second attribute. At this point, only one set of measurement confirms the hypothesis that 3 is the optimal value for the second attribute. Thus, the confidence value in this hypothesis is set to 1. Typically, a hypothesis has to be confirmed by more than one set of measurements before ADCL considers this hypothesis to be probably correct. Thus, an additional set of measurements with differing (but constant) values for one of the other attributes has to be gathered, e.g by using v(3, j’’+1) as the value for the third attribute. If the new set of measurements confirms the result of the previous set, the confidence value for the hypothesis is increased. If another attribute value is determined for this set of measurements to be the best one, the confidence value for the original performance hypothesis is decreased by one. Once a hypothesis reaches the required number of confirmations, the library removes all implementations which do not have the optimal value for the according attribute and shrinks the list of available implementations. Please note that if the measurements do not converge toward an optimal value for an attribute, no implementation will be removed based on this attribute.
Ongoing Research Additionally to the topics described so far, active research is currently being performed in multiple areas. In this subsection we would like to highlight some of the topics. The first research direction is related to the version management and selection within ADCL. As described in the previous paragraphs, ADCL utilizes as of today either a brute force search strategy, which is applied in case a function-set is not characterized by any attributes. Alternatively, ADCL can utilize the per-attribute based search strategy. Both approaches can however further be improved. The brute force search can be extended by algorithm containing early stopping criteria, as described in (Vuduc, 2004). This approach helps to reduce the number of alternative versions tested by randomly selecting versions to test and giving a measure, when the worst implementations have been excluded with a certain probability. The main restriction of the per-attribute based search strategy as of today is that it assumes that attributes are fundamentally not correlated. Depending on the usage scenario and attributes defined by the application, this assumption is not necessarily correct. However, there is a broad body of work known in experimental design theory, namely the 2k factorial design algorithms, which provide an excellent framework for version management of correlated attributes. ADCL is currently being extended to include 2k factorial design algorithms for correlated attributes. A second active research area is to introduce the notion of historic learning in ADCL, i.e. develop mechanism to propagate results of various optimizations from one run to another run. Historic Learning in ADCL extends the Wisdom concept of FFTW in multiple areas: first, the historic data in ADCL is always accompanied by a high level description of the architecture used to determine the results of the optimization. This is necessary in order to develop mechanisms which can discard automatically data in the historic data base, e.g. in case an application is run on a different network than the machine had when a particular optimization has been performed. Discarding historic data is furthermore supported by introducing the notion of expected performance. In case the historic data suggests a particular version of a function-set to lead to the optimal performance, ADCL can also generate an estimate of the execution time for that operation. If the measured execution time for the operation deviates from the predicted execution time by a given threshold, the library automatically discards the results and starts a new optimization, assuming that the runtime conditions have changed compared to the original assumptions.
594
Runtime Adaption Techniques for HPC Applications
Lastly, ADCL introduces the notion of related problems, in order to have the ability to deduct results from similar problems solved on similar machines. Related problems in that are defined by introducing a function-set specific distance measure. As an example, for the neighborhood communication the library uses a Euclidean distance between two problems using the vector of the data sizes transmitted to each process as the base measure.
PERFORMANCE EVALUATION In the following, we present performance results of three different scenarios. The first one discusses the optimization of the three-dimensional neighborhood communication as often occurring in scientific applications. The second scenario describes the results achieved for tuning a parallel matrix-matrix multiply kernel. Finally, the third scenario describes the usage of ADCL within the context of a tool optimizing the system parameters of the Open MPI library.
Optimizing 3-D Nearest Neighbor Communication In the following, we will analyze the effect of using different implementations for the neighborhood communication on the performance of a parallel, iterative solver as often applied in scientific application. The software used in this section solves a set of linear equations that stem from discretization of a partial differential equation (PDE) using center differences. The parallel implementation subdivides the computational domain into subdomains of equal size. The processes are mapped onto a regular threedimensional Cartesian topology. Due to the discretization scheme, a processor has to communicate with at most six processors to perform a matrix-vector product. For the subsequent analysis, the code has been modified such that it makes use of the ADCL library, i.e. the sections of the source code which established the 3-D process topology and the neighborhood communication routines have been replaced by the according ADCL counterparts. In order to evaluate the ‘correct’ decision of the runtime selection logic, we additionally executed the same application using one single implementation at a time by circumventing the runtime selection logic of ADCL. In order to make the conditions as comparable as possible, the reference data was produced within the same batch scheduler allocation and thus had the same node assignments. We will refer to these measurements as verification runs throughout the rest of this section. During each verification run, the execution times of 700 iterations per implementation were stored and subsequently averaged over all three runs. Depending on the machine, we have executed three different problem sizes, namely a small test case with (32 x 32 x 32) mesh points per process, a medium test case with (64 x 32 x 32) mesh points per process, and a large test case with (64 x 64 x 32) mesh point per process. Since most MPI libraries do not show performance advantages for MPI put/get operations compared to two-sided communication on a typical PC cluster and in order to simplify our analysis, we have configured ADCL for the following tests without the one-sided data transfer primitives. This leaves twelve implementations for the 3-D neighborhood communication for the runtime selection logic to choose from. The number of tests required to evaluate an implementation has been set to 30. Tests have been executed on six platforms all in all:
595
Runtime Adaption Techniques for HPC Applications
Table 1. Best performing implementation on different architectures and for different problem sizes Architecture DataStar p655 (DS)
# of proc
Pb Size/Proc
64
64x32x32
SendIrecv_aao
64x64x32
SendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao
64x32x32
IsendIrecv_aao
64x64x32
IsendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
SendIrecv_pair_pack
64x64x32
IsendIrecv_aao_pack
64x32x32
Sendrecv_pair
64x64x32
Sendrecv_pair
64x32x32
SendIrecv_pair_pack
64x64x32
SendIrecv_pair_pack
64x32x32
Sendrecv_pair_pack
64x64x32
SendIrecv_pair_pack
32x32x32
IsendIrecv_aao
64x64x32
SendIrecv_pair_pack
32x32x32
SendIrecv_aao
64x64x32
IsendIrecv_aao
32x32x32
IsendIrecv_aao
64x64x32
IsendIrecv_aao_pack
32x32x32
IsendIrecv_aao
64x64x32
Sendrecv_pair
64x32x32
SendRecv_pair_pack
128
256
512
IBM Blue Gene/L (BG)
128
256
512
NEC SX8 (SX)
16
32
64
SharkIB (ShIB)
32
48
SharkGE (ShGE)
32
48
CacauGE (CGE)
1.
2.
596
Best implementation
IBM Blue Gene/L: The Blue Gene system at the San Diego Supercomputing Center consists of 3,072 compute nodes with 6,144 processors. Each node consists of two PowerPC processors that run at 700 MHz and share 512 MB of memory. All compute nodes are connected by two high-speed networks: a 3-D torus for point-to-point message passing and a global tree for collective message passing. We ran tests using 128, 256 and 512 processes. NEC SX8: the installation at the High Performance Computing Center in Stuttgart, Germany (HLRS) consists of 72 nodes, each node having 8 vector processors of 16 GOPS peak (2Ghz) and 128 GB of main memory. The nodes are interconnected by an IXS switch. Each node can send and receive data with 16 GB/s in each direction. We executed tests with 16, 32 and 64 processes.
Runtime Adaption Techniques for HPC Applications
Figure 1. Performance overhead of the slowest vs. fastest implementation on each platform and for each problem size
3.
4.
5. 6.
DataStar p655: the p655 partition of DataStar cluster at San Diego Supercomputing Center has 272 8-way compute nodes 176 nodes with 1.5-GHz Power4+ CPUs and 16 GB of memory, and 96 with 1.7 GHz Power4+ CPUs and 32 GB of memory. The nodes are connected by an IBM highspeed Federation switch. The tests executed and presented in this subsection include 64, 128, 256 and 512 processes runs. SharkIB: this cluster consists of 24 nodes, each node having a dual core AMD Opteron processor and 2 GB of main memory. The nodes are interconnected by a 4xInfiniBand network. We present results for 32 and 48 processes on 16 respectively 24 nodes. SharkGE: This is same cluster as described in the previous bullet, using however the Gigabit Ethernet network interconnect. CacauGE: Cacau is a 200 node, dual processor Intel EM64T cluster at the HLRS. Although the main network of the cluster is a 4xInfiniBand interconnect, we used for the subsequent analysis the secondary network of the cluster, namely a hierarchical Gigabit Ethernet network. This network consists of six 48-port switches, each 48-port switch has four links to the upper level 24 port Gigabit Ethernet switch. Thus, this network has a 12:1 blocking factor. We have executed tests using 64 processors on 64 nodes in order to ensure that communication between the processes has to use two or more of the 48-port switches.
597
Runtime Adaption Techniques for HPC Applications
Figure 2. Comparison of the largest problems run on each patform (left) and of various platforms and problem sizes (right)
Table 1 summarizes for each platform and problem size the implementation of the neighborhood communication which leads to the overall best performance. In the 29 different test cases presented here, seven out of the twelve implementations available in ADCL for that communication pattern turn out to lead to the minimal execution time of the application. Most notably, there are changes for the best performing implementation depending on both, the number of processors and the problem size per process for basically all platforms tested. In the following, we show that (1) hard coding a particular code sequence can lead to a significant performance penalty on any platform; (2) pre-tuning the code on a platform does not lead to portable performance on other platforms and (3) that ADCL is capable of generating portable code with minimal overhead compared to manually tuned versions. Since most applications hard-code the neighborhood communication using a sequence of Send/Receive operations, we show in the following diagram the performance implication of potentially using a suboptimal code version for the neighborhood communication on the overall performance of the code. For this, we show in Figure 1 the maximum performance penalty that an application code could face for that scenario by comparing the performance of the best vs. the worst performing implementation. The penalty an application faces in this scenario does depend on the platform used. While most of the platforms analyzed show a performance penalty in the range of 5-20% in this test, some platforms show also more dramatic variability to the implementation, such as both platforms using Gigabit Ethernet networks, for which the execution time nearly doubles in the worst case. The NEC SX8 also shows a significant sensitivity to the implementation with an additional overhead of more than 60% depending on the number of processes and the problem size. Next, we would like to quantify the penalty an application would pay by using a code version which has been tuned on one platform on another platform. We detail two scenarios. First, for each platform we choose the largest problem on the largest number of processors that we ran, and evaluate what the performance penalty would be to use an implementation, which has been determined to be the winner on any of the other platforms. In Figure 2, each entry represents the performance penalty of the application running on the platform shown on the x-axis when using the implementation determined to be the winner on the platform shown in the according y-axis. As an example, the bar in the first row, third column
598
Runtime Adaption Techniques for HPC Applications
Figure 3. Performance difference between the manually tuned and an automatically tuned code version using ADCL
shows the performance penalty for the application running the large problem size on the Datastar cluster using the ‘winner’ function determined by the SX8 using 64 processes. The most remarkable result of Figure 2 (left) is, that the winner functions of the SharkGE and CacauGE cause significant performance penalties on the high performance interconnects used on Datastar, IBM Blue Gene and NEC SX 8. The performance penalty ranges from 1.74% up to 58.99%. Vice versa, the implementation chosen by these machines lead to significant performance penalties on SharkGE and CacauGE, increasing the execution time up to 90% compared to their fastest implementation. This result is especially relevant, since many large-scale scientific applications are originally developed and tuned on a smaller cluster within the institute where the authors reside. Typically, these smaller clusters utilize Gigabit Ethernet network interconnects. Our results indicate, that when moving to a large-scale system at a remote site, the code tuned for the Gigabit Ethernet network might in-fact pay a significant performance penalty when run without modifications. In the second scenario, we focus on only three architectures, namely Datastar, IBM Blue Gene and the NEC SX8. We analyze the execution times for the medium problem size for all available number of processes. The results are presented using same format as previously. Although the performance penalty for many scenarios shown in the right part of Figure 2 is negligible, there are notable exceptions. As an example, applying the winner function of the 64 processes case on Datastar onto the 128 processes
599
Runtime Adaption Techniques for HPC Applications
case on Blue Gene/L leads to a 5% increase in the execution time of that simulation. Similarly, the best performing implementation in the 256 processes test case on the Blue Gene would lead to a 11.49% increase in the execution time for the 256 processes test case on the Datastar architecture. Last but not least, the implementation leading to the best performance on the 64 processes test case on the SX8 would lead to a performance penalty of more than 15% on the same architecture for the same problem size per processes but for the 32 processes test case.
ADCL Performance Results As of now, we have documented the fact that the performance of an application does depend on the implementation of the neighborhood communication, the hardware architecture, the application problem size, and the number of processes. In the following, we would like to show that using the ADCL runtime selection logic leads in most cases to a close-to-optimal performance. Figure 3 documents the average overhead of the application when using ADCL compared to the performance of the application using the fastest implementation for that particular scenario. Figure 3 distinguishes between the two runtime selection logics of ADCL, namely the brute force search strategy, and the attribute based search strategy. The main result of Figure 3 is that the execution time of the application when using ADCL with its runtime adaption features is in fact very close the optimal performance determined in the verification runs. The overhead introduced by ADCL is in the vast majority of the test cases below 1%. This (minimal) overhead stems from two facts: first, during the initial iterations of the application, ADCL evaluates some of the implementations, which show a suboptimal performance on that platform for that particular problem size. Second, ADCL incorporates a distributed decision algorithm in order for all processes to agree on the same implementation as the ‘winner’. This distributed decision algorithm requires one allreduce operation per implementation. Furthermore, the attribute based search strategy shows in virtually all test scenarios a lower overhead due to the reduced number of implementations being tested. There are two notable exceptions to the results above: firstly, for the hierarchical Gigabit Ethernet network used on Cacau and secondly, for the 48 processes test case on SharkGE when using the attributes based search strategy and the large problem size. In the first problem scenario, despite of the fact that the ADCL runtime selection logic did determine in all three runs the correct implementation as the winner, the performance penalty for using a suboptimal implementation during the learning phase turned out to be tremendous: the overall execution time of the application increased by 72% compared to using the optimal implementation from the very beginning. This result only highlights the necessity for additional, improved runtime selection algorithms which can further reduce the time required to determine the fastest algorithm. For the second problem scenario, a more detailed analysis shows, that in two out of the three runs which have been used to calculate the average overhead in Figure 3, the parameter based search strategy did reveal a very good performance, showing only a minor overhead compared to the optimal execution time. However, in the third run, the system seemed to face some perturbations which lead to a wrong decision by the ADCL runtime selection logic. Using a suboptimal implementation for that test case resulted in a significant overhead. ADCL as of today relies on the fact, that the data gathered during the training phase is representative for the overall execution. In case this assumption turns out to be wrong, as happened in the third run, the runtime selection logic will make a suboptimal decision. In order to handle this scenario, ADCL will be extended by a monitoring subsystem in the near future. In case the performance data of the ‘winner’ implementation deviates significantly from the performance data
600
Runtime Adaption Techniques for HPC Applications
Figure 4. Performance of various algorithms for parallel matrix-matrix multiply operation and the according ADVL results
gathered during the learning phase, ADCL will be able to re-start the runtime selection logic and thus correct an erroneous decision. However, this component is not yet available as of today.
Tuning a Parallel Matrix-Matrix Multiply Kernel Matrix-matrix multiplication is a common operation in many applications from graph theory, numerical algorithms, digital control, and signal processing. Within the frame of a master thesis (Huang, 2007), three different kernels for a parallel matrix-matrix multiply operation have been implemented. The code used as the basis for this analysis assumed that matrices are decomposed among the processes using 1-D decomposition, i.e. each process holds a certain number of columns of the overall matrix. During execution, a process calculates a partial result using the sub-matrices it currently has access to. Using some form of communication, the processes then exchange their sub-matrices and successively perform the same calculation on new sub-matrices, adding the new results to the previous ones. From the computational perspective, Cannon’s algorithm is used to implement the parallel matrix-matrix multiplication. The main difference between the versions is the approach taken for communication between the processes. The three communication patterns explored within the thesis can be described as follows: •
•
Synchronous: In this version, the algorithm performs the computation on the local part of the matrices, followed by a circular shift of the sub-matrices of B. Thus, after the first communication step, process 1 holds the sub-matrices of B which has originally been assigned to process 0, process 2 holds the sub-matrices of B originally assigned to process 1 etc. This sequence of computation and communication is repeated p times for a p processes scenario. After the final computation, an additional shift operation is required in order to have the original assignment of the sub-matrices of B in place. This implementation is called synchronous, since there is no overlap between the communication and the computation. Overlapping: The main difference of this version to the previous one is, that the code tries to overlap the communication occurring in the shift operations, and the computation. For this, an additional temporary buffer is required to hold a sub-matrix. Using a double-buffering concept, the buffer given by the sub-matrix B and the temporary buffer are used in an alternating fashion for
601
Runtime Adaption Techniques for HPC Applications
•
communication and computation. Broadcast: this implementation avoids the circular shift operations for transferring sub-matrices of B. Instead, process i broadcasts its sub-matrix in iteration i to all other processes, and each process performs the corresponding part of the computation
The three versions of the matrix-matrix multiplication described in this section have been integrated with ADCL in order to let the ADCL runtime selection logic decide dynamically, which algorithm performs best for a given application and hardware configuration. Note, that ADCL deals in this subsection with user-defined function-set in contrary to the pre-defined function-set used in the previous subsection. For our tests, we used the SharkIB and SharkGE cluster described in the previous subsection. We show here performance results for matrix sizes of 1600x1600 and 3200x3200, using 8, 16 and 32 processes. (Figure 4) The results obtained indicate, that the algorithm using the broadcast operation for disseminating the sub-matrices is performing significantly slower than the implementations entitled synchronous or overlap. Over InfiniBand, synchronous typically achieves a slightly better performance than overlap, while over Gigabit Ethernet overlap is somewhat faster. ADCL chose for all instances the right implementation. The execution time of the operations when using ADCL is however somewhat slower than the according implementation, due to the fact that the library has to test also the under-performing broadcast version. This is especially evident for the 32 process test case over Gigabit Ethernet, where the penalty for using the broadcast versions is tremendous.
Using ADCL to Tune System Parameters of Open MPI The last usage scenario of ADCL described in this chapter deals with tuning runtime parameters of a communication library such as Open MPI (Gabriel, 2004). Open MPI supports a large number of runtime parameters, which allow an end-user or system administrator to tune specifics of the library, such as network parameters, settings for collective operations or processor affinity options, without having to recompile the library. The Open Tool for Parameter Optimization (OTPO) (Chaarawi, 2008), the result of a joint project between Cisco Systems and the University of Houston, allows the user to specify a certain number of Open MPI parameters, desired values or ranges of values to be explored, and the benchmark to be executed with each run. The result of an OTPO run is a collection of Open MPI parameters and the according optimal values, which lead to the minimal execution time of the benchmark previously specified. Internally, OTPO maps Open MPI parameters to ADCL attributes, and creates an ADCL function set. Each function in the function set executes the same sequence, namely spawning a new process which starts an MPI job with the according Open MPI parameters. This allows OTPO to take advantage of the regular ADCL mechanisms to determine the best performing function in the function set. In (Chaarawi, 2008) we demonstrated how this tool can be used to tune the InfiniBand parameters of Open MPI. Two separate set of tests have been executed, one exploiting the shared-receive queue feature of InfiniBand, the second using a separate receive queue per process. For the sake of clarity, we focused on tuning only four parameters, which lead to 825 possible combinations for the first scenario, and 275 possible combinations of for the second scenario. The test code executed for both scenarios consisted of the NetPipe (Turner, 2002) benchmark.
602
Runtime Adaption Techniques for HPC Applications
The results reveal a small number of parameter sets that resulted in the lowest latency (3.77µs and 3.78µs), namely four parameter sets for shared receive queues and six parameter sets for per peer receive queues. However, there were a significant number of parameter combinations leading to results within 0.05µs of the best latency. These results highlight, that typically, the optimization process using OTPO will not deliver a single set of parameters leading to the best performance, but will result in groups of parameter sets leading to similar performance
CONCLUSION This chapter presented a library capable of adapting the behavior of an application at runtime, which allows for tuning the performance of a particular code section by switching between different implementations. The library has been used in various scenarios, such as for tuning the neighborhood communication of scientific applications, tuning parallel matrix-matrix multiply operations, or adjusting the InfiniBand parameters of the Open MPI library. ADCL does not only allow for seamless tuning of an application at runtime, but also helps from the software engineering perspective since it avoids having to maintain different code version for different platforms. These two features combined make us believe, that runtime adaptation techniques such as used in ADCL are one of the most promising approaches in order to successfully use and exploit Petascale architectures.
REFERENCES Alexandrov, A. Ionescu, M. F. Schauser, K. E. & Scheiman, C. (1995). LogGP: Incorporating long messages into the LogP model. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, (pp. 95–105). New York: ACM Press. Benkert, K. Gabriel, E. & Resch, M. M. (2008). Outlier Detection in Performance Data of Parallel Applications. In the 9th IEEE International Workshop on Parallel Distributed Scientific and Engineering Computing (PDESC), Miami, Florida, USA. Bhowmick, S. Eijkhout, V. Freund, Y. Fuentes, E. & Keyes, D. (in press). Application of Machine Learning in Selecting Sparse Linear Solver. Submitted for publication to the International Journal on High Performance Computing Applications. Chaarawi, M. Squyres, J. Gabriel, E. & Feki, S. (2008). A Tool for Optimizing Runtime Parameters of Open MPI. Accepted for publications in EuroPVM/MPI, September 7-10, Dublin, Ireland. Culler, D., & Karp, R. Patterson, D. Sahay, A. Schauser, K. E. Santos, E. Subramonian, R. & von Eicken, T. (1993). LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, (pp. 1–12). New York: ACM Press. Dongarra, J. J., & Eijkhout, V. (2003). Self-Adapting Numerical Software for Next-Generation Applications. International Journal of High Performance Computing Applications, 17(2), 125–131. doi:10.1177/1094342003017002002
603
Runtime Adaption Techniques for HPC Applications
Evans, J. J. Hood, C. S.& Gropp, W. D. (2003). Exploring the Relationship Between Parallel Application Run-Time Variability and Network Performance. In Proceedings of the Workshop on High-Speed Local Networks (HSLN), IEEE Conference on Local Computer Networks (LCN), (pp. 538-547). Faraj, A. Yuan, X. & Lowenthal, D. (2006). STAR-MPI: self tuned adaptive routines for MPI collective operations. In ICS ‘06: Proceedings of the 20th Annual International Conference on Supercomputing, (pp. 199-208). New York: ACM Press. Faraj, A. Patarasuk, P. & Yuan, X. (2007). A Study of Process Arrival Patterns for MPI Collective Operations. International Conference on Supercomputing, (pp.168-179). Frigo, M., & Johnson, S. (2005). The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2), 216–231. doi:10.1109/JPROC.2004.840301 Gabriel, E. Fagg, G. Bosilca, G. Angskun, T. Dongarra, J. J. Squyres, J. M., et al. (2004). Open MPI: Goals, Concept, and Design of a Next Generation MPI Implemention. In D. Kranzlmueller, P. Kacsuk, J. J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, (LNCS, Vol. 3241, pp. 97-104). Berlin: Springer. Gabriel, E., & Huang, S. (2007). Runtime optimization of application level communication patterns. In Proceedings of the 2007 International Parallel and Distributed Processing Symposium, 12th International Workshop on High-Level Parallel Programming Models and Supportive Environments, (p. 185). Gill, P. E. Murray, W. & Wright, M. H. (1993). Practical Optimization. London: Academic Press Ltd. Hoefler, T. Lichei, A. & Rehm, W. (2007). Low-Overhead LogGP Parameter Assessment for Modern Interconnect Networks. Proceedings of the IPDPS, Long Beach, CA, March 26-30. New York: IEEE. Huang, S. (2007). Applying Adaptive Software Technologies for Scientific Applications. Master Thesis, Department of Computer Science, University of Houston, Houston, TX. Jain, R. K. (1991). The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. New York: Wiley. Kielmann, T. Hofman, R. F. H. Bal, H. E. Plaat, A. & Bhoedjang, R. A. F. (1999). MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), 34(8),131-140. Petrini, F. Kerbyson, D. J. & Pakin, S. (2003). The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. Proceedings of the 2003 ACM/ IEEE Conference on Supercomputing. Pjesivac-Grbovic, J., Bosilca, G., Fagg, G. E., Angskun, T., & Dongarra, J. J. (2007). MPI Collective Algorithm Selection and Quadtree Encoding. Parallel Computing, 33(9), 613–623. doi:10.1016/j. parco.2007.06.005 Turner, D., & Chen, X. (2002). Protocol-dependent message-passing performance on linux clusters. Proceedings of the 2002 IEEE International Conference on Linux Clusters, pp. 187-194. New York: IEEE Computer Society.
604
Runtime Adaption Techniques for HPC Applications
Voss, M. J., & Eigenmann, R. (2000). ADAPT: Automated De-coupled Adaptive Program Transformation. International Conference on Parallel Processing, Toronto, Canada, (pp. 163). Vuduc, R., Demel, J., & Bilmes, J. A. (2004). Statistical Models for Empirical Search-Based Performance Tuning. International Journal of High Performance Computing Applications, 18(1), 65–94. doi:10.1177/1094342004041293 Whaley, R. C., & Petite, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software, Practice & Experience, 35(2), 101–121. doi:10.1002/spe.626 Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd Ed.). San Francisco: Morgan Kaufmann.
KEY TERMS AND DEFINITIONS Adaptive Applications: Application capable of changing its behavior, switch to alternate code sections or change to different values for certain parameters at runtime as a response to different input data or changing conditions. Decision Algorithms: algorithms used to compare different versions of the same unctionalitywhile executing the application with respect to a particular metric such as execution time. Dynamic Tuning: Tuning of a code sequence or function during the execution of the real application. Static Tuning: Tuning of a code sequence or function before executing the real application.
605
606
Chapter 26
A Scalable Approach to RealTime System Timing Analysis Alan Grigg Loughborough University, UK Lin Guan Loughborough University, UK
ABSTRACT This chapter describes a real-time system performance analysis approach known as reservation-based analysis (RBA). The scalability of RBA is derived from an abstract (target-independent) representation of system software components, their timing and resource requirements and run-time scheduling policies. The RBA timing analysis framework provides an evolvable modeling solution that can be instigated in early stages of system design, long before the software and hardware components have been developed, and continually refined through successive stages of detailed design, implementation and testing. At each stage of refinement, the abstract model provides a set of best-case and worst-case timing ‘guarantees’ that will be delivered subject to a set of scheduling ‘obligations’ being met by the target system implementation. An abstract scheduling model, known as the rate-based execution model then provides an implementation reference model with which compliance will ensure that the imposed set of timing obligations will be met by the target system.
INTRODUCTION A key requirement in the development of a real-time system is the ability to demonstrate that the final target system has met its specified timing requirements. This can be carried out by constructing a model of system timing behaviour that can be used to make predictions about worst-case performance in terms of maximum response times, communication delays, delay variations and resource utilisation. In order to develop a modelling solution that is scalable, however, the timing analysis model must be constructed with this goal specifically in mind. DOI: 10.4018/978-1-60566-661-7.ch026
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Scalable Approach to Real-Time System Timing Analysis
Scalability of the timing analysis solution can ultimately be achieved by meeting two key modelling objectives: •
•
Partitionable analysis: The ability to model the timing behaviour of parts of the system independently from other parts and, correspondingly, limit the scope of re-analysis required in the event of localised changes; Evolvable analysis: The ability to assess the timing behaviour of the system incrementally throughout the development process (and through later modifications), based on the information available at each stage of development.
Partitionable Analysis It was recognised by Audsley et al (1993) that the holistic nature of timing analysis models for distributed real-time systems (the inability to analyse the system in part due to circular inter-dependencies in the model) arises due to a combination of functional and physical integration effects. It therefore follows that the model could be partitioned by placing appropriate restrictions on the manner in which integration of the system software and hardware components is performed as follows: • •
Functional partitioning: Constraints on data communication and ordering/ precedence relationships between related software components; Physical partitioning: Constraints on resource sharing policies (scheduling and communication protocols).
Functional partitioning can be implemented by breaking up end-to-end transactions (sequences of processing and communication activities required to meet some higher level system objective) into individual components based purely on knowledge of the end-to-end timing requirements of the system (and the transaction topology). The net result is that the timing behaviour of groups of software components allocated to individual processors (and other resources) can potentially be analysed independently. Functional partitioning, in its extreme, results in a ‘federated’ timing analysis model where resource boundaries in the physical architecture dictate the partitions in the timing model. The resulting model can only be applied/evaluated after the physical integration stage of system development has been performed, i.e. the allocation of software components to hardware processors (and other resources) has been defined and the details of the scheduling solution and communication protocols have been finalised. Physical partitioning can be implemented by applying resource scheduling mechanisms that support temporal partitioning between the components being scheduled. The net result is that the timing behaviour of functionally-dependent groups of components, ie. end-to-end transactions, can be analysed independently. This offers some key advantages over functional partitioning. Firstly, for systems that embody a degree of safety-critical functionality, this means that there is inherent temporal isolation between safety-critical and non-safety-critical software components that share processing resources and/or communication media. Secondly, and more generally from the perspective of an overall system engineering process, there is potential for supporting independent development and verification of groups of functionally-related components for a target computing environment that is physically integrated. Moreover, the analysis can potentially be applied before the physical integration stage of development has been performed, i.e. much earlier in the development life of the system.
607
A Scalable Approach to Real-Time System Timing Analysis
The Reservation-Based Analysis (RBA) method described in this chapter uses a physical partitioning approach throughout but also supports functional partitioning with the guideline that this is to be used sparingly. Before going on to describe RBA, the second of the two key modelling requirements is discussed below.
Evolvable Analysis No particular form of development life-cycle is assumed here, merely the notion that requirements, design, implementation/integration and testing/verification stages are involved, where these stages may be performed with a degree of concurrency and iteration. More specifically, it is assumed that the following activities are involved in the process: • • • •
Definition of system-level timing requirements; Decomposition and refinement of these requirements during system design and implementation; Development/acquisition and integration of software and hardware components; Verification of systems timing properties against stated requirements.
Timing analysis models are normally developed in a ‘bottom-up’ manner, i.e. the model is not finalised until after the implementation and integration details of the system have been decided. Hence, it is not possible to assess the timing behaviour of the system until late in the development process. Furthermore, subsequent changes to the implementation or integration details of the system, particularly changes to the scheduling or communication protocols involved, are likely to impact on the details of the timing model. The results of analysis performed at this late stage of development are, of course, essential to support final verification of the system timing requirements. Any deficiencies discovered at this late stage, however, can give rise to significant re-work, involving possible re-consideration of the artefacts produced from the integration, implementation, design or even requirements phases of development, depending on the severity of the problem. The costs associated with re-work can be a major factor in the overall development cost of industrial real-time systems. The amount of re-work associated with the development and verification of the timing properties of the system could be reduced by making the notion of timing analysis more integral to the systems engineering process as a whole, allowing it to be applied throughout the development of the system, starting much earlier in the process. This would allow an ongoing assessment of emerging system timing properties relative to specified timing requirements and also provide progressive guidance on the selection of future design/implementation details at successive stages of development. Two fundamental issues can be identified regarding the provision of an evolvable timing analysis model through the system life-cycle: • •
608
How to perform timing analysis with a lack of system integration and implementation details during the earlier stages of development; How to deal with a continually evolving definition of the system, with different parts of the system evolving at different rates.
A Scalable Approach to Real-Time System Timing Analysis
Timing Analysis without System Integration and Implementation Details In the earlier stages of development, the timing-related information for all parts of the system will be scarce. Whilst the system level timing requirements may be reasonably well understood and decomposed to varying extent during design stages, the ability to perform timing analysis relies fundamentally on the notion of resources - some media through which the functions of the system can be performed. Processing resource details are used as a basis for calculating worst-case execution times of software components. Similarly, details of the communication media (and associated protocols) are used to characterise worst-case communication delays. In order to perform timing analysis prior to the system implementation stage, therefore, it is therefore necessary to work with an implementation-independent, abstract model of system resources. Such a model could provide a means to perform timing analysis in the earlier stages of system development using estimations or assumptions about system integration and implementation details that will not become concrete until a later date. This abstract model must be defined to provide a sufficient (although pessimistic) basis for performing timing analysis but without over-constraining the final implementation of the system. When implementation details are eventually finalised, the results of timing analysis performed via the abstract model could then be verified. This gives rise to a two-stage approach to timing analysis: •
•
Abstract timing analysis: Performed during the definition and decomposition stages of development on the basis of the abstract resource model; the net result is a set of worst-case (and bestcase) guarantees regarding the timing behaviour of the system that are subject to a set of obligations being met by the final implementation of the system; Target-specific timing analysis: Performed during the system implementation and integration stages of system development, the aim being to demonstrate that the set of obligations imposed during the abstract timing analysis phase have actually been met.
By appropriate construction of the timing model, the guarantees generated during the abstract timing analysis phase can be inferred automatically at the target-specific analysis stage, i.e. there should be no need to reconstruct the results of the system-wide abstract analysis after the target-specific details have been finalised and verified. Significantly, the opportunity can be taken here to define a consistent (unified) model for system-wide analysis, i.e. for end-to-end timing analysis across all system resources. This is a key attribute of the RBA framework. It is normally the case that different types of system resources (processing and communication media) and resource access policies (scheduling and communication protocols) are modelled in different ways. The use of consistent abstractions improves the composability and ultimately scalability of the system-wide timing analysis model.
Timing Analysis for an Evolving System Definition The development of a timing analysis model for application through the system life-cycle must inherently address the problem of dealing with an evolving definition of the system. Throughout the development process, different parts of the system will mature at different rates, leading to an inconsistent level of detail regarding the timing properties of the system. The timing model must not only be able to represent
609
A Scalable Approach to Real-Time System Timing Analysis
the various parts of the system at different levels of detail but also still be evaluated to determine system timing behaviour based on such information. This can be achieved by structuring the real-time transaction model as a hierarchy, where successive levels of the hierarchy capture the results of successive stages of system evolution. At any given stage of development, the hierarchy needs to capture all known ordering/precedence relationships between the entities that describe the system at that stage and all known resource requirements (if any). Ideally, the nodes of the hierarchy should be of a form that is consistent throughout the development of the system in order to avoid any problems associated with transforming the timing properties of the system between different representations at different stages of development. The proposal to partition the timing model in the physical domain, as described above, suggests that the nodes of the hierarchy should be characterised by indivisible resource usage requirements (and the corresponding delay/response times) associated with computational and communication elements within the system. In this way, functionally unrelated parts of the system can potentially be modelled, analysed and modified independently and therefore allowed to evolve at different rates throughout the development process. Similarly, in the later stages of system development, the analysis should be applicable to partially integrated subsets of the system in order that verification of system timing properties can occur in step with the integration process itself, rather than waiting for the complete system to be finalised before the timing behaviour of any one part of it can be verified. In terms of the two-stage approach described above, the timing guarantees generated from the abstract timing analysis model at successive stages of refinement can be taken to represent timing obligations on subsequent stages of development. As the model is evolved, timing obligations are refined to generate increasingly detailed implementation constraints on the final system.
Abstract Timing Analysis Model The system level, end-to-end timing characteristics are captured, developed and analysed in terms of transactions. A transaction captures the temporal relationship between: • •
A set of input events from the external environment of the system (or from other transaction(s)); A set of output events to the external environment of the system (or to some other transaction(s)).
In the true sense, transaction input and output events correspond to the arrival or dispatch of control signals and/or information from or to the external interfaces of the system. In order to better support the engineering of large-scale real-time systems, however, such events can relate to other transactions within the system as well as the external environment of the system. This allows particular end-to-end requirements of the system to be modelled as multiple transactions where practical engineering constraints dictate, such as the need to allocate responsibilities across industrial partnerships and sub-contractors. This practical concern is supported without compromising the system timing model due to the uniform nature adopted for expressing transaction structural and timing properties, as explained below.
610
A Scalable Approach to Real-Time System Timing Analysis
Real-Time Transaction Topology For reasons of supporting an evolving definition of the system during development, the structure of the transaction model is hierarchical. The body of a transaction at any stage in its development is expressed in the form of an acyclic, directed, nested graph whose leaf nodes capture the concurrent processing and communication elements of the transaction, termed activities. Non-leaf nodes in the hierarchy are referred to as nested transactions. The edges (or arcs) of the graph capture the precedence (ordering) and nesting relationships within the transaction: • •
Precedence relationships describe the required order of execution of activities and nested transactions; Nesting relationships capture strict refinement (specification-implementation) relationships that arise during the evolution of the transaction model.
Consider the transaction λ depicted in Figure 1. This illustrates the evolving definition of a transaction from an initial system level description (referred to as the level 0 model) through two successive stages/ levels of refinement. At the first stage of refinement, the level 0 nested transaction λ1 is implemented via three level 1 nested transactions, each of which is implemented as a set of level 2 activities at the second stage of refinement. In the general case, let λi,..,k denote any arbitrarily nested transaction or activity within transaction λ. This notation is sufficient to define the topology of the transaction for the purpose of performing timing analysis (once the timing characteristics of the individual activities have been defined - see section 2.2). Figure 1. Example transaction topology
611
A Scalable Approach to Real-Time System Timing Analysis
Further structural information, however, can optionally be provided regarding the nature of precedence and nesting relationships as a means of reducing the pessimism in the analysis. This is achieved via the association of a guard function Qi,..,k with each node λi,..,k describing the conditions that trigger its arrival for execution. In many cases, Qi,..,k is implicit given the form of relationship involved. For example, the arrival of λi,..,k dependent on completion of a single predecessor does not require Qi,..,k to be defined at all since the constraint is wholly described by topology of the directed graph. In other cases, however, Qi,..,k is not sufficiently defined from the graph topology alone and can, if required, be expressed explicitly. For example, Q1,3 could be defined such that the arrival of λ1,3 is triggered only upon completion of the two predecessors, λ1,1 and λ1,2 or, alternatively, on completion of either one of those predecessors. The choice of guard functions will have an impact on the timing analysis in terms of improving accuracy of the model. To this end, a basic categorisation of precedence and nesting relationships, from which more complex transaction topologies can be constructed, is given below. Consider first the case of precedence relationships, which can be categorised as follows: • •
One-to-many: Where the completion of a single activity or nested transaction triggers the arrival of one or more successors (as illustrated in Figure 1 within the level 0 model); Many-to-one: Where the completion of one or more concurrent activities or nested transactions is required to trigger the arrival of a single successor (as illustrated in Figure 1 within the level 1 model).
The simple case of one-to-one precedence, where the completion of a single activity or nested transaction triggers the arrival of a single successor, is a special case of each of these classes and actually represents the intersection between the two. An analogous categorisation can be defined for nesting relationships as follows, observing that these relationships involve successive (referred to as parent and child) levels of the transaction hierarchy: •
•
One-to-many: Where the arrival of a single nested transaction at the parent level triggers the arrival of one or more concurrent activities at the child level (as illustrated in Figure 1 in the transition from the level 1 to the 2 model at λ1,2) or, traversing the hierarchy in the opposite direction, where the completion of a single activity at the child level triggers the completion of one or more concurrent nested transactions at the parent level; Many-to-one: Where the arrival of one or more concurrent nested transaction at the parent level triggers the arrival of a single activity at the child level or, traversing the hierarchy in the opposite direction, where the completion of one or more concurrent activities at the child level triggers the completion of a single nested transaction at the parent level (as illustrated in Figure 1 in the reverse transition from the level 2 to the 1 model at λ1,1).
For ease of future reference, nesting relationships that are directed from parent to child will be referred to as descending and those that are directed from child to parent will be referred to as ascending. The purpose of the nesting relationships in the transaction topology is to reflect the ongoing evolution of the system during development and later in service. For convenience, transactions can also be visualised at any stage of refinement in an equivalent ‘flat’ form, i.e. with nesting relationships transformed into precedence relationships. This is achieved by performing a depth-first traversal of the nested
612
A Scalable Approach to Real-Time System Timing Analysis
transaction graph and successively replacing each nested transaction with its lower level sub-graph; the sub-graph inheriting all higher level nesting and precedence relationships from the nested transaction that it replaces. For example, Figure 2 illustrates the example transaction with the nesting relationship that stems from λ1,1 resolved. Applying this process repeatedly to resolve all nesting relationships for the example transaction gives the resultant flat topology illustrated in Figure 3. Note that this flattening of the transaction hierarchy is merely for user visualisation purposes and does not affect the transaction from the perspective of the timing model.
Figure 2. Example transaction with nesting partially resolved
Figure 3. Example transaction with nesting fully resolved
613
A Scalable Approach to Real-Time System Timing Analysis
Real-Time Transaction Properties A set of timing parameters must be assigned for each transaction in order to perform timing analysis. A key consideration here is the need for the timing model to be evolvable and, to this end, the timing parameters that represent the nested transaction and the activity are defined such that these objects are interchangeable. In other words, any nested transaction can be implemented in terms of activities (and further nested transactions) and, vice versa, an activity can be replaced by a nested transaction and become the subject of further evolution. In this way, the same timing analysis approach can be applied to predict the behaviour of a transaction throughout its evolution. The only distinction between activities and nested transactions is that activities, since these represent leaf nodes of the transaction graph, must define their resource requirements directly, whereas nested transactions inherit such characteristics from the activities they embody. Otherwise, the parameters via which timing behaviour is represented and observed are the same for a single activity, a group of activities, a nested transaction and all the way up the hierarchy to a level 0 transaction. In the general case, for any arbitrarily nested transaction or activity, λi,..,k, with level 0 ‘parent’ λi, the timing properties are captured as follows: • • • •
input jitter, J iin,..,k - the maximum width of the time window that spans the arrival of all input events associated with λi,..,k ; - the maximum width of the time window that spans the deliverance of all outoutput jitter, J iout ,..,k put events associated with λi,..,k ; minimum I/O separation, di,..,k - the minimum separation in time (delay) between the input and output event windows of λi,..,k ; minimum inter-arrival time, ai,..,k - the minimum separation in time between the input event windows associated with successive instances of λi,..,k .
The relationships between these parameters are depicted in Figure 4. In Figure 4, the minimum inter-arrival time of λi,..,k is shown to be greater than the latest completion time of any λi,..,k output event, corresponding to the case where any one instance of λi,..,k will always complete before the next instance arrives. This is merely to keep the illustration simple and is not actually a constraint on the model. For example, a pilot display generation application is likely to require
Figure 4. Timing properties of transaction/Activity λi,..,k
614
A Scalable Approach to Real-Time System Timing Analysis
a minimum frame update time that is significantly less than the worst-case latency of the data being displayed. The model supports this type of behaviour, though there are some other constraints imposed on model parameters as follows, for reasons as given: • •
di,..,k > -J iin,..,k : A necessary condition arising from the basic restriction that, for any given instance of λi,..,k, the output event window cannot begin before the input event window begins; ai,...,k > 0: A constraint imposed to ensure that the input event windows associated with successive instances of λi,..,k are totally ordered.
End-to-End Timing Analysis End-to-end timing analysis can be performed at any stage of evolution of the transaction model, based on the information specified at that stage. Clearly, the analysis results will become more accurate as the definition of the model/system is evolved during development. The starting point for describing this timing analysis is to express the relationship between the basic timing parameters as specified and the overall delays accrued, at each level of nesting in the transaction definition. Let ri,..,k and Ri,..,k denote the minimum and maximum accrued delays (response times) associated with any nested transaction or activity λi,..,k. The following relationships are observed: di,..,k = ri,..,k - J iin,..,k
(1)
J iout = J iin,..,k + (Ri,..,k - ri,..,k ) ,..,k
(2)
These relationships are clarified by the illustration in Figure 5. The values ri,..,k and Ri,..,k must then be specified for each leaf node of the hierarchy, i.e. for each activity - these activity level delays are referred to as localised delays. The end-to-end analysis can then proceed by recursively descending the nested transaction topology/graph definition, accounting at each stage for the impact of nesting relationships, precedence relationships and localised delays on the overall end-to-end delays. The same approach can be taken to determining accrued delay variation, i.e. jitter. This will be illustrated by example later in the section. Ultimately, all accrued delay and jitter values relate back to the input event window for the level 0 transaction, λ. That said, the end-to-end timing model is constructed such that the delays and jitter accrued across any nested transaction or activity can be calculated relative to those inherited at its time Figure 5. Delay relationships for transaction/activity λi,..,k
615
A Scalable Approach to Real-Time System Timing Analysis
of arrival, i.e. the relative impact of a given stage in the transaction can be observed. In the transaction depicted in Figure 1, for example, the activities λ1,2,1 and λ1,2,2 will each inherit accrued delay and jitter values on their arrival via a nesting relationship from their parent transaction λ1,2. In the general case, let diin,..,k denote the accrued delay inherited by λi,..,k upon its arrival and, in turn, let diout denote the accrued ,..,k delay that λi,..,k exports to its successors upon completion. These values are related as follows: diout = diin,..,k + ri,..,k ,..,k
(3)
This relationship is illustrated in Figure 6. Notice in the diagram that the term diin,..,k actually represents the separation in time between two input windows, rather than one input and one output window. This is consistent with the use of d to denote minimum I/O separation since the input event window of activity λi,..,k is equivalent to the output window that describes the combined output jitter of it predecessors. Equation (3) provides a means by which minimum localised delay values calculated at the activity level can be consolidated into the end-to-end delay calculation. The consolidation of maximum local delays is taken care of by the output jitter calculation given in Equation (2). The manner in which accrued delays and jitter are inherited (and exported) for any particular λi,..,k depends on the form of precedence or nesting relationships involved. In the previous section, a number of fundamental forms of such relationships were identified. In the following circumstances, accrued delays and jitter are directly inherited (unchanged) in the direction of the relationship: •
one-to-many precedence relationship: When λi,..,k is one of many successors to λi,..,j: (Figure 7)
diin,..,k = diout ,.., j
(4)
J iin,..,k = J iout ,.., j
(5)
•
One-to-many nesting relationship (descending): When λi,..,k is one of many child activities whose arrival is triggered by that of the parent transaction λi,..,j: (Figure 8)
diin,..,k = diin,.., j
(6)
J iin,..,k = J iin,.., j
(7)
•
One-to-many nesting relationship (ascending): When λi,..,k is one of many parent transactions whose completion is triggered by that of the child activity λi,..,j: (Figure 9)
diout = diout ,..,k ,.., j
616
(8)
A Scalable Approach to Real-Time System Timing Analysis
Figure 6. Accrued delay relationships for transaction/activity λi,..,k
Figure 7. Delay inheritance - one-to-many precedence relationship
Figure 8. Delay inheritance - one-to-many nesting relationship (descending)
617
A Scalable Approach to Real-Time System Timing Analysis
J iout = J iout ,..,k ,.., j
(9)
In circumstances other than those cases illustrated above, however, delay and jitter inheritance is less straight forward, depending ultimately on the form of guard function Qi,..,k that resides over the arrival of λi,..,k. Without knowledge of Qi,..,k, the exact values for the inherited delay and jitter parameters cannot be determined, but the smallest ‘safe’ range of values can be stated as follows: •
many-to-one precedence relationship: When λi,..,j is one of many predecessors to λi,..,k: (Figure 10)
min(diout ) £ diin,..,k £ max(diout ) ,.., j ,.., j j
j
(10)
min(diout + J iout ) - diin,..,k £ J iin,..,k £ max(diout + J iout ) - diin,..,k ,.., j ,.., j ,.., j ,.., j j
j
•
(11)
many-to-one nesting relationship (descending): When λi,..,j is one of many parent transactions whose arrival is required to trigger that of the child activity λi,..,k: (Figure 11)
min(diin,.., j ) £ diin,..,k £ max(diin,.., j ) j
j
(12)
n min(diin,.., j + J iin,.., j ) - diin,..,k £ J iin,..,k £ max(dii,.., + J iin,.., j ) - diin,..,k j j
j
•
(13)
many-to-one nesting relationship (ascending): When λi,..,j is one of many child activities whose completion is required to trigger that of the parent transaction λi,..,k: (Figure 12)
min(diout ) £ diout £ max(diout ) ,.., j ,..,k ,.., j j
j
(14)
min(diout + J iout ) - diout £ J iout £ max(diout + J iout ) - diout ,.., j ,.., j ,..,k ,..,k ,.., j ,.., j ,..,k j
j
(15)
Note again that these bounds are derived without knowledge of Qi,..,k and are ‘safe’ but pessimistic: •
618
The stated lower bounds correspond to the case where Qi,..,k is defined such that the arrival of λi,..,k
A Scalable Approach to Real-Time System Timing Analysis
Figure 9. Delay inheritance - one-to-many nesting relationship (ascending)
Figure 10. Delay inheritance - many-to-one precedence relationship
•
is triggered upon completion of any one λi,..,j; The stated upper bounds correspond to the case where Qi,..,k is defined such that the arrival of λi,..,k is triggered upon completion of all λi,..,j.
In practice, the form of Qi,..,k could be defined and refined in line with the development of the associated transaction and this information can be used to reduce the pessimism of the accrued delay and jitter bounds compared to those determined by Equations (10) to (15). This can be illustrated by considering the two extreme cases that are used as the basis for deriving those equations, as given below. In the first case, where Qi,..,k is defined such that the arrival of λi,..,k is triggered upon completion of any one λi,..,j, Equations (10) to (15) can be reduced as follows:
619
A Scalable Approach to Real-Time System Timing Analysis
Figure 11. Delay inheritance - many-to-one nesting relationship (descending)
Figure 12. Delay inheritance - many-to-one nesting relationship (ascending)
•
many-to-one precedence relationship: When λi,..,j is one of many predecessors to λi,..,k:
diin,..,k = min(diout ) ,.., j j
J iin,..,k = min(diout + J iout ) - diin,..,k ,.., j ,.., j j
620
(10a)
(11a)
A Scalable Approach to Real-Time System Timing Analysis
•
many-to-one nesting relationship (descending) - when λi,..,j is one of many parent transactions whose arrival is required to trigger that of the child activity λi,..,k:
diin,..,k = min(diin,.., j ) j
J iin,..,k = min(diin,.., j + J iin,.., j ) - diin,..,k j
•
(12a)
(13a)
many-to-one nesting relationship (ascending): When λi,..,j is one of many child activities whose completion is required to trigger that of the parent transaction λi,..,k:
diout = min(diout ) ,..,k ,.., j j
J iout = min(diout + J iout ) - diout ,..,k ,.., j ,.., j ,..,k j
(14a)
(15a)
In the second case, where Qi,..,k is defined such that the arrival of λi,..,k is triggered only upon completion of all λi,..,j, Equations (10) to (15) can be reduced as follows: •
many-to-one precedence relationship: When λi,..,j is one of many predecessors to λi,..,k:
diin,..,k = max(diout ) ,.., j j
J iin,..,k = max(diout + J iout ) - diin,..,k ,.., j ,.., j j
•
(11b)
many-to-one nesting relationship (descending): When λi,..,j is one of many parent transactions whose arrival is required to trigger that of the child activity λi,..,k:
diin,..,k = max(diin,.., j ) j
J iin,..,k = max(diin,.., j + J iin,.., j ) - diin,..,k j
•
(10b)
(12b)
(13b)
many-to-one nesting relationship (ascending): When λi,..,j is one of many child activities whose completion is required to trigger that of the parent transaction λi,..,k:
621
A Scalable Approach to Real-Time System Timing Analysis
diout = max(diout ) ,..,k ,.., j j
(14b)
J iout = max(diout + J iout ) - diout ,..,k ,.., j ,.., j ,..,k j
(15b)
Timing Model Initialisation and Finalisation Initialisation of the end-to-end timing analysis model involves the assignment of input jitter and minimum I/O separation parameters for all nested transactions and activities that directly service the input events of the transaction. For example, for the transaction depicted in Figure 1, this relates to the nested transaction λ1. In the general case, the following assignments are made for each level 0 λi whose arrival is triggered directly by some transaction input event: diin = -J in
(16)
J iin = J in
(17)
where Jin is the transaction level input jitter, i.e. the maximum variation in arrival time over all transaction input events. Finalisation of the end-to-end analysis involves the assignment of transaction level minimum I/O separation and output jitter values. Transaction level values are determined by consolidating the values of the same parameters for all nested transactions and activities that directly service the output events of the transaction. For example, for the transaction depicted in Figure 1, this relates to the activities λ2, λ3 and λ4. In the general case, the following expressions are evaluated over all level 0 λi that relate directly to transaction output events: min(diout ) £ d £ max(diout ) i
(18)
i
min(diout + J iout ) - d £ J out £ max(diout + J iout ) - d i
i
(19)
where d and Jout are the transaction level minimum I/O separation and output jitter, respectively.
Example Transaction Definition and Decomposition In order to evaluate the end-to-end timing model at any stage of refinement, values must be assigned to the localised (activity level) delay parameters ri,..,k and Ri,..,k for each activity λi,..,k. In general terms, this can be done in one of two ways:
622
A Scalable Approach to Real-Time System Timing Analysis
• •
By assigning budgeted values for each ri,..,k and Ri,..,k, e.g. based on knowledge of the transaction timing requirements; By assigning actual values for each ri,..,k and Ri,..,k, e.g. based on actual measurement or static analysis of code.
The latter approach is clearly only applicable when the target hardware and software implementation are complete (or at least underway). The former approach is what is required during the early stages of system development and evolution and is used in the example that follows. To start, consider the assignment of the following end-to-end timing properties to the example transaction - in practice, such details could be extracted from an overall statement of system level end-to-end timing requirements. (Figure 13 and Figure 14) The initial level 0 model of the transaction is depicted in Figure 15 - in practice, this information could be extracted from the top level software architecture design. From the set of boundary conditions given in Equations (16) to (19), it is straight forward to assign a set of values to the corresponding level 0 timing attributes. Firstly, from the transaction input conditions given in Equations (16) and (17), values are directly inferred for the input jitter and initial accrued delay of λ1: d1in = -5 d1in = -5 J 1in = 5 J 1in = 5
Figure 13. Example - transaction timing requirements
Figure 14. Example - Level 0 Model
623
A Scalable Approach to Real-Time System Timing Analysis
Figure 15. Example partial assignment of Level 0 attributes
From the transaction output conditions given in Equations (18) and (19), suitable values can be found for the output jitter and final accrued delay of λ2, λ3 and λ4: d2out = 40 d2out = 40 J 2out = 25 J 2out = 25 d3out = 45 d3out = 45 J 3out = 17 J 3out = 17 d4out = 55 d4out = 55 J 4out = 15 J 4out = 15 In this example, the parameters have been assigned such that the output event window of the transaction is exactly spanned by the set of activity level output windows. This means that the corresponding transaction level timing requirements have been met exactly, rather than leaving an element of redundancy in the transaction level requirements relative to the level 0 model timing properties. Beyond that, and the satisfaction of the boundary conditions given in Equations (16) to (19), the actual values assigned are somewhat arbitrary and chosen purely for the purposes of illustration. Figure 15 illustrates this (partial) assignment of level 0 timing attributes. The rest of the level 0 timing attributes can be assigned on the basis of the level 0 topology details and the appropriate means of accounting for precedence relationships as defined in Equations (4) to (9). In practice, additional application-specific information could be taken into account here. In this example, values have been assigned as explained below. Given the one-to-many precedence relationship between λ1 and its successor activities λ2, λ3 and λ4, Equation (5) implies that the parameters J 1out , J 2in , J 3in and J 4in should all be assigned the same value.
624
A Scalable Approach to Real-Time System Timing Analysis
Given that jitter tends to increase in the direction of control flow along the transaction (unless specific jitter control mechanisms are introduced such as by using time-triggered releases), this value should be less than any of the output jitter values already assigned for activities λ2, λ3 and λ4. For the purposes of this example, the following assignment has been made: J 1out = J 2in = J 3in = J 4in = 12 From Equation (4), the positions of the time windows whose widths are defined by the above jitter values are all fixed by the size of d1out . Hence, the parameters d1out , d2in , d3in and d4in should all be assigned the same value. Given the topology of the transaction, a reasonable assignment for illustration purposes would be: d1out = d2in = d3in = d4in = 35 The level 0 timing attributes are now sufficiently defined to fix the position of all input and output windows in the level 0 topology. Figure 16 illustrates this (full) assignment of level 0 timing attributes. The final stage of level 0 transaction definition is to derive the set of timing obligations that are to be inherited as constraints on the next stage of model refinement (or implementation if so desired). These obligations are in the form of a set of minimum and maximum response times and minimum I/O separation values and can be determined (uniquely=) for all level 0 activities from the application of Equations (3), (2) and (1), respectively: r1 = 40 R1 = 47 d1 = 35 r2 = 5 R2 = 18 d2 = −7 r3 = 10 R3 = 15 d3 = −2 r4 = 20 R4 = 23 d4 = 8 The level 0 model is now completely defined. To illustrate how the approach supports further refinement of the timing model, a second stage of decomposition is now illustrated. The level 1 model for the nested transaction λ1 is depicted in Figure 17. From the statement of λ1 timing attributes above and the set of Equations (4) to (15), it is straight forward to assign a set of values to the corresponding level 1 timing attributes. Firstly, given the one-tomany nesting relationship (descending) between λ1 and its child input activities λ1,1 and λ1,2, Equations (6) and (7) give: d1in,1 = -5 d1in,1 = -5 J 1in,1 = 5 J 1in,1 = 5 d1in,2 = -5 d1in,2 = -5 J 1in,2 = 5 J 1in,2 = 5
625
A Scalable Approach to Real-Time System Timing Analysis
Figure 16. Example - full assignment of Level 0 attributes
Figure 17. Example - Level 1 Model for λ1
Given the one-to-one nesting relationship (ascending) between λ1 and its child output activity λ1,3, Equations (8) and (9) give: d1out = 35 d1out = 35 J 1out = 12 J 1out = 12 ,3 ,3 ,3 ,3
626
A Scalable Approach to Real-Time System Timing Analysis
Figure 18 illustrates this (partial) assignment of level 1 timing attributes for λ1. The rest of the level 1 timing attributes for λ1 can be assigned on the basis of the level 1 topology details and the appropriate means of accounting for precedence relationships as defined in Equations (4) to (9). Assuming Q1,3 has been specified such that the arrival of λ1,3 will be triggered only upon completion of both predecessors λ1,1 and λ1,2, Equations (10b) and (11b) can be applied to assign level 1 parameter values as described below. Given that jitter tends to increase in the direction of control flow along the transaction, J 1in,3 is assigned and an appropriate intermediate value between the specified input and output jitter values for λ1. J 1out ,1 out in J 1,2 are then assigned with knowledge of Q1,3 and the corresponding relationship with J 1,3 as expressed in Equation (11b). This expression implies that larger jitter values than J 1in,3 can be assigned to some (though not all) predecessors in a many-to-one precedence relationship so long as the output window has been assigned a terminates no later than the required successor input window. On this basis, J 1out ,1 in (see larger value than J 1,3 , which means that this must be taken into account in the assignment of d1out ,1 below). This leads to the following assignments: J 1in,3 = J 1out = 9 J 1in,3 = J 1out = 9 J 1out = 11 J 1out = 11 ,2 ,2 ,1 ,1 An appropriate intermediate value can be assigned for d1in,3 to fix the position of the input window for λ1,3. The accrued minimum delay requirements can then be specified for λ1,1 and λ1,2 such that the latest completion time for λ1,1 is less than that of λ1,2. d1in,3 = d1out = 20 d1in,3 = d1out = 20 d1out = 15 d1out = 15 ,2 ,2 ,1 ,1 Figure 19 illustrates the full assignment of level 1 timing attributes for λ1. Once again, a set of timing obligations can be determined for the purposes of further refinement or direct implementation. These obligations are specified in the form of a set of minimum and maximum Figure 18. Example - partial assignment of Level 1 attributes for λ1
627
A Scalable Approach to Real-Time System Timing Analysis
response times and minimum I/O separation values for all level 1 activities by application of Equations (3), (2) and (1), respectively: r1,1 = 20 R1,1 = 26 d1,1 = 15 r1,2 = 25 R1,2 = 29 d1,2 = 20 r1,3 = 15 R1,3 = 18 d1,3 = 6 A final stage of decomposition for the example transaction gives the set of level 2 timing attributes for the nested transaction λ1 as depicted in Figure 20. In practice, refinement of the transaction and its timing attributes could continue until the required level of detail is obtained. Clearly, the topological details generated at each stage of refinement and the number of refinement stages performed are dependent on the nature of the application and the software design/implementation approach. (Table 1) The final set of timing obligations for λ1 can now be determined via Equations (3), (2) and (1): Assuming no further refinement of the transaction prior to its implementation, this final set of timing obligations represents a set of constraints on the implementation of the system. During the implementation stages, however, the timing model could be evolved further in the same manner as above as a means of supporting more progressive implementation (and integration) of the final system. It should be observed during any such evolution, however, that the stage at which the timing model become target-specific or integration-specific, i.e. appropriate to a particular scheduling or communication regime, is the stage at which it ceases to support changes to the target system without the need to restate the model. When the final implementation and integration details of the system are stabilised, the timing model must be verified, i.e. the timing obligations defined in the abstract timing model must be shown to be safe. This requires some form of localised timing analysis model, ie. a model to determine activity Figure 19. Example - full assignment of Level 1 attributes for λ1
628
A Scalable Approach to Real-Time System Timing Analysis
Figure 20. Example - final assignment of Level 2 attributes for λ1
629
A Scalable Approach to Real-Time System Timing Analysis
Table 1. Example transaction (localised timing attributes for λ1) ri,...,k
di,...,k
Ri,...,k
λ1,1,1
15
10
18
λ1,1,2
5
-3
8
λ1,1,3
5
-3
12
λ1,2,1
10
5
20
λ1,2,2
20
15
22
λ1,2,3
5
-2
7
λ1,3,1
13
4
15
λ1,3,2
2
-9
3
level delay and jitter characteristics based on some notion of what constitutes a resource - processor or communication medium. This is almost where the transition begins towards a target-specific model. RBA permits the transition to be deferred a little longer, however, by adopting a rate-based execution model – an abstract model of run-time scheduling behaviour. This abstract scheduling model can then be implemented using either cyclic or priority-based scheduling. This next stage is described below.
RATE-BASED EXECUTION MODEL The RBA rate-based execution model is a generalised form of scheduling model that provides independence from the final target implementation and integration details of the system, including the precise form of the final run-time scheduling solution. This abstract scheduling model can be used to guide the final target scheduling solution to preserve the performance predictions of the abstract timing model. A range of compliant scheduler implementation schemes will be described later. Let {λj; j=1,..,n} denote the set of activities allocated to a shared system resource and denote the associated set of timing obligations by {(Cj, vj, Rj); j=1,..,n}, where Cj is the maximum execution time (or analogous communication bandwidth requirement), vj is the minimum required rate of execution and Rj is the worst-case response time requirement. The rate-based execution model defines the following simple linear relationship between these parameters: vj =
Cj Rj
(20)
An analogous set of best-case parameters is also defined by {(cj, Vj, rj); j=1,..,n}. The objective of any compliant implementation scheme is thus to maintain the run-time execution rate of each activity within the required range [vj, Vj], as illustrated in Figure 21. To illustrate the application of the rate-based execution model (and subsequent implementation schemes) by example, Table 2 presents a set of timing attributes for the GAP task set (Locke, 1991). Each GAP ‘task’ is modeled as a single RBA activity since there is no benefit in further decomposition in this example. All GAP tasks are periodic with period Tj=Rj, except for τ10 which is sporadic with
630
A Scalable Approach to Real-Time System Timing Analysis
minimum inter-arrival time a10=200. Since no input jitter is specified for the periodic tasks, it is assumed that aj=Tj for these tasks. Conversely, assigning a10=T10 for the sporadic task (the value of 200 shown in brackets in the table) gives a total task set utilisation requirement of 83.5%. The set of minimum execution rates are derived from Equation (20) but, since all GAP tasks are periodic with period=deadline (Tj=Dj), then {vj=Uj.; j=1,..,16}. The total bandwidth reservation requirement is therefore equal to the total utilisation requirement of the task set, i.e. 83.5%, which would be schedulable on a single processor by an ‘exact’ implementation of the rate-based execution model.
Basic Schedule Implementation Scheme A form of cyclic schedule implementation scheme can be used to directly implements the RBA rate-based execution model. This allows the run-time scheduling solution for a system to be derived directly from an RBA target-independent timing analysis model for the system without compromising the original timing requirements. The simplest form of such scheme has the following attributes: •
A fixed cycle time £ min R j ; j
• •
A fixed time and duration of execution δj for each activity λj within each cycle; The restriction Rj ≤ aj for each activity λj.
Consequently, each activity will execute for exactly δj time units in any interval of size Δ, ie. not necessarily aligned to the minor cycle. The actual order of execution of activities within each cycle is arbitrary. Moreover, the execution time δj allocated to an activity within a cycle does not need to be contiguous. It is necessary to assign an appropriate value for Δ and for each δj such that the timing obligations for each activity λj are met. The following scheme can be applied to achieved this but note that other valid assignments will normally exist for a given set of timing obligations. An example is developed alongside the description of the scheme by considering the activity λ7 from the GAP task set.Firstly, Figure 21. Valid execution space
631
A Scalable Approach to Real-Time System Timing Analysis
Table 2. Example task set j
Function 1
Radar Track Filter
Cj
Rj
vj=Uj
2
25
0.08
2
RWR Contact Mgt.
5
25
0.2
3
Data Bus Poll Device
1
40
0.025
4
Weapon Aiming
3
50
0.06
5
Radar Target Update
5
50
0.1
6
Nav. Update
8
59
0.1355
7
Display Graphic
9
80
0.1125
8
Display Hook Update
2
80
0.025
9
Target Update
5
100
0.05
10
Weapon Protocol
1
200
0.005
11
Nav. Steering Cmds.
3
200
0.015
12
Display Sores Update
1
200
0.005
13
Display Keyset
1
200
0.005
14
Display Stat. Update
3
200
0.015
15
BET E Status Update
1
1000
0.001
16
Nav. Status
1
1000
0.001
define the normalised response time value Rj £ Rj as follows:
Rj = [
Rj
]
(21)
From Equation (20), define the corresponding normalised execution rate v j as follows:
v j =
Cj Rj
(22)
It can be seen from Equation (20) that v j ³ v j since Rj £ Rj . Subsequently, assign δj the minimum value that will guarantee λj to meet its normalised response time requirement Rj :
dj = [v j ]
(23)
In the final schedule, each activity λj will consequently be executed at a guaranteed minimum rate v given as follows: j
v j =
632
dj
(24)
A Scalable Approach to Real-Time System Timing Analysis
Hence, for the example task, a value of Δ = 25 gives R7 = 75 , v7 = 0.12 , d7 = 3 and v7d = 0.12 . Since each activity will be executed at a rate which is no less than that specified by its minimum rate requirement, the worst-case response time can be guaranteed for any worst-case execution time in the range [0, Cj]. This makes final verification for a specific target implementation very straight forward. Denoting the target-specific resource requirement of each λj by C j* gives the target-specific feasibility test:
C j* £ C j
(25)
This test is independent from the choice of Δ and, for any given activity, the timing and resource requirements of other activities allocated to the shared resource. The test also allows simple re-verification of λj following any software implementation changes that impact on the value of C j* . A target-independent feasibility test (which could be applied as a resource allocation constraint) for the set of activities as a whole is as follows: n
åv
d j
£1
j =1
(26)
Since the value of v jd is only dependent upon the timing obligations for activity λj, the test can be applied incrementally, i.e. to accept or reject the addition of a ‘new’ activity to an existing set by comparing its final rate requirement with the remaining capacity available, independent from the actual rate requirements of activities that already exist in the schedule (that are already guaranteed). Hence, denoting the new activity by λn+1, the following acceptance test can be applied: n
vnd +1 £ 1 - åv jd j =1
(27)
Neither form of the test can be applied until the value of Δ is fixed, since this determines the value of each v jd . When the value of Δ is fixed, typically at design-time, this could be taken into account in the assignment of timing obligations at the final stage of decomposition of the end-to-end transactions. For example, the value of Δ impacts on the efficiency of the final bandwidth allocation, as discussed below. For any activity, the inefficiency of the scheme increases as the worst-case response time requirement decreases relative to the minimum inter-arrival time. This inefficiency is manifest in the final scheduling solution as an over-allocation of bandwidth compared to that which is actually required, as stipulated by the true utilisation requirement of the activity. This arises since the minimum inter-arrival time is not recognised in the construction of the cyclic schedule beyond the assumption that it is greater than the original response time requirement, i.e. Rj ≤ aj. Consequently, sufficient capacity is reserved in the schedule to execute each activity once in any time interval of duration Rj (irrespective of the minimum inter-arrival time). Any over-allocated bandwidth, however, along with any that is allocated but unused at run-time due to variation in execution times, can potentially be ‘reclaimed’ at run-time. Reclamation
633
A Scalable Approach to Real-Time System Timing Analysis
of unused bandwidth is discussed later in the paper. Alternatively, the over-allocation of bandwidth can be exploited to give a larger upper bound for the target-specific worst-case computation time for λj by restating the target-specific feasibility test, as previously stated in Equation (25), to give:
C j* £ C j
(28)
where C jd represents the actual maximum computation time allocated in the cyclic schedule over the time duration Rj and is given as follows:
C jd =
Rj
dj
(29)
Note that the value of C jd will automatically be integer since δj is integer and Rj is exactly divisible by Δ. This larger upper bound on the target-specific worst-case computation time can then be exploited to give a (specified) margin for error in either: • •
The actual execution-time of λj at run-time compared to the specified value Cj such that transient over-run of the activity can be tolerated; The worst-case computation time of a software component procured from some third-party compared to the specified value Cj such that failure of the supplier to meet the original specification can be tolerated to a limited extent.
The final target-independent response time Rjd for λj, given the original computation time budget Cj and the final bandwidth allocation due to the cyclic scheduling solution, can be stated as follows: Cj
Rjd = [
dj
] (30)
Hence, the target-dependent response time Rj* for λj, given an actual target-specific computation time value C j* £ C jd can be stated as follows: C j*
Rj* = [
dj
] (31)
Note that the scheme is exact, in the sense that the allocated bandwidth is both necessary and sufficient to meet true worst-case utilisation requirements, only under certain conditions. This is the case when there are nil effects from rounding in Equations (21) and (23) – the ‘sufficient but not necessary’ stages of the calculation. In the general case, the degree of inefficiency of the scheme is dependent upon the actual timing requirements of the activities.
634
A Scalable Approach to Real-Time System Timing Analysis
Example Application of Basic Cyclic Scheme The basic scheme can be applied to determine an RBA-compliant schedule by first selecting an appropriate value for the cycle time Δ. Then, for each activity λj: •
Determine the ‘normalised’ response time Rj ;
• • •
Determine the ‘normalised’ execution rate v j ; Determine the time δj for which the task must be executed in each cycle Δ; Derive the guaranteed response time Rjd ;
•
Derive the minimum run-time execution rate v jd ;
•
Derive the guaranteed computation time C jd .
Observing the original schedule construction constraint £ min R , assign Δ=25. This leads to the j solution given in Table 3, Table 4, and Table 5. j A number of observations can be made from these results. From Table 3, the sum of the initial execution rate parameters (vj) corresponds exactly to the total utilisation requirement of the task set (83.51%). This arises since the worst-case response time of every task is equal to its minimum inter-arrival time. After defining a cycle time of Δ=25, the sum of the rate parameters ( v j ) corresponds to a total bandwidth allocation of 88.37%, a noticeable but reasonable increase compared to the true requirement. At the final stage of calculation, however, the need to provide integer values for the final rate parameters ( v jd ) gives rise to a significant over-allocation of bandwidth due to the combination of rounding effects for the overall task set. The final bandwidth allocation is 120% and, hence, the extent of the over-allocation is sufficient to make the task set no longer schedulable on a single processor (by this scheme). The cyclic schedule has been constructed, however, to allow individual activities to be removed (or have their timing attributes changed) without affecting other activities in the schedule. Hence, it is straight forward to reduce the task set to one that is schedulable on a single processor by simply removing one or more activities (to be reallocated to another processor) until the final bandwidth allocation is less than 100%. The ability to manipulate the schedule in this manner is a considerable benefit in the context of engineering larger-scale real-time systems. A counter effect of bandwidth over-allocation is an equivalent reduction in worst-case response times d ( Rj ) compared to the stated requirements (Rj), as can be seen in Table 4. For example, λ15 has a final bandwidth allocation of 4% (equivalent to its execution rate of 0.04) compared to its stated requirement of 0.1%. The corresponding reduction in its worst-case response time is apparent in the final value of 25 compared to an original requirement of 1000. The over-allocation of bandwidth is due to the restriction that every task is executed (for a duration δj) in every cycle Δ, as reflected in the final computation times (C jd ) given in Table 5. This restriction leads to a simpler (and more readily modifiable) scheduling solution but can be lifted to allow a more flexible scheme to be defined in favour of reducing the bandwidth over-allocation. Such a scheme is described and illustrated in the next section. Note that the basic scheme does not compromise the true timing requirements of the task set – there is no imposition of false iteration rates for the purposes of constructing a schedule (a criticism often levelled at cyclic scheduling solutions). Furthermore, the schedule is incrementally modifiable such that
635
A Scalable Approach to Real-Time System Timing Analysis
schedulability can be maintained following activities being added, removed or modified by merely ensuring that the final bandwidth allocation is less than 100% (and that the choice of Δ is still suitable).
Bandwidth Server-Based Implementation Scheme As suggested above, it is possible to reduce the bandwidth over-allocation associated with the basic cyclic implementation scheme by relaxing the constraint that every activity must be offered the chance to execute in every cycle. This gives rise to the cyclic bandwidth server scheme. The starting point is once again the selection of a cycle time Δ subject to the same constraint. Then define a the server activity λS(δS,NS) as a notional activity that is allocated δS execution time units in every Δ cycle but does not actually consume that allocation itself. Instead, the server offers the resource to other activities so that these can execute with an effective cycle time of NSΔ. The total bandwidth of the server can then be used to execute a number of activities that individually have relatively low bandwidth requirements that would otherwise be allocated a disproportionate amount of bandwidth by the basic scheme. Assuming that the server executes its allocated activities in a fixed order then an activity λj allocated to a λS(δS,NS) server will execute for a duration δj spread over an interval of NSΔ . The cyclic server exploits the fact that the basic scheme, and its analysis, does not require the execution time allocated to an activity within a scheduling cycle to be contiguous. The analysis associated with the cyclic server is, therefore, exactly analogous to that for the basic cyclic scheme but with Δ replaced by NSΔ. Hence, the derivation of δj for an activity λj executed via a cyclic server λS(δS,NS) is
Table 3. Example - cyclic schedule implementation (execution rate parameters) j
636
Uj
vj
v j
v jd
1
0.0800
0.0800
0.0800
0.0800
2
0.2000
0.2000
0.2000
0.2000
3
0.0250
0.0250
0.0400
0.0400
4
0.0600
0.0600
0.0600
0.0800
5
0.1000
0.1000
0.1000
0.1200
6
0.1356
0.1356
0.1600
0.1600
7
0.1125
0.1125
0.1200
0.1200
8
0.0250
0.0250
0.0267
0.0400
9
0.0500
0.0500
0.0500
0.0800
10
0.0050
0.0050
0.0050
0.0400
11
0.0150
0.0150
0.0150
0.0400
12
0.0050
0.0050
0.0050
0.0400
13
0.0050
0.0050
0.0050
0.0400
14
0.0150
0.0150
0.0150
0.0400
15
0.0010
0.0010
0.0010
0.0400
16
0.0010
0.0010
0.0010
0.0400
Σ
0.8351
0.8351
0.8837
1.2000
A Scalable Approach to Real-Time System Timing Analysis
Table 4. Example - cyclic schedule implementation (response time parameters) j
Rj
Rj
Rjd
1
25
25
25
2
25
25
25
3
40
25
25
4
50
50
50
5
50
50
50
6
59
50
50
7
80
75
75
8
80
75
50
9
100
100
75
10
200
200
25
11
200
200
75
12
200
200
25
13
200
200
25
14
200
200
75
15
1000
1000
25
16
1000
1000
25
Table 5. Example - cyclic schedule implementation (computation time parameters) j
δj
Cj
C jd
1
2
2
2
2
5
5
5
3
1
1
1
4
3
4
2
5
5
6
3
6
8
8
4
7
9
9
3
8
2
3
1
9
5
8
2
10
1
8
8
11
3
8
1
12
1
8
1
13
1
8
1
14
3
8
1
15
1
40
1
16
1
40
1
637
A Scalable Approach to Real-Time System Timing Analysis
given by Equations (21) to (23) with Δ replaced by NSΔ. Similarly, Equations (24), (29) and (30) can be applied with Δ replaced by NSΔ to determine the final rate, computation time and response time values, respectively. For this reason, the cyclic server method is actually a generalisation of the basic cyclic scheme described previously, where multiple cycle times are supported. For the general case of activity execution via a cyclic server λS(δS,NS), the expression for determining the normalised response time for λj is adapted as follows:
Rj = [
Rj NS
]N S
(32)
The corresponding normalised execution rate v j is then found as before by Equation (22). The minimum execution time δj per interval NSΔ that will guarantee λj to meet its normalised response time requirement Rj (and therefore its true requirement Rj) is given by:
dj = [v j N S ]
(33)
Allocating δj execution time units per interval NSΔ in the final schedule means that each activity λj will be executed at a guaranteed minimum rate v jd as follows:
v j =
dj NS
(34)
The allocated computation time C jd over a time interval of duration Rj is as follows:
C jd =
Rj NS
dj
(35) d
The final target-independent response time Rj for λj, given the original computation time budget Cj and the final bandwidth allocation due to the cyclic scheduling solution, is given as follows: Cj
Rjd = [
dj
]N S (36)
To illustrate by example, consider activity λ15 from the GAP case study (which was shown above to suffer a factor of 40 bandwidth over-allocation when the basic scheme is applied). For example, a cyclic server λS(1,8) allocated to serve activity λ15 leads to the following results from successive application of = 0.001 , δ15 = 1 and v15d = 0.005 . This represents Equations (32), (22), (33) and (34): R15 = 1000 , v15 a factor of 5 over-allocation, a significant improvement compared to the basic scheme but still quite poor, although the remaining server capacity could be used to service further activities. The use of a server λS(1,40) dedicated to λ15 would be required to give an exact allocation for the single activity alone. The utilisation-based feasibility test given in Equation (26) is no longer exact but merely a neces-
638
A Scalable Approach to Real-Time System Timing Analysis
sary condition when one or more activities are executed via servers. A sufficient test can be produced by replacing the combined execution rates of the activities executed by servers by the total capacities of the corresponding servers, where the total capacity vS of a server λS(δS,NS) is given by adaptation of Equation (24):
vS =
dS
(37)
A simple test for feasible allocation of server bandwidth is given as follows: m
åv k =1
d k
£ vS
(38)
where vkd denotes the final execution rate of each of m activities λk allocated to the server. For a server λS(δS,NS) with period NS and set of allocated activities {λk; k = 1,..,nS}, the value of δS can be derived from the set of activity execution times {δk in NSΔ; k = 1,..,nS}: nS
åd
k
d = [ k =1 S ] S
N
(39)
Observing that the total time required to execute non-server-based activities in the basic cyclic schedule is equivalent to that of a cyclic server λS(δS,1), referred to as the base level server, the final bandwidth requirement Ψ is given as follows:
Y=
1 nS S åd i =1 i
(40)
given the set of servers { liS (diS , N iS ) ; i = 1,..,nS} that includes the base level server. This expression is sufficient-but-not-necessary since, depending on the actual server periods and utilisation figures, the bandwidth requirements of a lower rate server could, in practice, be absorbed within the spare capacity of a higher rate server. In such cases, the bandwidth requirements of a lower rate server can be effectively eliminated from the total bandwidth calculation (as illustrated by example later in the paper).
Example Application of Server-Based Scheme This example illustrates the use of the cyclic server method to improve bandwidth allocation compared to the basic cyclic implementation scheme. Assuming the same basic cycle time Δ=25, define a server λS(2,40) to execute the low utilisation activities {λ10, …, λ16}. Table 6 shows the improved results under this scheme (the values of other parameters not shown in the table are the same as before under the basic cyclic scheme).
639
A Scalable Approach to Real-Time System Timing Analysis
The total capacity of the server λS(2,40) is given by Equation (37) as: vS = 0.08. Hence, 92% of the total processor capacity is available for non-server-based activities {λ1, …, λ9} and 8% for server-based activities {λ10, …, λ16}. So, whilst the total bandwidth allocation is more efficient than for the basic scheme - 97.5% compared to 120%, this is not sufficient to guarantee feasibility on a single processor – it is also necessary to show separately that activities {λ1, …, λ9} can be executed within their 92% allocation and that activities {λ10, …, λ16} can be executed within their 8% allocation. From Table 6, the combined allocation for activities {λ1, …, λ9} turns out to be exactly 92% and the combined allocation for activities {λ10, …, λ16} is 5.5%. Hence, the complete set of activities is schedulable on a single processor under this scheme. The improved efficiency of this scheme is also reflected in the increased number of activities that have been allocated the exact bandwidth to meet their requirements – 10 out of the 16 activities now, compared to only 4 previously.
Introducing Priorities to Improve Resource Bandwidth Allocation and System Responsiveness It is now shown that the RBA rate-based execution model and cyclic implementation scheme can co-exist alongside a static priority-based scheduling regime to provide a flexible three-tier run-time execution model as follows: • • •
High priority activities that execute according to a static priority-based regime; RBA-compliant activities that execute according to the cyclic RBA implementation scheme, subject to interference from the set of high priority activities; Low priority activities that execute according to a static priority-based regime, subject to interference from the set of high priority activities and the set of RBA-compliant activities.
The motivation for this combined scheme is two-fold. Firstly, the high priority ‘band’ can be used to schedule activities with short response requirements compared to their minimum inter-arrival time without incurring bandwidth over-allocation. Secondly, the low priority band can be used to execute activities in the bandwidth that is over-allocated by the RBA cyclic/server scheme, plus any remaining capacity of the resource, thus reclaiming such bandwidth.
Introducing High Priority Activities for Improved Responsiveness Activities executed in the high priority band will execute according to a static priority-based scheduling regime in accordance with their relative priorities and always in preference to activities in the RBA band. These activities can be verified by static priority-based response time analysis given in (Audsley, 1993). The rate-based execution model and cyclic implementation schemes must be extended, however, to cater for interference effects due to the execution of high priority activities. The rate-based execution model can be adapted to recognise interference effects using an analogous approach to that for response time analysis for static priority-based scheduling. The solution is simply to add a worst-case interference time to the actual worst-case response time or, analogously, to subtract the interference delay from the required worst-case response time (deadline). The required minimum execution rate for an activity λj subjected to worst-case interference Ij is thus stated as follows:
640
A Scalable Approach to Real-Time System Timing Analysis
Table 6. Example - cyclic server implementation (improved allocation for {λ10, …, λ16}) j
δj
v jd
Rjd
C jd
1
0.0800
25
2
2
2
0.2000
25
5
5
3
0.0400
25
1
1
4
0.0800
50
4
2
5
0.1200
50
6
3
6
0.1600
50
8
4
7
0.1200
75
9
3
8
0.0400
50
3
1
9
0.0800
75
8
2
10
0.0050
200
8
8
11
0.0150
200
8
1
12
0.0050
200
8
1
13
0.0050
200
8
1
14
0.0150
200
8
1
15
0.0050
200
40
1
16
0.0050
200
40
1
Σ
0.9750
vj =
Cj Rj - I j
(41)
For an activity λj that shares a resource with a set of high priority activities {λk; k = 1,..,nH}, Ij equates to the interference term stated in the response time expression for static priority-based scheduling. Hence, given a set of timing attributes for the set of high priority activities, Ij can be determined as follows: nH
Rj + J kin
k =1
Tk
Ij = å[
]C k (42)
Introducing high priority activities and interference into the cyclic implementation scheme has two effects (the scheme is otherwise unchanged). Firstly, the initial assignment of Δ is now subject to the constraint £ min(Rj - I j ) . Secondly, for the general case of activity execution via a cyclic server j
λS(δS,NS), the expression for determining the normalised response time for λj, Equation (32), is adapted as follows:
Rj = [
Rj - I j NS
]N S
(43)
641
A Scalable Approach to Real-Time System Timing Analysis
Introducing Low Priority Activities for Improved Bandwidth Allocation The problem of bandwidth over-allocation has been highlighted in the series of examples given earlier. This problem occurs in the target-independent RBA rate-based execution model due to the calculation of activity execution rates based on response time requirements rather than minimum inter-arrival times. The problem is then compounded in the cyclic implementation scheme due to the need for a common cycle time and integer execution times within this cycle time (or some multiple of the cycle time when cyclic servers are used). This motivates the consideration of bandwidth reclamation via the execution of activities outside the RBA scheme according to a priority-based regime. This new set of priority-based activities is referred to as ‘low priority’ since none of these can pre-empt any RBA activity nor any high priority activity. The RBA cyclic scheme itself does not actually require modification. The low priority activities can be guaranteed (or rejected) by adapting the response time analysis for static priority-based scheduling as shown below. Given a set of high priority activities {λk; k = 1,..,nH}, a set of RBA activities {λj; j = 1,..,n} and a set of low priority activities {λi; i = 1,..,nL}, the following response time can be stated for a given low priority activity λl: nH
Rl + J kin
k =1
Tk
Rl = C l + å [
n
Rl + J jin
j =1
Tj
]C k + å [
l -1
Rl + J iin
i =1
Ti
]C j + å [
]C i (44)
Note that the difference between this expression and the response time analysis for static prioritybased scheduling given in (Audsley, 1993) is merely notational - the interference term is decomposed into three ‘bands’ to reflect the composite nature of the scheme.
Example of Reclaiming Over-Allocated Bandwidth Consider the GAP task set extended by the introduction of a set of low priority activities subject to deadline monotonic priority assignment: (Table 7)Equation (44) then gives: (Table 8) All low priority activities are thus feasible since all response times are less than their corresponding deadline. The total bandwidth requirement for the set of low priority activities is 12.8%. This has effectively been reclaimed from the over-allocated bandwidth for the set of RBA activities, whose true requirement is 83.5% but final allocation is 97.5% (or, including spare server capacity, exactly 100%).
Table 7. Example - low priority timing attributes i
642
Ci
Ti
Di
Ui
1
3
200
150
0.015
2
8
200
180
0.04
3
25
500
400
0.05
4
4
500
450
0.008
5
15
1000
800
0.015
A Scalable Approach to Real-Time System Timing Analysis
Table 8. Example - low priority response times I
Ri
Feasible?
1
140
2
148
3
384
4
388
5
789
RELATED WORK A number of scheduling schemes that support bandwidth-based (or, analogously, rate-based) expression of timing and resource requirements have previously been proposed for multimedia applications. These schemes offer a degree of abstraction from the target platform in the way that requirements are specified but are invariably aimed at dynamic applications and generally require the use of dynamic earliestdeadline-first (EDF) scheduling at run-time. Examples of such schemes include generalised processor sharing (GPS) (Parekh, 1994), virtual clock (Yau, 1996), constant utilisation server (Deng, 1999) and weighted fair queuing (Demers, 1989). Due to the reliance on EDF, however, the final bandwidth allocation (or execution rate) granted to each ‘task’ is dependent on the actual degree of competition for resources at run-time – as the total demand on a resource increases, the bandwidth reserved for a given task will decrease in absolute terms. Such solutions are more accurately referred to as proportional share methods than bandwidth reservation methods and are not suitable for dependable applications that require a priori performance guarantees. See (Grigg, 2002) for a comprehensive survey of related work.
SUMMARY RBA provides a target-independent timing analysis framework for application during the definition and decomposition stages of real-time system development, based on an abstract representation of target system processing and communication resources. Application of the abstract model provides a set of best-case and worst-case timing ‘guarantees’ that will be delivered subject to a set of scheduling ‘obligations’ being met by the target system implementation. An abstract scheduling model, known as the rate-based execution model then provides an implementation reference model with which compliance will ensure that the imposed set of timing obligations will be met by the target system. The end-to-end timing properties of the system are captured, decomposed and analysed in terms of real-time transactions. The transaction model is hierarchical, in the form of an acyclic, directed, nested graph, capturing an evolving system definition during development. The leaf nodes of the graph capture the concurrent processing and communication elements within the transaction, termed activities; non-leaf nodes are referred to as nested transactions. The edges of the graph capture the precedence and nesting relationships within the transaction. The parameters via which timing behaviour is represented and observed are the same for a single activity, a group of related activities, a nested transaction and a system level transaction, thus providing a highly composable and scalable model of real-time system performance.
643
A Scalable Approach to Real-Time System Timing Analysis
End-to-end delays and jitter are determined by a depth-first traversal of each transaction graph, accounting for activity level delays, precedence relationships and nesting relationships. In the earlier stages of system development, activity level delays can be specified directly in the form of budgets. Later in development, these delays can be determined via some form of localised timing analysis model. When the target platform implementation details are finally fixed, these delays can be verified. A number of further developments of the RBA framework and implementation schemes are being investigated. This includes extending the cyclic server implementation scheme to support ‘nested’ or ‘hierarchical’ bandwidth servers as a means of further reducing the extent of bandwidth over-allocation. Other work is beginning to investigate RBA-compliant support for scheduling communication network resources, initially focusing on ATM networks for future avionics applications. Work is also underway to develop RBA process and tool support for technology transfer into the sponsoring customer’s organization. Tool support is being implemented as an extension to the customer’s software design environment rather than as a separate standalone tool.
REFERENCES Audsley, N. C., Burns, A., Richardson, M. F., Tindall, K., & Wellings, A. (1993). Applying New Scheduling Theory to Static Priority Pre-emptive Scheduling. Software Engineering Journal, 8(5). Demers, A., Keshav, S., & Shenker, S. (1989). Analysis and Simulation of a Fair Queuing Algorithm. Proceedings of ACM SIGCOMM. Deng, Z., Liu, J.W.S., Zhang, L., Mouna, S., & Frei, A. (1999). An Open Environment for Real-Time Applications. Real-Time Systems Journal, 16(2/3). Grigg, A. (2002). Researvation-Based Timing Analysis – A Partitioned Timing Analysis Model for Distributed Real-Time Systems (YCST-2002-10). York, UK: University of York, Dept. of Computer Science. Locke, C. D., Vogel, D. R., & Mesler, T. J. (1991). Building A Predictable Avionics Platform in Ada. In Proceedings of IEEE Real-Time Systems Symposium. Parekh, A.K. & Gallager, R.G. (1994). A Generalised Processor Sharing Approach to Flow Control in Integrated Services Networks. IEEE Transactions on Networking 2(2). Yau, D. K. Y., & Lam, S. S. (1996). Adaptive Rate-Controlled Scheduling for Multimedia Applications. In Proceedings of ACM Multimedia Conference.
ENDNOTE
644
The assignment can easily be shown to be unique by inspection of Equations (3), (2) and (1).
645
Chapter 27
Scalable Algorithms for Server Allocation in Infostations Alan A. Bertossi University of Bologna, Italy M. Cristina Pinotti University of Perugia, Italy Romeo Rizzi University of Udine, Italy Phalguni Gupta Indian Institute of Technology Kanpur, India
ABSTRACT The server allocation problem arises in isolated infostations, where mobile users going through the coverage area require immediate high-bit rate communications such as web surfing, file transferring, voice messaging, email and fax. Given a set of service requests, each characterized by a temporal interval and a category, an integer k, and an integer hc for each category c, the problem consists in assigning a server to each request in such a way that at most k mutually simultaneous requests are assigned to the same server at the same time, out of which at most hc are of category c, and the minimum number of servers is used. Since this problem is computationally intractable, a scalable 2-approximation online algorithm is exhibited. Generalizations of the problem are considered, which contain bin-packing, multiprocessor scheduling, and interval graph coloring as special cases, and admit scalable on-line algorithms providing constant approximations.
INTRODUCTION An infostation is an isolated pocket area with small coverage (about a hundred of meters) of high bandwidth connectivity (at least a megabit per second) that collects information requests of mobile users DOI: 10.4018/978-1-60566-661-7.ch027
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scalable Algorithms for Server Allocation in Infostations
Table 1. Examples of actual time intervals to serve different kinds of requests Category
Size (kbps)
Time (s) – low rate
Time (s) – high rate
FTP download
10000
100
10
Video stream
5000
50
5
Audio stream, E-mail attachment
512
5
0.5
E-mail, Web browsing
64
0.6
0.06
and delivers data while users are going through the coverage area. The available bandwidth usually depends on the distance between the mobile user and the center of the coverage area: increasing with decreasing distance. An infostation represents a way in the current generation of mobile communication technology for supporting at many-time many-where high-speed and high-quality services of various categories, like web surfing, file transferring, video messaging, emails and fax. It has been introduced to reduce the cost per bit on wireless communications, and hence to encourage the exchange of ever increasing volumes of information. Infostations are located along roadways, at airports, in campuses, and they provide access ports to Internet and/or access to services managed locally (Goodman, Borras, Mandayam, & Yates, 1997; Wu, Chu, Wine, Evans, & Frenkiel, 1999; Zander, 2000; Jayram, Kimbrel, Krauthgamer, Schieber, & Sviridenko, 2001). It is desirable that the infostation be resource scalable, that is able to easily expand and contract its resource pool to accomodate a heavier or lighter load in terms of number and kind of users, and/or category of services. Indeed, the mobile user connection lasts for a temporal interval, which starts when the user first senses the infostation’s presence and finishes when it leaves the coverage area. Depending on the mobility options, three kinds of users are characterized: drive-through, walk-through, and sit-through. According to the mobility options, the response time must be immediate for drive-through, slightly delayed for walk-through, and delayed for sit-through. In general, several communication paradigms are possible: communications can be either broadcast or dedicated to a single user, data can be locally provided or retrieved from a remote gateway, and the bit-rate transmission can be fixed or variable, depending on the infostation model and on the mobility kind of the user. Each mobile user going through the infostation may require a data service out of a finite set of possible service categories available. The admission control, i.e., the task of deciding whether or not a certain request will be admitted, is essential. In fact, a user going through an infostation to obtain a (toll) service is not disposed to have its request delayed or refused. Hence, the service dropping probability must be kept as low as possible. For this purpose, many admission control and bandwidth allocation schemes for infostations maintain a pool of servers so that when a request arrives it is immediately and irrevocably assigned to a server thus clearing the service dropping probability. Precisely, once a request is admitted, the infostation assigns a temporal interval and a proper bandwidth for serving the request, depending on the service category, on the size of the data required and on the mobility kind of the user, as shown in Table 1 for a sample of requests with their actual parameters. Moreover, the infostation decides whether the request may be served locally or through a remote gateway. In both cases, a server is allocated on demand to the request during the assigned temporal interval. The request is immediately assigned to its server without knowing the future, namely with no knowledge of the next request. Requests are thus served on-line, that is in an ongoing manner as they become available. Each server, selected out of the predefined server pool, may serve more than one request simultane-
646
Scalable Algorithms for Server Allocation in Infostations
ously but it is subject to some architecture constraints. For example, no more than k requests could be served simultaneously by a local server supporting k infrared channels or by a gateway server connected to k infostations. Similarly, no more than h services of the same category can be delivered simultaneously due to access constraints on the original data, such as software licenses, limited on-line subscriptions and private access. This chapter considers the infostation equipped with a large pool of servers, and concentrates on the server allocation problem where one has to determine how many servers must be reserved to on-line satisfy the requests of drive-through users, so that the temporal, architectural and data constraints are not violated. In particular, it is assumed that the isolated infostation controls in a centralized way all the decisions regarding the server allocation. Moreover, the pool of servers of the infostation is localized in the center of the coverage area, and therefore the distance from a mobile user and any server in the pool is the same. In other words, all the servers are equivalent to serve a mobile user, independent of the user proximity. In details, a service request r will be modeled by a service category cr and a temporal interval Ir = [sr, er) with starting time sr and ending time er. Two requests are simultaneous if their temporal intervals overlap. The input of the problem consists of a set R of service requests, a bound k on the number of mutually simultaneous requests to be served by the same server at the same time, and a set C of service categories with each category c characterized by a bound hc. The output is a mapping from the requests in R to the servers that uses the minimum possible number of servers to assign all the requests in R subject to the constraints that the same server receives at most k mutually simultaneous requests at the same time (k-constraint), out of which at most hc are of category c (h-constraint). In this chapter, we refer to this problem as the Server Allocation with Bounded Simultaneous Requests (Bertossi, Pinotti, Rizzi, & Gupta, 2004). It is worthy to note that, equating servers with bins, and requests with items, the above problem is similar to a generalization of Bin-Packing, known as Dynamic Bin-Packing (Coffman, Galambos, Martello, & Vigo, 1999), where in addition to size constraints on the bins, the items are characterized by an arrival and a departure time, and repacking of already packed items is allowed each time a new item arrives. The problem considered in this chapter, in contrast, does not allow repacking and has capacity constraints also on the bin size for each category. Furthermore, equating servers with processors and requests with tasks, the above problem becomes a generalization of deterministic multiprocessor scheduling with task release times and deadlines (Lawler & Lenstra, 1993) where in addition each processor can execute more than one task at the same time, according to the k-constraints and h-constraints. Moreover, equating servers with colors and requests with intervals, our problem is a generalization of the classical interval graph coloring (Golumbic, 1980), but with the additional k-constraints and h-constraints. Another generalization of interval graph coloring has been introduced for modelling a problem involving an optical line system (Winkler & Zhang, 2003), which reduces to ours where only the k-constraint is considered. Finally, a weighted generalization of interval coloring has been introduced (Adamy & Erlebach, 2004) where there is only the k-constraint, namely, where each interval has a weight in [0,1] and the sum of the weights of the overlapping intervals which are colored the same cannot exceed 1. Further generalizations of such a weighted version were also considered (Bertossi, Pinotti, Rizzi, & Gupta, 2004). This chapter surveys the complexity results as well as the main scalable on-line algorithms for the Server Allocation with Bounded Simultaneous Requests problem, which are published in the literature (Adamy & Erlebach, 2004; Winkler & Zhang, 2003; Bertossi, Pinotti, Rizzi, & Gupta, 2004). Briefly, the rest of this chapter is structured as follows. The first section shows that the Server Allocation with
647
Scalable Algorithms for Server Allocation in Infostations
Bounded Simultaneous Requests problem is computationally intractable and therefore a solution using the minimum number of servers cannot be found in polynomial time. The second section deals with α-approximation algorithms, that is polynomial time algorithms that provide solutions which are guaranteed to never be greater than α times the optimal solutions. In particular, a 2-approximation on-line algorithm is exhibited, which asymptotically gives a (2 – h/k)-approximation, where h is the minimum among all the hc’s. Finally, a generalization of the problem is considered in the third section, where each request r is also characterized by an integer bandwidth rate wr, and the bounds on the number of simultaneous requests to be served by the same server are replaced by bounds on the sum of the bandwidth rates of the simultaneous requests assigned to the same server. For this problem, on-line scalable algorithms are illustrated which give a constant approximation.
COMPUTATIONAL INTRACTABILITY The Server Allocation with Bounded Simultaneous Requests problem on a set R = {r1,...,rn} of requests can be formulated as a coloring problem on the corresponding set I = {I1,…,In} of temporal intervals. Indeed, equating servers with colors, the original server allocation problem is equivalent to the following coloring problem: Problem 1 (Interval Coloring with Bounded Overlapping). Given a set I of intervals each belonging to a category, an integer k, and an integer hc £ k for each category c, assign a color to each interval in such a way that at most k mutually overlapping intervals receive the same color (k-constraint), at most hc mutually overlapping intervals all having category c receive the same color (h-constraint), and the minimum number of colors is used. To prove that Problem 1 is computationally intractable, the following simplified decisional formulation of Problem 1 was considered, where |C| = 4, k = 2, and hc = 1 for each category c. Problem 2 (Interval Coloring with Bounded Overlapping and Four Categories). Given a set I of intervals each belonging to one of four categories, and an integer b, decide whether b colors are enough to assign a color to each interval in such a way that at most two mutually overlapping intervals receive the same color and no two overlapping intervals with the same category receive the same color. In (Bertossi, Pinotti, Rizzi, & Gupta, 2004), Problem 2 was proved to be NP-complete by exhibiting a polynomial time reduction from the 3-Satisfiability (3SAT) problem, a well-known NP-complete problem (Garey & Johnson, 1979): Problem 3 (3SAT). Given a boolean formula B in conjunctive normal form, i.e. as a product of clauses, over a set U of boolean variables, such that each clause is the sum of exactly 3 literals, i.e. direct or negated variables, decide whether there exists a truth assignment for U which satisfies B. Theorem 1. Interval Coloring with Bounded Overlapping and Four Categories is NP-complete. By the above result, Problem 2, and hence the Server Allocation with Bounded Simultaneous Requests problem, is computationally intractable. Therefore, one is forced to abandon the search for fast algorithms that find optimal solutions. Thus, one can devise fast algorithms that provide sub-optimal solutions which are fairly close to optimal. This strategy is followed in the next section, where a scalable polynomial-time approximation algorithm is exhibited for providing sub-optimal solutions that will never differ from the optimal solution by more than a specified percentage. Moreover, further negative results have been proved (Bertossi, Pinotti, Rizzi, & Gupta, 2004). Assume that the intervals in I arrive one by one, and are indexed by non-decreasing starting times. When
648
Scalable Algorithms for Server Allocation in Infostations
an interval Ii arrives, it is immediately and irrevocably colored, and the next interval Ii+1 becomes known only after Ii has been colored. If multiple intervals arrive at the same time, then they are colored in any order. An algorithm that works in such an ongoing manner is said on-line (Karp, 1992). On-line algorithms are opposed to off-line algorithms, where the intervals are not colored as they become available, but they are all colored only after the entire sequence I of intervals is known. While Theorem 1 shows that Problem 1 is computationally intractable even if there are only four categories, k = 2, and hc = 1 for each category, the following result shows also that there is no optimal on-line algorithm even when the number of categories becomes three. Theorem 2. There is no optimal on-line algorithm for the Interval Coloring with Bounded Overlapping problem even if there are only 3 categories, k = 2, and h1 = h2 = h3 = 1.
ALGORITHM FOR INTERVAL COLORING WITH BOUNDED OVERLAPPING Since there are no fast algorithms that find optimal solutions for Problem 1, on-line algorithms providing sub-optimal solutions are considered. An α-approximation algorithm for a minimization problem is a polynomial-time algorithm producing a solution of value appr(x) on input x such that, for all the inputs x, appr(x) ≤ α * opt(x), where opt(x) is the value of the optimal solution on x. In other words, the approximate solution is guaranteed to never be greater than α times the optimal solution (Garey & Johnson, 1979). For the sake of simplicity, from now on, appr(x) and opt(x) will be simply denoted by appr and opt, respectively. A simple polynomial-time on-line algorithm for the Interval Coloring with Bounded Overlapping problem can be designed based on the following greedy strategy: AlgorithmGreedy(Ii) Color Ii with any already used color which does not violate the k-constraints and the h-constraints. If no color can be reused, then use a brand new color. Theorem 3. Algorithm Greedy provides a 2-approximation for the Interval Coloring with Bounded Overlapping problem. Proof. Let appr = φ be the solution given by the algorithm and assume that the colors 1,…, f have been introduced in this order. Let Ir = [sr, er) be the first interval colored φ. Let Ω1 be the set of intervals in I containing sr and let Ω2 be the set of intervals in I containing sr whose category is cr. Clearly, Ω2 is éw ù 1 contained in Ω1. Let ω1 and ω2 be the cardinalities of Ω1 and Ω2, respectively. Clearly, opt ³ êê úú and k ê ú éw ù ê 2 ú opt ³ ê ú . Color φ was introduced to color Ir because, for every 1 ≤ γ ≤ φ − 1, at least one of the folê hcr ú ê ú lowing two conditions held: 1. 2.
Exactly k intervals in Ω1 have color γ; Exactly hc intervals in Ω2 have color γ. r
For i = 1 and 2, let ni be the number of colors in {1,…, φ − 1} for which Condition i holds (if for a color both conditions hold, then choose one of them arbitrarily). Hence, n1 + n2 = φ − 1 or, equivalently,
649
Scalable Algorithms for Server Allocation in Infostations
éw ù appr = φ = n1 + n2 + 1. Clearly, Ω1 ≥ k n1 + hc n2 + 1 and Ω2 ≥ hc n2 + 1. Therefore:opt ≥ max{ êê 1 úú , r r êk ú éw ù é h n + 1ù é kn + h + 1 ù é h n + 1 ù h c 2 ê 2ú 1 cr cr 2 ê ú ê ê ú ú r ê ú } ≥ max{ ê ú , n2 + 1} ≥ max{n1+ n2, n2 + 1}where ú ,ê ú }≥ max{n1+ ê h k ê hcr ú k h ê ú ê ê ú ú c c ê ú r ê ú r ê ú ê ú h = min{h1,…,h|C|}. h If n2 + 1 ≥ n1+ n2, then: k h h n1 + n2 + 1 n2 (1 - k ) + n2 + 1 appr n1 + n2 + 1 n2 (1 - k ) + n2 + 1 h n2 appr 2≤ ≤ = k n2 + 1 n2 + 1 n2 + 1 opt opt n2 + 1 n2 + 1 ≤
n1 + n 2 + 1 n2 + 1
h n2 (1 - ) + n2 + 1 h n2 k ≤ = 2≤ 2. k n2 + 1 n2 + 1
h If n2 + 1 ≤ n1+ n2, then: k n1 + n 2 + 1
appr ≤ opt
h n1 + n 2 k
n1 + n 2 + 1 h n1 + n 2 k
≤
≤
h h n1 + n1 + n 2 n2 n + n + 1 n1 appr 1 2 k k = 1+ ≤ opt h h h h n1 + n 2 n1 + n 2 n1 + n 2 n1 + n 2 k k k k
n1 + n1 +
h n n1 k 2 = 1+ ≤ 2. h h n1 + n 2 n1 + n 2 k k
n1 + n1 +
Therefore, Algorithm Greedy gives a 2-approximation. QED Actually, a stronger result has been proved (Bertossi, Pinotti, Rizzi, & Gupta, 2004): Theorem 4. Algorithm Greedy asymptotically provides a (2 – h/k)-approximation for the Interval Coloring with Bounded Overlapping problem, where h = min{h1,…, h|C|}. Moreover, such an asymptotic bound is the best possible, even in the very special case that h = 1, k = 2, and no interval contains another interval: Theorem 5. Algorithm Greedy admits no α-approximation with α < 2 – 1/k for the Interval Coloring with Bounded Categories problem, even if min{h1,…, h|C|} = 1, k = 2, and no interval is properly contained within another interval. Finally, the result below shows that the Greedy algorithm is optimal in some special cases. Theorem 6. Algorithm Greedy is optimal for the Interval Coloring with Bounded Overlapping problem when either •
åh c ÎC
•
650
c
£ k , or
hc = k for all c ϵ C.
Scalable Algorithms for Server Allocation in Infostations
Proof. Let φ be the solution given by the Greedy algorithm and assume without loss of generality that φ≥2, since otherwise the solution is trivially optimal. As in the proof of Theorem 3, let Ir = [sr, er) be the first interval colored φ, let Ω1 be the set of intervals in I containing sr, let Ω2 be the set of intervals in I containing sr with category cr, and let ω1 = |Ω1| and ω2 = |Ω2|. Recall that φ was introduced to color Ir because, for every 1 ≤ γ ≤ φ − 1, at least one of the following two conditions held: 1. 2.
Exactly k intervals in Ω1 have color γ; Exactly hc intervals in Ω2 have color γ. r
When
åh c ÎC
c
£ k , it is easy to see that if Condition 1 is true for any color γ then Condition 2 is also
true. Indeed, by hypothesis, the only way to exhaust a color is to have exactly hc intervals of category r éw ù ê ú cr all colored γ. Therefore, ω2 ≥ (φ−1) hc + 1 and opt ³ ê 2 ú = φ. r ê hcr ú ê ú When hc = k for all c ϵ C, it is easy to see that any γ cannot be used only if Condition 1 is true. Thus, éw ù 1 ω1 ≥ (φ−1)k + 1 and opt ³ êê úú = φ. k ê ú In conclusion, in both cases the Greedy algorithm provides the optimal solution. QED Note that, in Theorem 6, when hc = k for all c ϵ C, the h-constraint is redundant, since it is dominated by the k-constraint. When hc = k = 1 for all c ϵ C, the Greedy algorithm reduces to the well-known optimal algorithm for coloring interval graphs (Golumbic, 1980). Moreover, when hc = k for all c ϵ C and k > 1, Problem 1 is the same as the Generalized Interval Graph Coloring problem (Winkler & Zhang, 2003). As regard to the time complexity of the Greedy algorithm, the following result holds: Theorem 7. Algorithm Greedy requires O(1) time to color each interval Ir. Proof. The algorithm employs |C| palettes P1, …, P|C|, one for each category. The generic palette Pc is implemented as a double linked list and stores all the colors that can be assigned to a new interval of category c. For each color γ, a record Rγ with |C| + 1 counters and |C| pointers is maintained. For each category c, the corresponding counter Rγ.countc stores how many intervals of category c can be still colored γ (such a counter is initialized to hc). Moreover, there is an additional counter Rγ.kcount (initialized to k) storing how many intervals of any category can be still colored γ. Finally, for each category c, there is a pointer to the position of color γ in Pc. The algorithm uses a global counter, initialized to 0, to keep track of the overall number of colors used. When a brand new color is needed, the global counter is incremented. Let γ be the new value of the global counter. Then, a new record Rγ is initialized, color γ is inserted in all the palettes, and the pointers of Rγ to the palettes are updated. This requires O(|C|) time. When a new interval Ii starts, say of category ci, it is colored in O(1) time by any color γ available in palette Pc . Then, the counters Rg .countc and Rγ.kcount are decremented. If Rg .countc becomes 0, i i i then color γ is deleted from Pc . Whereas, if Rγ.kcount becomes 0, then color γ is deleted from all the i palettes. In the worst case, O(|C|) time is needed. When interval Ii ends, the counters Rg .countc and Rγ.kcount are incremented, where γ is the color of i
Ii. If Rγ.kcount becomes 1, then color γ is inserted in all the palettes Pc for which Rγ.countc is greater than 0. Instead, if Rγ.kcount is larger than 1, then color γ is inserted in Pc if Rg .countc becomes 1. Again, i
i
651
Scalable Algorithms for Server Allocation in Infostations
in the worst case, O(|C|) time is needed. Since |C| is a constant, O(1) time is required to color each single interval Ii. QED
ALGORITHM FOR WEIGHTED INTERVAL COLORING Consider now a generalization of the Server Allocation with Bounded Simultaneous Requests problem, where each request r is also characterized by an integer bandwidth rate wr, and the bounds on the number of simultaneous requests to be served by the same server are replaced by bounds on the sum of the bandwidth rates of the simultaneous requests assigned to the same server. Such a problem can be formulated as a weighted generalization of Problem 1 as follows. Problem 4 (Weighted Interval Coloring with Bounded Overlapping). Given a set I of intervals, with each interval Ir characterized by a category cr and an integer weight wr, an integer k, and an integer hc ≤ k for each category c, assign a color to each interval in such a way that the sum of the weights for mutually overlapping intervals receiving the same color is at most k (k-constraint), the sum of the weights for mutually overlapping intervals of category c receiving the same color is at most hc (h-constraint), and the minimum number of colors is used. More formally, denote by I[t] the set of intervals which are active at instant t, that is, I[t] = {Ir ∈ I: sr ≤ t ≤ er}; I[c] the set of intervals belonging to the same category c, that is, I[c] = {Ir ∈ I: cr = c}; I(γ) the set of intervals colored γ; I(γ)[t] = I(γ)∩ I[t], namely, the set of intervals colored γ and active at instant t; and I(γ)[t][c] = I(γ)[t] ∩ I[c], namely, the set of intervals of category c, colored γ, and active at instant t.
• • • • •
Then, the k-constraints and h-constraints can be stated as follows:
å
I r ÎI ( g )[t ]
å
wr £ k for all γ and t (k-constraints),
I r ÎI ( g )[t ][c ]
wr £ hc for all γ, t, and c (h-constraints).
Note that Problem 1 is a particular case of Problem 4, where wr = 1 for each interval Ir. When considering only the k-constraints and normalizing each weight wr in [0,1], Problem 4 is a generalization of that introduced in (Adamy & Erlebach, 2004) where a 195-approximate solution is provided under a particular on-line notion, namely, when the intervals are not given by their arrival time, but by some externally specified order. An approximation on-line algorithm for Problem 4, which contains Bin-Packing as a special case (Coffman, Galambos, Martello, & Vigo, 1999), is presented below. AlgorithmFirst-Color(Ii) Color interval Ii with the smallest already used color which does not violate the k-constraints and the h-constraints. If no color can be reused, then use a brand new color. The following result has been proved in (Bertossi, Pinotti, Rizzi, & Gupta, 2004).
652
Scalable Algorithms for Server Allocation in Infostations
Theorem 8. Algorithm First-Color asymptotically provides a constant approximation for the Weighted Interval Coloring with Bounded Overlapping problem. k k 8 The worst approximation constant proved by Theorem 8 is 5 , when > , and 8, otherwise (by h h 5 k 8 the way, an 8-approximation could be achieved even in the case that > , but by a different, off-line h 5 algorithm). It is worthy to note that in the case there are no h-constraints on the total weight of mutually overlapping intervals of the same category, the First-Color algorithm yields a 4-approximation. As regard to the time complexity of algorithm First-Color, an implementation similar to that described in Theorem 7 can be used, where the palettes are maintained as heaps. Then, it is easy to see that a single interval can be colored in O(log φ) time, where φ is the total number of colors used.
FURTHER GENERALIZATIONS Consider now two further generalizations of the Server Allocation with Bounded Simultaneous Requests problem, where each request r is characterized by real bandwidths, normalized in [0,1] for analogy with the Bin-Packing problem (Coffman, Galambos, Martello, & Vigo, 1999). In the first generalization, which contains Multi-Dimensional Bin-Packing as a special case, each request r is characterized by a k-dimensional bandwidth rate wr = ( wr(1) , …, wr(k ) ), where the c-th component specifies the bandwidth needed for the c-th category and k is the number of categories, i.e. k = |C|. The overall sum of the bandwidth rates of the simultaneous requests of the same category assigned to the same server at the same time is bounded by 1, which implies that the total sum of the bandwidth rates over all the categories is bounded by k. Such a generalized problem can be formulated as the following variant of the interval coloring problem. Problem 5 (Multi-Dimensional Weighted Interval Coloring with Unit Overlapping). Given a set I of intervals, with each interval Ir characterized by a k-dimensional weight wr = ( wr(1) , …, wr(k ) ), where wr(c ) ∈ [0,1], for 1 ≤ c ≤ k, assign a color to each interval in such a way that the overall sum of the weights of the same category for mutually overlapping intervals receiving the same color is bounded by 1 and the minimum number of colors is used. More formally, according to the notations introduced in the previous section, the constraints of Problem 5 can be stated as follows:
å
I r ÎI ( g )[t ][c ]
wr(c ) £ 1 for all γ, t, and c.
Note that the above constraints are in fact h-constraints and, when added up over all the categories in C, imply the following redundant k-constraint k
å c =1
å
I r ÎI ( g )[t ][c ]
wr(c ) £ k for all γ and t,
653
Scalable Algorithms for Server Allocation in Infostations
which is analogous to the k-constraint of Problem 4. Problem 5 can also be solved on-line by the FirstColor algorithm introduced in the previous section. Theorem 9. Algorithm First-Color provides a 4k-approximation for the Multi-Dimensional Weighted Interval Coloring with Unit Overlapping problem. It is worth mentioning that the above problem, when considered as an off-line problem, is APX-hard since it contains Multi-Dimensional Bin-Packing as a special case, which has been shown to be APXhard (Woeginger, 1997) already for k = 2. Therefore, there is no polynomial time approximation scheme (PTAS) that solves the problem within every fixed constant α (that is, one different polynomial time approximation algorithm for each constant α) unless P = NP. In the second generalization, instead, each request r is characterized by a gender bandwidth rate gr ,c r associated to the category cr and by a bandwidth rate wr. The overall sum of the bandwidth rates of the simultaneous requests assigned to the same server at the same time is bounded by 1, as well as the overall sum of the gender bandwidth rates of the simultaneous requests of the same category assigned to the same server at the same time, which is also bounded by 1. This generalized problem can be formulated as the following variant of the interval coloring problem. Problem 6 (Double Weighted Interval Coloring with Unit Overlapping). Given a set I of intervals, with each interval Ir characterized by a gender bandwidth gr ,c ∈ (0,1] associated to the category cr and r by a bandwidth weight wr ∈ (0,1], assign a color to each interval in such a way that the overall sum of the gender weights for mutually overlapping intervals of the same category receiving the same color is bounded by 1 (h-constraint), the overall sum of the bandwidth weights for mutually overlapping intervals receiving the same color is bounded by 1 (k-constraint), and the minimum number of colors is used. Formally, the constraints of Problem 6 are given below:
å
I r ÎI ( g )[t ]
å
wr £ 1 for all γ and t,
I r ÎI ( g )[t ][c ]
gr ,c £ 1 for all γ, t, and c.
Note that Problem 6 is a generalization of Bin-Packing, and hence it is NP-hard. However, Problem 6 can again be solved on-line by the First-Color algorithm introduced in the previous section. Theorem 10. Algorithm First-Color provides a constant approximation and, asymptotically, an 11approximation for the Double Weighted Interval Coloring with Unit Overlapping problem.
CONCLUSION This chapter has considered several scalable on-line approximation algorithms for problems arising in isolated infostations, where user requests characterized by categories and temporal intervals have to be assigned to servers in such a way that a bounded number of simultaneously requests are assigned to the same server and the number of servers is minimized. However, several questions still remain open. For instance, one could lower the approximation bounds derived for the problems reviewed in this chapter. Moreover, it is still an open question to determine whether the NP-hardness result reported in this chapter still holds when k = 2, there are only 3 categories, and h1 = h2 = h3 = 1. Finally, one could consider the
654
Scalable Algorithms for Server Allocation in Infostations
scenario in which the number of servers is given in input, each request has a deadline, and the goal is to minimize the overall completion time for all the requests.
REFERENCES Adamy, U., & Erlebach, T. (2004). Online coloring of intervals with bandwidth (LNCS Vol. 2909, pp. 1–12). Berlin: Springer. Bertossi, A. A., Pinotti, M. C., Rizzi, R., & Gupta, P. (2004). Allocating servers in infostations for bounded simultaneous requests. Journal of Parallel and Distributed Computing, 64, 1113–1126. doi:10.1016/ S0743-7315(03)00118-7 Coffman, E. G., Galambos, G., Martello, S., & Vigo, D. (1999). Bin-packing approximation algorithms: Combinatorial analysis. In D. Z. Du & P. M. Pardalos, (Ed.), Handbook of Combinatorial Optimization, (pp. 151–207). Dondrecht, the Netherlands: Kluwer. Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability. San Francisco: Freeman. Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. New York: Academic Press. Goodman, D. J., Borras, J., Mandayam, N. B., & Yates, R. D. (1997). INFOSTATIONS: A new system model for data and messaging services. Proceedings of the 47th IEEE Vehicular Technology Conference (VTC), Phoenix, AZ, (Vol. 2, pp. 969–973). Jayram, T. S., Kimbrel, T., Krauthgamer, R., Schieber, B., & Sviridenko, M. (2001). Online server allocation in server farm via benefit task systems. Proceedings of the ACM Symposium on Theory of Computing (STOC’01), Crete, Greece, (pp. 540–549). Karp, R. M. (1992). Online algorithms versus offline algorithms: How much is it worth to know the future? In J. van Leeuwen, (Ed.), Proceedings of the 12th IFIP World Computer Congress. Volume 1: Algorithms, Software, Architecture, (pp. 416–429). Amsterdam: Elsevier. Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, H. (1993). Sequencing and Scheduling: Algorithms and Complexity. Amsterdam: North-Holland. Winkler, P., & Zhang, L. (2003). Wavelength assignment and generalized interval graph coloring. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’03), Baltimore, MD, (pp. 830–831). Woeginger, G. J. (1997). There is no asymptotic PTAS for two-dimensional vector packing. Information Processing Letters, 64, 293–297. doi:10.1016/S0020-0190(97)00179-8 Wu, G., Chu, C. W., Wine, K., Evans, J., & Frenkiel, R. (1999). WINMAC: A novel transmission protocol for infostations. Proceedings of the 49th IEEE Vehicular Technology Conference (VTC), Houston, TX, (Vol. 2, pp. 1340–1344).
655
Scalable Algorithms for Server Allocation in Infostations
Zander, J. (2000). Trends and challenges in resource management future wireless networks. In Proceedings of the IEEE Wireless Communications and Networks Conference (WCNC), Chicago, (Vol. 1, pp. 159–163).
KEY TERMS α-Approximation Algorithm: An algorithm producing a solution which is guaranteed to be no worst than α times the best solution. Bin-Packing: A combinatorial problem in which objects of different volumes must be packed into a finite number of bins of given capacity in a way that minimizes the number of bins used. Infostation: An isolated pocket area with small coverage of high bandwidth connectivity that delivers data on demand to mobile users. Interval Graph Coloring: A combinatorial problem in which colors have to be assigned to intervals in such a way that two overlapping intervals are colored differently and the minimum number of colors is used. Such a problem corresponds to color the vertices of an interval graph, that is, a graph representing the intersections of the set of intervals. Multiprocessor Scheduling: A method by which tasks are assigned to processors. On-Line Algorithm: An algorithm that processes its input data sequence in an ongoing manner, that is as they become available, without knowledge of the entire input sequence. Scalable Algorithm: An algorithm able to maintain the same efficiency when the workload grows. Server Allocation: An assignment of servers to the user requests.
656
Section 7
Web Computing
658
Chapter 28
Web Application Server Clustering with Distributed Java Virtual Machine1 King Tin Lam The University of Hong Kong, Hong Kong Cho-Li Wang The University of Hong Kong, Hong Kong
ABSTRACT Web application servers, being today’s enterprise application backbone, have warranted a wealth of J2EE-based clustering technologies. Most of them however need complex configurations and excessive programming effort to retrofit applications for cluster-aware execution. This chapter proposes a clustering approach based on distributed Java virtual machine (DJVM). A DJVM is a collection of extended JVMs that enables parallel execution of a multithreaded Java application over a cluster. A DJVM achieves transparent clustering and resource virtualization, extolling the virtue of single-system-image (SSI). The authors evaluate this approach through porting Apache Tomcat to their JESSICA2 DJVM and identify scalability issues arising from fine-grain object sharing coupled with intensive synchronizations among distributed threads. By leveraging relaxed cache coherence protocols, we are able to conquer the scalability barriers and harness the power of our DJVM’s global object space design to significantly outstrip existing clustering techniques for cache-centric web applications.
INTRODUCTION Scaling applications in web server environment is a fundamental requisite for continued growth of ebusiness, and is also a pressing challenge to most web architects when designing large-scale enterprise systems. Following the success of the Java 2 Platform, Enterprise Edition (J2EE), the J2EE world has developed an alphabet soup of APIs (JNDI, JMS, EJB, etc) that programmers would need to slurp down if they are to cluster their web applications. However, comprehending the bunch of these APIs and the clustering technologies shipped with J2EE server products is practically daunting for even those expeDOI: 10.4018/978-1-60566-661-7.ch028
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Web Application Server Clustering with Distributed Java Virtual Machine
rienced programmers. Besides the extra configuration and setup time, intrusive application rework is usually required for the web applications to behave correctly in the cluster environment. Therefore, there is still much room for researchers to contribute improved clustering solutions for web applications. In this chapter, we introduce a generic and easy-to-use web application server clustering approach coming out from the latest research in distributed Java virtual machines. A Distributed Java Virtual Machine (DJVM) fulfills the functions of a standard JVM in a distributed environment, such as clusters. It consists of a set of JVM instances spanning multiple cluster nodes that work cooperatively to support parallel execution of a multithreaded Java application. The Java threads created within one program can be distributed to different nodes and perform concurrently to exploit higher execution parallelism. The DJVM abstracts away the low-level clustering decisions and hides the physical boundaries across the cluster nodes from the application layer. All available resources in the distributed environment, such as memory, I/O and network bandwidth can be shared among distributed threads for solving more challenging problems. The design of DJVM adheres to the standard JVM specification, so ideally all applications that follow the original Java multithreaded programming model on a single machine can now be clustered across multiple servers in a virtually effortless manner. In the past, various efforts have been conducted in extending JVM to support transparent and parallel execution of multithreaded Java programs on a cluster of computers. Among them, Hyperion (Antoniu et al., 2001) and Jackal (Veldema et al., 2001) compile multithreaded Java programs directly into distributed applications in native code, while Java/DSM (Yu & Cox, 1997), cJVM (Aridor, Factor, & Teperman, 1999), and JESSICA (Ma, Wang, & Lau, 2000) modify the underlying JVM kernel to support cluster-wide thread execution. These DJVM prototypes debut as proven parallel execution engines for high-performance scientific computing over the last few years. Nevertheless, their leverage to clustering real-life applications with commercial server workloads has not been well-studied. We strive to bridge this gap by presenting our experience in porting the Apache Tomcat web application server on a DJVM called JESSICA2. A wide spectrum of web application benchmarks modeling stock quotes, online bookstore and SOAP-based B2B e-commerce are used to evaluate the clustering approach using DJVMs. We observe that the highly-threaded execution of Tomcat involves enormous fine-grain object accesses to Java collection classes such as hash tables all over the request handling cycles. This presents the key hurdles to scalability when the thread-safe object read/write operations and the associated synchronizations are performed in a cluster environment. To overcome this issue, we employ a home-based hybrid cache coherence protocol to support object sharing among the distributed threads. For cache-centric applications that cache hot and heavyweight web objects at the application-level, we find that by using JESSICA2, addition of nodes can grow application cache hits linearly, significantly outperforming the share-nothing approach using web server load balancing plug-in. This is attributed to our global object space (GOS) architecture that virtualizes network-wide memory resources for caching the application data as a unified dataset for global access by all threads. Clustering HTTP sessions over the GOS enables effortless cluster-wide session management and leads to a more balanced load distribution across servers than the traditional sticky-session request scheduling. Our coherence protocol also scales better than the session replication protocols adopted in existing Tomcat clustering. Hence, most of the benchmarked web applications show better or equivalent performance compared with the traditional clustering techniques. Overall, the DJVM approach emerges as a more holistic, cost-effective and transparent clustering technology that disappears from the application programmer’s point of view. With efficient protocol support for shared object access, such a middleware-level clustering solution is suitable for scaling most
659
Web Application Server Clustering with Distributed Java Virtual Machine
web applications in a cluster environment. Maturing of the DJVM technology would bring about stronger server resource integration and open up new vistas of clustering advances among the web community. The rest of the chapter is organized as follows. In Section 2, we survey the existing web application clustering technologies. Section 3 presents the system architecture of our JESSICA2 DJVM. In Section 4, we describe Tomcat execution on top of the JESSICA2 DJVM. Section 5 discusses JESSICA2’s global object space design and implementation. In Section 6, we evaluate the performance of Tomcat clustering using the DJVM. Section 7 reviews the related work. Section 8 concludes this chapter and suggests some possible future work.
EXISTING WEB APPLICATION CLUSTERING TECHNOLOGIES In the web community, clustering is broadly viewed as server load balancing and failover. Here, we discuss several widely adopted clustering technologies under the hood of J2EE. The most common and cost-effective way for load balancing is to employ a frontend web server with load balancing plug-ins such as Apache mod_jk (ASF, 2002) to dispatch incoming requests to different application servers. The plug-ins usually support sticky-sessions to maintain a user session entirely on one server. This solution could make the cluster resource utilization more restricted and is not robust against server failures. More advanced solutions need to support application state sharing among servers. Large-scale J2EE server products generally ship with clustering support for HTTP sessions and stateful session beans. One traditional approach is to serialize the session contents and persist the states to a data store like a relational database or a shared file system. However, this approach is not scalable. In-memory session replication is an improved technique also based on Java serialization to marshal session-bound objects into byte streams for sending to peer servers by means of some group communication services such as JGroups (Ban, 1997) (based on point-to-point RMI or IP multicast). Such a technique has been implemented in common web containers such as Tomcat. However, scalability issues are still present in group-based synchronous replications, especially over the general all-to-all replication protocols which are only efficient in very small-size clusters. Enterprise JavaBeans (EJB) is a server-side component architecture for building modular enterprise applications. Yet the EJB technology itself and its clustering are both complicated. Load balancing among EJB containers can be achieved by distributed method call, messaging or name services which correspond to the three specifications: Remote Method Invocation (RMI), Java Messaging Service (JMS) and Java Naming and Directory Interface (JNDI). In particular, JNDI is an indispensible element of EJB clustering as EJB access normally starts with a lookup of its home interface in the JNDI tree. For clients to look up clustered objects, EJB containers implement some global JNDI services (e.g. cluster-wide shared JNDI tree) and ship with special RMI compilers to generate replica-aware stubs for making userdefined EJBs “cluster-aware”. The stub contains the list of accessible target EJB instances and codes for load balancing and failover among the instances. EJB state changes are serialized and replicated to peer servers after the related transaction commits or after each method invocation. Undoubtedly, this clustering technology is expensive, complicated and with application design restrictions. In recent years, a growing trend in web application development has begun to adopt lightweight containers such as the Spring Framework (Johnson, 2002) to be the infrastructural backbone instead of the EJB technology. On such a paradigm, business objects are just plain old Java objects (POJOs)
660
Web Application Server Clustering with Distributed Java Virtual Machine
implementing data access logic and running in web containers like Tomcat. Caching POJOs in a collection object like Hashtable is also a common practice for saving long-latency access to database and file systems. To support clustering of POJOs which conform to no standard interface, it seems almost inevitable that application programmers have to rework their application code to use extra APIs to synchronize object replicas among the JVMs. Though distributed caching libraries (Perez, 2003) can facilitate POJO clustering, these solutions again rely on Java serializations and require complex configurations. The cache sizes they support are usually bounded by single-node memory capacity as a result of employing simplistic all-to-all synchronization and full replication protocols. Although the clustering solutions surveyed so far have their own merit points, most of them share several significant shortcomings. •
•
•
•
•
Restrictions on application design: Many object sharing mechanisms rely on Java serializations which pose restrictions on application design and implementation. They cannot easily work in a cluster environment. Possible loss of referential integrity: Most solutions suffer the break of referential integrity since it creates clones of the replicated object graph at deserialization and may lose the original object identity. That’s why when a shared object undergoes changes, it must be put back into the container object by an explicit call like setAttribute() to reflect the new referential relation. Likewise, consistency problems occur when attributes with cross-references in HTTPSession are modified and unmarshaled separately. Costly communication: Object serialization is known to be hugely costly in performance. It performs a coarse trace and clones a lot of objects even for one field change. So there is certain limit on the number and sizes of objects that can be bound in a session. No global signaling/coordination support: Subtle consistency problems arise when some design patterns and services are migrated to clusters. For example, the singleton pattern sharing a single static instance among threads as well as some synchronization codes become localized to each server, losing global coordination. Event-based services like timers make no sense if they are not executed on a single platform. Only a few products (e.g. JBoss’s clustered singleton facility) ship with configurable cluster-wide coordination support to ease these situations. Lacking global resource sharing: Most clustering solutions in the web domain put little focus on global integration of resources. They cannot provide a global view of the cluster resources such as memory, so each standalone server just does its own work without cooperation and may not fully exploit resources.
JESSICA2 DISTRIBUTED JVM JESSICA2 (Zhu, Wang, & Lau, 2002) is a DJVM designed to support transparent parallel execution of multithreaded Java applications in a networked cluster environment. It was developed based on Kaffe JVM (Wilkinson, 1998). The acronym JESSICA2 spells as Java-Enabled Single-System-Image Computing Architecture version 2; this architecture promotes the single-system image (SSI) notion when connecting Java with clusters. Such a design concept is helpful to take away the burden of clustering by hand from application developers. The key advantage with using JESSICA2 is its provision of transparent clustering services which require no source code modification and bytecode preprocessing. It will
661
Web Application Server Clustering with Distributed Java Virtual Machine
Figure 1. JESSICA2 DJVM System Architecture
automatically take care of thread distribution, data consistency of the shared objects and I/O redirection so that the program will run under an SSI illusion with integrated computing power, memory and I/O capacity of the cluster. Figure 1 shows the system architecture of the JESSICA2 DJVM. JESSICA2 has bundled a number of salient features extended from the standard JVM that realize the SSI services. To execute a Java application on JESSICA2, a tailored command is called to start the master JVM on the local host and the worker JVMs on remote nodes, based on the specified list of hostnames. In each JVM, a class loader is responsible for importing bytecode data (of both the basic Java class library classes and the application classes) into its method area where a Java thread can look up a specific method to invoke. The class loader of JESSICA2 is extended to support remote class loading which ensures when a worker JVM cannot find a class file locally, it can request the class bytecode on demand and fetch the initialized static data from the master JVM through network communication. This feature greatly simplifies cluster-wide deployment of Java applications and hence transparently provides the web farming support which traditionally requires application server extension to fulfill. When the Java threads of the application are started, the thread scheduler of the JVM will put their contexts (e.g. program counter and other register values) into the execution engine in turns. The Java methods invoked by the running thread will be compiled by the Just-In-Time (JIT) compiler into native codes for high-speed execution. JESSICA2 incorporates a cluster-aware JIT compiler to support lightweight Java thread migration across node boundaries to assist global thread scheduling. Java threads will be assigned to each worker JVM at the startup time in a round-robin manner to strike a raw load balance. Dynamic load balancing during runtime can be done by migrating Java threads that are running into computation hotspots to the less loaded nodes. For detecting hotspots, each JVM instance
662
Web Application Server Clustering with Distributed Java Virtual Machine
has a load monitor daemon that periodically wakes up and sends current load status such as CPU and memory utilization to the master JVM which is then able to make thread migration decisions with a global resource view. Java threads migrated to remote JVMs may still be carrying references to the objects under the source JVM heaps. For seamless object visibility, JESSICA2 employs a special heap-level service called the Global Object Space (GOS) to support location-transparent object access. Objects can be shared among distributed threads over the GOS as if they were under a single JVM heap. For this to happen, the GOS implements object packing functions to transform object graphs into byte streams for shipping to the requesting nodes. The shipped object data will be saved as a cache copy under the local heap of the requesting node. Caching improves data access locality but leads to cache consistency issues. To tackle the problem of stale data, the GOS employs release-consistent memory models stemmed from software Distributed Shared Memory (DSM) systems to preserve correct memory views on shared objects across reads/writes done by distributed threads. JESSICA2 offers parallel I/O and location-transparent file access. We extend JESSICA2 to support transparent I/O redirection mechanism so that I/O requests (file and socket access) can be virtually served at any node. Our system does not rely on shared distributed file systems such as NFS, nor does it need to restrict a single IP address for all the nodes in the running cluster. Rather, we extend each JVM to run a transparent I/O redirection mechanism to redirect non-home I/O operations on files or sockets to their home nodes. To attain I/O parallelism atop transparency, read-only file operations and connectionless network I/O can be done at the local nodes concurrently without redirection. Finally, all inter-node communication activities required by the subsystems at upper layers like the GOS and I/O redirections are supported by a common module called the host manager which wraps up the underlying TCP communication functions with connection caching and message compression optimizations. On the whole, we can see that DJVM is a rather generic middleware system that supports parallel execution of any Java program. Since the unveiling of DJVMs, their application domains remain mostly in scientific computing over the last few years. They were used to support multithreaded Java programs that are programmed in a data-parallel manner. These applications tend to be simple, embarrassingly parallel so that DJVMs could offer good scalability. However, much more mainstream applications are business-oriented, centered at server-side platforms and run atop some Java application servers. Their object access and synchronization patterns are far more complex. In the next sections, we will elaborate on the common runtime characteristics of application servers and their impacts on the DJVM performance through a case study of Apache Tomcat running on JESSICA2.
APACHE TOMCAT ON DISTRIBUTED JVM Apache Tomcat is a Java servlet container developed at the Apache Software Foundation (ASF). It serves as the official reference implementation of the Java Servlet and JavaServer Page (JSP) specifications. Tomcat is the world’s most widely used open-source servlet engine and has been used by renowned companies like WalMart, E*Trade Securities and The Weather Channel to power their large-scale and mission-critical web applications in production systems. As a common design in many servers, Tomcat maintains a thread pool to avoid thread creation cost for every short-lived request as well as to give an upper bound to the overall system resource usage.
663
Web Application Server Clustering with Distributed Java Virtual Machine
Upon an incoming connection, a thread is scheduled from the pool to handle it. The web container then performs various processing such as HTTP header parsing, sessions handling, web context mapping and servlet class loading. The request eventually reaches the servlet code which implements application logics such as form data processing, database querying, HTML/XML page generation, etc. Finally, the response is sent back to the client. This request service cycle is complex, comes across many objects throughout the container hierarchy and imposes multithreading challenges to the DJVM runtime. Being a classical and large-scale web application server, Tomcat reflects an important class of reallife object-oriented server execution patterns that are summarized as follows. 1. 2. 3.
4. 5.
6.
7.
I/O-intensive workload: Most web server workloads are I/O-bound and composed of short-lived request processing. The per-request computation-communication ratio is usually small. Highly-threaded: It is common that a server instance is configured with a large number of threads, typically a few tens to a hundred per server to hide I/O blocking latency. High read/write ratios: Shaped by customer buying behaviors and e-business patterns, web applications usually consist of high read/write ratio, say around 90/10; the dominant reads come from browsing while only a few writes owing to ordering happen over a period. Long-running: Typically a server application runs for an indefinitely long time, processing requests received from the client side. High utilization of collection framework: Tomcat makes extensive use of Java collection classes like Hashtable and Vector to store information (e.g. web contexts, sessions, attributes, MIME types, status codes, etc). They are accessed frequently when checking, mapping and searching operations happen inside the container. To reduce object creation and garbage collection costs, many application servers apply the object pooling technique and use collection classes to implement the object pools. Fine-grain object access: Fine-grain object access has two implications here: (1) the object size is small; (2) the interval between object accesses to the heap is short. Unlike many scientific applications which have well-structured objects with size of at least several hundred bytes, Tomcat contains an abundance of small-size objects (average about 80 bytes by our experience) throughout the container hierarchy. Object accesses are very frequent due to object-oriented design of Tomcat. Complex object graph with irregular reference locality: Some design patterns such as facade and chain of interceptors used in Tomcat yield ramified object connectivity, cross-referencing and irregular reference locality among objects throughout the container hierarchy. By property 5, heavy use of Java Hashtable or HashMap also intensifies the irregularity of reference locality as hash entries are accessed in a shuffling pattern, contrasting with the consecutive memory access pattern in array-based scientific computations.
Figure 2 depicts the execution of the Tomcat application server on top of a 4-node cluster. When Tomcat is executed atop JESSICA2, Tomcat is exposed to an SSI view of the integrated resources of the cluster nodes as if it was in one powerful server. A customized Tomcat startup script is used to bring up the server, running atop the master JVM. The script is tailored to supply the DJVM runtime parameters (e.g. the port number for master-worker handshaking) and to read a host configuration file which defines the hostnames or IP addresses of the worker nodes the DJVM would span. When the server spawns a pool of threads, the threads will be migrated to the worker nodes. They will load the classes of the Java library, Tomcat and the web applications deployed dynamically through
664
Web Application Server Clustering with Distributed Java Virtual Machine
Figure 2. Execution of Tomcat on JESSICA2 DJVM
the cluster-aware class loader of JESSICA2. In this way, “virtual” web application server instances are set up on the worker nodes. The virtual server instances pull workload continuously from the master node by accepting and handling incoming connections through transparent I/O redirections. On each worker node, I/O operations (accept, read/write and close) performed on the shared server socket object (wrapped in the pooled TCP connector) will be redirected to the master node where it was bound to the outside world. Most other I/O operations can be performed on I/O objects created locally; so each cluster node can serve web page requests and database queries in parallel. When a client request is accepted, the context manager of Tomcat will match it to the target web application context. If the request carries session state such as a cookie, the standard manager will search for the allocated session object from the sessions hash table. In essence, all Tomcat container objects including the context manager, the standard manager, the sessions hash table and web contexts allocated in the master JVM heap are transparently shared among the distributed threads by means of the underlying GOS service mentioned in section 3. When a thread gets the first access to a non-local object reference, it will encounter an access fault and send a fetching request to the object’s home node. The home node will respond with the up-to-date object data and export the local object as the home copy of the shared object. Cluster-wide data consistency will be enforced on the home copy and all cache copies derived from it thereafter. Since each thread will be able to see the shared object updates made by others through synchronization, the global shared heap creates an implicit cooperative caching effect among the threads. The power of this effect can be exemplified by collection classes like hash tables. As illustrated, all HTTP sessions stored in a Tomcat-managed hash table can be globally accessible. The responsibility of maintaining HTTP session data consistency across servers has transparently shifted to the GOS layer. In other words, every server is eligible to handle requests belonged to any client session. This leads to more freedom of choice in request scheduling policies over sticky-sessions load
665
Web Application Server Clustering with Distributed Java Virtual Machine
balancing which can run into hotspots. Another useful scenario is using the GOS to augment the effective cache size of an application-level in-memory Java cache (e.g., a hash table for looking up database query results). The fact that every thread sees the cache entries created by one another contributes to secondary (indirect) application cache hits through remote object access. The cache size can now scale linearly with additional nodes, so we can greatly take the load off the bottlenecked database tier by caching more data at the application tier. The DJVM approach inherits most advantages of clusters. However, the aforesaid server runtime properties bring additional design challenges on the DJVM runtime. First, I/O intensive workloads are known to be more difficult to scale efficiently over a cluster. Second, the high thread count property implies higher blocking latency if contention occurs. More memory overheads would be resulted from any per-thread protocol data structures. High read/write ratio is a positive news to the GOS as it implies shared writes are limited, so our protocols can take this property as a design tradeoff. Next, for longrunning applications, we need to make sure the memory overhead induced by the coherence protocol data structures scales up slowly for less frequent garbage-collection cycles. Property 5 puts up the biggest barrier to scalability. Frequent synchronizations on the globally shared thread pool and object pools produce intensive remote locking overhead. Worse still, these pools are usually built from Java collection classes which are not scalable. For example, fine-grain accesses to hash entries of a Java hash table are all bottlenecked around the single map-wide lock contention which will be much intensified by distributed locking. Finally, properties 6 and 7 together issue enormous remote access roundtrips and demand smart object prefetching techniques for aggregating fine-grain communications. These observations call for a renovation of JESSICA2’s global object space (GOS) architecture.
GLOBAL OBJECT SPACE In this section, we elaborate on the design and implementation of our enhanced GOS system. We discuss the structure of the extended JVM heap, a home-based cache coherence protocol tailored for managing locks and a cluster-wide sequential consistency protocol for handling volatile field updates.
5.1 Overview of the Extended JVM Heap To support cluster-wide object modification propagation and consistency maintenance, the heap of the standard JVM should be extended to make it “cluster-aware”. In JESSICA2, each JVM heap is logically divided into two areas: the master heap area and the cache heap area. The master heap area essentially rides on the unmodified JVM heap, storing ordinary local objects. To make it “cluster-aware”, when the local objects are being shared with some remote threads, they are exported as home objects with special flags marked in their object headers. The cache heap area manages cache objects brought from the master heap of a peer node. It consists of extra data structures for maintaining cluster-wide data consistency. The original GOS follows an intuitive design in which each thread has its own cache heap area, resembling the thread-private working memory based on the Java memory model (JMM). This design prevents local threads from interfering each other’s cache copies (such as during invalidations) but wastes precious memory space to keep redundant per-thread cache copies on the same node. So we adopt a unified cache design in the enhanced GOS which allows all local threads running on a single node to share a common cache copy. This design not only makes better usage of available memory resources but
666
Web Application Server Clustering with Distributed Java Virtual Machine
Figure 3. GOS internal data structures
also reduce remote object fetching since when a thread faults in an object, other peer threads at the same node requesting the same object could find it in place. We also switch to a release consistency memory model in which the dominant read-only objects are never invalidated, so the interference among local threads is practically small. These modifications potentially could accommodate a high server thread count and achieve better memory utilization. Figure 3 shows the internal data structures of the extended JVM heap. The object header of every object is augmented with special fields such as the cache pointer. A local or home object has a null cache pointer whereas a cache object has its cache pointer points to an internal data structure called cache header that contains the state and home information of the object. A node-level hash table (shared by all local threads) is used to manage and to look up cache headers during fetching events. In order to tell the home nodes of the modifications made on cache objects, each thread maintains a dirty list that records the ids of cache objects it has modified. At synchronization points, updates made on the dirty objects are flushed back to their home nodes. A similar per-node volatile dirty list is used to record updates on objects with volatile fields which are maintained by a separate single-writer protocol to be explained in section 5.3. Object state is composed of two bits: valid/invalid and clean/dirty. The JIT compiler is tweaked to perform inline checking on each cache object access to see if its state is valid for read/write. Read/write on an invalid object will trigger appropriate interface functions to fault-in the up-to-date copy from its home. For efficiency, the software check is injected as a tiny assembly code fragment to the relevant bytecode instructions (GETFIELD, PUTFIELD, AALOAD, AASTORE, etc), testing the last two bits of the cache pointer. Valid object access passing the check will not impose any GOS interface function call overhead and is thus as fast as local object access. Creating a single-heap illusion to distributed threads entails an advanced design of distributed cache coherence protocol as it has to be compliant to the Java memory model that defines the memory consistency semantics across multiple threads. The Java language provides two synchronization constructs for
667
Web Application Server Clustering with Distributed Java Virtual Machine
the programmers to render thread-safe code – the synchronized and volatile keywords. The synchronized keyword guarantees a code fragment or method with atomicity and memory visibility while volatile ensures that threads can see the latest values of volatile variables. We will discuss our enhancements of the GOS for handling the two types of synchronizations in Section 5.2 and 5.3 respectively.
5.2 Home-based Lazy Release Consistency Protocol Entering and exiting a synchronized block or method correspond to acquiring and releasing the lock associated with the synchronized object. To fulfill the Java memory model, the original GOS implements an intuitive solution that works as follows. Upon a lock release, all updates to cache objects are flushed to their home nodes. Upon a lock acquire, all cache objects are invalidated, so later accesses will fault in the up-to-date copies from the home nodes. However, this would incur significant object fault-in overheads after every lock acquire. Thus, we renovate the original global object space by adopting a more relaxed home-based lazy release consistency (HLRC) memory model. Contrary to the intuitive solution, upon a lock acquire, we confine invalidations to cache copies of shared objects that have been modified by other nodes only, rather than invalidating the total cache heap area. Our home-based cache coherence protocol guarantees memory visibility based on Lazy Release Consistency (LRC) (Keleher, Cox, & Zwaenepoel, 1992). LRC delays the propagation of modifications to a node until it performs a lock acquire. Lock acquire and release delimit the start and end of an interval. Specifically, LRC insures that the node can see the memory changes performed in other nodes’ intervals according to the happened-before-1 partial order (Adve & Hill, 1993), which is basically given by the local node’s locking order and the shared lock transfer event. This means all memory updates preceding the release performed by a node should be made visible to the node that acquires the same lock. HLRC is similar to LRC in the sense of lock management but shapes the modification propagation into home-based patterns. Memory updates are communicated based on a multiple-writer protocol implemented using the twin-and-diff technique that allows two or more threads to modify different parts (i.e. different fields or array portions) of the same shared object concurrently without conflict. In this technique, a twin copy is made as a data snapshot before the first write to a cache object in the current interval. Upon a shared lock release, for each dirty cache object, the modified part, i.e. diff, is differentiated from the twin. The diff is eagerly flushed to the corresponding home node, keeping the home copy always up-to-date. The thread can then safely discard the twins and diffs and close the interval. When the lock is acquired by another thread, the releaser passes write notices along the lock grant to the acquirer. The acquirer uses the write notices to invalidate the corresponding cached objects. It also saves the write notices so that they can be passed on to the next acquirer enforcing the happens-before partial order. A later access on an invalidated cache object will fault in the up-to-date copy from its home. Here, we have to deal with some tricky data-race problems arising from sharing a unified cache copy among local threads. First, for systems of object-based granularity as in our case, field-level false sharing may occur since protecting different fields of one object by different locks is reckoned as wellsynchronized in Java. For example, while one thread T1 holds a lock for modifying field A of a cache copy and makes it becomes in dirty state, another local thread T2 may acquire a lock for modifying field B of the same object. If another node has modified field B using the same lock, then T2 will invalidate that cache copy and fault-in the home copy, overwriting those pending modifications made by T1. Second, in systems with object prefetching, it is possible for one thread faulting in a home object A
668
Web Application Server Clustering with Distributed Java Virtual Machine
with object B prefetched to overwrite the pending modifications on the shared cache copy B made by another thread. Currently, we deal with these hazards by reconciling the timestamp field associated to each object to resolve detectable version conflicts and by incorporating techniques similar to two-way diffing (Stets et al., 1997). For home objects, local read/write can be done directly without generating and applying diffs. This benefit is usually known as the home effect (Zhou, Iftode, & Li, 1996). Some minor overhead that home nodes still need to pay is to keep record of the local writes for the next remote acquiring thread to invalidate the relevant cache copies. Locking of home objects resembles locking of local objects if the lock ownership has not been given to any remote nodes. Otherwise, it has to wait for the lock release done by the last remote acquirer. Compared with homeless protocols, the advantages of HLRC are: 1. the home effect for reducing high diffing overheads; 2. fewer messages since an object fault can always be satisfied by one round-trip to home instead of diff request messages to multiple writers; 3. no diff accumulation and so no need for garbage collection of diffs. Hence, this becomes our protocol design choice for shorter latency seen by I/O-bound workload and less garbage accruing from long-running server applications. Nevertheless, we depart from the usual HLRC implementations in some aspects. To track and enforce the happens-before partial order, traditional HLRC implementations rely heavily on vector timestamps to dig out the exact minimal intervals (and write notices) that the acquirer must apply. While this ensures the most relaxed invalidation, this entails complex data structures like interval records in Treadmarks (Keleher et al., 1994) or bins database (Iosevich & Schuster, 2005) to keep the stacks of vectors. The storage size occupied by them scales with the number of locking on shared objects. For lock-intensive applications, these stacks can grow up quickly and consume enormous space. For long-running server applications, the problem becomes more critical and systems that rely on pre-allocation schemes such as cyclic bins buffers (Iosevich & Schuster, 2005) will ultimately run out of space and result in runtime failure. Discarding interval records is possible if they have already been applied to all nodes. But some nodes may never acquire a particular lock while some nodes intensively acquire it. This issue is not ignorable particularly in multithreaded protocols where the length of vector timestamp scales with the number of threads. For highly-threaded applications like Tomcat, this has scalability impacts on both memory and network bandwidth. Therefore, our protocol eschews the use of vector timestamps. Rather we employ object-level scalar timestamps to assist deriving the set of write notices. The basic idea is illustrated by Figure 4. Each node maintains a data structure called timestamp map which is essentially a hash table recording all shared objects that have once been modified. Each map entry consists of the object id, a scalar timestamp and a n-bit binary vector (n being number of nodes) and is used to derive the corresponding write notice formatted as a couple (object id, timestamp). The n-bit binary vector is used to keep track of which node has applied the write notice (0 = not yet; 1 = applied). If all the n nodes have applied the write notice, it is considered obsolete and can be discarded. The size of this map scales with the number of modified shared objects rather than the number of shared locking. Repetitive locking on the same object will not generate separate interval records but go to update the same entry in the timestamp map. Due to high read/write ratios in web applications, the number of modified shared objects is limited. The timestamp map will also undergo a periodic shrinking phase to clean up those obsolete entries. So the map is practically small in size at most of the time. Upon a shared lock release, modifications will be recorded into the local node’s timestamp map. When a lock transfer happens, all non-obsolete map entries will be extracted as write notices and passed from the releaser to the acquirer. They will also be saved in the acquirer’s map. Write notices with a
669
Web Application Server Clustering with Distributed Java Virtual Machine
Figure 4. Timestamp map for implementing HLRC
newer timestamp will overwrite an old map entry if any and reset its n-bit vector to all zeros so that future acquirers will be able to know the changes. Without tracking the exact partial order, the set of write notices sent to an acquirer may not be minimal and possibly include modifications that “happensafter” the release of the lock being acquired. The drawback is that some cache objects at the acquirer side may be invalidated earlier than necessary. However this effect is insignificant since if the thread is really going to access the invalidated cache objects, it eventually needs to see the modifications. This effect will not accrue owing to our periodic cleanup of obsolete map entries and selective invalidations based on object timestamp comparison.
5.3 Volatile Consistency Protocol Most DJVM prototype implementations enforce cluster-wide semantics of the volatile construct in a way that is stricter than necessary. For straight-forward compatibility, the volatile construct is usually treated as if it was a lock, thus introducing unnecessary mutual exclusivity to the application. The latest Java concurrent utility package (JSR166, 2004), particularly the ConcurrentHashMap class shipped along, employs segment-based locks plus volatile count and object value fields to guard different hash bucket ranges. The advanced data structure offers much more scalable throughput than the conventional Java Hashtable. However, such a good design for concurrency will be smothered if the underlying DJVM handle the volatile fields as locks. So we decided to tailor consistency support to volatile fields. Our new protocol for maintaining cluster-wide volatile field consistency is a passive-based concurrent-read exclusive-write (CREW) protocol. It enforces sequential consistency to ensure the next reader thread can see the updates made by the last writer on the same object. To implement this model, we need to assign a manager for each object with volatile fields and it is naturally the home node where the object is created. For ease of explanation, we call an object with a volatile field as volatile object. The home node needs to maintain two states on the home copy of volatile object: readable and exclusive, as well as a list called copyset of the nodes that currently have a valid cache copy of this object. When the home node receives a fetch request from a node on a readable volatile object, the node’s id will be added to the copyset list of the home copy. The consistency of a volatile object relies on the active writer to tell the readers of such an update. When a thread wants
670
Web Application Server Clustering with Distributed Java Virtual Machine
to write the object, no matter the home or cache copy, it must first gain the exclusive right on it from its home node. Before the exclusive right is granted to the candidate writer, the home will broadcast invalidations to all members of the copyset and clean up the copyset. The writer will record its modified objects into the per-node volatile dirty list. The exclusive right will be returned to the home along when the modification (diff) is flushed. Read/write on home objects similarly need to go through the state check except that they are done directly on the object data without diff generation and flushing. There is no need to generate any write notices because volatile cache copies are passively invalidated by the home when a writer exists. Upon read on an invalid volatile object, it will need to contact the home for the latest copy and join the copyset again. If the state of the home copy is exclusive, then the fetch request will be put into a queue pointed by volatile object header. When the writer returns the diff and exclusive right to the home, the home will turn the object state back to readable and reply all queued readers with the updated object data. As long as the state of a cache volatile object stays valid, its consistency has been guaranteed and the thread can directly trust it until invalidation is received when some writer exists. This leads to the beauty of this protocol that results in much better concurrency. Reads on a valid volatile object are pure local operations without remote locks and any communications. For high application read/write ratio, our design tradeoff shifts the communication overhead of the dominant reads to writes.
PERFORMANCE ANALYSIS In this section, we present the performance results obtained by running Tomcat on JESSICA2.
6.1 Experimental Setup Our experimental platform consists of three tiers: 1. web tier: a 2-way Xeon SMP server with 4GB RAM for running the master JVM of JESSICA2 with Apache Tomcat 3.2.4 started up on it. 2. application tier: a cluster of eight x86-based PCs with 512 MB RAM serving as the DJVM worker nodes. 3. data tier: a cluster of four x86-based PCs with 2GB RAM supporting MySQL Database Server 5.0.45. All nodes run under Fedora Core 1 (kernel 2.4.22). A Gigabit Ethernet switch is used to link up the three tiers, while nodes within the same tier are connected by Fast Ethernet networks. The initial and maximum heap sizes of each worker JVM are set to 128MB and 256MB respectively. Each database node has the same dataset replica with MySQL replication enabled to synchronize data updates across database servers at nearly real time. Jakarta JMeter 2.2 is used to synthesize varying workloads to stress the testing platform. Table 1 shows the application benchmark suite that we use to evaluate our clustering approach using the DJVM. They are designed to model real-life web application patterns. 1.
2.
Bible-quote characterizes applications like text search engines, news archives and company catalogs. The servlet application is I/O intensive, serving document retrievals and search requests over a set of text files of books. Stock-quote models stock market data providers. We follow the trend of web services that deliver price data by XML messages. The application reads stock price data matching the input date range from the database and formats the query result into an XML response.
671
Web Application Server Clustering with Distributed Java Virtual Machine
Table 1. Application Benchmark Suite Application Bible-quote
Object Sharing
Workload Nature
I/O
No sharing
I/O-intensive
Text files
Relatively compute-intensive
Database
Stock-quote Stock-quote/RSA SOAP-order
HTTP session
I/O-intensive
Cached database records
Memory-intensive
TPC-W Bulletin-search
3. 4.
5.
6.
Database and image files Database
Stock-quote/RSA is secure version of Stock-quote involving compute-intensive operations of 1024bit RSA encryption on the price data. SOAP-order models a B2B e-commerce web service. A SOAP engine is needed to support the service. We choose Apache SOAP 2.3.1 and deploy it to Tomcat. The application logic is to parse a SOAP message enclosing securities order placements, validate the user account and order details and then put the successful transactions into the database. TPC-W is a standard transactional web benchmark specification. It models an online bookstore with session-based workloads and a mix of static and dynamic web interactions. We adopt the Java servlet implementation developed by (ObjectWeb, 2005) but tailor the utility class for data access by disabling the default database connection pooling and utilizing thread-local storage to cache connections instead. Bulletin-search emulates a search engine in a bulletin board or web forum system. We take the data dump from the RUBBoS benchmark (ObjectWeb, 2004) to populate the database. The application maintains a hash-based LRU-cache map of the results of the costly database searches, and is thus memory-intensive. In order not to lift up garbage collection frequency too much, we impose a capacity limit on the cache map, taking up about one-forth of the local JVM heap.
The original Tomcat is ported to JESSICA2 with a few customizations as follows: 1. the shared thread pool is disbanded. We replace the original thread pool by a simpler implementation which spawns a static count of non-pooled threads based on the server configuration file. 2. several shared object pools (e.g. static mapping tables for MIME types and status codes) are disintegrated into thread-local caches. The total lines of modified code including the new thread pool source file we introduce are less than 370 (about 0.76% of the Tomcat source base).
6.2 Scalability Study In this experiment, we measure the maximum throughputs and average response times obtained by scaling the number of worker nodes from two to eight. The speedup is calculated by dividing the baseline runtime of Tomcat on Kaffe JVM 1.0.7 by the parallel runtime of Tomcat on JESSICA2. Figure 5 shows the results obtained for each benchmark. We can see that most of the applications scale well and achieve efficiency ranging from 66% (SOAP-order) to 96.7% (Stock-quote). Bible-quote, Stock-quote and Stockquote/RSA show almost linear speedup because they belong to the class of stateless applications, undergoing true parallelism without any GOS communications between the JVMs. In particular, Stock-quote
672
Web Application Server Clustering with Distributed Java Virtual Machine
Figure 5. Scalability and average response time obtain by Tomcat on JESSICA2
and Stock-quote/RSA involve operations of coarser work granularity, such as string manipulations and RSA encryptions, and are hence more readily to attain nearly perfect scalability. The relatively poorer speedups seen by SOAP-order and TPC-W are expected as they are stateful applications and involve GOS overheads when sharing HTTP session objects among JVM heaps. We will further discuss the limited speedup obtained by SOAP-order in section 6.4. Bulletin-search shows a nonlinear but steepening curve in speedup when the number of worker nodes scales out due to the implicit cooperative cache effect given by the GOS that we described in section 4. Along the scaling of nodes, when the cluster-wide aggregated available memory becomes large enough to accommodate most of the data objects cached in the application, the cache benefit will contribute an impulsive rise in speedup. Further study on this effect will be given in section 6.3. Table 2 shows the cluster-wide thread count used in each application and the overall protocol messaging overheads inside JESSICA2 in the 8-node configuration. The count of I/O redirections is proportional to the request throughput and generally does not have impact on the scalability. The higher number of GOS protocol messages explains the poorer scalability obtained by the application if we reconcile with Figure 5. Bulletin-search is regarded as an exceptional case for its performance is more determined by its cooperative caching benefits which could supersede the cost of GOS communications.
6.3 Comparison with Existing Tomcat Clustering A control experiment is conducted on the same platform to compare the DJVM approach with an existing clustering method for Tomcat using web load balancing plug-ins. We run an instance of Apache web server 2.0.53 on the web tier and eight standalone Tomcat servers on the application tier of our platform. The web server is connected to the Tomcat servers via the mod_jk connector 1.2.18 with sticky-session enabled (in-memory session replication is not supported in this comparison). The cluster-wide total number of threads and heap size configurations in this experiment are equal to the previous ones used in the DJVM approach. Figure 6 shows the throughputs obtained by the two clustering approaches on eight nodes. We can see that both solutions achieve similar performance (within ±8%) for those stateless web applications (Bible-quote, Stock-quote and Stock-quote/RSA). These applications exhibit embarrassing parallelism
673
Web Application Server Clustering with Distributed Java Virtual Machine
Table 2. Protocol message overheads of JESSICA2 DJVM Application
# Threads
# GOS Messages / Sec
# I/O Redirections / Sec
Bible-quote
80
0
2006
Stock-quote
80
0
1791
Stock-quote/RSA
80
0
275
SOAP-order
16
979
146
TPC-W
40
351
1413
Bulletin-search
16
483
297
and will not gain much advantage from the GOS. So putting the GOS aside, we can expect both solutions should perform more or less the same because both our transparent I/O redirection and mod_jk’s socket forwarding are functionally alike for dispatching requests and collecting responses. Yet, extra overheads could be incurred in our solution when transferring big trunks of data via fine-grain I/O redirections and during object state checks. TPC-W performs about 11% better on the DJVM than with mod_jk. One reason is that servers sharing sessions over the GOS are no longer restricted to handle requests bounded to their sticky sessions while load hotspots could happen intermittently when using mod_jk. On the other hand, SOAP-order performs 26% poorer on JESSICA2 than with mod_jk. The main factor that pulls down the performance is that the SOAP library has some code performing fairly intensive synchronizations in every request processing cycle. We will see later that the overhead breakdown presented in Section 6.5 echoes this factor. Bulletin-search performs 8.5 times better on the DJVM due to application cache hits augmented by the GOS. We will explain why the DJVM approach has significantly outplayed the existing solution in the next section.
Figure 6. Comparison of Tomcat on DJVM and existing Tomcat clustering
674
Web Application Server Clustering with Distributed Java Virtual Machine
Table 3. Bulletin-search’s cache size setting and hit rates augmented by GOS No. of Nodes
Cache Size (#Cache Entries)
Relative Cache Size
Total Hit Rate
Indirect Hit Latency (ms)
Cost Ratio of Miss: Indirect Hit
Throughput Speedup
1
512
12.5%
18.6%
N/A
N/A
N/A
2
931
22.7%
33.9%
9.07
40.79
1.26
4
1862
45.5%
59.3%
8.18
45.23
2.02
8
3724
90.9%
90.7%
11.74
31.52
7.96
6.4 Effect of Implicit Cooperative Caching Bulletin-search exemplifies the class of web applications that can exploit the GOS to virtualize a large heap for caching application data. Table 3 shows the application cache hits obtained by Bulletin-search when the number of cluster nodes scales from one to eight. With the GOS, the capacity setting of the cache map can be increased proportional to the node count beyond the single-node limit for different portions of the map are stored under different heaps. This is not possible without the GOS. Upon the creation of a new cache entry, its object reference built to the map is brought visible to all threads across synchronization points. So redundant caching is eliminated. Threads can exploit indirect (or global) cache hits in case the desired object is not in the local heap, easing the database bottleneck. We can see from Figure 7 that the overall hit rate keeps rising along with the scaling of worker nodes of the DJVM and most of the cache hits are contributed by the indirect hits when the single-node capacity has been exceeded. This is the reason why our approach achieves a multifold throughput than the existing clustering approach in which there are only direct (local) hits that would level off or even drop slightly no matter how many nodes are added. Here we define a term called relative cache size (RCS) that refers to the percentage of the aggregated cache size (combining all nodes) relative to the total size of the data set. When the RCS is below 50% in the 4-node case, the achievable cache hit rate is only around 60% and the 40% misses get no improvement
Figure 7. Composition of application cache hits in Bulletin-search with GOS
675
Web Application Server Clustering with Distributed Java Virtual Machine
Table 4. GOS overhead breakdown # Messages / Sec GOS Message Type SOAP-order
TPC-W
Bulletin-search
Lock acquire
198
48
61
Lock release
198
48
61
Flush
217
70
92
Static data fetch
18
10
0
Object fault-in
197
99
160
Array fault-in
79
50
105
such that the application obtains a speedup of merely two. But when the RCS exceeds certain level (e.g. 90% in the 8-node case), most of the requests are fulfilled by the global cache instead of going through the database tier. This explains the non-uniform scalability curve of this application in Figure 5.
6.5 GOS Overhead Breakdowns Table 4 shows the GOS overhead breakdowns in terms of message count per second for the three stateful applications. Figure 8 supplements with percentage breakdown of the message count as well as message latency. Lock acquire and release messages are issued when locking a remote object. Flush messages are sent upon lock releases but the flush message count is a bit more than lock release messages because in some cases updates are flushed to more than one homes. Other overheads are related to access faults which translate to communications with the corresponding home nodes. It is obvious that SOAP-order involves much more remote locking overhead than the other applications. Our further investigation finds that one utility class of the deployed SOAP library would induce for each request about five to six remote locks on several shared hash tables and four remote locks on
Figure 8. GOS percentage overhead
676
Web Application Server Clustering with Distributed Java Virtual Machine
Table 5. Cluster-wide locking overheads Application
# Local Locks / Sec
# Remote Locks / Sec
% Remote Locks Under Contention
Ratio of Local: Remote Locks
SOAP-order
232631
198
35%
1175:1
TPC-W
240470
48
45%
5010:1
Bulletin-search
27380
61
6.5%
449:1
ServletContextFacade coming from the facade design pattern of Tomcat. Such heavy cluster-wide synchronization overheads justify the relatively poorer scalability given by this application. Table 5 presents the local and remote locking rates for each application. We can see that local locks are much more than remote locks. The main reason behind this is that in Java-based servers, threadsafe reads/writes on I/O stream objects are exceptionally frequent, producing tremendous local locks. While local lock latency is very short (benchmarks shows an average of 0.2us), remote lock latency is however at least several thousand times longer in commodity clusters; yet remote locks are practically much fewer in most web applications. Another piece of information given by Table 5 is that SOAP-order and TPC-W have about 35% to 45% remote locks under cluster-wide contention, thus prolonging the wait time before locks are granted. This is why lock acquire has been the dominant part in the message latency for these two applications in Figure 8.
RELATED WORK Despite the boom of software DSM and the later DJVM research, it seems there have been only a few attempts at transparently supporting real-life server applications by means of shared virtual memory systems. Even fewer have been successful cases demonstrating good scalability though some of them had relied on non-commodity hardware to support their systems. Shasta (Scales & Gharachorloo, 1997) is a fine-grained software DSM system that uses binary code instrumentation techniques extensively to transparently extend memory accesses to have cluster-wide semantics. Oracle 7.3 database server was ported to Shasta running on SMP clusters, albeit without success in achieving good scalability. They used TPC-B and TPC-D database benchmarks which model online transaction processing and decision support queries respectively. TPC-B failed to scale at all due to too frequent updates while TPC-D strived to achieve a speedup of one point something on three servers connected by non-commodity Memory Channel Network. To some extent, their experience and result exhibit many limitations of implementing a single system image at operating system level, compared to our approach of clustering at middleware level. For example, relaxed memory consistency model cannot be adopted at operating system level in usual cases, since correctness of binary applications often relies on consistency model imposed by hardware, which is generally much stricter than Java memory model. Being able to adopt relaxed memory model such as HLRC in our case is very important to server applications which may be intensive in synchronization. cJVM (Aridor et al., 1999) is one of the earliest DJVM designed with intent to enable large multithreaded server applications such as Jigsaw to run transparently on a cluster. cJVM operates in interpretermode; it employs a master-proxy model and a method shipping approach to support object sharing among
677
Web Application Server Clustering with Distributed Java Virtual Machine
distributed threads. The system relies on proxy objects to redirect field access and method invocation to the node where the object’s master copy resides. This model basically conforms to sequential consistency and is not efficient since every object access and method invocation may require communication although some optimization techniques were developed to avoid needless shipping. In contrast, our DJVM runs in JIT-compilation mode and conforms to release consistency, both propelling faster execution. In (Aridor et al., 2000), cJVM was evaluated by running pBOB (Portable Business Object Benchmark), a multithreaded business benchmark inspired by TPC-C, on a 4-node cluster connected by non-commodity Myrinet. They obtained an efficiency of around 80%. However, it is unclear that whether cJVM will perform such well if JIT gets enabled and commodity Ethernet is used as in our case. Terracotta (Zilka, 2006) is a JVM-level clustering product emerged on the market for a couple of years. It applies bytecode instrumentation techniques similar to JavaSplit (Factor, Schuster, & Shagin, 2003) to a predefined list of common products and to user-defined classes for clustering among multiple Java application instances. Users need to manually specify shared classes as distributed shared objects (DSOs) and their cluster-aware concurrency semantics. Contrasting with our SSI-oriented approach, this configuration-driven approach may impair user transparency and create subtle semantic violation. Terracotta uses a hub and spoke architecture that requires setting up a central server, namely the “L2 server”, to store all DSOs and to coordinate heap changes (field-level diffs) across JVMs. At synchronization points, changes on a DSO have to be sent to the L2 server that forwards the changes to all other clustered JVMs under the DSO’s copyset to keep all replicas consistent. Our home-based protocol needs to keep only the home copy up-to-date by flushing diffs, then the next acquirer can see the changes by faulting in the whole object. Terracotta’s centralized architecture may make the cluster susceptible to a global bottleneck when scaling out. Tailoring the bottleneck requires forklift upgrades on the L2 server (i.e. vertical scaling) that spoil the virtue of horizontal scaling using commodity hardware. We believe a home-based peer-to-peer protocol is a more scalable architecture for distributed object sharing.
CONCLUSION AND FUTURE WORK In this chapter, we introduce a new transparent clustering approach using distributed JVMs (DJVMs) for web application servers like Apache Tomcat. A DJVM couples a group of extended JVMs for distributing a multithreaded Java application on a cluster. It realizes transparent clustering without the need for introducing new APIs and incorporates most of the advantages of a SSI-centric system such as global resource integration and coordination. Using DJVMs to cluster web application servers can enhance the ease of web application clustering and global resource utilization – both have been poorly met in most existing clustering solutions among the web community. We port Tomcat to the JESSICA2 DJVM to testify this clustering approach. Our study addresses new challenges of supporting web application servers that characterize unique runtime properties of today’s object-oriented servers over the classical scientific applications evaluated in the previous DJVM projects. The key challenge lies in making the system scalable with a large number of threads and offering efficient shared memory support for fine-grain object sharing among the JVMs. We enhance the cache coherence protocol design accordingly in several aspects: 1. adopt a unified cache among local threads to make better memory utilization; 2. implement a timestamp-assisted HLRC protocol to ensure release consistency of shared objects; 3. enforce sequential consistency among cluster-wide volatile fields via a concurrent-read exclusive-write (CREW) protocol. These improvements result in more relaxed coher-
678
Web Application Server Clustering with Distributed Java Virtual Machine
ence maintenance and higher concurrency. Our experimental result has illustrated significant cache hits obtained by using the global object space (GOS) to cache a large application dataset with automatic consistency guarantee. Several trends have put forward the advent of the DJVM clustering technology. Today’s web applications are becoming increasingly resource-intensive due to security enhancement, more complicated business logics and XML-based standards. Collaborative computing paradigm provisioned by DJVMs becomes vital to generate helpful cache effect across cluster nodes for efficient resource usage. Second, application logics tend to increase in complexity and now more and more application frameworks are POJO-based. Clustering at application level and adoption of proprietary clustering mechanisms shipped with particular application server products will tend to be laborious and error-prone, if not unfeasible. We foresee DJVMs, typifying the kind of generic clustering middleware systems, will be gaining more user acceptance. Third, design and development for user applications, server programs and library support nowadays have put more emphasis on scalability than ever. When scalability or performance portability is not a problem and meanwhile DJVMs are supreme in cost-effectiveness, this would have a catalytic effect that more applications readily go for the DJVM technology. In future, we will investigate solutions to enhance fine-grain object sharing efficiency in the DJVM environment. In our research plans, we would consider incorporating transactional consistency (Hammond et al., 2004) into the cluster-wide memory coherence protocol.
REFERENCES Adve, S. V., & Hill, M. D. (1993). A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems, 4(6), 613–624. doi:10.1109/71.242161 Antoniu, G., Bougé, L., Hatcher, P., MacBeth, M., McGuigan, K., & Namyst, R. (2001). The Hyperion system: Compiling multithreaded Java bytecode for distributed execution. Parallel Computing, 27(10), 1279–1297. doi:10.1016/S0167-8191(01)00093-X Aridor, Y., Factor, M., & Teperman, A. (1999). cJVM: A Single System Image of a JVM on a Cluster. Paper presented at the Proceedings of the 1999 International Conference on Parallel Processing. Aridor, Y., Factor, M., Teperman, A., Eilam, T., & Schuster, A. (2000). Transparently obtaining scalability for Java applications on a cluster. Journal of Parallel and Distributed Computing, 60(10), 1159–1193. doi:10.1006/jpdc.2000.1649 ASF. (2002). The Apache Tomcat Connector. Retrieved June 18, 2008, from http://tomcat.apache.org/ connectors-doc/ Ban, B. (1997). JGroups - A Toolkit for Reliable Multicast Communication. Retrieved June 18, 2008, from http://www.jgroups.org/javagroupsnew/docs/index.html Factor, M., Schuster, A., & Shagin, K. (2003). JavaSplit: a runtime for execution of monolithic Java programs on heterogenous collections of commodity workstations. Paper presented at the Proceedings of the IEEE International Conference on Cluster Computing.
679
Web Application Server Clustering with Distributed Java Virtual Machine
Hammond, L., Wong, V., Chen, M., Carlstrom, B. D., Davis, J. D., & Hertzberg, B. (2004). Transactional Memory Coherence and Consistency. SIGARCH Comput. Archit. News, 32(2), 102. doi:10.1145/1028176.1006711 Iosevich, V., & Schuster, A. (2005). Software Distributed Shared Memory: a VIA-based implementation and comparison of sequential consistency with home-based lazy release consistency: Research Articles. Software, Practice & Experience, 35(8), 755–786. doi:10.1002/spe.656 Johnson, R. (2002). Spring Framework - a full-stack Java/JEE application framework. Retrieved June 18, 2008, from http://www.springframework.org/ JSR166. (2004). Java concurrent utility package in J2SE 5.0 (JDK1.5). Retrieved June 24, 2008, from http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html Keleher, P., Cox, A. L., Dwarkadas, S., & Zwaenepoel, W. (1994). TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Paper presented at the Proceedings of Winter 1995 USENIX Conference. Keleher, P., Cox, A. L., & Zwaenepoel, W. (1992). Lazy release consistency for software distributed shared memory. Paper presented at the Proceedings of the 19th annual international symposium on Computer architecture. Ma, M. J. M., Wang, C. L., & Lau, F. C. M. (2000). JESSICA: Java-enabled single-system-image computing architecture. Journal of Parallel and Distributed Computing, 60(10), 1194–1222. doi:10.1006/ jpdc.2000.1650 ObjectWeb. (2004). RUBBoS: Bulletin Board Benchmark. Retrieved June 19, 2008, from http://jmob. objectweb.org/rubbos.html ObjectWeb. (2005). TPC-W Benchmark (Java Servlets version). Retrieved June 19, 2008, from http:// jmob.objectweb.org/tpcw.html Perez, C. E. (2003). Open Source Distributed Cache Solutions Written in Java. Retrieved June 24, 2008, from http://www.manageability.org/blog/stuff/distributed-cache-java Scales, D. J., & Gharachorloo, K. (1997). Towards transparent and efficient software distributed shared memory. Paper presented at the Proceedings of the sixteenth ACM symposium on Operating systems principles. Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G., Kontothanassis, L., & Parthasarathy, S. (1997). Cashmere-2L: software coherent shared memory on a clustered remote-write network. SIGOPS Oper. Syst. Rev., 31(5), 170–183. doi:10.1145/269005.266675 Veldema, R., Hofman, R. F. H., Bhoedjang, R., & Bal, H. E. (2001). Runtime optimizations for a Java DSM implementation. Paper presented at the Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande. Wilkinson, T. (1998). Kaffe - a clean room implementation of the Java virtual machine. Retrieved 2002, from http://www.kaffe.org/
680
Web Application Server Clustering with Distributed Java Virtual Machine
Yu, W., & Cox, A. (1997). Java/DSM: A Platform for Heterogeneous Computing. Concurrency (Chichester, England), 9(11), 1213–1224. doi:10.1002/(SICI)1096-9128(199711)9:11<1213::AIDCPE333>3.0.CO;2-J Zhou, Y., Iftode, L., & Li, K. (1996). Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. SIGOPS Oper. Syst. Rev., 30(SI), 75-88. Zhu, W., Wang, C. L., & Lau, F. C. M. (2002). JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support. Paper presented at the Proceedings of the IEEE International Conference on Cluster Computing. Zilka, A. (2006). Terracotta - JVM Clustering, Scalability and Reliability for Java. Retrieved June 19, 2008, from http://www.terracotta.org
KEY TERMS AND DEFINITIONS Copyset: The current set of nodes or threads that hold a valid cache copy of an object. This data structure is kept at the home node of the object and is helpful for sending invalidations in a single-writermultiple-reader cache coherence protocol. Distributed Java Virtual Machine (DJVM): A parallel execution environment composed of a collaborative set of extended Java virtual machines spanning multiple cluster nodes for running a multithreaded Java application. Global Object Space (GOS): A virtualized memory address space for location-transparent object access and sharing across distributed threads. The GOS for distributed Java virtual machines is built upon a distributed shared heap architecture. Java Memory Model (JMM): A memory (consistency) model that defines legal behaviors in a multi-threaded Java code with respect to the shared memory. The JMM serves as a contract between programmers and the JVM. Lazy Release Consistency (LRC): The most widely adopted memory consistency model in software distributed shared memory (DSM) in which the propagation of shared page/object modifications (in forms of invalidation/update) is delayed to lock-acquire time. Implicit Cooperative Caching (ICC): – A helpful cache effect created by distributed threads through cluster-wide accesses to a collection of shared object references.
ENDNOTE 1
This research was supported by Hong Kong RGC grant (HKU7176/06E) and China 863 grant (2006AA01A111).
681
682
Chapter 29
Middleware for Community Coordinated Multimedia Jiehan Zhou University of Oulu, Finland Zhonghong Ou University of Oulu, Finland Junzhao Sun University of Oulu, Finland Mika Rautiainen University of Oulu, Finland Mika Ylianttila University of Oulu, Finland
ABSTRACT Community Coordinated Multimedia (CCM) envisions a novel paradigm that enables the user to consume multiple media through requesting multimedia-intensive Web services via diverse display devices, converged networks, and heterogeneous platforms within a virtual, open and collaborative community. These trends yield new requirements for CCM middleware. This chapter aims to systematically and extensively describe middleware challenges and opportunities to realize the CCM paradigm by reviewing the activities of middleware with respect to four viewpoints, namely mobility-aware, multimedia-driven, service-oriented, and community-coordinated.
INTRODUCTION With the popularity of mobile devices (e.g. mobile phone, camera phone, PDA), the advances of mobile ad hoc networks (e.g. enterprise networks, home networks, sensor networks), and the rapidly increasing amount of end user-generated multimedia content (e.g. audio, video, animation, text, image), human experience is being enhanced and extended by the consumption of multimedia content and multimedia DOI: 10.4018/978-1-60566-661-7.ch029
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Middleware for Community Coordinated Multimedia
services over mobile devices. This enhanced human experience paradigm is generalized with the term of Community Coordinated Multimedia, abbreviated as CCM, in this chapter. The emerging CCM communication takes on the feature of pervasively or wirelessly accessing multimedia-intensive Web services for aggregating, sharing, viewing TV broadcasting/multicasting services, or on-demand audiovisual content over mobile devices collaboratively. Thus the end user’s experience is enhanced and extended by mobile multimedia communication with the transparencies in networking, location, synchronization, group communication, coordination, collaboration, etc.(Zhou et al, 2008a). Middleware plays a key role in offering the transparent networking, location, synchronization, group communication, coordination, collaboration, etc. In this chapter, middleware is perceived as a software layer that sits above the network operating system and below the application layer. It encapsulates the knowledge from presentation layer and session layer in OSI model that provides controls on the dialogues/ connections (sessions) and the understanding of syntax and semantics between distributed applications, and abstracts the heterogeneity of the underlying environment between distributed applications. This chapter presents a survey and initial design of P2P service-oriented community coordinated multimedia middleware. This work is a part of EUREKA ITEA2 project CAM4Home1 metadata-enabled content delivery and service framework. The chapter investigates technological CCM middleware challenges and opportunities from four viewpoints that describe the CCM: mobility-aware, multimediadriven, service-oriented, and community-coordinated. These are the most highlighted characteristics for CCM applications. The following lists identified middleware categories for addressing challenges and opportunities in the CCM paradigm: •
•
•
•
Middleware for mobility management. The middleware for mobility management aims to provide mobile access to distributed multimedia applications and services, and addresses the limitations caused by terminal heterogeneity, network resource limitation, and node mobility. Middleware for multimedia computing and communication. The middleware for multimedia computing and communication aims to provide standard formats, specification and techniques for representing all multimedia types in a digital form, handling compressed digital video and audio data, and delivery streams. Middleware for service computing and communication. The middleware for service computing and communication aims to provide specifications and standards in the context of Web services to achieve the service-oriented multimedia computing paradigm covering service description, interaction, discovery, and composition. Middleware for community computing and communication. The middleware for community computing and communication aims to provide standards and principles which govern the participation of peers into the community and messaging models.
The remainder of the chapter is organized as follows: Section 2 defines concepts relevant to CCM and middleware. Section 3 illustrates a generic CCM scenario. Section 4 analyzes the requirements of middleware for CCM. Section 5 designs a middleware architecture for CCM. Section 6 surveys middleware technology for CCM with respect to mobility-aware, multimedia-driven, service-oriented, and community coordinated viewpoints. Section 7 discusses the future trends towards the evolution of CCM. Finally, Section 8 draws a conclusion for the chapter.
683
Middleware for Community Coordinated Multimedia
Figure 1. Middleware ontology in the context of CCM
DEFINITIONS This section specifies a few concepts relevant to the CCM paradigm and CCM middleware as follows:Multimedia: Represents a synchronized presentation of bundled media types, such as text, graphic, image, audio, video, and animation.Community: Is generally defined as groups of limited number of people held together by common interests and understandings, a sense of obligation and possibly trust (Bender, 1982).Community Coordinated Multimedia: (CCM) system maintains a virtual community for the consumption of CCM multimedia elements, i.e. both content generated by end users and content from professional multimedia provider (e.g., Video on Demand). The consumption involves a series of interrelated multimedia intensive processes such as content creation, aggregation, annotation, etc. In the context of CCM, these interrelated multimedia intensive processes are encapsulated into Web services, instead of multimedia applications, namely multimedia intensive services, briefly multimedia services.Standard: Refers to an accepted industry standard. Protocol refers to a set of governing rules in communication between computing endpoints. A specification is a document that proposes a standard. Middleware: is the key technology which integrates two or more distributed software units and allows them to exchange data via heterogeneous computing and communication devices (Quasy, 2004). In this chapter, middleware is perceived as an additional software layer in OSI model encapsulating knowledge from presentation and session layers, consisting of standards, specifications, forms, and protocols for multimedia, service, mobility and community computing and communication. Figure 1 illustrates the middleware ontology with relationship to other defined concepts.
684
Middleware for Community Coordinated Multimedia
Figure 2. An example scenario for CCM as presented in (Zhou et al, 2008b)
USAGE SCENARIO CCM envisions that user experiences are enriched and extended by the collaborative consumption of multimedia services with the interplay of two key enabling technologies of web services and P2P technology. Figure 2 illustrates the CCM paradigm. The use sequence about the CCM paradigm is given in (Zhou et al, 2008b).
CCM MIDDLEWARE REqUIREMENTS Figure 3 illustrates the four major viewpoints which provide a cooperating outline for the specification of CCM middleware which supports the emerging CCM application solutions (e.g. mobile content creation, online learning, collaborative multimedia creation, etc.). These four viewpoints are mobility-aware, multimedia-driven, service-oriented, and community-coordinated perspectives. The requirements associ-
685
Middleware for Community Coordinated Multimedia
Figure 3. CCM middleware viewpoints
ated with each viewpoint comprises the complete requirements for CCM middleware specifications. Mobility-aware CCM. Ubiquitous computing is becoming prominent as small portable devices, and the wireless networks to support them, have become more and more pervasive. In the CCM system, content can be consumed, created, analyzed, aggregated, and transmitted over mobile devices. Also services can be requested, discovered, invoked and interacted over mobile devices. Examples of such mobile devices are portable computers, PDAs, and mobile phones. Mobile communication is the key technical enabler allowing mobile service computing to deliver services, including conventional voice and video on demand over broadband 3G connection. In the last decade, the mobile communication industry has been one of the most flourishing sectors within the ICT industry. Mobile communication is a solution that enables flexible integration of smart mobile phones, such as camera phones, to other computer systems. Smart phones make it possible to access services anywhere and anytime. In the context of mobility-aware CCM, mobile communication systems and infrastructures play an important role in delivering services cost-efficiently and high QoS-guaranteed to support the users’ activities on the air. Therefore, the CCM middleware systems must provide context management, dynamic configuration, connection management, etc. to facilitate anytime, anywhere service access. Multimedia-driven CCM. Digital convergence between audiovisual media, high-speed networks and smart devices becomes a reality and helps make media content more directly manageable by computers (ERCIM, 2005). It presents new opportunities for CCM applications to enhance, enrich and extend user daily experiences. The CCM applications are multimedia intensive. These applications include multimedia content creation, annotation, aggregation, and sharing. The CCM is expected to facilitate multimedia content management through multimedia-intensive automatic or semi-automatic applications (e.g. automatic content classification and annotation), or services. The nature of multimedia-driven CCM
686
Middleware for Community Coordinated Multimedia
yields the requirements for multimedia representation, compression, and streaming delivery. Service-driven CCM. The CCM system is expected to employ service-oriented computing (Krafzig et al, 2005, Erl, 2005) and content delivery networking (Vakali et al, 2003, Dixit et al, 2004) technologies for delivering content by broadcasting (e.g. music, TV, radio). By providing Web sites and tool suites, the CCM system enables all end users to access to content and services with desired quality and functionality. This vision takes advantages of the nature of registration, discoverability, composibility, open standards, etc. in service orientation approach. Service-orientation perceives functions as distinct service units, which can be registered, discovered, and invoked over a network. Technically, a service consists of a contract, one or more interfaces, and an implementation (Krafzig et al, 2005). The service orientation principles for modern software system design are gained and promoted through contemporary service-oriented architecture (SOA) by introducing standardizations to service registration, semantic messaging platforms, and Web services technology. Therefore the CCM middleware system must provide description, discovery, and interaction mechanisms in the context of multimedia services. Multimedia services dealing with multimedia computing such as content annotation are networked and made available by service discovery. Community-coordinated CCM. The CCM attempts to build an online community for end users and service providers by providing the means of managing community memberships and member’s contacts. As illustrated in the scenario, end user Bob aggregates content with information that is relevant within a common subject and sends it to his friend Alice in the community who are also interested in it. Moreover, the CCM attempts to provide the end user with individual and customized content and service by managing CCM users’ preferences and profiles. The CCM usually has a large user base for sharing multimedia. The user base is usually organized in terms of specific interest ties and membership management. In order to succeed community coordination in multimedia sharing, the CCM middleware system must provide standards and principles which govern the participation of peers into the community (peer management), preference management, and various messaging models.
MIDDLEWARE ARCHITECTURE FOR CCM Based on the analysis of the middleware requirements from the four viewpoints, a middleware architecture for CCM is introduced in Figure 4. It comprises of four layered categories which abstract the components for computing and communication into multimedia, service, mobility, and community management perspectives. From top to bottom, the CCM middleware architecture consists of the Community-coordinated, Service-oriented, Multimedia-driven, and Mobility-aware layers. Each layer is a collection of related standards and specifications that provides services to manage multimedia data representations, sessions, and end-to-end collaboration in a heterogeneous computer environment (Table 1). The lowest layer is the mobility-aware middleware, which aims to provide multimedia service access to the multimedia user equipped with a portable device anytime and anywhere. Due to the limitations on the terminal heterogeneity, network resource, and node mobility, this mobility-aware middleware layer meets various requirements such as management of context, resource, and connections. The second layer is multimedia-driven middleware, which establishes the context of video and audio representation, encoding standards, and communications protocol for audio, video, and data. In this sense, the multimedia-driven middleware layer contains specifications and standards for multimedia representation and multimedia communication.
687
Middleware for Community Coordinated Multimedia
Figure 4. The middleware architecture for CCM
Table 1. Overview of the middleware architecture for CCM Middleware layer
Specification
Keywords
Related CCM viewpoints
Mobility-aware
Context management, connection management, dynamic configuration, and adaptivity
Context-awareness, network connection, mobile nodes
Mobility and pervasivenessaware CCM
Multimedia-driven
Multimedia representation, compression, and communication
Multimedia description language, audio, image, video codec, and multimedia streaming
Multimedia-driven CCM
Service-oriented
Service description languages, messaging formats, discovery mechanisms
Service description, service discovery, service composition
Service-oriented CCM
Community-coordinated
Principles and rules for grouping and messaging management
Peer, group, messaging modes, etc.
Community-coordinated CCM
688
Middleware for Community Coordinated Multimedia
The third layer is service-oriented middleware, which comprises of specifications and standards which allow traditional software applications to exchange data with one another as they participate in multimedia business processes. These specifications include XML, SOAP, WSDL, UDDI, etc. By using service-oriented middleware, the traditional multimedia applications are transformed into multimedia services which are made accessible over a network and can be combined and reused in the development of multimedia applications. The top layer is community-coordinated middleware, which establishes the technical context that allows any peer connected to a network to exchange messages and collaborate independently of the underlying network topology. This ultimately leads to the idea of creating a community-coordinated multimedia communication environment for social, professional, educational, or other purposes.
SURVEY IN MIDDLEWARE FOR CCM This section aims to elaborate the middleware layers identified above. On the one hand, this section describes the state of the art of middleware with respect to four major viewpoints, i.e. multimediadriven, mobility-aware, service-oriented, and community-coordinated CCM. On the other hand, this section presents a feasible and integrated middleware solution to meet the generic CCM requirements specified in Section 4.
MIDDLEWARE FOR MOBILITY-AWARE CCM Limitations and Requirements The rapid growth of wireless technologies and the development of smaller and smaller portable devices have led to the widespread use of mobile computing. Any user, equipped with a portable device, is able to access any service at anytime and anywhere. Mobile access to distributed applications and services brings with it a great number of new issues. The limitations caused by the inherent characteristics of mobility are as follows: •
•
Terminal heterogeneity (Gaddah et al, 2008). Mobile devices usually have diverse physical capabilities, e.g. CPU processing, storage capacity, power consumption, etc. For example, laptops own much more storage capacity and provide faster CPU processing capability, etc., while pocket PCs and mobile phones usually have much less available resources. Though the mobile terminal technology has been progressed promptly in the recent years, it is still impossible to make mobile devices as competitive as fixed terminals. Hence, middleware should be designed to achieve optimal resource utilization. Network resource limitation. Compared with fixed networks, the performance of wireless networks (GPRS, UMTS, Beyond 3G networks, Wi-Fi/WiMAX, HiperLAN, Bluetooth, etc.) vary significantly depending on various protocols and technologies. Meanwhile, mobile devices may encounter sharp drop in network bandwidth, high interference, or temporary disconnection when moving around different areas. Therefore, the middleware should be designed in a way that takes into account intrinsically the optimization of the limited network resource.
689
Middleware for Community Coordinated Multimedia
•
Mobility. According to the node mobility, when mobile devices move from one place to another, they will have to deal with different types of networks, services, and security policies. In turn, this requires applications to behave accordingly to handle various dynamic changes of the environment parameters. Henceforth, the design of middleware should take into consideration the mobility of nodes as well. Various requirements of middleware for mobility-aware CCM are as follows:
•
•
•
•
•
690
Context management. The characteristics of mobile networks in the CCM environment are the intermittent network connection and limited network bandwidth. Disconnection can happen frequently, either by active reason, i.e. saving power, or passive reason, temporary uncoverage or high interference. In order to deal with the disconnection effectively and efficiently, the context of middleware should be disclosed to the upper application layer instead of being hidden from it to make the application development much easier. Bellavista (Bellavista et al, 2007) summarized the context in a mobile computing environment as three different categories: network context, device context, and user context. The network context consists of the adopted wireless technology, available bandwidth, addressing protocol, etc. The device context includes details on the status of the available resources, such as CPU, batteries, memory, etc. The user context, in its turn, is composed of information related to the location, preferences and QoS requirement of the user. Dynamic reconfiguration (Gaddah et al, 2008). During the CCM application lifetime, dynamic changes in infrastructure facilities, e.g. the availability of certain services, will require the application behavior to be altered accordingly. Therefore, dynamic reconfiguration is needed in such environment. It can be achieved by adding a new functionality or changing an existing one at the CCM application runtime. For the purpose of supporting dynamic reconfiguration, middleware should be able to detect changes happening in the available resources and adopt corresponding approaches to deal with it. Reflective middleware is the widespread solution adopted to solve this problem. Connection management. User mobility and intermittent wireless signals result in the frequent disconnection and reconnection of mobile devices, which is exceptional in fixed distributed systems. Therefore, middleware for mobility-aware CCM should adopt different connection management mechanism from fixed distributed systems. Asynchronous communication mechanism is usually used to decouple the client from the server, with which tuple space system is one of the typical solutions. Another issue related to connection management is the provision of services based on the concept of session. In this case, the proxy can be adopted to hide the disconnection from the service layer. Resource management. Mobile devices are characterized by their limited resources, such as battery, CPU, memory, etc. Henceforth, mobile middleware for CCM should be lightweight enough to avoid overloading the mobile devices. Currently, middleware platforms designed for fixed distributed systems, e.g. CORBA, are too heavy to run on mobile devices as they usually have a number of functionalities which are not necessarily needed in resource-limited devices. Modular middleware design is adopted widely to make the middleware more lightweight. Adaptability. In mobile CCM, adaptability mainly refers to the ability to adapt to context changes dynamically. According to the currently available resources, adaptability allows middleware to
Middleware for Community Coordinated Multimedia
optimize the system behavior by choosing the corresponding protocol suite that better suits the current environment, integrating new functionalities and behaviors into the system, and so on.
STATE OF THE ART IN MIDDLEWARE FOR MOBILITY-AWARE CCM Traditional middleware for fixed distributed systems are too heavy to be used in mobile computing environments. In order to provide new solutions, the research work has been progressed along two distinct directions in the last decade (Cotroneo et al, 2008): (1) extending traditional middleware implementations with primitive mobile-enabled capabilities (e.g. Wireless-CORBA (Kangasarju, 2008)), and (2) proposing middleware that adopts mobile-enabled computing models (e.g. Lime (Murphy et al, 2001)). The former adopts a more effective computing model, but does not effectively overcome the intrinsic limitation of the synchronous remote procedure call; the latter adopts decoupled interaction mechanisms, but fails in providing a high level and well understood computing model abstraction (Migliaccio, 2006, Quasy, 2004). Following the similar categories, we divide middleware for mobility-aware CCM into four categories: extended traditional middleware, reflective middleware, tuple space middleware, and context-aware middleware. Extending traditional middleware. To be able to operate within existing fixed networks, object-oriented middleware has been proposed to mobile environments. Wireless CORBA is a CORBA specification in Wireless Access and Terminal Mobility (OMG, 2008). The overall system architecture is divided into three separate domains: home domain, terminal domain, and visited domain. In ALICE (Haahr et al, 1999), in order to support client/server architecture in nomadic environments, mobile devices with Windows CE operating system and GSM sensors have been adopted. The main focus of the existing examples of extending traditional middleware is on the provision of services from a back-bone network to the network-edge, i.e. mobile devices. Therefore, the main concerns are how to deal with connectivity and exchange messages. However, in cases where the networks are unstructured and the services have to be provided by the mobile devices, traditional middleware does not work well and new paradigms have to be put forward. This has motivated the birth of e.g. reflective middleware, tuple space middleware, and context-aware middleware. Reflective middleware. The primary motivation of reflective middleware is to increase its adaptability to the changing environment. A reflective system consists of two levels referred to as meta-level and baselevel. The former performs computation on the objects residing in the lower levels, the latter performs computation on the application domain entities (Gaddah et al, 2008). Open-ORB (Blair et al, 2002) and Globe (Steen et al, 1999) are two examples of middleware which utilized the concept of reflection. Tuple space middleware. The characteristics of wireless propagation environment make the synchronous communication mechanism, typical in most of the traditional distributed systems, not suitable for mobile applications. One solution for this is the so-called tuple space. A Tuple space is a globally shared, associatively addressed memory space that is organized as a bag of tuples (Ciancarini, 1996). Client processes can create the tuples by utilizing write operation and then retrive the tuples by read operation. LIME (Murphy et al, 2001), TSpaces (Wyckoff et al, 1998), JavaSpace (Bishop et al, 2002) are examples of the tuple space based systems. Context-aware middleware. Mobile systems are characterized by the dynamic execution context due to the mobility of the mobile devices. The context information has to be exposed to the application layer to make it adaptable to the corresponding changes which happen in the lower-levels. Context-aware computing was first proposed by (Schilit et al, 1994, Haahr et al, 1999). After that, lots of research inter-
691
Middleware for Community Coordinated Multimedia
Table 2. Overview of standards and specifications for multimedia representation Specification
Key notes
Role in CCM
DCMI
Element, qualifier, application, profile
Document description
DICOM
Image specific
Image specific
SMDL
Music data, SGML-based
Music data
MULL
Course, XML-based
Multimedia course preparation
MRML
Multimedia, XML-based
Multimedia
EDL
Sessions, XML-based
Session description
SMIL
Audiovisual, XML-based
AV data
SMEF
Media, SMEF-DM
Media description
P/Meta
Metadata, XML-based, P/Meta scheme
Metadata framework
SMPTE
Metadata, XML-based
Metadata framework
MXF
Audio-visual, SMPTE-based
AV data description
SVG
Describing 2D graphics, XML-based
Describing 2D graphics
TV-Anytime
Audio-visual, descriptor, preferences
AV data description
MPEG-7
Multimedia content data, interactive, integrated audio-visual
Multimedia description
MPEG-21
Common framework, multimedia delivery chain, digital item
Common multimedia description framework
ests are plunged into this field, but most of them focus on the location awareness, e.g. Nexus (Fritsch et al, 2000) was designed to provide various kinds of location-aware applications. Some other approaches investigated the feasibility of utilizing reflection in the context of mobile systems to offer dynamic context-awareness and adaptation mechanisms (Roman et al, 2001).
MIDDLEWARE FOR MULTIMEDIA-DRIVEN CCM In order to support multimedia content transmission via various networks, the issues of semantic multimedia representation, multimedia storage capacity, and the delivery time delay must be taken into consideration. Middleware for multimedia-driven CCM aims to abstract the knowledge about multimedia representation and communication, which comprises of specifications and standards for multimedia representation, compression, and communication.
STANDARDS FOR MULTIMEDIA REPRESENTATION AND COMPRESSION Table 2 presents an overview of standards and specifications for multimedia representation. A brief description about these specifications is given as follows and detailed in corresponding references. Dublin Core Metadata Initiative (DCMI). In the Dublin Core (DC) (Stuart et al, 2008), the description of the information resources is created using Dublin Core elements, and may be refined or further explained by a qualifier. Qualification schemes are used for ensuring a minimum level of metadata interoperability. No formal syntax rules are defined. DCMI evolution involves in extending the element
692
Middleware for Community Coordinated Multimedia
Table 3. Some compression standards for multimedia Specification
Key notes
Role in CCM
JPEG
Image, discrete cosine transform-based, codec specification, ISO standard
Compression for single images
JPEG-2000
Image, wavelet-based, greater decompression time than JPEG
Compression for single images
MPEG-1
Lossy video and audio compression, MP3, ISO standard
Compression for video and audio
MPEG-2
Lossy video and audio compression, popular DTV format, ISO standard
Compression for video and audio
MPEG-4
AV compression for web, CD distribution, voice and TV applications.
Compression for video and audio
set, description of images, standardization, special interests, and metadata scheme. The Digital Imaging and Communications in Medicine (DICOM) standard (ACR, 2008) is used for the exchange of images and related information. The DICOM standard has several supports, including the support for image exchange between senders and receivers, support for retrieving image information, and image management. Standard Music Description Language (SMDL) (ISO/IEC, 2008) defines an architecture for the representation of music information, either alone or in conjunction with text, graphics, or other information needed for publishing or business purposes. MUltimedia Lecture description Language (MULL) (Polak et al, 2001) enables to modify a remote presentation and control this presentation. Multimedia Retrieval Markup Language (MRML) (MRML, 2008) aims to unify access to multimedia retrieval and manage software component in order to extend their capabilities. Event Description Language (EDL) (Rodriguez, 2002) describes advanced multimedia sessions for supporting multimedia services management, provision and operation. The Synchronized Multimedia Integration Language (SMIL) (SMIL, 2005) enables simple authoring of interactive audiovisual presentations, which integrate streaming audio and video with images, text, or any other media type. Standard Media Exchange Framework (SMEF) (BBC, 2005) is defined by BBC to support and enable media asset production, management, and delivery. P/Meta (Hopper, 2002) is developed for content exchange by providing the P/Meta Scheme which consists of common attributes and transaction sets for P/Meta members such as content creators and distributors. Metadata Dictionary & Sets Registry (SMPTE) (SMPTE, 2004) creates the Metadata Dictionary (MDD) and a sets registry. The MDD dynamic document encompasses all the data elements considered relevant by the industry. The sets registry describes the business purpose and the structure of the sets. The Material eXchange Format (MXF) (Pro-MPEG, 2005) targets at supporting the interchange of audio-visual material with associated data and metadata. Scalable Vector Graphics (SVG) (Watt et al, 2003) describes 2D graphics and graphical applications in XML. It contains two parts: (1) an XML-based file format and (2) a programming API for graphical applications. TV-Anytime metadata (TV-Anytime, 2005) consists of the attractors/descriptors used, e.g. in Electronic Program Guides (EPG), or in Web pages to describe content. Multimedia Content Description Interface (MPEG-7) (ISO/IEC, 2003) describes the multimedia content data that supports some degree of interpretation of the information meaning, which can be passed onto, or accessed by a device or a computer code. The MPEG-21 multimedia framework (Burnett, 2006) identifies and defines the key elements needed to support the multimedia delivery chain. Table 3 presents widely used compression techniques that are in part competitive and in part complementary. The details about the standards and specifications are given as follows.
693
Middleware for Community Coordinated Multimedia
Table 4. Protocols and specifications for multimedia communication Middleware
Specification
Remote Procedure Call (RPC)
Procedure-oriented call, synchronous interaction model
Remote Method Invocation (RMI)
Object-oriented RPC, object-oriented references
Message Oriented Middleware
Message oriented communication, asynchronous interaction model
Stream-oriented communication
Continuous asynchronous, synchronous, isochronous, and QoS-specified multimedia transmission
ISO JPEG standard (William et al, 1992) defines how an image is compressed into a stream of bytes using discrete cosine transform and decompressed back into an image. It also defines the file format used to contain that stream. JPEG 2000 (Taubman et al, 2001) is an image compression standard advanced from JPEG. It is based on wavelet-based compression, which requires longer decompression time than JPEG and allows more sophisticated progressive downloads. MPEG-1 (Harte et al, 2006) is a standard for lossy compression of video and audio. It has been used for example as the standard for video CDs, but later video disc formats adopt newer codecs. It also contains the well-known MP3 audio compression. MPEG-2 (Harte et al, 2006) describes several lossy video and audio compression methods for various purposes. It is widely used in the terrestrial, cable, and satellite digital television formats. MPEG-4 (Harte et al, 2006) defines newer compression techniques for audio and video data. H.264/AVC (Richardson et al, 2003) is also known as MPEG-4 Part 10 for video compression, which is widely utilized in modern mobile TV standards and specifications. Audio counterpart is AAC, defined in MPEG-4 Part 3.
MIDDLEWARE FOR MULTIMEDIA COMMUNICATION The CCM multimedia applications support the view that local multimedia systems expand towards distributed solutions. Applications such as multimedia creation, aggregation, consumption, and others require high speed networks with a high transfer rate. Multimedia communication sets several requirements on services and protocols, e.g. processing of AV data needs to be bounded by deadlines or by a time interval. Multimedia communication standards and protocols can be categorized into Remote Procedure Call (RPC) based (Nelson, 1981, Tanenbaum et al, 2008), Message Oriented Middleware (MOM) based (Tanenbaum et al, 2008, Quasy, 2004), Remote Method Invocation (RMI) based (Tanenbaum et al, 2008), and Stream based (Tanenbaum et al, 2008, Halsall, 2000). They define the middleware alternatives for multimedia communication (Table 4). Details about the protocols and specifications are given below with relevant references. Remote procedure call (RPC) (Nelson, 1981, Tanenbaum et al, 2008) allows a software program to invoke a subroutine or procedure to execute in another computer. In the remote procedure call, the programmer writes the subroutine call code whether it is local or remote. Remote Method Invocation (RMI) (Nelson, 1981, Tanenbaum et al, 2008) is another RPC paradigm based on distributed objects. In the case of Java RMI, the programmer can create applications consisting of Java objects from different host computers. Message-oriented middleware (MOM) (Tanenbaum et al, 2008, Quasy, 2004) typically supports asynchronous calls between the client and server. With the message queue mechanism, MOM reduces the involvement of application developers. For example, applications send a subjective mes-
694
Middleware for Community Coordinated Multimedia
Figure 5. The relationship between Web service technologies
sage to logical contact points or indicate theirs interest for a specific type of message. As examined in CCM scenario, CCM multimedia communication involves multiple media types of audio and video. It becomes necessary for CCM to use stream-oriented middleware for streaming multimedia which purposes to support for continuous asynchronous, synchronous, isochronous, and QoS-specified media transmission. Stream-oriented middleware (Tanenbaum et al, 2008) examples are MPEG-TS (Harte, 2006), Resource ReSerVation Protocol (RSVP) (Liu et al, 2006), and Real-time Transport Protocol (RTP) (Perkins, 2003), etc. The MPEG-TS (Harte, 2006) is designed to allow multiplexing of digital video and audio and to synchronize the output. The RSVP (Liu et al, 2006) is a transport layer protocol designed to reserve resources across a network for an integrated services Internet. The RTP (Ray, 2003) defines a standardized packet format for delivering audio and video over the Internet.
MIDDLEWARE FOR SERVICE-ORIENTED CCM This section discusses middleware for service-oriented CCM, which consists of standards and specifications which govern the conversion of conventional multimedia applications into a service-oriented computing environment. These standards and specification are based on several notable Web service technologies, i.e. XML (Ray, 2003), WSDL (WSDL, 2005, Erl, 2005), UDDI (UDDI, 2004), SOAP (SOAP, 2003), and BPEL (Thatte, 2003). See Figure 5. As a successor to HTML, the eXtensible Markup Language (XML) (Ray, 2003) is used to represent information objects consisting of elements (e.g. tags and attributes). XML defines the syntax for markup languages. XML Schema allows definition of languages in a machine readable format. Web Services Description Language (WSDL) (WSDL, 2005) is an XML-based language for describing Web services in a machine understandable form. WSDL describes and exposes a web service using major elements of portType, message, types, and binding. The portType element describes the operations performed by a web service. The message element defines the data elements of an operation. The types element
695
Middleware for Community Coordinated Multimedia
defines the data type used by the web service. The binding element defines the message format and protocol details for each port. Universal Description Discovery and Integration (UDDI) (UDDI, 2004) is regarded as a specification of the service, service definition, and metadata “hub’ for service-oriented architecture. Various structural templates are provided by UDDI for representing data about business entities, their services, and the mechanisms for governing them. The UDDI upper service model consists of a BusinessEntity (who), a BusinessService (what), a BindingTemplate (how and where) and a tModel (service interoperability). XML Schema Language is used in UDDI to formalize its data structures. Simple Object Access Protocol (SOAP) (SOAP, 2003) is regarded as a protocol for exchanging XMLbased messages over computer networks. One of the most common SOAP messaging patterns is the Remote Procedure Call (RPC) pattern, in which the client communicates with the server by sending a request/response message. Business Process Execution Language for Web Services (BPEL, also WSBPEL or BPEL4WS) (Thatte, 2003) provides a flexible way to define business processes comprised of services. BPEL supports executable processes and abstract processes. Executable processes allow specifying the exact details of business processes and can be executed by an orchestration engine. The process descriptions for business protocols are called abstract processes, which allow to specifying the public message exchange between parties. With BPEL, complex business processes can be defined in an algorithmic manner (Thatte, 2003).
MIDDLEWARE FOR COMMUNITY-COORDINATED CCM In the CCM scenario, end users’ experience is enriched and extended by community-coordinated multimedia. In the case of community-coordinated TV channel, TV viewers can watch video clips which are uploaded by themselves, add comments and even vote for them. The user preference profile is maintained in the community coordinator. The moderator moderates the incoming videos and compiles a playlist for the TV program. The moderator also filters the incoming comments and chooses which can be shown on the program. This section discusses standards and principles that govern the participation of peers into the community (peer management); messaging models, e.g. point-to-point, publish/subscribe, multicast and broadcast, and profile management, P2P SIP new features, and coordination, etc.
CLASSIFICATION OF COMMUNITIES From the technical point of view, user communities can be classified into private and public communities, as done in the project JXTA (Oaks et al, 2002). When taking the purpose of communities into account, these two fundamental classes could be further divided at least into social, commercial, and professional communities (Koskela et al, 2008) that are somewhat paid attention to in an attribute-based system. In practice, there will probably be situations where the members of a public community do not want to reveal their memberships to nodes outside of their sub-community. These kinds of communities, where a part of the members do not publish their membership to the main overlay, are called partially private communities. However, to maintain the community, at least one of the peer members must publish their membership to the main overlay (Koskela, 2008).
696
Middleware for Community Coordinated Multimedia
REqUIREMENTS FOR COMMUNITY MIDDLEWARE The requirements of middleware for community are initially specified as messaging management and peer management. Messaging management. Messaging management is crucial for the middleware for community, as it provides the basic communication methods based on the messaging models. Peer management. Peer management functionality manages the formation of the peer group, the scale of the community, the joining and leaving of the peers.
SURVEY ON MIDDLEWARE TECHNOLOGY FOR COMMUNITY-COORDINATED CCM Messaging models. A solid understanding of the available messaging models is crucial to understand the unique capabilities it provides. Four main message models are commonly available, unicast, broadcast, multicast, anycast. The unicast model, also known as point-to-point messaging model, provides a straightforward exchanging of messages between software entities. Broadcast is a very powerful mechanism used to disseminate information between anonymous message consumers and producers. It provides a one-to-many distribution mechanism where the number of receivers is not limited. The multicast model (Pairot et al, 2005) is a variation of the broadcast. It works by sending a multicast message to a specific group of members. The main difference between broadcast and multicast is that multicast just sends messages to the members of a subscribed group, while the broadcast sends messages to everyone without any membership limitation. The broadcast model can also be implemented as a publish/ subscribe messaging model so that it resembles multicast. The anycast model means sending anycast notification to a group, which will make the sender’s closest member in the network answer, as long as it satisfies a condition (Pairot et al, 2005). This feature is very useful for retrieving object replicas from the service network. Peer management protocols. For the purpose of interoperability and other peer management functionalities, Internet Engineering Task Force (IETF) founded a Peer-to-Peer Session Initiation Protocol (P2PSIP) working group recently. The P2PSIP working group is chartered to develop protocols and mechanisms for the use of the Session Initiation Protocol (SIP) in settings where the service of establishing and managing sessions is principally handled by a collection of intelligent endpoints, rather than centralized servers, as in SIP as currently deployed (P2PSIP, 2008). There are two different kinds of nodes in P2PSIP networks: ‘P2PSIP peers’ and ‘P2PSIP clients’. P2PSIP peers participate the P2PSIP overlay networks, provides routing information to other peers, etc. P2PSIP clients do not participate the P2PSIP overlay networks, but instead utilize the service provided by the peers to locate users and resources. In this way, P2PSIP can determine the correct destination of SIP requests by this distributed mechanism. The other functionalities, e.g. session management, messaging, and presence functions are performed using conventional SIP. The work on the P2PSIP working group is still work-in-progress, but it has put forward some peer protocols, such as RELOAD (Jennings et al, 2008), SEP (Jiang et al, 2008), etc. for the management of peers, and two client protocols (Pascual et al, 2008, Song, 2008) to manage the clients. Furthermore, JXTA also supports community concept which it calls as ‘group’. It provides a dedicated Membership Service to manage the group related issues.
697
Middleware for Community Coordinated Multimedia
FUTURE TRENDS The trend of CCM is towards delivering multimedia services in a customized quality over heterogeneous network, which enables multimedia services to be adapted to any IP-based mobile and P2P content delivery networks. The future work of middleware on CCM is identified as the follows: •
•
•
•
Context-aware middleware. Context-aware middleware provides mobile applications with necessary knowledge about the execution context in order to make them adapt to dynamic changes in mobile condition. But most of the current systems just focus on the location awareness. Thus, there is no middleware which can fully support all the requirements of mobile applications. Further research is still needed. QoS-aware middleware. The strong motivation to QoS-aware middleware is initiated by meeting stringent QoS requirements such as predictability, latency, efficiency, scalability, dependability, and security. The goal is to help accelerate the software process by making it easier to integrate parts together and shielding developers from many inherent and accidental complexities, such as platform and language heterogeneity, resource management, and fault tolerance (Quasy, 2004). The extension of Web service specification, i.e. WS-* specifications (Erl, 2005), provide a means to assert control over QoS management. Middleware for multimedia service delivery over 4G networks. The motivation to 4G network operators is initiated by providing multimedia service for mobile devices. Incorporating IMS (Camarillo, 2006) into mobile multimedia services is part of the vision for evolving mobile networks beyond GSM. Middleware for multimedia service delivery over P2P SIP. P2P technologies have been widely used on the Internet in file sharing and other applications including VoIP, Instant Message, and presence. This research continues the study of community middleware and extends capabilities of delivering multimedia services to mobile devices over P2P network, especially, by employing SIP session management.
CONCLUSION Community Coordinated Multimedia presents a novel use paradigm for consuming multimedia through requesting multimedia-intensive Web services via diverse terminal devices, converged networks, and heterogeneous platforms within a virtual, open and collaborative community. In order to reach the paradigm, this chapter focused on addressing the key enabling technology of middleware for CCM. It started with the concept definition relevant with CCM and the specification of middleware ontology in the context of CCM. Then a generic CCM scenario was described and the requirements for CCM middleware were analyzed with respect to the characteristics of mobility-aware, multimedia-driven, service-oriented, and community-coordinated CCM. A middleware architecture for CCM was introduced to address the requirements from four viewpoints. Each part of the middleware architecture for CCM was surveyed. Finally, the future trends in the evolution of CCM middleware were discussed.
698
Middleware for Community Coordinated Multimedia
ACKNOWLEDGMENT This work is being carried out in the EUREKA ITEA2 CAM4Home project funded by the Finnish Funding Agency for Technology and Innovation (Tekes).
REFERENCES p2psip working group. (2008). Peer-to-Peer Session Initiation Protocol Specification. Retrieved June 15th, 2008, from http://www.ietf.org/html.charters/p2psip-charter.html ACR-NEMA. (2005). DICOM (Digital Image and Communications in Medicine). Retrieved June 15th, 2008, from http://medical.nema.org/ BBC. (2005). SMEF- Standard Media Exchange Framework. Retrieved June 15th, 2008, from http:// www.bbc.co.uk/guidelines/smef/.15th June, 2008. Bellavista, P., & Corradi, A. (2007). The Handbook of Mobile Middleware. New York: Auerbach publications. Bender, T. (1982). Community and Social Change in America. Baltimore, MD: The Johns Hopkins University Press. Bishop, P., & Warren, N. (2002). JavaSpaces in Practice. New York: Addison Wesley. Blair, G. S., Coulson, G., & Blair, L. DuranLimon, H., Grace, P., Moreira, R., & Parlavantzas, N. (2002). Reflection, self-awareness and self-healing in OpenORB. In WOSS ‘02 Proceedings of the First Workshop on Self-Healing Systems, (pp. 9-14). Burnett, I. (2006). MPEG-21: Digital Item Adaptation - Coding Format Independence, Chichester, UK. Retrieved 15th June, 2008, from http://www.ipsi.fraunhofer.de/delite/projects/mpeg7/Documents/ mpeg21-Overview4318.htm#_Toc523031446. Ciancarini, P. (1996). Coordination Models and Languages as Software Integrators. SCM Comput. Surv., 28(2), 300–302. doi:10.1145/234528.234732 Cotroneo, D., Migliaccio, A., & Russo, S. (2007). The Esperanto Broker: a communication platform for nomadic computing systems. Software, Practice & Experience, 37(10), 1017–1046. doi:10.1002/ spe.794 Dixit, S., & Wu, T. (2004). Content Networking in the Mobile Internet. New York: John Wiley & Sons. ERCIM. (2005). Multimedia Informatics. ERCIM News, 62. Erl, T. (2005). Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Upper Saddle River, NJ: Prentice Hall.
699
Middleware for Community Coordinated Multimedia
Fritsch, D., Klinec, D., & Volz, S. (2000). NEXUS positioning and data management concepts for location aware applications. In the 2nd International Symposium on Telegeoprocessing (Nice-Sophia-Antipolis, France), (pp. 171-184). Gaddah, A., & Kunz, T. (2003). A survey of middleware paradigms for mobile computing. Carleton University and Computing Engineering [Research Report]. Retrieved June 15th, 2008, from http://www. sce.carleton.ca/wmc/middleware/middleware.pdf Gonzalo, C., & García-Martín, M.-A. (2006). The 3G IP Multimedia Subsystem (IMS): Merging the Internet and the Cellular Worlds. New York: Wiley. Haahr, M., Cunningham, R., & Cahill, V. (1999). Supporting CORBA applications in a mobile environment. In MobiCom ‘99: Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, (pp. 36-47). Halsall, F. (2000). Multimedia Communications: Applications, Networks, Protocols and Standards (Hardcover). New York: Addison Wesley. Harte, L., Wiblitzhouser, A., & Pazderka, T. (2006). Introduction to MPEG; MPEG-1, MPEG-2 and MPEG-4. Fuquay Varina, NC: Althos Publishing. Hopper, R. (2002). P/Meta - metadata exchange scheme. Retrieved June 15th, 2008, from http://www. ebu.ch/trev_290-hopper.pdf ISO/IEC. (1995). SMDL (Standard Music Description Language) Overview. Retrieved June 15th, 2008, from http://xml.coverpages.org/gen-apps.html#smdl ISO/IEC. (2003). MPEG-7 Overview. Retrieved June 15th, 2008, from http://www.chiariglione.org/ mpeg/standards/mpeg-7/mpeg-7.htm. Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., & Schulzrinne, H. (2008). REsource LOcation And Discovery (RELOAD). Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-bryan-p2psipreload-04.txt. Jiang, X.-F., Zheng, H.-W., Macian, C., & Pascual, V. (2008). Service Extensible P2P Peer Protocol. Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt Kangasharju, J. (2002). Implementing the Wireless CORBA Specification. PhD Disertation, Computer Science Department, University of Helsinki, Helsinki, Finland. Retrieved June 15th, 2008, from http:// www.cs.helsinki.fi/u/jkangash/laudatur-jjk.pdf Koskela, T., Kassinen, O., Korhonen, J., Ou, Z., & Ylianttila, M. (2008). Peer-to-Peer Community Management using Structured Overlay Networks. In the Proc. of International Conference on Mobile Technology, Applications and Systems, September 10-12, Yilan, Taiwan. Krafzig, D., Banke, K., & Slama, D. (2005). Enterprise SOA: Service-Oriented Architecture Best Practices. Upper Saddle River, NJ: Prentice Hall. Liu, C., Qian, D., Liu, Y., Li, Y., & Wang, C. (2006). RSVP Context Extraction in IP Mobility Environments. Vehicular Technology Conference, 2006, VTC 2006-Spring, IEEE 63rd, (Vol. 2, pp. 756-760).
700
Middleware for Community Coordinated Multimedia
Matjaz, B. J. (2008). BPEL and Java. Retrieved June 15th, 2008, from http://www.theserverside.com/ tt/articles/article.tss?l=BPELJava Migliaccio, A. (2006). The Design and Development of a Nomadic Computing Middleware: the Esperanto Broker. PhD Dissertation, Department of Computer and System Engineering, Federico II, University of Naples, Naples, Italy. MRML. (2003). MRML- Multimedia Retrieval Markup Language. Retrieved June 15th, 2008, from http://www.mrml.net/ Murphy, A. L., Picco, G. P., & Roman, G. (2001). LIME: a middleware for physical and logical mobility. 21st International Conference on Distributed Computing Systems, (pp. 524-533). Nelson, B. J. (1981). Remote Procedure Call. Palo Alto, CA: Xerox - Palo Alto Research Center. Oaks, S., Traversat, B., & Gong, L. (2002). JXTA in a Nutshell. Sebastopol, CA: O’Reilly Media, Inc. OMG. (2002). Wireless Access and Terminal Mobility in CORBA Specification. Retrieved June 15th, 2008, from http://www.info.fundp.ac.be/~ven/CIS/OMG/new%20documents%20from%20OMG%20 on%20CORBA/corba%20wireless.pdf Pairot, C., Garcia, P., Rallo, R., Blat, J., & Gomez Skarmeta, A. F. (2005). The Planet Project: collaborative educational content repositories on structured peer-to-peer grids. CCGrid 2005, IEEE International Symposium on Cluster Computing and the Grid, (Vol. 1, pp. 35-42). Pascual, V., Matuszewski, M., Shim, E., Zheng, H., & Song, Y. (2008). P2PSIP Clients. Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-pascual-p2psip-clients-01.txt Pennebaker, W. B., & Mitchell, J. L. (1992). JPEG: Still Image Data Compression Standard (Digital Multimedia Standards). Berlin: Springer. Perkins, C. (2003). RTP: Audio and Video for the Internet. New York: Addison-Wesley. Polak, S., Slota, R., Kitowski, J., & Otfinowski, J. (2001). XML-based Tools for Multimedia Course Preparation. Archiwum Informatyki Teoretycznej i Stosowanej, 13, 3–21. Pro-MPEG. (2005). Material eXchange Format (MXF). Retrieved 15th June, 2008, from http://www. pro-mpeg.org. Quasy, H. M. (2004). Middleware for Communications. Chichester, UK: John Wiley Sons ltd. Ray, E. (2003). Learning XML. Sebastopol, CA: O’Reilly Media, Inc. Richardson, I., & Richardson, I. E. G. (2003). H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia. Chichester, UK: Wiley. Rodriguez, B. (2002). EDLXML serialization. Retrieved 15th June, 2008, from download.sybase.com/ pdfdocs/prg0390e/prsver39edl.pdf Roman, M., Kon, F., & Campbell, R. (2001). Reflective Middleware: From your Desk to your Hand. IEEE Communications Surveys, 2(5).
701
Middleware for Community Coordinated Multimedia
Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. In Proceedings of Mobile Computing Systems and Applications, (pp. 85-90). SMIL/ W3C. (2005). SMIL- Synchronized Multimedia Integration Language. Retrieved June 15th, 2008 from http://www.w3.org/AudioVideo/ SMPTE. (2004). Metadata dictionary registry of metadata element descriptions. Retrieved June 15th, 2008, from http://www.smpte-ra.org/mdd/rp210-8.pdf SOAP/W3C. (2003). SOAP Version 1.2 Part 1: Messaging Framework. Retrieved June 15th, 2008, from Http://www.w3.org/TR/2003/REC-soap12-part1-20030624/ Song, Y., Jiang, X., Zheng, H., & Deng, H. (2008). P2PSIP Client Protocol. Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt. Steen, van M., Homburg, P., & Tanenbaum, A. S. (1999). Globe: a wide area distributed system. Concurrency, IEEE [See also IEEE Parallel & Distributed Technology], 7, 70-78. Stuart, W., & Koch, T. (2000). The Dublin Core Metadata Initiative: Mission, Current Activities, and Future Directions, (Vol. 6). Retrieved June 15th, 2008, from http:/www/dlib.org/dlib/december00/ weibel/12weibel.html Tanenbaum, A. S., & Steen, M. V. (2008). Distributed Systems: Principles and Paradigms. Upper Saddle River, NJ: Prentice Hall. Taubman, D., & Marcellin, M. (2001). JPEG2000: Image Compression Fundamentals, Standards and Practice. Berlin: Springer. Thatte, S. (2003). BPEL4WS, business process execution language for web services. Retrieved June 15th, 2008, from http://xml.coverpages.org/ni2003-04-16-a.html TV-Anytime. (2005). TV-Anytime. Retrieved June 15th, 2008, from http://www.tv-anytime.org UDDI. (2004). UDDI Version 3.0.2. Retrieved June 15th, 2008, from http://www.Oasis-Open.org/committees/uddi-spec/doc/spec/v3/uddi-v3.0.2-20041019.Htm Vakali, A., & Pallis, G. (2003). Content Delivery Networks: Status and Trends. IEEE Internet Computing, (November 6): 68–74. doi:10.1109/MIC.2003.1250586 Watt, A. Lilley Chris, & J., Daniel. (2003). SVG Unleashed. Indianapolis, IN: SAMS. WSDL/W3C. (2005). WSDL: Web Services Description Language (WSDL) 1.1. Retrieved June 15th, 2008, from http://www.w3.org/TR/wsdl. Wyckoff, P., McLaughry, S. W., Lehman, T. J., & Ford, D. A. (1998). T Spaces. IBM Systems Journal, 37, 454–474. Zhou, J., Ou, Z., Rautiainen, M., & Ylianttila, M. (2008b). P2P SCCM: Service-oriented Community Coordinated Multimedia over P2P. In Proceedings of 2008 IEEE International Conference on Web Services, Beijing, China, September 23-26, (pp. 34-40).
702
Middleware for Community Coordinated Multimedia
Zhou, J., Rautiainen, M., & Ylianttila, M. (2008a). Community coordinated multimedia: Converging content-driven and service-driven models. In proceedings of 2008 IEEE International Conference on Multimedia & Expo, June 23-26, 2008, Hannover, Germany.
KEY TERMS AND DEFINITIONS Community: is generally defined as groups of limited number of people held together by common interests and understandings, a sense of obligation and possibly trust. Community Coordinated Multimedia (CCM): system maintains a virtual community for the consumption of CCM multimedia elements, i.e. both content generated by end users and content from professional multimedia provider (e.g., Video on Demand). The consumption involves a series of interrelated multimedia intensive processes such as content creation, aggregation, annotation, etc. In the context of CCM, these interrelated multimedia intensive processes are encapsulated into Web services, instead of multimedia applications, namely multimedia intensive services, briefly multimedia services. Middleware: is the key technology which integrates two or more distributed software units and allows them to exchange data via heterogeneous computing and communication devices. In this chapter, middleware is perceived as an additional software layer in OSI model encapsulating knowledge from presentation and session layers, consisting of standards, specifications, forms, and protocols for multimedia, service, mobility and community computing and communication. Multimedia: represents a synchronized presentation of bundled media types, such as text, graphic, image, audio, video, and animation. Standard: refers to an accepted industry standard. Protocol refers to a set of governing rules in communication between computing endpoints. A specification is a document that proposes a standard.
ENDNOTE 1
http://www.cam4home-itea.org/
703
Section 8
Mobile Computing and Ad Hoc Networks
705
Chapter 30
Scalability of Mobile Ad Hoc Networks Dan Grigoras University College Cork, Ireland Daniel C. Doolan Robert Gordon University, UK Sabin Tabirca University College Cork, Ireland
ABSTRACT This chapter addresses scalability aspects of mobile ad hoc networks management and clusters built on top of them. Mobile ad hoc networks are created by mobile devices without the help of any infrastructure for the purpose of communication and service sharing. As a key supporting service, the management of mobile ad hoc networks is identified as an important aspect of their exploitation. Obviously, management must be simple, effective, consume least of resources, reliable and scalable. The first section of this chapter discusses different incarnations of the management service of mobile ad hoc networks considering the above mentioned characteristics. Cluster computing is an interesting computing paradigm that, by aggregation of network hosts, provides more resources than available on each of them. Clustering mobile and heterogeneous devices is not an easy task as it is proven in the second part of the chapter. Both sections include innovative solutions for the management and clustering of mobile ad hoc networks, proposed by the authors.
INTRODUCTION In this chapter, we discuss the concept of scalability applied to Mobile Ad hoc NETworks (MANET). MANETs are temporarily formed networks of mobile devices without the support of any infrastructure. One of the most important characteristics of MANETs is the unpredictable evolution of their configuration. The number of member nodes within a MANET can vary immensely over a short time interval, from tens to thousands and vice-versa. Therefore, the scalability of network formation and management, DOI: 10.4018/978-1-60566-661-7.ch030
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scalability of Mobile Ad Hoc Networks
mobile middleware and applications is a key factor in evaluating the overall MANET effectiveness. The large diversity and high penetration of mobile wireless devices make their networking a very important aspect of their use. By self-organizing in mobile ad hoc networks, heterogeneous devices can communicate, share their resources and services and run new and more complex distributed applications. Mobile applications such as multiplayer games, personal health monitoring, emergency and rescue, vehicular nets and control of home/office networks illustrate the potential of mobile ad hoc networks. However, the complexity of these networks brings new challenges regarding the management of heterogeneity, mobility, communication and scarcity of resources that all have an impact on scalability. The scalability property of complex distributed systems does not have a general definition and evaluation strategy. Within the realm of MANET, scalability can refer to several aspects, from performance at the application layer to the way scarce resources are consumed. One example is the case where the mobile system is not scalable if batteries’ energy is exhausted by demanding management operations. A mobile middleware service is not scalable if it does not meet the mobile clients’ requirements with similar performance, irrespective of their number or mobility patterns. All current MANET deployments or experiments involve a small number of devices, at most of the order of few tens, but, in the future, hundreds, thousands or even more devices will congregate and run the same application(s). Therefore it is essential to consider the strategies by which scalability will be provided to the network and application layers such that any number of devices and clients will be accommodated with the same performance. When used, mobile middleware systems will also be required to be scalable. Following, the most important aspects of scalability with regard to mobile ad hoc networks will be reviewed considering how MANET can be managed cost-effectively and how an important application of large distributed systems, clustering, can be implemented in a scalable manner on MANET. This chapter is organized as follows. The first Section discusses the management service of mobile ad hoc networks and innovative means for making it a scalable service. The rapid change of MANET membership impacts on the node address management. Additionally, frequent operations as split and merge require address management as well. Therefore, MANET management is mostly the management of node addresses. As potentially large networks, MANET can be used as the infrastructure that supports mobile cluster computing. Consequently, the second section is dedicated to cluster computing on MANET and its related scalability issues.
MANAGEMENT OF MANET The Set of Management Operations for MANET An ad hoc network is a dynamic system whose topology and number of member nodes can change at any time. The MANET scenario assumes that there will always be one node which will set up the network followed up by other nodes which will join the network by the acquisition of unique addresses. This is, for example, the strategy of Bluetooth, where the initial node, also known as the master, creates the network identity and allocates individual addresses to up to seven new members of the network as they join it.
706
Scalability of Mobile Ad Hoc Networks
During its lifetime, any MANET can be characterized by the following set of management operations: • • • • •
Network setup, usually executed by one (initial) node that also creates the MANET identity; Join/leave, when a node joins or leaves an existing MANET; Merge of two or more MANETs and the result is one larger MANET; Split of MANET into two or more sub-MANETs; Termination when nodes leave the network and the MANET ceases to exist.
As, due to nodes’ mobility, these operations can be quite frequent, the MANET management becomes complex and resource-consuming, especially in terms of node addresses, battery energy and bandwidth. Therefore it is important to study strategies proposed for MANET organization in respect to the way these critical resources are managed. Although one of the main goals of mobile platforms is to minimize energy consumption, by introducing more sleep states for the CPU and the idle thread in the operating system for example, less attention is paid to MANET management operations cost in terms of resources. For example if lots of messages are used for a basic join operation, there will be high energy and bandwidth costs for each joining node and for the entire system. A good strategy would use a minimum number of short messages and be scalable while a poor strategy fails to manage a large number of devices. Currently, there are two main technologies, Bluetooth (Bluetooth, 2008) and IEEE 802.11x (WiFi, 2008), used to create MANET. Almost all new mobile phones, PDAs and laptops are Bluetooth-enabled making them potentially members of Bluetooth MANET. However, Bluetooth accepts only eight devices in a piconet (this is the name of the Bluetooth MANET) as the standard adopted only three bits for addressing a node. The first node, setting up the piconet, becomes its master. It creates the piconet identity and allocates addresses to nodes that join the network. Regarding the merge operation, Bluetooth does not have any specific procedure for it. However, a node that is within the range of two piconets can switch from one to the other. Alternative membership to the two piconets creates the potential for the node of acting like a gateway – the larger network is called scatternet. There is no provision for split as this was probably not considered as a likely event for such a small network. There is no clear protocol for piconet termination, this operation being left to user’s intervention. The lack of provision for MANET management in the case of Bluetooth is explained by its primary goal of eliminating cables and not to create P2P mobile networks. However the increasing popularity of Bluetooth may lead to the necessity to manage large scatternets (collections of interconnected piconets) and a rethink of this technology. The 802.11x standards cover only the physical and link layer. A WiFi MANET is generally an IPbased network.
Solutions for the Management of IP-based MANET The management of an IP-based MANET is the management of the IP addresses. Considering the role of IP addresses as a pointer to a unique computer/node in a certain network, it is easy to understand the difficulty of porting this concept to the mobile area, where nodes can change network membership often. Several solutions were proposed for IP address allocation to mobile nodes. The simplest use the current features of the protocol. Either, there is a DHCP server that has a pool of IP addresses, or mobile nodes are self-configuring. In the former situation, it is assumed that one node runs the DHCP server,
707
Scalability of Mobile Ad Hoc Networks
the pool has enough IP addresses for all potential candidates to membership, and the DHCP host will belong to MANET for its entire lifetime. Obviously, all these assumptions are difficult to guarantee. In the latter situation, each node picks an IP address from the link-local range or from a private range of addresses and then checks for duplicates. From time to time, nodes will search for a DHCP server and if this is present, will ask for an IP address. Although much simpler, this strategy assumes that nodes share the same link and the size of the network is not too large. Otherwise, duplicate checks would dominate the network communication. Moreover, join and merge can be committed only after duplicate check confirms the lack of duplicates. If these operations are frequent, the management of IP addresses will consume a lot of the most important resources of mobile nodes, battery energy and bandwidth. IP-based MANET termination is either triggered by the return of all IP addresses to the pool, or by a user defined protocol.
MANETconf One of the earliest projects that offered a full solution to IP-based MANET management is MANETconf (Nesargi, 2002). This protocol assumes that all nodes use the same private address block, e.g., 10.0.0.0 to 10.255.255.255, and each node that requests to join a network will benefit of the services of an existing member of the network. This node, acting as a proxy, will allocate an IP address to the newly arrived node after checking with all the other nodes that that IP address is idle. Conflicts among multiple proxies trying to allocate IP addresses in the same time are solved by introducing priorities. The proxy with the lower IP address has priority over the other(s). Split and merge were considered as well. While split is managed by simply cleaning up the IP addresses of departed nodes, belonging to all the other partitions, merge requires a more elaborated algorithm. The authors associate with each partition an identity represented by a 2-tuple. The first element of the tuple is the lowest IP address in use in the partition. The second element is a universally unique identifier (UUID) proposed by the node with this lowest IP address. Each node in the partition stores the tuple. When two nodes come into each other’s radio range, they exchange their partitions identities. If they are different, a potential merge is detected. This operation will proceed by exchanging the sets of idle IP addresses and then broadcasting them to all the other members of each partition. If by merging there are conflicting addresses (duplicates) the node(s) with the lower number of active TCP connections will request a new IP address. MANETconf as a complete solution requires a lot of communication that increases with the size of the network. Therefore, we cannot consider this protocol a scalable solution to MANET management. IP addressing and the associated protocol are effective for wired networks and to a certain extent to Access Point-based mobile networks (Mobile IPv4, or IPv6), but difficult to manage in a MANET (Ramjee, 2005) (Tseng, 2003). The main difficulty arises from the fact that an IP address has no relevance for a mobile node that can change network membership frequently. Any such change may result in a new address allocated to that node, each time, followed by duplicate checks. More important than a numeric address is the set of services and resources made available to other peers by the node. In this respect, there are initiatives to introduce new ways of addressing mobile nodes of MANET (Adjie-Winoto, 1999), (Balazinska, 2002). One particular project deals with a service oriented strategy which builds on the assumption that MANET are mainly created for the purpose of sharing services and, in this context, IP addresses as an indication of location have no relevance (Grigoras, 2005). Then, service discovery, remote execution and service composition are the most important operations related to service sharing.
708
Scalability of Mobile Ad Hoc Networks
If the Internet Protocol is not used anymore, new transport protocols, probably simpler but still reliable if replacing TCP, have to be designed. To define the scope of MANET, a new concept of soft network identity was proposed in (Grigoras, 2007a). As this concept provides a totally new approach in the way MANET are managed and moreover assures scalability, we will explain it in the following section.
The Management of Non-IP MANET The difficulties and high cost of managing IP addresses led to the idea that we might find better ways to manage MANET, preserving the requirements of communication and service provisioning among all nodes. Because MANET is a system with a limited lifetime, it makes sense to allocate it an identity that is valid only as long as MANET is active/alive. This identity is then used for all the management operations. The first node that organizes MANET computes a network identity, net_id for short, based on its MAC address, date and time. It then attaches to it a time-to-live (TTL), an expectation of how long that network will be alive:{net_id, TTL} This pair represents the soft identity state of the new network. For example, {112233, 200} corresponds to network 112233 with a living expectation of 200 seconds. A node joins the network after requesting, receiving and storing, from any one-hop neighbour already in the network, the net_id and updated TTL, for example {112233, 150}. TTL is counted down by each node. When it times out, the associated net_id is cancelled, meaning that the node is no more a member of the network, 112233 in our example. On the other hand, TTL is prolonged when a message carrying the net_id in the header is received by the node. The significance is that messages mean activity and therefore the network should be kept alive (i.e. the node is or can be on an active path). For increasing the chance to find services of interest, a node may join as many MANET as it wants – using the Greedy join algorithm (Grigoras, 2007a). All {net_id, TTL} records are cached and each of them managed separately. If a node leaves a network, it may still be active in other network(s). Within MANET, a node is uniquely addressed by the MAC address and its set of public and private services.
The MANET Management Operations MANET setup and join are executed by the same algorithm: initially, the host broadcasts a message to join an existing network; if there is a reply, carrying the net_id and TTL, the host caches them and becomes a member of the network; if there is no reply within a join time interval, it will still wait for a delay interval for possible late replies and, then, if no reply was received, it will compute its own net_id and attach an TTL. The expectation is that other hosts will join this network and activity will start. Otherwise, the TTL counter will time out and the net_id will be cancelled. The host is free to start the procedure again. For example, this can be a background process that will assist a distributed application by providing the network infrastructure. The join and delay time intervals are two parameters whose initial values, picked by the user, can be updated depending on the environment (number of failures, mobility pattern etc). Merge is triggered by a host that receives messages carrying in the header a new net_id. This operation can be executed on demand or implicitly – two or more overlapping networks merge. In both cases, the
709
Scalability of Mobile Ad Hoc Networks
contact host will forward the new {net_id, TTL} pair to all peers. The merge can be mandatory or not. Obviously, islands of nodes may lose the membership by time out if they don’t receive/route messages. This behaviour was indeed noticed during simulation (Grigoras, 2007a). Split is simpler: all sub-networks preserve the net_id; if there is activity, TTL will be prolonged, otherwise it will time out. If split networks from the same network will come together again, the net_id is still the same. Termination is signalled by time out. Indeed, when there is no activity, the counter will time out and hosts will gracefully leave. The MANET management based on the soft net_id concept, presented here, is simple, uses the minimum number of messages (2), is scalable and offers a full solution. Experimental results (Grigoras, 2007b) showed not only that the net_id strategy uses the minimum number of messages for carrying out the management operations but also is scalable. Scalability is provided by the use of local operations. Indeed, when a node plans to join one or more networks, it simply broadcast its join request and then listens for offers. By storing the net_id, the node becomes a de facto member of the network and can now communicate with other nodes. No management operation requires global communication and this is a key rule for scalable distributed systems. As MANET is still a new networking model, there will be more and potentially better strategies for their management that will also be scalable - accept any number of new nodes with minimum consumption of resources.
CLUSTER COMPUTING ON MANET Global High Performance Mobile Computing The world of High Performance Computing (HPC) utilises the combined processing power of several inter-connected nodes to compute the solution of a complex problem within a reasonable timeframe. Presently the top rated HPC machine is IBM’s Blue Gene/L (IBM, 2007) comprising of 131,072 processors providing a peak performance of 596 Teraflops. According to performance projections it is expected that a Petaflop capable machine should be in place before 2009 (TOP500, 2007) and a ten Petaflop machine by 2012. The SETI@Home project is the most well know distributed Internet computing project in the world. It is just one of several projects (BOINC, 2008) that are part of the Berkeley Open Infrastructure for Network Computing (BOINC). The recent upgrade of the world’s largest radio telescope in Arecibo, Puerto Rico from where SETI@Home receives its data stream means a five hundred fold increase in the amount of data that needs to be processed (Sanders, 2008). This amounts to 300 gigabytes of data per day. The Seti@Home project uses a divide and conquer strategy implemented as a Client/Server architecture whereby client applications running on personal computers throughout the world carry out the task of processing the data and returning the results to the servers at Berkeley. The project has over five million registered volunteers, with over 201,147 users processing data blocks on a regular basis across 348,819 hosts. The project was running at 445.4 teraflops (SETIstats, 2008), the combined speed of all the BOINC projects was rated at 948.7 teraflops comprising 2,781,014 hosts (BOINCstats, 2008) as of 10th March 2008. The Folding@home project is similar to SETI@home using the processing power of volunteers from around the globe. The client statistics for Folding@home (as of 9th March 2008) had
710
Scalability of Mobile Ad Hoc Networks
Table 1. Comparison of mobile phone CPU speeds Phone
Announced
OS
Nokia N96
11/02/2008
Symbian OS v9.3
CPU
JBenchmark ACE
400Mhz
Unknown
Nokia N93
25/04/2006
Symbian OS v9.1
330Mhz
329Mhz
Nokia N70
27/04/2005
Symbian OS v8.1a
220Mhz
220Mhz
Nokia N73
25/04/2006
Symbian OS v9.1
206Mhz
221Mhz
Nokia 6680
14/02/2005
Symbian OS v8.0a
220Mhz
224Mhz
Nokia 6630
14/06/2004
Symbian OS v8.0a
220Mhz
227Mhz
Nokia 7610
18/03/2004
Symbian OS v7.0s
123Mhz
126Mhz
264,392 active nodes operating at 1,327 Teraflops (Folding@home, 2008), well over twice that of the world’s most powerful supercomputer. The bulk of this processing came from Playstation 3 gaming machines giving 1,048 Teraflops from 34,715 active nodes. Clearly for applications that require a high degree of processing, the architecture of distributing the work out to numerous clients can achieve processing speeds far in excess of the world’s top HPC machines. Scalability still poses many questions within the realm of HPC such as what type of architecture should a million-node system have, or how should an application be scheduled on a 1,024 core processor. Could the principle of client applications carrying out CPU intensive operations be feasible within the world of mobile computing? If so what possibilities may lie in store for the future of mobile distributed computation? The rate at which mobile phone technology is being adopted is astonishing, the first billion subscribers took 20 years, the next billion required just 40 months, while the third required a mere 24 months. It would appear that the world has an ever growing and insatiable hunger for mobile technology. November 29, 2007 saw a significant milestone in global mobile phone ownership when it was announced that mobile telephone subscriptions reached 50%, this amounts to over 3.3 billion subscribers (Reuters, 2007). Reports predict that subscriptions may be as high as five billion by 2012 (PortoResearch, 2008). Could the computing power of these billions and billions of mobile phones be harnessed? If so what would the combined computing power of all these devices be? The present number of phones outstrips the total number of processors of the world’s largest supercomputer by over 25,000 times. Mobile devices may have far less computing power than high-end server machines, but their sheer and ever growing number more that counteracts this, as well as their rapidly increasing processing capabilities. In January 2008, ARM announced that it achieved the ten billion processor milestone. The current rate at which these chips are being created is staggering with the annual run rate now estimated at three billion units per year (ARM, 2008). What level of computing power could the mobiles of today provide? In 1999 a 500 Mhz Pentium III machine had a capacity of about 1,354MIPS. In October 2005 ARM announced the 1 Ghz Cortex A8 processor capable of a whopping 2,000MIPS. Even the processors of a phone of five years ago are rated at 200MIPS. Table 1 gives a cross section overview of a selection of mobile phone types and their associated processor speeds. The table was compiled using information from the Nokia developer forum, the reviews section of the my-symbian.com website and was cross referenced with yet another site that provides detailed and up-to-date comparisons of mobile phone specifications (Litchfield S, 2008). The phones presented were also evaluated against JBenchmark’s ARM CPU Estimator (ACE) which
711
Scalability of Mobile Ad Hoc Networks
provides an accurate estimate the processors CPU speed. It is generally very difficult to obtain concrete and detailed information about a phones specification. Most manufacturers neglect to provide detailed specifications so that mobile phones are not weighed up by the common factors such as system memory, processor speed and persistent storage, by which desktop/laptop machines are examined by consumers. This may change in time as phones are gaining more and more computing capabilities. A testament to the increasing power of the mobile device is Sun Microsystems (Shankland, 2007) discontinuation of the use of Java Micro Edition in favour of a full blown virtual machine the likes of which run on the desktop systems of today. An article in late 2007 (Davis, 2007) considered the notion of cell-phone grid computing, and proposed the question of whether Android, the new open-source mobile platform, could provide a foundation for the same. It is therefore becoming evident that people are now becoming aware of the huge potential computing capabilities that the billions of mobile phones could provide. In summary, the combined might of all the worlds’ mobile phones could be the most powerful supercomputer in the world if we could just harness their processing capabilities in a manner similar to the BOINC projects. One of course may say that processor intensive computation would quickly drain the limited battery. Even if processing was carried out only while the phone was connected to mains power it would still allow for probably two hours of solid processing per week. Given this, then at any one time one would still have upwards of 40 million devices contributing their processing power at any instant given the 3.3 billion mobile phone population at the end of 2007. Such extreme mobile parallel computing systems would of course be suitable only for hyper-parallel tasks that can be easily divided into million/billions of distinct jobs. Third generation phones allows for relatively fast internet connectivity with rates of several hundred kbit/s. The main prohibiting factors in the creation of a hyper parallel mobile grid are the interconnectivity costs and people’s willingness to participate. Costs are continually reducing, and as more and more people join BOINC like projects they are realising that they can contribute to the solving of complex and processor intensive problems. Moving in to the future Science will tackle larger and larger problems that will require all the potential processing power we can muster to solve within a reasonable timeframe. It may take some time before we see a hyper-parallel globalised mobile grid, but on a smaller scale the alternative is to use the processing power of the phones within our local vicinity. This is where technologies such as Bluetooth and message passing come into their own.
Localised Mobile Parallel Computing The majority of today’s phones are Bluetooth-enabled as standard; they also have the ability of executing java based applications in the form of MIDlets. Most of these Bluetooth enabled devices allow for data transmission rates of up to 723kbit/s (Bluetooth 1.2) and have an effective range of 10 meters. These mobile phones are therefore perfect platforms for parallel computing tasks on a small scale. The standard Bluetooth Piconet allows for up to eight devices to be interconnected together. This functions in a star network topology using a Client/Server architecture. A star network is of limited use when one Client device wishes to communicate with another client device. In this case, all traffic will have to be routed through the Master device. The solution lies within the bedrock that is parallel computing today, the message passing interface, whereby any node is capable of communicating with any other node. In the mobile world this is achieved by firstly creating the standard star network topology after the process of device and service discovery have been carried out. With connections established to a central node
712
Scalability of Mobile Ad Hoc Networks
Figure 1. MMPI network structure for Piconet and Scatternet sized networks
the process of creating the inter-client connections can take place allowing for the building up of a fully interconnected mesh network (Figure 1). A system called the Mobile Message Passing Interface (MMPI) allows for such an infrastructure to be created and provides methods for both point to point and global communications (Doolan, 2006). Bluetooth itself is inherently Client/Server based, therefore when establishing a parallel world using the MMPI system it is necessary for the user to indicate if the application should be started in Client or Server mode. In the case of a node started with a Client setting, its primary task is to create a Server object to advertise itself as being available. With all the Client nodes up and running, the remaining node can be started as a Master node, which will carry out the discovery process and coordinate the creation of the inter-client links. Bluetooth programming in itself changes the form of how a typical Client/Server system works, as the Client devices are required to establish server connections to allow the Server device to carry out the discovery process and establish Client connections to same. In standard Client/Server systems it is the Server application that is started first and left running, and remains in a constant loop awaiting of incoming Client applications to connect to it, a web server being a typical example. To ensure correct inter-node communication each node maintains an array of connections to every other device within the world. This takes the form of a set of DataInputStreams and DataOutputStreams. Communication is achieved between nodes through a set of methods that abstracts the developer from dealing with streams, communication errors and so forth. In the case of point to point communication one needs to simply call a method such as send(…) to transmit a message to another node. The parameters that are passed are firstly the data to be transmitted (an array of data), an offset, the amount of data to send the data type, and most importantly the id (rank) of the receiving device. Correspondingly the receiving device must have a matching recv(…) method call, to correctly receive the message from the source node. Can the MMPI system scale to a world larger than eight nodes? The Bluetooth Piconet allows for a maximum of eight interconnected devices; however one may use the Scatternet architecture to build larger systems, by interconnecting two or more Piconets together by way of a bridging node common to both Piconets. Using a Scatternet framework, the MMPI system can be scaled to allow for larger networks, for example one could have a network of twelve, fifteen or even twenty devices that allows for inter-node communications between all nodes. This is achieved by the creation of a java class called
713
Scalability of Mobile Ad Hoc Networks
CommsCenter which forms the heart of the Scatternet MMPI system (Donegan, 2008). The CommsCenter receives raw data from the network and translates it into MMPI messages. These messages are passed on to the MMPI interface that is exposed to the developer by means of an additional intermediary class called the MMPINode. The purpose of this is to interface between the high level MMPI methods and the lower level communications; it also helps to take care of the discovery process. Messages that are sent out on to the Bluetooth network are fed up and down through this chain of classes that allows for the abstraction of lower level operations. Messages that are received by the CommsCenter are identified by the header and may take one of five forms: Bridge, Master, Slave, Confirm and Data. The first three are used for the establishment of the network structure to inform what role a specific device should take. The Confirm message is initiated on completion of the network formation process. The Data header is used for the transportation of inter-node messages. In the case that the number of devices that are discovered exceeds the limit of seven, then one of these devices will be chosen to act as a bridging node, therefore forming essentially two distinctive Piconets. The root node will carry out this selection process as it is aware of the number of active nodes that are advertising themselves of inclusion within the parallel world. The root node will build up a list of what devices are to be in each network, and in the case of devices that will appear in a network connected to a bridging node, a Bridging message will be sent to the bridge in question with a list of the node addresses to which it should establish connections to. A routing table is also established as many nodes may not have a direct connection to several of the other nodes in the world. A routing table is maintained by each node, which maintains an entry for every other node except itself. The entries in the table are an index to which node a message should be routed through in order to get to its destination. In the case of a slave node on one Piconet wishing to communicate to a slave node on another Piconet, the message is firstly transmitted to the Master node of the first Piconet (Figure 1). The Master will then forward the message on to the bridging node that interconnects the two networks, and is again forwarded to the Master node of the second Piconet. The message can then be finally sent on to its final destination (a slave node on the second Piconet). Figure 1 clearly shows the interconnections for MMPI running on Piconet sized (limited to eight or less) and Scatternet sized network. In the case of the Piconet sized worlds every node maintains direct connections with every other node. This differs greatly in the case of larger MMPI worlds where the network structure reverts back to a Scatternet structure, comprising of star network topologies interconnected by bridges. The Master node for each sub-network in this case must deal with the routing of messages between Slave nodes, therefore this node can easily become a communications bottleneck when there is a high amount of data transmissions. The larger incarnation of the MMPI architecture was developed in a Scatternet manner to keep the routing tables as simple as possible. This however, could be improved by creating inter-slave/bridge connections between each of the nodes in each sub-network. The process of network formation and routing would be more complex for the initial creation of the world, but it would have the effect of reducing the bottleneck effect on the sub-network Master nodes in the case of inter-slave communications. The MMPI system can be used for a myriad of applications, from parallel processing and graphics to mLearning and multiplayer gaming. Due to the high level of abstraction it liberates the developer from java Bluetooth development, network formation, and the handling of data streams. One can develop a multi-node application very rapidly in comparison to a multi-node application developed from scratch. Instead of writing hundreds of lines to carry out discovery and establish connections, one needs only to
714
Scalability of Mobile Ad Hoc Networks
call the constructor of the MMPI system. Therefore one single line replaces hundreds, when carrying out communications between nodes one simply just needs to call and appropriate method be it for point to point or global communications. In the space of less than a dozen lines one can develop the necessary code to build a fully functional Bluetooth network, and achieve communications between the nodes. This has several advantages such as the speeding up of application development, allowing the developer to focus on the domain specific task at hand, and not having to worry about handling detailed communications issues. In the area of games development a well built single user game can be transformed into a multiplayer game in a matter of hours requiring minimal code changes. Many people enjoy playing computer games, but playing against another human player rather than an AI algorithm adds far more unpredictably to the game. The number of multiplayer Bluetooth-enabled games for mobile phones is quite limited; one reason is so that a game is compatible with as many devices as possible. The process of transforming a single player game in to a multiplayer game can also be time consuming and require significant development resources, but this however is no longer the case. Perhaps as more and more people invest in Bluetooth enabled phones we will see a change in the market, with more multiplayer games being developed, as such the MMPI system may prove to be of significant advantage to these developers, both in reducing development time and costs, and reducing code complexity.
CONCLUSION In this chapter, we addressed aspects of scalability of MANET management and MANET clusters. Regarding the MANET management, the prevalent strategy is to use IP. However, managing IP addresses is resource-consuming and for large MANET can become a nightmare. Our conclusion is that IP-based MANET can not be scalable. New approaches such as the net_id are simpler, use less resources and, more important, provide scalability. Clustering is an interesting solution for creating more powerful systems out of many basic devices. For example, mobile phones can generally be classed as having very limited resources, be it a combination of electrical power, processor and system memory. The use of parallel computing techniques can allows for these small devices to divide up large tasks among themselves and carryout jobs that would otherwise be impossible for a single device. Such examples of this would be a job may have higher memory requirements than what is available on a single device. Another and more imperative restriction is electrical power, where by a task may take too long to process, given a limited battery. The division of work among multiple nodes can spread the resource cost among a number of devices allowing for tasks impossible for one single device to solve to be completed and the results from same obtained at a far reduced wall clock time. The amalgamation of Bluetooth and message passing paradigms to form Java based mobile parallel applications is one solution to this problem, allowing for mobile parallel computing to take place between a limited number of mobile devices. Perhaps in the not too distant future we will see the rise of the hyper-parallel globalised mobile grid as the information processing needs of research projects escalate. Supercomputing may no longer be the realm of high end server farms, but ubiquitous throughout the world with devices such as our phones, set top boxes, desktop computers, and even our cars providing their free clock cycles to solve the data processing requirements of tomorrow.
715
Scalability of Mobile Ad Hoc Networks
REFERENCES Adjie-Winoto, W., Schwartz, E., Blakrishnan, H., & Lilley, J. (1999). The design and implementation of an intentional naming system. Operating Systems Review, 34(5), 186–201. doi:10.1145/319344.319164 ARM. (2008). ARM Achieves 10 Billion Processor Milestone. Retrieved March 10, 2008, from http:// www.arm.com/news/19720.html Balazinska, M., Blakrishnan, H., & Karger, D. (2002). INS/Twine: a scalable peer-to-peer architecture for intentional resource discovery. In Pervasive 2002, Zurich, Switzerland, August. Berlin: Springer Verlag. Bluetooth (2008). Retrieved November 2008 from www.bluetooth.com BOINC. (2008). Berkeley Open Infrastructure for Network Computing. Retrieved March 10, 2008 from http://boinc.berkeley.edu BOINCstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats. com/stats/project_graph.php?pr=sah Davis, C. (2007). Could Android open door for cellphone Grid computing? Retrieved March 10, 2008, from http://www.google-phone.com/could-android-open-door-for-cellphone-grid-computing-12217. php Donegan, B., Doolan, D. C., & Tabirca, S. (2008). Mobile Message Passing using a Scatternet Framework. International Journal of Computers, Communications & Control, 3(1), 51–59. Doolan, D. C., Tabirca, S., & Yang, L. T. (2006). Mobile Parallel Computing. In Proceedings of the Fifth International Symposium on Parallel and Distributed Computing (ISPDC 06), (pp. 161-167). Folding@home, (2008). Client statistics by OS. Retrieved March 10, 2008, from http://fah-web.stanford. edu/cgi-bin/main.py?qtype=osstats Grigoras, D. (2005). Service-oriented Naming Scheme for Wireless Ad Hoc Networks. In the Proceedings of the NATO ARW “Concurrent Information Processing and Computing”, July 3-10 2003, Sinaia, Romania, 2005, (pp. 60-73). Amsterdam: IOS Press Grigoras, D., & Riordan, M. (2007a). Cost-effective mobile ad hoc networks management. Future Generation Computer Systems, 23(8), 990–996. doi:10.1016/j.future.2007.04.001 Grigoras, D., & Zhao, Y. (2007b). Simple Self-management of Mobile Ad Hoc Networks. Proc of the 9th IFIP/IEEEInternational Conference on Mobile and Wireless Communication Networks, 19-21 September 2007, Cork, Ireland. IBM. (2007). Blue Gene. Retrieved March 10, 2008, from http://domino.research.ibm.com/comm/ research_projects.nsf/pages/bluegene.index.html Litchfield, S. (2008). A detailed comparison of Seires 60 (S60) Symbian smartphones. Retrieved March 10, 2008, from http://3lib.ukonline.co.uk/s60history.htm
716
Scalability of Mobile Ad Hoc Networks
Nesargi, S., & Prakash, R. (2002). MANETconf: Configuration of Hosts in a Mobile Ad Hoc Network. In Proceedings of the IEEE Infocom 2002, New York, June 2002. PortoResearch. (2008). Slicing Up the Mobile Services Revenue Pie. Retrieved March 10, 2008, from http://www.portioresearch.com/slicing_pie_press.html Ramjee, R., Li, L., La Porta, T., & Kasera, S. (2002). IP paging service for mobile hosts. Wireless Networks, 8, 427–441. doi:10.1023/A:1016534027402 Reuters (2007). Global cellphone penetration reaches 50 pct. Retrieved March 10, 2008, from http:// investing.reuters.co.uk/news/articleinvesting.aspx?type=media&storyID=nL29172095 Sanders, R. (2008). SETI@home looking for more volunteers. Retrieved 10 March, 2008, from http:// www.berkeley.edu/news/media/releases/2008/01/02_setiahome.shtml SETIstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.com/ stats/project_graph.php?pr=bo Shankland, S. (2007). Sun starts bidding adieu to mobile-specific Java. Retrieved March 10, 2008, from http://www.news.com/8301-13580_3-9800679-39.html?part=rss&subj=news&tag=2547-1_3-0-20 TOP500. (2007). TOP 500 Supercomputer Sites, Performance Development, November 2007. Retrieved March 10, 2008 from http://www.top500.org/lists/2007/11/performance_development Tseng, Y-C., Shen, C-C. & Chen, W-T. (2003). Integrating Mobile IP with ad hoc networks. IEEE Computer, May, 48-55. WiFi (2008). Retrieved November 2008 from http://www.ieee802.org/11/
KEY TERMS AND DEFINITIONS Bluetooth: An RF based, wireless communications technology that has very low power requirements making it a suitable system for energy conscious mobile devices. The JSR-82 Bluetooth API facilitates the development of Java based Bluetooth applications. IEEE 802.11x (WiFi): A set of standards defined by IEEE for wireless local area networks. IP: The Internet Protocol is a data communication protocol used on packet-switched networks. MANET: Mobile ad hoc network, a temporarily created network by mobile devices without any infrastructure support. MMPI: The Mobile Message Passing Interface is a library designed to run on a Bluetooth piconet network. It facilitates application development of parallel programs, parallel graphics applications, multiplayer games and handheld multi-user mLearning applications. Net_id: The mobile ad hoc network identity created by the mobile host which organizes it. It is soft variable that is valid only as long as the network is active.
717
718
Chapter 31
Network Selection Strategies and Resource Management Schemes in Integrated Heterogeneous Wireless and Mobile Networks Wei Shen University of Cincinnati, USA Qing-An Zeng University of Cincinnati, USA
ABSTRACT Integrated heterogeneous wireless and mobile network (IHWMN) is introduced by combing different types of wireless and mobile networks (WMNs) in order to provide more comprehensive service such as high bandwidth with wide coverage. In an IHWMN, a mobile terminal equipped with multiple network interfaces can connect to any available network, even multiple networks at the same time. The terminal also can change its connection from one network to other networks while still keeping its communication alive. Although IHWMN is very promising and a strong candidate for future WMNs, it brings a lot of issues because different types of networks or systems need to be integrated to provide seamless service to mobile users. In this chapter, the authors focus on some major issues in IHWMN. Several noel network selection strategies and resource management schemes are also introduced for IHWMN to provide better resource allocation for this new network architecture.
INTRODUCTION Wireless and mobile networks (WMNs) attract a lot of attention in both academic and industrial fields. They are also witnessing a great success in recent years. Generally, WMN can be classified into two types, centralized (or infrastructure-based) and distributed (or infrastructure-less) WMNs. Cellular netDOI: 10.4018/978-1-60566-661-7.ch031
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Network Selection Strategies and Resource Management Schemes
works are the most widely deployed centralized WMNs and have evolved from the earliest 1G cellular network to current 2G/3G cellular networks. Generally, the service area of a cellular network is divided into multiple small areas that are called cells. Each cell has a central control unit that is referred to as base station (BS). All the communications in the cellular network take place via the BSs. That is, the communication in a cellular must be relayed through a BS. The IEEE 802.11 WLAN (Wireless Local Area Network) is another type of centralized WMNs, which has much smaller coverage compared to cellular networks. Because WLANs are easy to deploy and can provide high bandwidth service, they have experienced rapid growth and wide deployment since they were launched to the market. In a WLAN, the central control unit is called access point (AP). Similar to cellular networks, the communications in a WLAN must be via the APs. The BSs or APs are connected to the backbone networks and provide connections with other external networks, such as Public Switched Telephone Network (PSTN) and Internet. Besides the cellular network and WLANs, there are also many other types of centralized WMNS such as satellite network, WiMax, HiperLan etc. Unlike centralized WMNs, there is no fixed network structure in a decentralized (or distributed) WMN. Wireless and mobile ad hoc network is a typical distributed WMN that attracts a lot of research interests recently (Agrawal, 2006). The wireless and mobile ad hoc network is dynamically created and maintained by the nodes. The nodes forward packets to/from each other via a common wireless channel without the help of any wired infrastructure. When a node needs to communicate with other nodes, it needs a routing discovery procedure to find a potential routing path to the destination node. Due to frequent movement of communication nodes, the routing path between two communication nodes is not fixed. When a relay node moves out of transmission range of other communication nodes, the current routing path is broken. As a result, another routing path has to be found in order to keep the communication alive. Wireless and mobile ad hoc networks are very useful in some areas that a centralized WMN is not possible or inefficient, such as disaster recovery and battle field. Although there are a lot of wireless and mobile networks and they are witnessing a great success in recent years, different types of WMNs have different design goals and restriction in the wireless signal transmission which results in the limitation of the services. Therefore, they cannot satisfy all the communication needs for the mobile users. For example, any single type of existing WMN is not able to provide a comprehensive service such as high bandwidth with wide coverage. In order to provide more comprehensive services, a concept of integrated heterogeneous wireless and mobile network (IHWMN) is introduced by combing different types of WMNs. On the other hand, any traditional mobile terminal only supports one network interface, which can only connect to one type of network. With the advance of the software defined radio technology, it is possible to integrate multiple WMN interfaces (multi-mode interfaces) into a single mobile terminal now. Such multi-mode terminal is able to access multiple WMNs if it is under the coverage of multiple WMNs. For example, a mobile terminal equipped with cellular network interfaces and WLAN can connect cellular network or WLAN if both networks are available. It can further connect to both networks at the same time. However, it is a big challenge since effective and efficient schemes are required to manage the connection. It is obvious that the introduction of IHWMN as well as multi-mode terminal brings more flexible and plentiful access options for mobile users. The mobile users can connect the more suitable network for different communication purpose. One example is that the mobile user may connect to cellular network for the voice communication and connect to the WLANs to receive email and surf the Internet. However, there is a lot of challenge, such as the architecture of network integration, network selection strategies,
719
Network Selection Strategies and Resource Management Schemes
Figure 1. An example of integrated heterogeneous wireless and mobile network
handoff scheme, resource allocation, etc. These problems have to be solved before launching IHWMN to the commercial market and enjoying its benefit. The major challenges for IHWMN are: •
How to integrate different types of wireless and mobile networks? In the same geographic area, it is possible to have more than one network as shown in Figure 1. If these networks belong to the same operator, it is obvious that the network operator can allocate the network resource in a centralized way. However, these networks may belong to different operators in most cases. These different networks may manage resources individually based on different policies, which is possible to cause low resource utilization for the whole IHWMN. Therefore, how to integrate these different types of WMNs from different operators directly affects the performance of the whole systems.
•
How to manage radio resource in IHWMNs? The resource management becomes more complex in IHWMN due to the diversity of the services (or traffic) provided by heterogeneous WMNs. Another challenging issue has to be handled in multiple traffic system is the fairness among different types of traffic and different types of networks. That is, some low priority traffic obtains poor performance while some high priority traffic obtains over-qualified services. From the point of view of network, the throughput of some networks may saturate due to high volume traffic while other networks have much less traffic to handle. How to select a network in the IHWMN? When a mobile user having a multi-mode terminal generates a new (or originating) call in an IHWMN, it needs a network selection strategy to determine
•
720
Network Selection Strategies and Resource Management Schemes
•
•
which network should be accessed. How to manage the vertical handoff? When a mobile user roams in an IHWMN, the multi-mode terminal may change the connection from one network to another network. Such process is called vertical handoff which will be disused in the following section. A great challenge for IHWMN is how to manage the vertical handoff since frequent vertical handoff causes a lot of signaling burden and fluctuation of the service quality. How to efficiently manage multi-mode interfaces? Since the power consumption of each wireless network interface cannot be neglected even in the idle or power saving mode, the terminal cannot activate all the interfaces all the times. Therefore, some algorithms are required to find the preferred networks promptly. Another problem of managing multi-mode interface is how to use multiple interfaces at the same time. Such kind of technology is very useful to support services that require very high bandwidth. However, it brings a lot of issues such as bandwidth allocation among different types of networks, synchronization etc.
In the chapter, we focus on several issues in IHWMN. The existing strategies and schemes are reviewed and compared. Several novel schemes are proposed to improve the performance of the IHWMN, which can be categorized as network selection strategies and resource management schemes. The issues addressed in this chapter provide many insights into characterizing the emerging problems in IHWMNs. The remainder of this chapter is organized as followings. We first introduce some basic definition for IHWMN and review the existing work in the next section. Then, we tackle the network selection and resource management problems in IHWMNs and provide several novel solutions for these problems. After that, we analyze the potential direction in the next step for IHWMNs.
BACKGROUND Figure 1 is an example of integrated heterogeneous wireless and mobile networks. The entire service area in Figure 1 is covered by satellite network. There are two cells of cellular network which has smaller cell size than satellite network. Each cell of cellular network has a BS which manages the communication in the cell. Several WLAN cells are overlapped with the cells of cellular networks, where some areas are covered by both networks. All these networks can be integrated as a whole, which is an IHWMN. The mobile user having multi-mode terminal can enjoy multiple communication modes in the IHWMN. In the following, we give some basic definitions for the IHWMN. In a traditional WMN such as cellular or WLAN network, an active mobile user (in communication) may move from one cell to another cell. In order to keep the communication alive, the connection has to be changed from one BS (or AP) to another BS (or AP). Such process of changing the connection within the same network is called handoff (Wang, 2003). In this chapter, we define such handoff as horizontal handoff. For example, the handoff between two adjacent cellular network cells in Figure 1 is a horizontal handoff. In an IHWMN, however, the connection may be changed from one network to another network for better service besides the horizontal handoff. Such process of changing the connection between two different types of networks is called vertical handoff (Chen, 2004). For example, the handoff between cellular network and WLAN in Figure 1 is a vertical handoff. Since the vertical handoff happens between two different types of networks (or systems), it is also known as inter-network handoff or inter-system handoff. Therefore, the horizontal handoff can be called intra-network or intra-
721
Network Selection Strategies and Resource Management Schemes
system handoff. Compared to horizontal handoff, vertical handoff is more complex and brings a lot of issues such as vertical handoff decision and vertical handoff execution, need to be handled carefully. The horizontal handoff usually has to be made in order to keep the communication alive in the traditional cellular networks or WLANs. Therefore, it is mandatory. Vertical handoff is more complicated than horizontal handoff, which can be divided into two categories. If an active mobile user roams into a new network which provides better service than the current serving network, it may request a vertical handoff and change the connection to the better network. Unlike the horizontal handoff, this type of vertical handoff is optional and is called downward vertical handoff (DVH) (Chen, 2004). Furthermore, the mobile user may still keep the communication with the current serving network. On the other hand, when an active mobile user moves out the coverage of the current serving network, it has to make a vertical handoff to other available networks. Such type of vertical handoff is called upward vertical handoff (UVH) (Chen, 2004). Similar to the horizontal handoff, the UVH is mandatory since a failed vertical handoff terminates the communication. In the following, we review the related work that has been done in the field of IHWMN. Some research has been done to integrate different types of wireless and mobile networks. In (Salkintzis, 2002; Salkintzis, 2004), two different mechanisms, tight coupling and loose coupling, have been introduced to interconnect WLANs and cellular networks (GPRS and 3G networks). In the tight coupling mechanism, WLAN connects to the GPRS core network like other Radio Access Networks (RAN). In other words, the traffic between WLAN and other external communication networks goes through the core network of cellular network. Therefore, the traffic of WLAN incurs a burden to the core network of cellular network. In the loose coupling mechanism, however, WLAN is deployed as a complementary network for the cellular network. The traffic of WLAN does not go through the core network of cellular network. The tight coupling mechanism requires that WLAN and cellular networks belong to the same operator. By using loose coupling mechanism, WLAN and cellular networks can be deployed individually. Both WLANs and cellular networks need not belong to the same operator, which is more flexible than the tight coupling mechanism. Additionally, 3GPP (the 3rd Generation Partnership Project) (3GPP, 2007) working group also has discussed the requirements, principles, architectures, and protocols to interwork the 3G networks and WLANs. In (Akyildiz, 2005), the authors proposed to use a third party to integrate different types of wireless and mobile networks. The third party, called as Network Inter-operating Agent, resides in the Internet and manages the vertical handoff between different types of networks. When a multi-mode terminal generates a call in an IHWMN, it requires a strategy to determine which network should be accessed. In (Stemm, 1998), since a mobile user always selects the network with the highest available bandwidth among all the available networks during its communication, the only concern of the network selection for the mobile user is bandwidth. From the user’s point of view, this is good for the service quality. In (Nam, 2004), a network selection strategy that only considers the power consumption mobile users has been introduced. In order to maximize the battery life, the mobile user selects the uplink and downlink from 3G network or WLAN that have the lowest power consumption. Consider the scenario that the power consumption of the uplink in the 3G network is less than the uplink in WLAN and the power consumption of the downlink in the 3G network is larger than the downlink in WLAN. In (Wang, 1999), the authors have proposed a policy-enabled network selection strategy which combines several factors such as bandwidth provision, price, and power consumption. A mobile user defines the “best” network based on his preferences. By setting different weights over different factors, a mobile user can calculate the total preference of each available network. The mobile
722
Network Selection Strategies and Resource Management Schemes
user connects to the network with the highest preference, which is its most desired network. In order to reduce the computation complexity of the cost function in (Wang, 1999), an optimization algorithm has been proposed in (Zhu, 2004). The authors have proposed another network selection algorithm in (Song, 2005) by using two mathematic methods: analytical hierarchy process (AHP) and grey relational analysis (GRA). The AHP algorithm divides the complex network selection problem into a number of decision factors and the optimal solution can be found by integrating the relative dominance among these factors. The GRA [17] has also been proposed for selecting the best network for a mobile user. Although the above network selection strategies have their own advantages, they are all designed to meet individual mobile user’s needs. That is, they are user-centric. Furthermore, they do not put much attention on the system performances, such as the blocking probability of originating calls and the forced termination probabilities of horizontal and vertical handoff calls. Generally, the vertical handoff can be divided into three phases: system discovery, vertical handoff decision, and vertical handoff execution (McNair, 2004). In the first phase, the multi-mode mobile terminal keeps searching for another network that can provide better service. Once such network is found, vertical handoff decision is made. The vertical handoff decision is a multiple criteria process which involves many factors like bandwidth usage, money cost, QoS parameters etc. The decision results also affect both the degree of user’s satisfaction and system performance. If the vertical handoff decision has been made to change the connection to the new network, the context has to be switched to make the change smoothly and user-transparent. Since the vertical handoff may not be mandatory and it incurs significant signaling messages, the decision algorithm is critical to the IHWMN. The vertical handoff decision is also seen as network selection problem in some literature (Wang, 1999; Song, 2005). The number of users in a WMN after a successful vertical handoff is considered to affect the QoS of the IHWMN. A modiffed Elman neural network is used to predict such number. The predicted number of mobile users is fed into a fuzzy interference system to make a vertical handoff decision. With the rapid emergence of multimedia applications such as voice, video, and data, these different types of traffic should be supported in wireless and mobile networks. Generally, multiple traffic can be classified into real-time and non-real-time traffic based on their sensitivity to the delay. The major challenge to support such multiple traffic is that different types of traffic are incorporated into one system and each traffic has its distinct QoS requirements. For example, real-time traffic (such as voice and video) is delay-sensitive, while non-real-time traffic (such as data) is delay-tolerant. Therefore, an efficient resource management scheme to support multiple traffic has to treat them differently and satisfy their individual QoS requirements. In an integrated wireless and mobile network, resource management faces more challenges due to the diversity of the services provided by different types of wireless and mobile networks. Unfairness may happen among different types of traffic when handling multiple traffic. That is, the performance of lower priority traffic should be improved when the higher priority traffic has been provided satisfied services. In (Pavlidou, 1994), the authors have presented different call admission control polices for voice and data traffic. Since data traffic is delay-insensitive and voice traffic is stringent to access delay, they have proposed to allow voice traffic to preempt data traffic. A priority queue is introduced to hold the preempted data calls. When a data call arrives, it is also put into the queue if there is no enough resource. Although their scheme improves the blocking probability of originating voice calls, it treats the originating calls and the handoff calls equally. Since terminating an ongoing call is more frustrating than blocking an originating call from a user’s point of view, higher priority should be provided to the ongoing calls (handoff calls). In (Wang, 2003), the authors have proposed an analytical model that
723
Network Selection Strategies and Resource Management Schemes
supports preemptive and priority reservation for handoff calls. Detailed performance analysis is also provided to give guidelines on how to configure system parameters to balance the blocking probability of originating calls and the forced termination probability of handoff calls. Multiple traffic with different QoS requirements have been discussed in (Xu, 2005). In order to support different types of traffic, a model that gives different priorities to different types of traffic has been designed. Their model allows the traffic with lower priority to be preempted by the traffic with higher priority, which can support DiffServ (Differentiate Service) in WMNs. Although all of the above resource management schemes achieve significant improvements on the system performances, they focus on a single WMN, which may not efficiently support multiple traffic in an IHWMN. Compared to a single type of WMN, the resource management in an IHWMN has to face more challenges due to the heterogeneity of different types of WMNs. That is, different types of WMNs may have different resource management policies for the same type of traffic. The resource management scheme in (Park, 2003) treats the real-time and non-real-time traffic differently in an integrated CDMA-WLAN network. For real-time traffic, vertical handoff is made as soon as possible to minimize the handoff delay. For non-real-time traffic, they considered that the amount of data being transmitted is more important than the delay. Therefore, the connection to the higher bandwidth network is kept as long as possible to maximize the throughput. In (Zhang, 2003), the authors have also proposed different vertical handoff policies for real-time and non-real-time traffic. Although all of the above schemes improve the system performances in certain perspectives, call-level performances such as the blocking probability of originating calls and the forced termination probability of handoff calls are not examined. Furthermore, in all of above schemes, any type of traffic is switched to a higher bandwidth network when a higher bandwidth network becomes available. However, this policy may not be suitable for some delay-sensitive traffic because frequent handoffs may interrupt the ongoing communications. The goal of the proposed resource management scheme in (Liu, 2006) is to increase user’s data rate and decrease the blocking probability and the forced termination probability. A switch profit is used to encourage the vertical handoff to the network that can offer better bandwidth. On the other hand, a handoff cost is used to prevent excessive vertical handoff. The switch profit depends on the bandwidth gain obtained from the vertical handoff, while the handoff cost depends on the delay incurred by the vertical handoff. The simulation results show that their scheme can reduce the blocking probability and the forced termination probability. It also achieves better throughput and grade of service. Although the above schemes focus on the resource management scheme in IHWMNs, they only consider a single type of traffic. Therefore, they may not efficiently support multiple traffic in IHWMNs.
NETWORK SELECTION STRATEGIES AND RESOURCE MANAGEMENT SCHEMES Cost-Function-Based Network Selection Strategies System Model As we mentioned before, most existing network selection strategies are user-centric and focus on the individual user’s needs. Our motivation is to design a network selection strategy from system’s perspective and the network selection strategy can also meet certain individual user’s needs. Before we discuss
724
Network Selection Strategies and Resource Management Schemes
how our proposed cost-function-based network selection strategy (CFNS) works, we briefly describe our system model. We consider an integrated heterogeneous wireless and mobile system having M different types of networks. We assume that the entire service area of the system is covered by network N1 that consists of many homogeneous cells and provides a low bandwidth service. Assume that network Ni(2 ≤ i ≤ M) is randomly distributed in the service area covered by network N1 and provides a higher bandwidth service than network N1. Network Ni(2 ≤ i ≤ M) has limited coverage, which only covers some portion of the entire service area. For example, a cellular network N1 covers several WLANs (N1, N2,…,NM). For the purpose of simplicity, we focus on one cell of network N1, which is called marked cell, where some area is covered by several high bandwidth networks. Each cell of a higher bandwidth network Ni(2 ≤ i ≤ M) has an AP (access point), and each cell of network N1 has a BS (base station). We assume that each cell of network Ni(1 ≤ i ≤ M) has a circular cell shape with radius Ri . We denote the area covered by network Ni(2 ≤ i ≤ M) as area Ai(2 ≤ i ≤ M). In the overlapped areas, mobile users may have more than one connection option. We assume that each cell of network Ni(2 ≤ i ≤ M) has Bi bandwidth units. It is necessary to clarify that each bandwidth unit is a logic channel which can be allocated to a mobile user. We assume that mobile users are uniformly distributed in the service area. They move in all the directions with equal probability. The moving speed V (random variable) of mobile user follows an arbitrary distribution with a mean value of E[V]. In the system, we assume that there are three types of calls, namely originating calls, horizontal handoff calls, and vertical handoff calls. An originating call is an initial call in the system, and a handoff call, either horizontal or vertical handoff call, is an ongoing call. When an active mobile user changes its connection from its current serving network Ni to network Nj (for all i, j), a handoff call (request) is generated in network Nj. If i = j, the handoff call is a horizontal handoff call. If i ≠ j, it is a vertical handoff call .
Cost-Function-Based Network Selection Strategy When an originating call is generated, the proposed network selection strategy works as follows. • • •
If there is no free bandwidth unit, the originating call is blocked; If only one available network has free bandwidth units, the originating call is accepted by that network; If there are more than one available network having free bandwidth units, all these candidate networks are compared based on network selection strategy and the originating call is accepted by the most desired network.
Since we focus on network selection strategy in this chapter, horizontal handoff is handled in a traditional way like (Wang, 2003) and vertical handoff is handled in the following ways: when an active mobile user moves from area covered by network Ni into adjacent area covered by network Nj, it changes its connection from network Ni to network Nj if network Nj has a higher bandwidth than Ni and there are free bandwidth units in Ni. If the target area is not covered by network Ni, the mobile user has to change its connection to other available networks. If there is no free bandwidth unit in other available networks, the vertical handoff call will be forcedly terminated. If there are more than one available network having free bandwidth units, the vertical handoff call is randomly accepted by any one of these networks.
725
Network Selection Strategies and Resource Management Schemes
Our proposed network selection strategy prefers an originating call to be accepted by a network with a low traffic load and stronger received signal strength, which can achieve better traffic balance among different types of networks and a good service quality. Consequently, we define a cost function to combine these two factors, traffic load and received signal strength. Therefore, the cost to use network Ni for an originating call is defined as Ci = wg · Gi + ws · Si, for i=1,2,...,M.
(1)
where Gi is the complementary of normalized utilization of network Ni, Si is the relative received signal strength from network Ni. wg and ws are the weights that provide preferences to Gi and Si, where 0 ≤ wg,s ≤ 1. The constraint between wg and ws is given by wg + ws = 1
(2)
The complementary of normalized utilization Gi is defined by Gi =
Bif Bi
, for i=1,2,...,M,
(3)
where Bif is the available bandwidth units of network Ni and Bi is the total bandwidth units of network Ni. In general, stronger received signal strength indicates better signal quality. Therefore, an originating call prefers to be accepted by a network with a higher received signal strength. However, it is difficult to compare received signal strengths among different types of networks because they have different maximum transmission powers and receiver threshold. As a result, we propose to use a relative received signal strength to compare different types of WMNs. Therefore, Si in Equation (1) is defined by Si =
Pic - Pith Pi max - Pith
for i=1,2,...,M,
(4)
where Pic is the current received signal strength from network Ni. Pith is the receiver threshold from network Ni. Pi max is the maximum transmitted signal strength from network Ni. Note that we only consider path loss in the propagation model. Consequently, the received signal strength (in decibel) from network Ni is given by Pic = Pi max - 10g log(ri )
(5)
where ri is the distance between the mobile user and the BS (or AP) of network Ni, and γ is the fading factor that is generally in the range of [2, 6]. Therefore, the receiver threshold from network Ni is given by Pith = Pi max - 10g log(Ri )
726
(6)
Network Selection Strategies and Resource Management Schemes
The relative received signal strength from network Ni is rewritten as Si = 1 -
log(ri ) log(Ri )
, for i=1,2,...,M.
(7)
If an originating call has more than one connection option, the costs for all candidate networks are calculated by using cost function of Equation (1). The originating call is accepted by a network that has the largest cost, which indicates the “best” network. If there are more than one “best” network, the originating call is randomly accepted by any one of these “best” networks. In the following, we discuss two special cases of the proposed CFNS strategy, i.e., wg = 1 and wg = 0. When wg = 1, the cost function of Equation (1) only considers Gi. It gives rise to another network selection strategy and we call it traffic balanced-based network selection (TBNS) strategy. This network selection strategy tries to achieve the best traffic balance among different types of networks. In this case, when an originating call is generated and there are more than one network having free bandwidth units, the originating call is accepted by a network that has the largest Gi. That is, the call is accepted by a network which has more free bandwidth units. In the second case, when wg = 0, the proposed CFNS strategy gives rise to another network selection strategy, i.e., received signal strength-based network selection (RSNS) strategy. It is obvious that the only concern of selecting a desired network is based on received signal quality. In this case, when an originating call is generated in an area covered by more than one network, the call is accepted by the network that has the largest Si. In this chapter, although our cost function only consists of two factors, traffic load and received signal strength, it is easy to extend to involve more factors like access fee of using network Ni, which can be rewritten as Ci = wg · Gi + ws · Si + wφ · Φi,
(8)
and access fee Φi is given by Fi = 1 -
Fi Fmax
(9)
where ϕmax is the highest access fee that the mobile user likes to pay and ϕi is the actual access fee to use network Ni. The mobile user does not connect to a network which charges more than ϕmax even if the network has free bandwidth units. wϕ (0 ≤ ϕi ≤ 1) is the weight for the access fee with the constraint: wg + ws + wϕ = 1
(10)
Therefore, a network with a cheaper price has a larger cost, and the mobile user is more likely to be accepted by that network. Using the similar way, other factors also can be included into the cost function after properly normalization.
727
Network Selection Strategies and Resource Management Schemes
Numerical Results We also apply Markov model method to analyze the system performance of the proposed CFNS strategy. Due to space limitations, we do not provide the details of the performance analysis and results, which can be found in (Shen, 2007; Shen, 2008). In the following, we give some numerical results for the system performance of the proposed CFNS strategy. By comparing major system performance, CFNS strategy can achieve a tradeoff between the blocking probability of originating calls and the average received signal strength, which are very important for both systems and users. This is the major difference of our strategies compared to most existing strategies, which considers both system performance and users’ needs.
RESOURCE MANAGEMENT SCHEMES FOR MULTIPLE TRAFFIC System Model Since the IHWMN is a new concept, there is not much research to discuss about the resource management schemes to support multiple traffic in IHWMN. In this section, we propose a novel resource management scheme to support real-time and non-real-time traffic in IHWMN. The fairness issue between real-time and non-real-time traffic is also addressed to avoid the unbalanced QoS provision to non-real-time traffic. The system model used in this section is similar to the last section, except we consider two types of traffic: real-time and non-real-time traffic. In this chapter, voice traffic is used as a real-time traffic and data traffic is used as non-real-time traffic. Each bandwidth unit in different networks has different bandwidth provision, while bandwidth provision in network N2 is much larger than that in network N1. In the following, we describe our schemes from handling the voice traffic.
Preemption-Based Resource Management Scheme An ongoing voice call is forcedly terminated due to a failure handoff since voice traffic is delay-sensitive. Therefore, a resource management scheme needs to reduce the number of handoff. On the other hand, a voice call only needs a low bandwidth. The call holding time of a voice call does not change even if a higher bandwidth channel is allocated. In other words, the resource utilization is not efficient if a higher bandwidth channel is allocated to a voice call. Therefore, we assume that a voice call is accepted only by network N1 to prevent vertical handoff and occupation of a higher bandwidth channel. As a result, there are only two types of voice calls arrival in our system, i.e., originating voice calls and horizontal handoff voice calls. A horizontal handoff voice request is generated in the marked cell when an active voice call user moves into the marked cell from neighboring cells of network N1. When an originating voice request or horizontal handoff voice request is generated in the marked cell of network N1, it is accepted if there are free channels in the marked cell of network N1. We assume that voice traffic has a higher priority than data traffic since voice traffic is delay-sensitive. Therefore, an incoming voice call, either originating or horizontal handoff voice call, can preempt ongoing data call in the marked cell of network N1 if there is no free bandwidth unit upon its arrival. We adopt a queue to hold those preempted data calls in the marked cell of network N1. Two concerns may arise in such preemption-based
728
Network Selection Strategies and Resource Management Schemes
resource management scheme. First of all, excessive preemption easily results in unfairness between voice traffic and data traffic, which must be avoided. The other concern is the priority of horizontal handoff voice calls over originating voice calls. From a user’s point of view, terminating a horizontal handoff voice call is more frustrating than blocking an originating voice call. Therefore, higher priority should be provided to horizontal handoff voice calls. In some channel reservation schemes, certain logic channels are exclusively reserved for handoff calls to provide such priority. However, the originating and horizontal handoff voice calls completely share the resources in our scheme unlike the reservation scheme. Therefore, we have to treat them differently during the preemptions in order to provide higher priority to horizontal handoff voice calls. In the following, we describe how the preemption works to differentiate the originating and horizontal handoff voice calls. Firstly, we do not want to terminate an ongoing data call for accepting an incoming voice call. Therefore, the preemption fails if the queue of the marked cell of network N1 is full. We further propose two thresholds, VHmax and VOmax , to prevent excessive preemption. Both thresholds are defined as the maximum capacities of ongoing voice calls when an incoming voice call tries to make a preemption. VHmax is used for the preemption of horizontal handoff voice calls, and VOmax is used for the preemption of originating max voice calls. In the following, we introduce how the preemption works with VH and VOmax . VHmax is a real value and can be presented as VHmax = êêVHmax úú +VHmax êêVHmax úú aH = êêVHmax úú + aH ë û ë û ë û
(11)
where êêVHmax úú is the integral part of êêVHmax úú , and αH is the decimal part of VHmax . ë û ë û When an incoming horizontal handoff voice call tries to make a preemption, the result of the preemption depends on the value of VHmax and the state of the marked cell, which is given by •
•
•
If the number of current ongoing voice calls is less than êêVHmax úú , the incoming horizontal handoff ë û call can successfully preempt ongoing data calls if there are ongoing data calls and the queue is not full; If the number of current ongoing voice calls is equal to êêVHmax úú , the incoming horizontal handoff ë û call can successfully preempt ongoing data calls only with probability αH if there are ongoing data calls and the queue is not full. In other words, the preemption fails with probability αH; If the number of current ongoing voice calls is larger than êêVHmax úú , the preemption fails even if ë û there are ongoing data calls and the queue is not full in the marked cell of network N1.
When implementing the above preemption scheme, the preemption succeeds (or fails) if the number of current ongoing voice calls is less (or larger) than êêVHmax úú . If the number of current ongoing voice ë û calls is equal to êêVHmax úú , a random number is generated uniformly in the range of [0; 1). If the generë û ated random number is less than αH, the preemption succeeds. Otherwise, the preemption fails and the incoming horizontal handoff voice call is forcedly terminated. Similar to Equation (11), VOmax can be presented as
729
Network Selection Strategies and Resource Management Schemes
max VOmax = êêVOmax úú +VOmax êêVOmax úú aO = êêëVO úûú + aO ë û ë û
(12)
where êêVOmax úú is the integral part of êêVOmax úú , and αO is the decimal part of VOmax . The preemption of ë û ë û originating voice calls is the same as the preemption of horizontal handoff voice calls except using VOmax instead of VHmax . If the preemption fails, the incoming originating voice call is blocked. It is obvious that thresholds, VOmax and VHmax , provide certain limitation to the preemption. Unlike voice traffic, data traffic is delay-tolerant and benefits from a higher bandwidth channel. That is, a higher bandwidth channel can improve the throughput of a data call and reduce its holding time. Therefore, we assume that an originating data call always tries the highest bandwidth network first if there are more than one network available. When an originating data call is generated in an overlapped area, it tries network Ni(2 < i ≤ M) first. If there is no free channel in network Ni, the originating data call is put into the queue of network Ni(2 < i ≤ M). When an originating data call is generated in the area that is only covered by network N1, it is accepted by network N1 if there are free channels in the marked cell of network N1. Otherwise, it is put into the queue of network N1 if the queue is not full or terminated if the queue is full. If an active data call mobile user moves into the marked cell from neighboring cells of network N1, a horizontal handoff data request is generated if only network N1 is available. The horizontal handoff data call is accepted if there are free channels in the marked cell of network N1. Otherwise, it is put into the queue of the marked cell if the queue is not full or terminated if the queue is full. For a data call waiting in the queue of neighboring cells of network N1, it also generates a horizontal handoff data request in the marked cell of network N1 when its mobile user moves into the marked cell. If an active data call mobile user in singly covered area moves into area covered by more than one network, a DVH (downward vertical handoff) request is generated in the higher bandwidth network Ni. A data call in the queue of network N1 also generates a DVH request in network Ni(2 < i ≤ M) when its mobile user moves into doubly covered area. If an active data call mobile user in a higher bandwidth network Ni(2 < i ≤ M). moves out of its coverage before call completion, it generates a UVH (upward vertical handoff) request in an available network Nj(i ≠ j). A data call in the queue of network Ni(2 < i ≤ M). also generates a UVH request in an available network Nj when the mobile user moves out of coverage of Ni. In the following, we define λODA as the average arrival rate of originating data call in different areas, λOV is the average arrival rate of originating voice calls in the marked cell, λHHV and λHHD is the average arrival rate of horizontal handoff voice and data call, respectively. ΛDVH (λUVH) is the average arrival rate of downward (upward) vertical handoff data calls.
Fairness Between Voice and Data Traffic The aim of WMNs is to provide desired services to mobile users, which can be measured using QoS requirements. Two main QoS requirements of voice traffic are the blocking probability BOV of originating calls and the forced termination probability BHHV of handoff calls. For data traffic, main QoS requirements include the blocking probability of originating data calls and the average delay. In our system, the QoS requirements of voice traffic are our main concern. We also do not want to ignore the performances of data traffic. That is, the system provides the guaranteed BOV and BHHV to voice traffic and the best effort service to data traffic. In other words, BOV and BHHV must be less than certain thresholds. Therefore, we
730
Network Selection Strategies and Resource Management Schemes
Figure 2. Performances of voice traffic with different VOmax
th th define two probability thresholds, BOV and BHHV , where the blocking probability of originating voice th calls must not be larger than BOV and the forced termination probability of handoff voice calls must not th . be larger than BHHV Intuitively, by increasing the values of VOmax and VHmax , BOV and BHHV decrease since the originating voice calls and the horizontal handoff voice calls obtain more priority. Since resources are completely shared by voice and data traffic, the performances of data traffic deteriorate simultaneously when the performances of voice traffic improve. If the QoS requirements of voice traffic have been met, the further increase of VOmax and VHmax imposes the unfairness to data traffic. In order to provide the best effort service to data traffic and guaranteed QoS to voice traffic, we have to find the minimum values of VOmax and VHmax that satisfy the QoS requirements of voice traffic. It is obvious that these minimum values of VOmax and VHmax result in the best performances for data traffic. Bisection algorithm is used to find the minimum values of VOmax and VHmax .
Numerical Results Due to space limitation, we do not provide the details on how to obtain the performance metric through Markov methods. Figure 2, Figure 3, Figure 4, and Figure 5 gives the numerical results to show the performance of our proposed schemes. Figure 2 shows the blocking probability of originating voice calls and the forced termination prob-
731
Network Selection Strategies and Resource Management Schemes
Figure 3. Average delay of data calls with different VOmax
ability of horizontal handoff voice calls with different VOmax and VHmax . The offered voice traffic load is fixed. With the increase of VOmax , the blocking probability of originating voice calls becomes less. For fixed VHmax , the forced termination probability of horizontal handoff voice calls improves significantly when VHmax increases. It is obvious that larger VOmax and VHmax can provide better performances for voice traffic. However, larger VOmax and VHmax result in deterioration of data traffic. Figure 3 shows that the average delay becomes longer when VOmax and VHmax increase. Therefore, we have to find the suitable VOmax and VHmax to provide the best service for data traffic. In the following, we examine the system performances under our optimum VOmax and VHmax . The total offered traffic load is fixed while the ratio between voice and data traffic changes. The QoS requirements th th = 5% and BHHV = 2% . In order to compare the performances of voice of voice traffic are set to BOV max max traffic using optimum set of VO and VH with other sets, we define three sets of VOmax and VHmax , set 1 = {12.6, 14.6}, set 2={16.5, 19.1}, and set 3={20.2, 21.2}. Figure 4 shows the performances of voice traffic using different sets of VOmax and VHmax . Only Set 3 and the optimum set can provide the th th and BHHV under any offered voice traffic load. When using optimum set, the blocking guaranteed BOV probability of originating voice calls and the forced termination probability of horizontal handoff calls th th and BHHV . In low traffic load, Set 2 can only provide are not larger than the guaranteed values, i.e., BOV th th guaranteed BOV and BHHV . QoS requirements of voice traffic cannot be met by Set 1 since the forced th under any offered voice termination probability of horizontal handoff voice calls is larger than BHHV traffic load. Figure 5 show the average delay of data traffic. Although Set 3 can provide better service
732
Network Selection Strategies and Resource Management Schemes
Figure 4. Performances of voice traffic with different offered voice traffic load
Figure 5. Average delay of data calls with different offered voice traffic load
733
Network Selection Strategies and Resource Management Schemes
for voice traffic than other sets, it achieves the worst average delay of data traffic. Set 1 provides the best service for data traffic as shown in Figures 5. However, it cannot provide satisfied service for voice traffic. Compared to Set 2, the optimum set achieves better performances of data traffic when both of them satisfy the QoS requirements of voice traffic. Therefore, the optimum set can provide the best service for data traffic while satisfying the QoS requirements for voice traffic.
FUTURE TRENDS In the next generation wireless and mobile networks (Beyond 3G or 4G), cellular networks still play a major role due to their dominant market share and good service quality. Other types of networks, such as WiMax and WLAN, are also witnessing fast deployment. However, any single type of WMN cannot always provide the ``best’’ service for every mobile user everywhere. Network integration is a promising choice to offer the “best” service. Based on the type of user’s service request and network availability, the mobile user can obtain the “best” service from IHWMNs. However, such integration still faces a lot of challenges as follows: •
•
•
•
734
Adjustment of bandwidth allocation in IHWMNs: In the IHWMN, different types of WMN have different bandwidth provision. On the other hand, different types of traffic have different bandwidth requirements. Therefore, the users may experience unstable QoS when the vertical handoff happens frequently. Some bandwidth adjustment or smoothing algorithms are required to make the transition smooth and achieve more stable QoS. Application of bandwidth splitting in IHWMNs: Some applications have very high bandwidth requirement, like Internet TV. As a result, any single type of WMN may not support such high bandwidth application very well. Bandwidth splitting is an approach to solve this problem, where the whole bandwidth requirement is divided into several parts and different parts are serviced by different types of WMNs. However, such splitting approach brings a lot of issues, such as bandwidth splitting strategies and synchronization among different networks. Efficient algorithms are required to provide satisfying services through multiple networks at the same time. Adaptive bandwidth allocation in IHWMNs: In modern wireless and mobile systems, adaptive bandwidth allocation can be applied to accept more mobile users when the incoming traffic becomes heavy. However, it becomes more difficult when it is applied to support multiple services in IHWMNs. It needs to decide to make a vertical handoff that may cause adaptive bandwidth reallocation, or stay within the current serving networks. Charging model and its effect: Different types of WMNs have different charging models. For example, some networks charge the access fee based on the amount of traffic, while other networks have the monthly charging plan. Therefore, such diversity of charging models will affect the user’s preference. An approach to combine charging models and resource management schemes is emergently required.
Network Selection Strategies and Resource Management Schemes
CONCLUSION In this chapter, we have reviewed the current research in integrated heterogeneous wireless and mobile networks. We also proposed network selection strategies and resource management schemes in IHWMNs and analyzed their system performance. Unlike most existing network selection strategies which are user-centric, our proposed CFNS (cost-function-based network selection) strategy is designed based on system’s perspective and also considers user’s needs. The numerical results showed that the proposed CFNS strategy can achieve a tradeoff between the blocking probability of originating calls and the average received signal strength. We also proposed preemption-based resource management schemes to support voice and data traffic in IHWMNs, which takes advantages of heterogeneities of traffic and networks, and the moving nature of mobile users. In the proposed preemption scheme, two thresholds were set to differentiate the originating and horizontal handoff voice calls. In order to provide the best service for data traffic and the guaranteed QoS for voice traffic, a bisection algorithm was used to find the suitable thresholds. The numerical results showed that the proposed scheme can provide the best effort service to data traffic while satisfying the QoS requirement of voice traffic. Finally, we discuss the open issues for the IHWMNs. We believe that the research topics and analytic methods presented in our work will contribute to the research and development of the future IHWMNs.
REFERENCES Agrawal, D. P., & Zeng, Q.-A. (2006). Introduction to wireless and mobile systems (2nd Ed.). Florence, KY: Thomson. Akyildiz, I., Mohanty, S., & Xie, J. (2005). A ubiquitous mobile communication architecture for nextgeneration heterogeneous wireless systems. IEEE Radio Communications, 43(6), 29–36. doi:10.1109/ MCOM.2005.1452832 Chen, W., Liu, J., & Huang, H. (2004). An adaptive scheme for vertical handoff in wireless overlay networks. IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 541-548). Washington, DC: IEEE. 3GPP TS 23.234 V7.5.0 (2007). 3GPP system to WLAN interworking, 3GPP Specification. Retrieved May 1, 2008, from http://www.3gpp.org, 2007. Liu, X., Li, V., & Zhang, P. (2006). Joint radio resource management through vertical handoffs in 4G networks IEEE GLOBECOM (pp. 1-5). Washington, DC: IEEE. McNair, J., & Fang, Z. (2004). Vertical handoffs in fourth-generation multinetwork environments. IEEE Wireless Communications., 11(3), 8–15. doi:10.1109/MWC.2004.1308935 Nam, M., Choi, N., Seok, Y., & Choi, Y. (2004). WISE: Energy-efficient interface selection on vertical handoff between 3G networks and WLANs. IEEE PIMRC 2004, 1, (pp. 692-698). Washington, DC: IEEE.
735
Network Selection Strategies and Resource Management Schemes
Park, H.-S., Yoon, S.-H., Kim, T.-Y., Park, J.-S., Do, M., & Lee, J.-Y. (2003). Vertical handoff procedure and algorithm between IEEE 802.11 WLAN and CDMA cellular network (LNCS, pp.103-112). Berlin: Springer. Pavlidou, F. N. (1994). Two-dimensional traffic models for cellular mobile systems. IEEE Transactions on Communications, 42(234), 1505–1511. doi:10.1109/TCOMM.1994.582831 Salkintzis, A. K. (2004). Interworking techniques and architectures for WLAN-3G integration toward 4G mobile data networks. IEEE Wireless Communications, 11(3), 50–61. doi:10.1109/MWC.2004.1308950 Salkintzis, A. K., Fords, C., & Pazhyannur, R. (2002). WLAN-GPRS integration for next generation mobile data networks. IEEE Wireless Communications, 9(5), 112–124. doi:10.1109/MWC.2002.1043861 Shen, W., & Zeng, Q.-A. (2007). Cost-function-based network selection strategy in heterogeneous wireless networks. IEEE International Symposium on Symposium on Ubiquitous Computing and Intelligence (UCI-07). Washington, DC: IEEE. Shen, W., & Zeng, Q.-A. (2008). Cost-function-based network selection strategy in integrated heterogeneous wireless and mobile networks. To appear in IEEE Transactions on Vehicle Technology. Song, Q., & Jamalipour, A. (2005). Network selection in an integrated wireless LAN and UMTS environment using mathematical modeling and computing techniques. IEEE Wireless Communications, 12(3), 42–48. doi:10.1109/MWC.2005.1452853 Stemm, M., & Katz, R. H. (1998). Vertical handoffs in wireless overlay networks. ACM Mobile Networking (MONET) [New York: ACM.]. Special Issue on Mobile Networking in the Internet, 3(4), 335–350. Wang, H., Katz, R., & Giese, J. (1999). Policy-enabled handoffs across heterogeneous wireless networks. Mobile Computing Systems and Applications (PWMCSA), (pp. 51-60). Wang, J., Zeng, Q.-A., & Agrawal, D. P. (2003). Performance analysis of a preemptive and priority reservation handoff scheme for integrated service-based wireless mobile networks. IEEE Transactions on Mobile Computing, 2(1), 65–75. doi:10.1109/TMC.2003.1195152 Xu, Y., Liu, H., & Zeng, Q.-A. (2005). Resource management and Qos control in multiple traffic wireless and mobile Internet systems. [WCMC]. Wiley’s Journal of Wireless Communications and Mobile Computing, 2(1), 971–982. doi:10.1002/wcm.360 Zhang, Q., Guo, C., Guo, Z., & Zhu, W. (2003). Efficient mobility management for vertical handoff between WWAN and WLAN. IEEE Communications Magazine, 41(11), 102–108. doi:10.1109/ MCOM.2003.1244929 Zhu, F., & McNair, J. (2004). Optimizations for vertical handoff decision algorithms. IEEE Wireless Communications and Network Conference (WCNC), (pp. 867-872).
KEY TERMS AND DEFINITIONS Fairness Among Different Types of Traffic: Due to the limitation of some resource management
736
Network Selection Strategies and Resource Management Schemes
schemes, some traffic may be allocated too much resource while other traffic may achieve very bad performance Integrated Heterogeneous Wireless and Mobile Networks: A new network architecture that combines different types of wireless and mobile networks and provide comprehensive services Multi-Mode Terminal: A terminal equipped with multiple network interfaces Multiple Traffic: The combination of different types of traffic, e.g., voice and data traffic in this chapter Network Selection Strategy: A strategy to determine which network should be connected in an IHWMN Preemption: a resource allocation scheme that preempt ongoing lower priority traffic when higher priority traffic is coming and there is no enough resource in the system Resource Management: The allocation of radio resource like channel, bandwidth to different types of traffic. Optimization of resource management can achieve better system performance. Vertical Handoff: A switch process that changes the connection from one network to another different type of network in integrated heterogeneous wireless and mobile networks
737
Section 9
Fault Tolerance and QoS
739
Chapter 32
Scalable Internet Architecture Supporting Quality of Service (QoS) Priyadarsi Nanda University of Technology, Sydney (UTS), Australia Xiangjian He University of Technology, Sydney (UTS), Australia
ABSTRACT The evolution of Internet and its successful technologies has brought a tremendous growth in business, education, research etc. over the last four decades. With the dramatic advances in multimedia technologies and the increasing popularity of real-time applications, recently Quality of Service (QoS) support in the Internet has been in great demand. Deployment of such applications over the Internet in recent years, and the trend to manage them efficiently with a desired QoS in mind, researchers have been trying for a major shift from its Best Effort (BE) model to a service oriented model. Such efforts have resulted in Integrated Services (Intserv), Differentiated Services (Diffserv), Multi Protocol Label Switching (MPLS), Policy Based Networking (PBN) and many more technologies. But the reality is that such models have been implemented only in certain areas in the Internet not everywhere and many of them also faces scalability problem while dealing with huge number of traffic flows with varied priority levels in the Internet. As a result, an architecture addressing scalability problem and satisfying end-to-end QoS still remains a big issue in the Internet. In this chapter the authors propose a policy based architecture which they believe can achieve scalability while offering end to end QoS in the Internet.
INTRODUCTION The concept of Policy Based Networking has long been in use by networks for controlling traffic flows and allocating network resources to various applications. A network policy defines how traffic, user and/or applications should be treated differently within the network based on QoS parameters, and may include policy statements. In most cases, such statements are defined and managed manually by the DOI: 10.4018/978-1-60566-661-7.ch032
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scalable Internet Architecture Supporting Quality of Service (QoS)
network administrator based upon the Service Level Agreements (SLA) between the network and its customers. Management of network devices for policy conditions to be satisfied is usually performed by a set of actions performed on various devices. For example, Internet Service Providers (ISPs) rely on network operators to monitor their networks and reconfigure the routers when necessary. Such actions may work well within the ISPs own network, but when considered across the Internet, may have serious effect in balancing traffic across many ISPs on an end-to-end basis. Hence, managing traffic over multiple Autonomous System (AS) domains requires an obvious need for change in the architecture for the current Internet and the way they function. Traffic control and policy management between these AS domains also encounter an additional set of challenges that are not present in the intra-domain case, including trust relationship between different competing ISPs. We demonstrated the architecture based on these heterogeneous policy issues and identified various architectural components which may contribute significantly towards simplification of traffic management over the Internet. Validity of the architecture and its deployment in the Internet heavily depends on the following factors: 1. 2. 3. 4. 5.
Service Level Agreements (SLAs) Autonomous Systems (ASs) relationship Traffic engineering and Internet QoS routing Internet wide resource and flow management Device configuration in support for QoS
The architecture takes into account above-mentioned factors in an integrated approach in order to support end-to-end QoS over the Internet. These factors are discussed and the design objectives of our architecture are presented throughout this chapter. We first discuss the design objectives of the architecture. In section two, we introduce background knowledge about the Internet topology and hierarchy, and identify various relationships which exist between those hierarchies. We also discuss how this knowledge of relationship between Autonomous Systems affects key design decisions. Section three provides an overview of our architecture with a brief description on various components involved within them. Section four summarizes the key features of the architecture and concludes this chapter.
DESIGN OBJECTIVES Service Level Agreement (SLA) is one of the first requirements towards implementing policy based network architecture in the Internet. With a growing demand for better QoS, AS domains and network operators need to enforce strong SLA at various service boundaries by having some additional mechanisms for such support. Hence, in order to achieve end-to-end QoS over the Internet, the SLAs must be extended beyond the standard customer and provider relationships as used in the past and the architecture should incorporate necessary components to build such SLAs dynamically spanning different ASs in the end-to-end path. Current Internet is a connection of ASs where the connection between the ASs are very much influenced by the relationship based on which such connectivity are formed. Fundamentally, the relationships between those ASs may be categorized as peer-to-peer, client-server and sibling (Gao, 2001), and are the driving forces behind economic benefits of individual domains. Most of the ASs try to perform load
740
Scalable Internet Architecture Supporting Quality of Service (QoS)
balancing through certain links connected to their neighbors and peers by using either traffic engineering approaches, such as MPLS and ATM, or policy routing decisions supported by Border Gateway Protocol (BGP) or/and a combination of traffic engineering and Internet routing. But there is no standard mechanism which may be applied universally by individual networks. One of the alternatives to support better QoS on an end-to-end basis over the Internet may be considered by deploying overlay networks. Such approaches are also being deployed by various network service providers to support many new applications and protocols without any changes to its underlying network layer (Li & Mohapatra, 2004). Because overlay traffic use application layer profiles, they can effectively use the Internet as a low level infrastructure to provide high level services to end users of various applications provided that the lower layers support adequate QoS mechanisms. Traffic engineering (Awduche, Chiu, Elqualid, Widjaja & Xiao, 2002) is crucial for any operational network and such thoughts have been put into the architecture by using BGP based parameter tuning in support for end-to-end QoS. We discuss various aspects of traffic engineering and its impact on the architecture in this chapter. Managing resources for QoS flows play an important role in supporting multiple users with multiple service requirements and can be directly seen as a result of traffic engineering in the Internet. Because ASs act upon their own policy rules defined by their network administrators, achieving network wide traffic engineering and resource management is quite difficult though not impossible. Our proposed architecture is based upon a hierarchical resource management scheme (Simmonds & Nanda, 2002) which distributes the control of network functions at three different levels. Policy Based Networking has been seen as a major area of research in the past few years and continues to draw attention from various researchers, network vendors and service providers due to increased number of network services over the Internet. Also the need for policy based network architecture should be considered more actively for the current as well as future demands for Internet use. Earlier works on policy based network management within the IETF have resulted in two standard working groups: Resource Allocation Protocol (RAP), and the policy framework working groups (Yavatkar, Pendarakis & Guerin, 2000), (Salsano, 2001). These standard architectures describe standard policy mechanisms (mainly policy core schemas and QoS schemas that are currently being used in policy servers and specifically to manage intra-domain traffic) but do not include how and where they should be applied within the current structure of the Internet. Following key components are addressed within the architecture using a bottom up approach: 1. 2. 3. 4. 5.
Service differentiation mechanism Network wide traffic engineering Resource availability to support QoS Routing architecture to dynamically reflect any policy changes Inter, intra and device level policy co-ordination mechanisms
The bottom up approach emphasizes that the network can be ready for policy compliance considering device level policy being configured first and stored in a database. Then looking at the high level policies based on business relationship, a mapping function would be able to pick up right devices, traffic routes and other associated resources in order to satisfy the QoS for various services. Based on this, the architecture is broadly divided into a three layer model and is presented later in this chapter. The architecture would be able to control various network devices with policy rules both within and
741
Scalable Internet Architecture Supporting Quality of Service (QoS)
between the AS domains dynamically, as opposed to current static procedures deployed in the Internet. The architecture would then work collaboratively with the underlying technologies such as Diffserv, Intserv, and other lower layer support for achieving performance enhancements over the Internet. Scalability is considered one of the important features of the architecture and in order to address the scalability issue within such architecture, we emphasized that each domain may manage its so called policy devices in a hierarchical way in three groups: 1. 2. 3.
Service support provided within the network with their related traffic characteristics such as bandwidth, delay, loss, jitter etc. for any traffic belonging to that specific service class Devices (such as routers, switches, servers etc.) falling within each of the service class in support for QoS The third group deals with management of all these devices in the second group by fine tuning traffic engineering through proper selection of protocols, and their related parameters
Currently most networks involving policy based management activities support intra-domain QoS guaranteed services for their networks only. However in this architecture we considered support for inter-domain QoS (Quoitin & Bonaventure, 2005),(Agarwal, Chuah & Katz, 2003) with an assumption that QoS inside the network will still be met. Hence, our effort is to present design a network architecture based on the policies negotiated between the customer and the service provider on a direct SLA agreement, and define a policy negotiation/coordination mechanism between the ASs because, without such negotiation an end-to-end QoS model is difficult, particularly at a service level. By doing so, the architecture can allow network administrators to automate many key labor intensive tasks and hence increase overall performance related to QoS for various services. The network architecture is described to achieve the following objectives when deployed across the Internet: •
•
•
•
742
Scalability: Considering the intrinsic characteristics of various traffic and their requirements for QoS in the Internet, the architecture can incrementally scale to support large number of users with heavy traffic volumes and real-time performance requirements. It is also intended to manage control plane activities by easily adapting to a three layer hierarchy and with a clear understanding of communication between each layer. Efficient use of network resources: The architecture attempts to allocate resources at three different levels: Device level, Network level and Application level. Interactions between these three levels are controlled through dedicated resource managers and their associated protocols. One of the key components for such resource management strategy is based upon resource availability mechanism through the use of BGP based community attribute announcement carrying specific values for resources available within an AS domain. Provisioning QoS parameters between end nodes: The architecture does not restrict itself to technology specifics such as: Intserv, Diffserv, MPLS etc. However, it does recommend using aggregated resource allocation strategies along the source and destination networks in the Internet. Such an approach would simplify overall network management; achieve scalability in the face of increased user base and better co-ordination between the control and data planes. Support for standard architectural components: In order to support optimal QoS performance involving various applications in the Internet, the architecture is built upon various key functions
Scalable Internet Architecture Supporting Quality of Service (QoS)
such as: Traffic engineering, Inter-domain routing, Resource management and Service Level Agreements. Hence an Integrated frame-work is presented without which end-to-end QoS would be difficult to achieve in the Internet. The resource management mechanism is implemented through proper co-ordination of service parameters both within an AS and between the neighboring ASs in a hierarchical manner. The architecture also ensures that there are sufficient available resources on both intra and inter-domain links to carry any QoS aware applications before admitting the flow into the network. Such strategy would then control various factors such as maximum loss rate, delay and jitter contributing to performance improvement before deciding which QoS flows to allow and which to deny for a better QoS model in the Internet. We define the policy and trust issues between various AS domains based on their connectivity in the Internet. We also investigate the effect of such policies on various other components of the architecture. Policies are central to two or more entities where various levels of services are offered, based upon the Service Level Agreements (SLAs) between them. Current Internet is comprised of groups of ASs (ISPs) placed into different tiers and the connectivity between these tiers are performed through Internet Exchange Points (IXPs) (Huston, n.d.). One of the key concerns about connectivity among various tiers is based on the kind of relationship each AS holds with their neighbors and peers. Hence the architecture considers those relationships between ASs and investigate further to identify their effect on various components of the architecture. Following section of this chapter presents AS relationships along with the AS hierarchy in the Internet.
AUTONOMOUS SYSTEM (AS) RELATIONSHIPS AND NETWORK POLICIES Current Internet with more than 16,000 Autonomous Systems (ASs) reflects tremendous growth both in its size and complexity since its commercialization. These ASs may be classified into different types such as Internet Service Providers (ISPs), Universities or other enterprises having their own administrative domains. Sometimes, each administrative domain may have several ASs. Based upon the property of each AS and their connections between each other, it is important to develop intelligent routing policies for transportation of Internet traffic and achieve desired performance objectives for various applications. We first describe those properties related to individual AS and then work further to reflect our principles within the scope of the proposed architecture. In (Rekhter & Li, 2002) Gao et al, classified the types of routes that could appear in BGP routing tables on the basis of relationships between the ASs and presented a heuristic algorithm based on the degree of connectivity, inferring AS relationships from BGP routing tables.
Internet Connectivity Based on network property, type of connectivity and traffic transportation principles, ASs may be classified under the following three categories. While most of the Stub networks are very much generic in nature and mainly limited to customer networks only, multi-homed and transit ASs are widely used within the Internet hierarchy because of their working relationship through which traffics are transported through them.
743
Scalable Internet Architecture Supporting Quality of Service (QoS)
1.
2.
3.
Stub AS: A stub AS is usually referred to an end user customer’s internal network, typically a LAN (Local Area Network). One of the most important properties of stub network is that hosts in a stub network do not carry traffic for other networks (i.e. no transit service). Multi-homed AS: Many organizations having their own AS depend upon Internet connectivity to support critical applications. One popular approach for improving Internet connectivity is to use a technique called multi-homing to connect to more than one Internet service provider (ISP). Multihoming can be very effective ensuring continuous connectivity, eliminating the ISP as a single point of failure, and it can be cost effective as well. However, ASs must plan their multi-homing strategy carefully to ensure that such a scheme actually improves connectivity instead of degrading the service availability. Also the number of providers an AS can subscribe to is always limited because of economic considerations. In most cases, an AS uses only one of its ISP connectivity for normal traffic, whilst the second one is reserved as a back-up link in case of failure. From traffic engineering point of view such scheme improves the performance of traffic throughput between multiple links. Transit AS: Transit ASs are described as multi-homed due to multiple connections with other service providers and carry both local and transit traffic. Such networks are generally the ISPs located within the Internet hierarchy (tier-1, tier-2,…, Customer network) as shown and described below. The figure does not show tiers 3 to 1. However connectivity between tiers is through exchange points in the Internet. Such exchange points then carry the transit traffic between each tier connected to them. (Figure 1)
Connectivity among different ISPs in the Internet is always subjected to the tier, in which they are placed in, the size of each ISP and the number of subscribers. There are mainly four ISP tier levels: •
Tier-1: These ISPs are called Transit providers in each country which carries core traffic in the Internet.
Figure 1. Transit AS: Multiple connections in the Internet hierarchy
744
Scalable Internet Architecture Supporting Quality of Service (QoS)
• • •
Tier-2: These are nationwide backbone networks with over a million subscribers. Such networks are connected to transit ISPs in each country. Tier-3: Tier-3 ISPs are regional backbone networks which may have over 50,000 subscribers and connect to Tier-1 ISPs through peering relationship. Tier-4: These ISPs belong to local service providers and consist of major small ISPs in each country. Tier-4 ISPs support less than 50,000 users offering local services to their customers.
Apart from the above mentioned properties of ASs, they can be categorized based upon contractual relationship and agreements between them. These agreements between the ASs play an important role in representing the structure of the Internet as well end-to-end performance characteristics. Such relationships between the ASs are fundamental to the architecture and are discussed in the following: 1.
2.
3.
Customer-Provider relationship: In a customer-provider relationship scenario, a customer buys services (network connectivity and service support) from its provider which is typically an ISP. Similarly, the ISPs buy the services they offer to their customers from their upstream service providers such as tier-4. In other words, a provider does transit traffic for its own customers, whereas a customer does not transit traffic between any two of its providers even if multi-homed. Network architecture supporting customer-provider relationship need to address issues in relation with Service Level Agreements (SLA) enforced between the customer and its providers. Peer-to-Peer relationship: Two ASs offering connectivity between their respective customers without exchange of any payments are said to have peer-to-peer relationship. Hence these two ASs agree to exchange traffic between their respective customers free of charge. Such relationship is enforced through routing policies between the ASs at the same level within the Internet hierarchy. For example, a tier-4 service provider must peer with another service provider at the same level, i.e. another tier-4 only. Such a relationship and agreement between two ASs would mutually benefit both, perhaps because roughly equal amounts of traffic flow between them. Sibling relationship: Sibling relationship may be established between two or more ASs if they are closely placed to each other. In this situation, the relationship allows the individual domains to provide connectivity to the rest of the Internet for each other. Also sometimes called mutual transit relationship, they may be used to provide backup connectivity to the Internet for each other when connection for one of the AS fails. Sibling relationship between ASs may also be used for load balancing and using bandwidth efficiently among various services in the Internet provided the ASs involved agree to such an arrangement.
In order to design a new Internet architecture based on the AS properties and their relationship (stub, multi-homed, transit and customer-provider, peer, or sibling), it is important that the architecture must first of all support them and then derive related policies to enforce them across the Internet. Such architecture will then be able to answer the following issues when deployed across the Internet: • • • • •
Resource management with multiple service support End-to-End QoS for individual service Load Balancing Fault management Facilitate Overlay routing
745
Scalable Internet Architecture Supporting Quality of Service (QoS)
•
Security related to information sharing between ASs
One such mechanism which has potential to address above issues is the BGP that is dynamic and also support network policies. Next section of this chapter provides details about how BGP can be used to support network policies in our proposed architecture.
Border Gateway Protocol (BGP) and AS Relationships Currently, Border Gateway Protocol (BGP) is being deployed across the Internet as a standard routing protocol between AS domains, where AS relationships are enforced by configuring certain BGP parameters. Routing traffic containing BGP carry nearly 90% of Internet route announcements due to its rich feature supporting network policies and contributing significantly towards Internet load balancing, traffic engineering and support for fall-back procedures in the event of network bottlenecks. BGP as part of the Inter-domain routing protocol standard for the current Internet allows each AS domain to select its own administrative policy by choosing the best route and announcing and accepting routes to and from other AS domains connected as neighbors. Though such an approach works reasonably well for most of the ASs individually to satisfy their personal objectives and maximizing their profits, it does not address the impact of such an approach on a global scale. Before presenting the architecture in detail, in the following, we present a few of the BGP approaches for policy enforcement between ASs. We try to present those policy issues and how BGP is currently configured related to AS relationships mentioned above. Border Gateway Protocol (BGP-4) (Rekhter & Li, 2002) was a simple path vector protocol when first developed and the main purpose of BGP was to communicate and control path level information between ASs so as to control route selection process between them. Using path level announcements by neighbors, an AS decides which path to use in order to reach specific prefixes. One of the main reasons Figure 2. Connectivity among autonomous systems
746
Scalable Internet Architecture Supporting Quality of Service (QoS)
Table 1. BGP routing decision process 1. Find the path with highest Local-Preference 2. Compare AS-Path length and choose the one with least length 3. Look for the path with Lowest MED attribute 4. Prefer e-BGP learned routes over i-BGP routes 5. Choose the path with lowest IGP metric to next hop
ASs use BGP for Inter-domain routing is for their own policies to be communicated to their neighbors and subsequently across the whole Internet. Many modifications to the original BGP have happened over time, and today we see BGP as a protocol weighed down with a huge number of enhancements overlapping and conflicting in various unpredictable ways. In this chapter we do not try to analyze those complex issues with BGP, instead our aim is to use BGP as a transport vehicle across ASs and implement network wide policies between them. It is sensible at this point of time to consider the ASs as ISPs and then we can be more specific in terms of exploring policies related to those ISPs and work for a better management of Internet wide traffic mapping to the relationships we have mentioned before. Henceforth, this chapter will use the terms AS and ISP interchangeably. Figure 2, shows a scenario connecting different ASs and representing their relationships with each other. One of the key features of BGP is the decision process through which each BGP router determines the path to destination prefixes. The rules are given in Table 1 below: As shown in Table 1, the relationships between individual ASs are realized by BGP attributes and in order to determine the actions to be performed for the purpose of traffic engineering between them, the following must be considered: •
•
Use of Local Preference to influence Path announcement: In a customer-provider relationship, providers prefer routes learned from their customers over routes learned from peers and providers, when all those routes to the same prefix are available. Hence in the above figure, ISP A would certainly prefer routes to all those prefixes within customer Y from customer X instead from ISP B. By doing so, ISP A can generate revenue by sending traffic through its own customer. Instead, if ISPA sends traffic through its own provider (not shown), it costs money for ISP A and through ISP B (peer), the credit rating of ISP A will be down graded. In order to implement such a policy (prefer customer route advertisements) using the local preference attribute, ISPs in general must assign higher local preference value to the path for a given prefix learned from their customer. In (Caesar & Rexford, 2002), Caesar et al described the use of assigning a non-overlapping range of Local Preference values to each type of peering relationships between AS domains and mentioned that Local Preference be varied within each range to perform traffic engineering between them. Hence Local Preference attribute in BGP can be used to perform the job of traffic engineering especially controlling outgoing traffic between ASs as well holding the policy relationships between them. Use of AS path pre-pending and Multi-Exit Discriminator (MED) attributes to influence transit and peering relationships: ISPs may influence the load balance of incoming traffic on different links connected to their neighbors. Such a scheme can be implemented by selectively exporting
747
Scalable Internet Architecture Supporting Quality of Service (QoS)
the routes to their neighbors. For example, an ISP may selectively announce its learned paths to its peer thereby forcing the peer to only get information on specific routes. Transit ISPs can control their incoming traffic by selectively announcing their learned routes to their peers. Apart from this, BGP makes use of AS path pre-pending technique where an ISP can apply its own AS number to multiple times and announce the path to its neighbor. Because BGP decision process selects lowest AS path length (rule-2 in Table1) such a technique will force the neighbor to choose another path if available, instead of a pre-pended AS path announced by the neighbor. To investigate further into the policy mechanisms associated with AS path pre pending and load balancing between peers, consider Figure 2 again. In this, ISP A decides to pre-pend its AS number in the path announcement to ISP B 3 times, ISP A will announce the path: ISP A, ISP A, ISP A, Customer A to ISP B. Hence ISP B will instead choose a different path to reach the prefixes with Customer A. But such a scheme is often selected manually on a trial and error basis simply to avoid more traffic from other domains. In the architecture we discuss to use AS path pre-pending only when the relationship between peers is based upon strict exchange of traffic without monetary involvement. Another scheme in which ISPs use the Multi Exit Discriminator (MED) attribute to control incoming traffic from its neighbors. ISPs having multiple links to other ISPs can use MED attribute in order to influence the link that should be used by the other ISP to send its traffic towards a specific destination. However, use of MED attribute must be negotiated beforehand between two peering ISPs. In the architecture, MED attribute is only used between transit ISPs having multiple links to other ISPs. •
Use of community attribute for route export to neighbors: ISPs have been using the BGP community attributes for traffic engineering and providing their customers a finer control on the redistribution of their routes (Quoitin, Uhlig, Pelsser, Swinnen & Bonaventure, 2003).
Internet Assigned Numbers Authority (IANA) typically assigns a block of 65536 community values to each AS, though only a few of them are used to perform community based traffic engineering. Using these attributes, by tagging them into the AS path announcement, an ISP may ask its neighbor or customer to perform a set of actions on the AS path when distributing that path to its neighbor (Li & Mohapatra, 2004),(Uhlig, Bonaventure, & Quoitin, 2003). For example in Figure 2, ISP A may want its customer (customer X) to pre-pend customer Xs AS path information 3 times before announcing further up stream. Similarly, ISP B may ask ISP A not to announce any path to customer X (e.g. NO_EXPORT). By doing so, ISPs are able to better control their incoming traffic. However, because community based attributes are yet to be standardized, there is a need for uniform structure in these attributes in order to apply them in the Internet. Also because each community attribute value requires defining a filter for each supported community in the BGP router, such a process will add more complexity into already fragile BGP and hence increase the processing time of the BGP message (Yavatkar, Pendarakis & Guerin, 2000). In summary, AS relationships play important role in Internet connectivity between various ISPs and contribute significantly towards designing a new architecture for the Internet. BGP based traffic engineering can be made more scalable by carefully selecting and configuring the attributes based on business relationships between various ISPs. The architecture is presented in next section which is based upon the analysis we have presented before on the use of ISP relationships and their corresponding traffic engineering attributes supported by many features of BGP.
748
Scalable Internet Architecture Supporting Quality of Service (QoS)
THREE-LAYER POLICY ARCHITECTURE FOR THE INTERNET We present the architecture supporting scalability and end-to-end performance in this section while accomplishing the following major tasks: •
•
•
•
Traffic flow management and resource monitoring: Flows are usually aggregated based on desired characteristics such as delay, loss, jitter, bandwidth requirements and assigned priority levels which determine end-to-end performance for various applications. Based on such aggregated flow characteristics Bandwidth Brokers in each AS domain then decide whether to accept the flow or reject them. Such flow management activities are performed at layer-2 of the architecture delivering network layer QoS across multiple domains in the end-to-end path. QoS Area identification: Each domain in the Internet is engineered and provisioned to support a variety of services to their customers. In the worst case an AS having connectivity to the Internet at least supports Best Effort service as a default service. In addition many of these ASs are also engineered to support VoIP, Video Conferencing and other time critical applications. Identifying these AS domains in the Internet and routing traffics through them improves overall QoS for various applications. This function is supported at layer-3 of the architecture which essentially performs Inter-domain QoS routing by identifying QoS areas in support for a specified service. This layer is different from TCP based session establishment because, in our architecture it is assumed that, there exist multiple AS domains with multiple QoS. Hence using QoS routing the architecture then supports selection of different QoS networks based on AS relationships for QoS sensitive applications. Traffic engineering and load balancing: Since traffic engineering is important to improve endto-end QoS, the architecture tries to balance traffic flows between various domains through policy co-ordination mechanism. The mechanism uses an approximation technique to balance any traffic parameter conflict between neighboring domains and improve overall QoS for services. Policy based routing: Applications requiring strict QoS must adhere to certain policies and our architecture uses BGP based policy decisions and applies various attributes of BGP to compute optimized routing paths. A route server in each domain relieves the routers from such complex policy decisions and processes information in fast mode.
The above mentioned functions are integrated and operate at different levels within our proposed architecture. Resource management within and between AS domains are clearly supported through the hierarchical grouping of various architectural components and are described below: 1. 2. 3.
4.
The architecture is hierarchical and operates at different levels to support both high level (business focused) and low level (device focused) resource availability for maintaining end-to-end QoS. The control plane of the architecture is separated from the data plane in order to maintain scalability across the Internet with wide variety of service classes. Each level in the hierarchy is controlled by a manager which receives resource availability information from components within that level only and the output is informed to a higher level in the hierarchy. Hence the approach is bottom up. Communication between same levels is allowed through peer-to-peer session establishment without additional overhead to manage the Internet. Any conflict for end-to-end QoS is resolved through
749
Scalable Internet Architecture Supporting Quality of Service (QoS)
5.
proper policy co-ordination mechanisms. Apart from resource management, the architecture also includes routing and traffic engineering and hence is an integrated approach to manage various services in the Internet.
The logical view of the architecture is presented in Figure 3. Each level in the hierarchy is associated with number of functions independent of any technology choice. One of the key components of the architecture is to separate out the control plane from the data forwarding plane by hierarchically grouping network management functions. Also important to note that, both layer-2 and layer-3 can be combined to form the Inter-network layer of TCP/IP network architecture. Essentially, layer-3 of the architecture determines the AS domains in support for a specified QoS through which a flow can be set-up but does not go beyond that to address the issue of resource and flow management which are performed by layer-2 only. The architecture also considers both flow and resource management functions between domains only as individual domains need to guarantee QoS based on device capabilities within their own domain. A detailed description on individual layers of the architecture is presented below.
Layer-1: Device Layer qoS Network devices including routers, switches, servers etc. are the key components to maintain and manage traffic flows across networks in an AS domain. These devices can be configured with low-level policy information managed by a single or multiple device managers depending on the size of the network. Support for QoS to various applications heavily depends on identifying, configuring, maintaining and accounting for these QoS aware devices within an AS. In order to support device level QoS in our architecture, following policies may be applied: Figure 3. Scalable QoS architecture, logical view
750
Scalable Internet Architecture Supporting Quality of Service (QoS)
1. 2.
3.
4.
5.
Each device registers their device functionalities through a policy repository indicating the kind of service support to different applications The repository has a direct interface with network management tool such as SNMPv3 which monitors the devices on a time scale to determine fairly accurate assessment of the physical connectivity of various devices in the network. Information supporting various queuing strategies, scheduling mechanisms and prioritized traffic handling for different devices may also be obtained from the repository. Such information is useful to determine the kind of QoS architecture supported (Intserv, Diffserv, MPLS) within a network domain. Decision on admitting a traffic flow and offering particular level of QoS to the flow depends on the ability of those devices falling on the path of the flow. However such decision on admission control is managed at the next level of the architecture by inspecting the policy repository along with any routing decisions. The overall management of network devices within an AS is performed by a Device manager. The device manager will then need a direct interface with the management tool and the policy repository. Use of separate management component reduces the load of SNMP tool and carries out next level communication within the architecture. Hence the device manager handles device configuration decisions and communicates with higher level managers to indicate various device level QoS support in the network. One thing to note in the architecture is that the device level QoS is only responsible for obtaining different device resources and help in preparing network topology for QoS support to various flows within the AS domain only.
The logical view of device level QoS is shown in Figure 4. It is assumed that resources within an AS are computed based on aggregated bandwidth requirements by different service classes. For scalability reasons, the architecture handles aggregated reservation for resources within each device in the path of Figure 4. Device layer QoS components
751
Scalable Internet Architecture Supporting Quality of Service (QoS)
the flow sharing same link and request for the same traffic class. If any one of the devices cannot support the required resources and QoS parameters, either of the following actions may be performed which is based on service priority and any further economic incentives: • •
The request for the flow may be denied Alternative devices may be selected and informed to the next layer, the network layer, where a separate path may be created by applying network QoS flow management using BB resource management strategy The device manager may negotiate with network layer and further up the line in the hierarchy, and the final decision to offer a lower QoS level may be communicated between different entities involved in this process
•
However network layer decision on QoS flow management across various AS domains is important within the architecture in order to determine how traffic can best be handled with the desired QoS support from the devices below. The architectural support at next level in the hierarchy (network layer QoS) is based on admission control policy, inter-domain flow management and signaling for various flows. The layer-2 functions of the architecture are presented below.
Layer-2: Network Layer qoS Network layer service guarantees for QoS sensitive applications are supported through the information from device managers and their associated management tools which are located in the bottom most layer of the architecture. Once this information is obtained, the network layer QoS determines and supports QoS guarantees between network boundaries within an AS domain. Hence this layer performs flow and resource management functions for both intra-domain and inter-domain in the Internet. The intra-domain functions are presented below, followed by inter-domain functions which are important. However for intra-domain following procedures may be noted: 1.
Identifying paths between edge routers applying intra-domain routing and ranking them (high-tolow) according to QoS service support based on parameters such as: bandwidth, delay, jitter, loss and cost. Resource allocation for each device, in every edge-to-edge path based on aggregated reservation strategy, and is classified under the intra-domain resource management framework. Such aggregated reservation is made for a group of flows sharing same link and request for the same traffic class. Once allocations of resources are completed, such information is stored in a repository which is updated at regular intervals. Admission control for individual flows at the edge (ingress) routers by checking SLAs and any policy related information. A QoS manager within each AS ensures support for network layer QoS to various applications as well as communicating with the device manager and other higher level components in the architecture.
2.
3. 4. 5.
The network layer QoS is also responsible for topology discovery within an AS where connectivity
752
Scalable Internet Architecture Supporting Quality of Service (QoS)
information between various devices is obtained in order to know the exact path between end points within an AS domain. Such topology discovery information is required for two different cases. The first one is the information related to physical topology of the network, describing physical connectivity between different devices which in most cases is static unless physical changes happen within the network. The second one is based on routing topology, which changes frequently as the routes taken between any pair of devices within the network are likely to change relatively more frequently e.g. effect of traffic engineering to load balance traffic among different links. Physical topology information is obtained by interacting with the device manager while the routing topology information is based on factors such as the type of routing protocol, various routing schemes (overlay routing vs. standard next-hop routing) and any routing policies applied in order to support service guarantees for different traffic flows within the AS domain. However, we only describe the architecture and its associated components without further details on any specific mechanisms for deployment in the Internet. For the sake of an example we consider Diffserv, but it is entirely up to the network managers/ designers to decide the kind of technology to be used with their relevant QoS support for various applications in the Internet. The logical view of the network layer QoS for intra-domain flow management is shown in Figure 5. The QoS manager plays a central role in managing network wide QoS delivery by interacting with other components of the architecture. As device managers manage various devices and interact with device repository and management tool to monitor, configure, and record any device level statistics in support for QoS in the network, such information are crucial for QoS manager at the network layer to apply between network edges for intra-domain QoS guarantee. Hence an accurate view of both device support at lower layers and resource management contributes significantly towards building good architecture for the Internet. One of the important tasks at the network layer is to make sure sufficient resources are available for different QoS sensitive flows originating from both within the network and outside the network. While flows originating within the network are guaranteed resources for specific QoS based on SLAs and poli-
Figure 5. Network level QoS (intra-domain): logical view
753
Scalable Internet Architecture Supporting Quality of Service (QoS)
cies between the network and its customers, flows entering from outside the network are permitted if prior contract and relationships are established between other network domains. Otherwise, the network treats the flows as BE specific without further guarantees on QoS. Another interesting point to note within the architecture is the interaction between routing topology and physical topology information. While intra-domain routing protocols are used to determine the network level path within an AS, such path may not give an optimized solution to support QoS for the application. Hence, the QoS manager communicates with the QoS path repository to determine any better path availability with the desired resources for that application at that instant. If an alternative path is discovered between the same edge points, actions may be taken by the QoS manager to inform the device manager to configure those devices falling on the path. Physical topology of a network describing connectivity between devices may be used as a choice for forwarding the traffic in situations where routing protocols may not be useful in support for QoS within an AS. Such considerations are taken within the architecture in order to support “better than best-effort” QoS particularly involving control load services defined in Diffserv. QoS manager is responsible for providing service level guarantees within an AS only. However end-to-end QoS guarantee in the Internet needs to be supported by multiple domains through which Internet wide connectivity is established. Various factors apart from individual network level QoS are important to consider in this regard. Within the architecture, third layer in the hierarchy, the Inter-domain QoS, is designed to further manage end-to-end QoS for various applications. Issues in relation to trust management, policy co-ordination, inter-domain routing, traffic engineering and competitive pricing structures are some of the key factors which are considered at the next level in the architecture and are described below.
Layer-3: End-to-End qoS The End-to-End QoS layer in the architecture is responsible for managing higher level policies in the networks in order to guarantee end-to-end QoS across the Internet. One of the most important functions performed by this layer is selecting QoS areas between end nodes and routing traffic flows based on their QoS parameters. Inter-domain policy routing, traffic engineering for load balancing and supporting various user QoS requirements using SLAs are the key functions of this layer. Hence this layer extends single AS level QoS (offered at layer-2 of the architecture) to multiple AS level QoS by adding following functions into the architecture. 1.
2. 3.
4.
754
Application level QoS are supported through SLAs between the network service provider and customers. Hence identifying various parameters from the SLAs such as customer identification, service category, resource support, service schedule, traffic descriptor, QoS parameters and outof-profile traffic treatments are important to consider at this layer of the architecture. Admission control policies determine user authentication through a central repository within each AS and find out the level of QoS support for the flows. Administrative and pricing policies are considered part of admission control process and resource allocation strategy to various applications. However the architecture does not include pricing issues. AS relationships and trust issues are central to determine existence of any QoS paths between end nodes spanning multiple domains in the Internet. Such approach investigates a number of QoS paths
Scalable Internet Architecture Supporting Quality of Service (QoS)
5.
6.
7. 8.
rather than simply choosing the lowest metric routing path between end points in the Internet. AS relationships are determined by inferring various Internet characteristics as well as using policy based routing information between different domains. Inter domain routing decisions based on various policies are given preferences, allowing the set of service requirements to be optimally supported within the network’s aggregate resource capability. A central domain coordinator within each AS is responsible for the above mentioned activities by interacting with domain coordinators from other domains in the Internet. Hence, identifying QoS domains and investigating their service offerings are keys to such architecture. Any conflict in resource and traffic management including simple pricing parameters is resolved by applying resource co-ordination or similar algorithm between various domains in the QoS path. Once QoS discovery process is completed, the extracted technical parameters from the SLAs, which are referred to as service level specifications (Goderis, et al., 2001), are mapped across network layer QoS components and finally through the device managers in individual ASs.
The architectural support at the End-to-End QoS layer which does the above mentioned functions are achieved largely through a series of negotiation and policy co-ordination activities before the exact QoS parameters are determined by the domain coordinator and applied across various domains in the QoS path. Such approach would then guarantee the SLAs between the service provider and its customers supporting end-to-end QoS objectives for various applications. The logical view of End-to-End QoS layer is presented in Figure 6 with various interactions among the components present within them. The function of the domain coordinator can be compared with a Bandwidth Broker (Terzis, Wang, Ogawa & Zhang, 1999), (Li, Zhang, Duan, Gao & Hou, 2000) or a Policy Server (Yavatkar, Pendarakis & Guerin, 2000) which does the job of policy based admission control by comparing SLSs with user
Figure 6. End-to-End QoS: Logical view
755
Scalable Internet Architecture Supporting Quality of Service (QoS)
flow parameters. Resource control similar to the work stated in both (Yavatkar, Pendarakis & Guerin, 2000), (Terzis, Wang, Ogawa & Zhang, 1999) is also performed by the domain coordinator. The domain coordinator primarily manages two different sets of policies as specified in the logical diagram and exchanges with other domains. The customer specific policy controls access to available services within a service provider’s domain by comparing parameters such as: priorities, usable services, resource availability, valid time against SLA rules as specified between the service provider and its customer. A decision on whether to accept the customer’s flow or to deny it is finally conveyed through the admission control module. The right side of the domain coordinator as shown in Figure 6 is responsible for service and resource specific policies. Service parameters related to QoS values, valid time and cost are compared against any policy rules found in their respective SLAs. In order to determine optimized values for these service parameters, the domain coordinator needs to consider various traffic engineering policies as well as routing policies involving peering domains. Finally the architecture deals with resource specific policies such as: bandwidth, delay and jitter available at the network level by communicating with the QoS manager. In case of policy conflicts (e.g. available resources are not sufficient), the domain coordinator initiates a policy coordination algorithm between domains present in the end-to-end QoS path. In order to understand the overall architecture on an end-to-end basis, the flow diagram in Figure 7 demonstrates various activities and their sequence of operations between end systems in the Internet. In the functional description of the architecture, while the objective is to support QoS between various domains under different AS administrative control, individual domains should support both network level and device level QoS throughout the life of the flow. Figure 7. Functional descriptions and interactions among different layers
756
Scalable Internet Architecture Supporting Quality of Service (QoS)
CONCLUSION We discussed a Policy architecture that handles resource management for both intra and inter domain resources for QoS specific (high-priority) applications. One of the strengths of the architecture is its separation of control and management plane from the data plane in order to facilitate better end-to-end QoS control. The architecture is also hierarchical and operates at different levels to support both high level (business focused) and low level (device focused) resource availability for maintaining end-toend QoS. Various levels in the hierarchy are controlled by a manager that receives resource availability information from components within that level only, and the output is informed to a higher level in the hierarchy. Hence the approach is also bottom up. Communication between same levels is allowed through peer-to-peer session establishment and making use of necessary signaling protocols. Any conflict for end-to-end QoS is resolved through proper policy co-ordination mechanism. Apart from resource management, the architecture also includes policy based routing and traffic engineering through fine tuning of BGP routing policy structures. Hence the architecture is scalable, integrated and is aimed at improving end-to-end QoS for various services in the Internet. Validation of the architecture presenting functionalities of the three different layers is performed using three different environments. Layer-1 functionalities of the architecture demonstrating Diffserv network is presented by creating a test bed scenario with Diffserv capable domain and measuring end-to-end QoS parameters for VoIP application in the presence of other background traffic. Such experiments then motivate to consider various QoS parameters and use resource management strategy between AS domains. However, layer-2 of our architecture mainly deals with resource management between neighboring AS domains on an end-to-end basis. For this we designed a prototype based on Bandwidth Broker and using our own signaling scheme to properly manage traffic flows with different QoS classes. Finally layer-3 of our architecture is designed to select QoS domains and forwarding traffic based on Inter domain routing protocol such as BGP to enforce routing policies in a dynamic way. In order to demonstrate such functions of our architecture we used several simulation experiments based on OPNET simulator. The simulation environment also considered parameters used by the real routers and demonstrated the efficiency of using community based attribute and policy co-ordination algorithm in case of policy conflict. A series of experiments are conducted to investigate the effect of BGP based policy enforcement, load balancing between AS domains and traffic engineering for scalability and better management of QoS in the Internet. We presented a Policy based architecture which is designed to support end-to-end QoS for multiple service classes in an integrated way. With the integrated approach, our design and performance evaluation results presented in (Nanda, 2008) indicated such end-to-end QoS can be achieved with the help of service mapping, policy based routing and traffic engineering, resource management using BB functionalities, and device level QoS support across the Internet. The main strengths of our design are scalability, ability to handle heterogeneous policies, and distributed resource management support. This chapter also established a foundation for further research on policy routing involving security, policy based billing and charging in the Internet, and application level resource management in the Internet.
757
Scalable Internet Architecture Supporting Quality of Service (QoS)
REFERENCES Agarwal, S., Chuah. C. N., & Katz, R. H. (2003). OPCA: Robust Inter-domain Policy Routing and Traffic Control, OPENARCH. Awduche, D.O., Chiu, A., Elqalid, A., Widjaja, I., & Xiao, X. (2002). A Framework for Internet Traffic Engineering [draft 2]. Retrieved from IETF draft database. Caesar, M., & Rexford, J. (2005, March). BGP routing policies in ISP networks, (Tech. Rep. UCB/CSD05-1377). U. C. Berkeley, Berkeley, CA. Gao, L. (2001). On inferring autonomous system relationships in the Internet. IEEE/ACM Transactions on networking, 9(6), December. Goderis, D. et al. (2001, July). Service Level Specification Semantics and parameters: draft-tequilasls-01.txt [Internet Draft]. Huston, G. (n.d.). Peering and settlements Part-1. The Internet protocol journal. San Jose, CA: CISCO Systems. Li, Z. Zhang, Duan, Z., Gao, L.& Hou, Y.T.(2000). Decoupling QoS control from Core routers: A Novel bandwidth broker architecture for scalable support of guaranteed services. Proc. Of SIGCOMM’00, Stockholm, Sweden, (pp. 71-83). Li, Z., & Mohapatra, P. (2004, January). QoS Aware routing in Overlay networks (QRON). IEEE Journal on Selected Areas in Communications, 22(1). Nanda, P. (2008, January). A three layer policy based architecture supporting Internet QoS. Ph.D. thesis, University of Technology, Sydney, Australia. Quoitin, B., & Bonaventure, O. (2005). A Co-operative approach to Inter-domain traffic engineering. 1st Conference on Next Generation Internet Networks Traffic Engineering (NGI 2005), Rome, Italy, April 18-20th. Quoitin, B., Uhlig, S., Pelsser, C., Swinnen, L., & Bonaventure, O. (2003). Internet traffic engineering with BGP: Quality of Future Internet Services. Berlin: Springer Rekhter, Y. & Li, T. (2002, January). A border gateway protocol 4 (BGP-4): draft-ietf-idr-bgp4-17.txt [Internet draft, work in progress]. Salsano, S. (2001 October). COPS usage for Diffserv resource allocation (COPS-DRA) [Internet Draft]. Simmonds, A., & Nanda, P. (2002). Resource Management in Differentiated Services Networks. In C McDonald (Ed.), Proceedings of ‘Converged Networking: Data and Real-time Communications over IP,’ IFIP Interworking 2002, Perth, Australia, October 14 - 16, (pp. 313 – 323). Amsterdam: Kluwer Academic Publishers. Terzis, A., Wang, L., Ogawa, J. & Zhang, L. (1999, December). A two tier resource management model for the Internet, Global Internet, (pp. 1808 – 1817).
758
Scalable Internet Architecture Supporting Quality of Service (QoS)
Uhlig, S., Bonaventure, O., & Quoitin, B. (2003). Internet traffic engineering with minimal BGP configuration. 18th International Teletraffic Congress. Yavatkar, R., Pendarakis, D., & Guerin, R. (2000, January). A framework for policy based admission control, (RFC 2753).
KEY TERMS AND DEFINITIONS Autonomous System (AS): An autonomous system is an independent routing domain connecting multiple networks under the control of one or more network operators that presents a common, clearly defined routing policy to the Internet and has been assigned an Autonomous System Number (ASN). Bandwidth Broker(BB): Bandwidth Broker (BB) is a logical entity used to act as a resource manager both within a network and between networks so as to guarantee performance. Border Gateway Protocol (BGP): BGP is a routing protocol which allows networks to tell other networks about destinations that they are “responsible” by exchanging routing information in different autonomous systems. Differentiated Services (Diffserv): Diffserv supports QoS guarantee by aggregating traffic flows on a per class basis. Integrated services (Intserv): Intserv supports end-to-end QoS guarantee on a per flow basis. Policy Based Networking (PBN): Policy based networking is defined as the management of a network so that various kinds of traffic get certain priority of availability and bandwidth needed to serve the network’s users effectively. Quality of Service (QoS): Quality of Service (QoS) is defined as supporting and guaranteeing network resources to various users, applications and services in the Internet. Traffic Engineering (TE): Traffic Engineering (TE) is concerned with performance optimization of operational IP networks and can be used to reduce congestion and improve resource utilization by careful distribution of traffic in the network.
759
760
Chapter 33
Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing Zizhong Chen Colorado School of Mines, USA
ABSTRACT Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage also increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques are highly scalable.
INTRODUCTION The unquenchable desire of scientists to run ever larger simulations and analyze ever larger data sets is fueling a relentless escalation in the size of supercomputing clusters from hundreds, to thousands, and even tens of thousands of processors (Dongarra, Meuer & Strohmaier, 2004). Unfortunately, the struggle to design systems that can scale up in this way also exposes the current limits of our understanding DOI: 10.4018/978-1-60566-661-7.ch033
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scalable Fault Tolerance
of how to efficiently translate such increases in computing resources into corresponding increases in scientific productivity. One increasingly urgent part of this knowledge gap lies in the critical area of reliability and fault tolerance. Even making generous assumptions on the reliability of a single processor, it is clear that as the processor count in high end clusters grows into the tens of thousands, the mean time to failure (MTTF) will drop from hundreds of days to a few hours, or less. The type of 100,000-processor (Adiga, et al., 2002) machines projected in the next few years can expect to experience a processor failure almost daily, perhaps hourly. Although today’s architectures are robust to enough incur process failures without suffering complete system failure, at this scale and failure rate, the only technique available to application developers for providing fault tolerance within the current parallel programming model checkpoint/restart has performance and conceptual limitations that make it inadequate to the future needs of the communities that will use these systems. Alternative fault tolerance techniques need to be investigated. In this chapter, we present some scalable techniques to tolerate a small number of process failures in large scale parallel and distributed computing. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the total number of application processes p increases. We introduced several encoding strategies into diskless checkpointing to improve the scalability of the technique. We present an algorithm-based checkpoint-free fault tolerance approach, in which, instead of taking checkpoint periodically, a coded global consistent state of the critical application data is maintained in memory by modifying applications to operate on encoded data. Because no periodical checkpoint or rollback-recovery is involved in this approach, process failures can often be tolerated with a surprisingly low overhead. We explore a class of numerically stable floating-point number erasure codes based on random matrices which can be used in the algorithm-based checkpointfree fault tolerance technique to tolerate multiple simultaneous process failures. Experimental results demonstrate that the introduced fault tolerance techniques can survive a small number of simultaneous processor failures with a very low performance overhead.
BACKGROUND Current parallel programming paradigms for high-performance distributed computing systems are typically based on the Message-Passing Interface (MPI) specification (Message Passing Interface Forum,1994). However, the current MPI specification does not specify the behavior of an MPI implementation when one or more process failures occur during runtime. MPI gives the user the choice between two possibilities on how to handle failures. The first one, which is the default mode of MPI, is to immediately abort all survival processes of the application. The second possibility is just slightly more flexible, handing control back to the user application without guaranteeing that any further communication can occur.
FT-MPI Overview FT-MPI (Fagg, Gabriel, Losilca, Angskun, Chen, Pjesivac-Grbovic, et al., 2004) is a fault tolerant version of MPI that is able to provide basic system services to support fault survivable applications. FT-MPI implements the complete MPI-1.2 specification and parts of the MPI-2 functionality, and extends some of the semantics of MPI to support self-healing applications. FT-MPI is able to survive the failure of n − 1 processes in an n-process job, and, if required, can re-spawn the failed processes. However, fault
761
Scalable Fault Tolerance
tolerant applications have to be implemented in a self-healing way so that they can survive failures. Although FT-MPI provides basic system services to support self-healing applications, prevailing benchmarks show that the performance of FT-MPI is comparable (Fagg, Gabriel, Bosilca, Angskun, Chen, Pjesivac-Grbovic, et al., 2005) to the current state-of-the-art non-fault-tolerant MPI implementations.
FT-MPI Semantics FT-MPI provides semantics that answer the following questions: 1. 2.
What is the status of an MPI communicator after recovery? What is the status of the ongoing communication and messages during and after recovery?
When running an FT-MPI application, there are two parameters used to specify which modes the application is running. The first parameter is communicator mode which indicates the status of an MPI object after recovery. FT-MPI provides four different communicator modes, which can be specified when starting the application: • • •
•
ABORT: like any other MPI implementation, in this FT-MPI mode, application aborts itself after failure. BLANK: failed processes are not replaced; all survival processes have the same rank as before the crash and MPI COMM WORLD has the same size as before. SHRINK: failed processes are not replaced; however the new communicator after the crash has no ’holes’ in its list of processes. Thus, processes might have a new rank after recovery and the size of MPI COMM WORLD will change. REBUILD: failed processes are re-spawned; survival processes have the same rank as before. The REBUILD mode is the default, and the most used mode of FT-MPI.
The second parameter, the communication mode, indicates how messages, which are sent but not received while a failure occurs, are treated. FT-MPI provides two different communication modes, which can be specified while starting the application: •
•
CONT/CONTINUE: all operations which returned the error code MPI SUCCESS will finish properly, even if a process failure occurs during the operation (unless the communication partner has failed). NOOP/RESET: all pending messages are dropped. The assumption behind this mode is that on error the application returns to its last consistent state, and all currently pending operations are not of any further interest.
FT-MPI Usage It usually takes three steps to tolerate a failure: 1) failure detection, 2) failure notification, and 3) recovery. The only assumption the FT-MPI specification makes about the first two points is that the run-time environment discovers failures and all remaining processes in the parallel job are notified about these events. The recovery procedure consists of two steps: recovering the MPI run-time environment, and
762
Scalable Fault Tolerance
recovering the application data. The latter one is considered to be the responsibility of the application developer. In the FT-MPI specification, the communicator-mode discovers the status of MPI objects after recovery, and the message-mode ascertains the status of ongoing messages during and after recovery. FT-MPI offers for each of these modes several possibilities. This allows application developers to take the specific characteristics of their application into account and use the best-suited method to tolerate failures.
SCALABLE DISKLESS CHECKPOINTING FOR LARGE SCALE SCIENTIFIC COMPUTING In this section, we introduce some techniques to improve the scalability of classical diskless checkpointing technique.
Diskless Checkpointing: From an Application Point of View Diskless checkpointing (Plank, Li & Puening, 1998) is a technique to save the state of a long running computation on a distributed system without relying on stable storage. With diskless checkpointing, each processor involved in the computation stores a copy of its state locally, either in memory or on local disk. Additionally, encodings of these checkpoints are stored in local memory or on local disk of some processors which may or may not be involved in the computation. When a failure occurs, each live processor may roll its state back to its last local checkpoint, and the failed processor’s state may be calculated from the local checkpoints of the surviving processors and the checkpoint encodings. By eliminating stable storage from checkpointing and replacing it with memory and processor redundancy, diskless checkpointing removes the main source of overhead in checkpointing on distributed systems (Plank, Li & Puening, 1998). Figure 1 is an example of how diskless checkpoint works. To make diskless checkpointing as efficient as possible, it can be implemented at the application level rather than at the system level (Plank, Kim & Dongarra, 1997). In typical long running scientific applications, when diskless checkpointing is taken from application level, what needs to be checkpointed is often some numerical data (Kim, 1996). These numerical data can either be treated as bit-streams or as floating-point numbers. If the data are treated as bitstreams, then bit-stream operations such as parity Figure 1. Fault tolerance by diskless checkpointing
763
Scalable Fault Tolerance
can be used to encode the checkpoint. Otherwise, floating-point arithmetic such as addition can be used to encode the data. In this research, we treat the checkpoint data as floating-point numbers rather than bit-streams. However, the corresponding bit-stream version schemes could also be used if the the application programmer thinks they are more appropriate. In the rest of this chapter, we discuss how local checkpoints can be encoded efficiently so that applications can survive process failures.
Checksum-Based Checkpointing The checksum-based checkpointing is a floating-point version of the parity-based checkpointing scheme proposed in (Plank, Li, & Puening, 1998). In the checksum-based checkpointing, instead of using parity, floating-point number addition is used to encode the local checkpoint data. By encoding the local checkpoint data of the computation processors and sending the encoding to some dedicated checkpoint processors, the checksum-based checkpointing introduces a much lower memory overhead into the checkpoint system than neighborbased checkpoint. However, due to the calculating and sending of the encoding, the performance overhead of the checksum-based checkpointing is usually higher than neighbor-based checkpoint schemes. The basic checksum scheme works as follow. If the program is executing on N processors, then there is an N + 1-st processor called the checksum processor. At all points in time a consistent checkpoint is held in the N processors in memory. Moreover a checksum of the N local checkpoints is held in the checksum processor. Assume Pi is the local checkpoint data in the memory of the i-th computation processor. C is the checksum of the local checkpoints in the checkpoint processor. If we look at the checkpoint data as an array of real numbers, then the checkpoint encoding actually establishes an identity (1) between the checkpoint data Pi on computation processors and the checksum data C on the checksum processor. If any processor fails, then the identity (1) becomes an equation with one unknown. Therefore, the data in the failed processor can be reconstructed through solving this equation.P1 +…+Pn = C (1) Due to the floating-point arithmetic used in the checkpoint and recovery, there will be round-off errors in the checkpoint and recovery. However, the checkpoint involves only additions and the recovery involves additions and only one subtraction. In practice, the increased possibility of overflows, underflows, and cancellations due to round-off errors in the checkpoint and recovery algorithm is negligible.
Overhead and Scalability Analysis Assume diskless checkpointing is performed in a parallel system with p processors and the size of checkpoint on each processor is m bytes. It takes α + βx to transfer a message of size x bytes between two processors regardless of which two processors are involved. α is often called latency of the network, 1/β is called the bandwidth of the network. Assume the rate to calculate the sum of two arrays is γ seconds per byte. We also assume that it takes α + βx to write x bytes of data into the stable storage. Our default network model is the duplex model where a processor is able to concurrently send a message to one partner and receive a message from a possibly different partner. The more restrictive simplex model permits only one communication direction per processor. We also assume that disjoint pairs of processors can communicate each other without interference each other. In classical diskless checkpointing, binary-tree based encoding algorithm is often used to perform the checkpoint encoding (Chiueh & Deng, n.d.), (Kim, 1996), (Plank, 1997), (Plank, Li & Puening, 1998), (Silva & Silva, 1998). By organizing all processors as a binary tree and sending local checkpoints along
764
Scalable Fault Tolerance
the tree to the checkpoint processor (see Figure 2 (Plank, Li & Puening, 1998)), the time to perform one checkpoint for a binary-tree based encoding algorithm, Tdiskless−binary, can be represented as
Tdiskless −binary = 2 log p (α + ( β + γ )m ). In high performance scientific computing, the local checkpoint is often a relatively large message (megabyte level), so (β + γ)m is usually much larger than Tdiskless − binary ≈ 2 log p ( β + γ ) m. Therefore,
Ttotal ( s) = (t + p − 2)(α + βs + γ s) = (m / s + p − 2)(α + βs + γ s) = α m / s + ( p − 2)( β + λ) s + ( p − 2) α+ ( β+ γ)m ≤ 2 ( p − 2)α ( β + γ )m + ( p − 2)α + ( β + γ ) m = ( β + γ )m ⋅ (1 + Ο( p / m ) Note that, in a typical checkpoint/restart approach where βm is usually also much larger than α, the time to perform one checkpoint, Tcheckpoint/restart, isTcheckpoint/restart = p(α + βm) ≈ pα Therefore, by eliminating stable storage from checkpointing and replacing it with memory and processor redundancy, diskless checkpointing improves the scalability of checkpointing greatly on parallel and distributed systems.
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing Although the classical diskless checkpointing technique improves the scalability of checkpointing dramatically on parallel and distributed systems, the overhead to perform one checkpoint still increases logarithmicly as the number of processors increases. In this section, we propose a new style of encoding algorithm which improves the scalability of diskless checkpointing significantly. The new encoding algorithm is based on the pipeline idea. When the number of processors is one or two, there is not much that we can improve. Therefore, in what follows, we assume the number of processors is at least three. Figure 2. Encoding local checkpoints using the binary tree algorithm
765
Scalable Fault Tolerance
Pipelining The key idea of pipelining is (1) the segmenting of messages and (2) the simultaneous non-blocking transmission and receipt of data. By breaking up a large message into smaller segments and sending these smaller messages through the network, pipelining allows the receiver to begin forwarding a segment while receiving another segment. Data pipelining can produce several significant improvements in the process of checkpoint encoding. First, pipelining masks the processor and network latencies that are known to be an important factor in high-bandwidth local area networks. Second, it allows the simultaneous sending and receiving of data, and hence exploits the full duplex nature of the interconnect links in the parallel system. Third, it allows different segments of a large message being transmitted in different interconnect links in parallel after a pipeline is established, hence fully utilize the multiple interconnects of a parallel and distributed ststem.
Chain-Pipelined Encoding for Diskless Checkpointing Let m[i] denote the data on the ith processor. The task of checkpoint encoding is to calculate the encoding which is m[0] +m[1] + ... +m[p − 1] and deliver the encoding to the checkpoint processor. The chainpipelined encoding algorithm works as follows. First, organize all computational processors and the checkpoint processor as a chain. Second, segment the data on each processor into small pieces. Assume the data on each processor are segmented into t segment of size s. The jth segment of m[i] is denoted as m[i][j]. Third, m[0]+m[1]+...+m[p−1] are calculated by calculating m[0][j]+m[1][j]+...+m[p−1] [j] for each 0 ≤ j ≤ t – 1 in a pipelined way. Fourth, when the jth segment of encoding m[0][j]+m[1] [j]+...+m[p−1][j] is available, start to send it to the checkpoint processor. Figure 3 demonstrates an example of calculating a chain-pipelined checkpoint encoding for three processors (processor 0, processor 1, and processor 2) and deliver it to the checkpoint processor (processor 3). In step 0, processor 0 sends its m[0][0] to processor 1. Processor 1 receives m[0][0] from processor 0 and calculates m[0][0]+m[1][0]. In step 1, processor 0 sends its m[0][1] to processor 1. Processor 1 first concurrently receives m[0][1] from processor 0 and sends m[0][0] + m[1][0] to processor 2 and then calculates m[0][1] + m[1][1]. Processor 2 first receives m[0][0] + m[1][0] from processor 1 and then calculate m[0][0] + m[1][0] + m[2][0]. As the procedure continues, at the end of step 2, the checkpoint processor will be able to get its first segment of encoding m[0][0] +m[1][0]+m[2][0]+m[3][0]. From now on, the checkpoint processor will be able to receive a segment of the encoding at the end of each step. After the checkpoint processor receives the last checkpoint encoding, the checkpoint is finished.
Overhead and Scalability Analysis In the chain-pipelined checkpoint encoding, the time for each step is Teach-step = α + βs + γs, where s is the size of the segment. The number of steps to encode and deliver t segments in a p processor system is t + p − 2. If we assume the size of data on each processor is m (= ts), then the total time for encoding and delivery is
s=
766
αm ( p − 2)( β + γ )
Scalable Fault Tolerance
The minimum is achieved when
2 log p (( β + γ ) m +α )
(2)
Therefore, by choosing an optimal segment size, the chain-pipelined encoding algorithm is able to reduce the checkpoint overhead to tolerate single failure from ( β + γ ) m ⋅ (1 + Ο( p / m ) to
a11 P1 + … + a1n Pn a P + … + a P Mn n M1 1
=
C1
� = CM
In diskless checkpointing, the size of checkpoint m is often large (megabytes level). The latency α is often a very small number compared with the time to send a large message. If p is not too large, then Ttotal ≈ (β + γ)m. Therefore, in practice, the number of processors often has very little impact on the time to perform one checkpoint unless p is very large. If p does become very large, strategies in one dimensional weighted checksum scheme can be used to guarantee small latency related terms.
Coding to Tolerate Multiple Simultaneous Process Failures To tolerate multiple simultaneous process failures of arbitrary patterns with minimum process redundancy, a weighted checksum scheme can be used. A weighted checksum scheme can be viewed as a version of the Reed-Solomon erasure coding scheme (Plank, 1997) in the real number field. The basic idea of this scheme works as follow: Each processor takes a local in-memory checkpoint, and M equalities are established by saving weighted checksums of the local checkpoint into M checksum processors. When f failures happen, where f ≤M, the M equalities becomes M equations with f unknowns. By appropriately choosing the weights of the weighted checksums, the lost data on the f failed processors can be recovered by solving these M equations.
Figure 3. Chain-pipelined encoding for diskless checkpointing
767
Scalable Fault Tolerance
The Basic Weighted Checksum Scheme Suppose there are n processors used for computation. Assume the checkpoint data on the ith computation processor is Pi. In order to be able to reconstruct the lost data on failed processors, another M processors are dedicated to hold M encodings (weighted checksums) of the checkpoint data (see Figure 4). The weighted checksum Cj on the jth checksum processor can be calculated from
2 log p (( β + 2γ ) m +α )
(3)
where aij, i = 1, 2, ...,M, j = 1, 2, ..., n, is the weight we need to choose. Let A = (aij)Mn. We call A the checkpoint matrix for the weighted checksum scheme. Suppose that k computation processors and M −h checkpoint processors have failed. Then there are n − k computation processors and h checkpoint processors that have survived. If we look at the data on the failed processors as unknowns, then (3) becomes M equations with M − (h − k) unknowns. If k > h, then there are fewer equations than unknowns. There is no unique solution for (3), and the lost data on the failed processors can not be recovered. However, if k < h, then there are more equations than unknowns. By appropriately choosing A, a unique solution for (3) can be guaranteed, and the lost data on the failed processors can be recovered by solving (3). Let Ar denote the coefficient matrix of the linear system that need to be solved to recover the lost data. Whether we can recover the lost data on the failed processes or not directly depends on whether Ar has a full column rank or not. However, Ar can be any sub-matrix (including minor) of A depending on the distribution of the failed processors. If any square sub-matrix (including minor) of A is non-singular and there are no more than M process failed, then Ar can be guaranteed to have full column rank. Therefore, to be able to recover from no more than any M failures, the checkpoint matrix A has to satisfy the condition that any square sub-matrix (including minor) of A is non-singular. How can we find such kind of matrices? It is well known that some structured matrices such as Vandermonde matrix and Cauchy matrix satisfy this condition (Golub & Van Loan, 1989). Let Tdiskless_pipeline(k, p) denotes the encoding time to tolerate k simultaneous failures in a p-processor system using the chain-pipelined encoding algorithm and Tdiskless_binary(k, p) denotes the corresponding encoding time using the binary-tree encoding algorithm.
Figure 4. Basic weighted checksum scheme for diskless checkpointing
768
Scalable Fault Tolerance
When tolerating k simultaneous failures, k basic encodings have to be performed. Note that, in addition to the summation operation, there is an additional multiplication operation involved in (3). Therefore, the computation time for each number will increase from γ to 2γ. Hence, when the binary-tree encoding algorithm is used to perform the weighted checksum encoding, the time for one basic encoding is 2élog pù ((β + 2γ)m + α). Therefore, the time for k basic encodings is
Tdiskless −binary (k , p) = k ⋅ 2 log p (α + ( β + 2 γ ) m ) ≈ 2 log p ⋅ k ( β + 2γ ) m. When the chain-pipelined encoding algorithm is used to perform the checkpoint encoding, the overhead to tolerate k simultaneous failures becomes
p p Tdiskless − pipeline (k , p) = k ⋅ 1 + Ο ⋅ ( β + 2 γ) m = 1 + Ο ⋅ k ( β + 2γ ) m. m m When the number of processors p is not too large, the overhead for the basic weighted checksum scheme Tdiskless-pipeline(k, p) ≈ k(β + 2γ)m. However, in today’s large computing systems, the number of processors p may become very large. If we do have a large number of processors in the computing systems, either the one dimensional weighted checksum scheme or the localized weighted checksum scheme discussed in the following can be used.
One Dimensional Weighted Checksum Scheme The one dimensional weighted checksum scheme works as follows. Assume the program is running on p = g×s processors. Partition the g×s processors into s groups with g processors in each group. Dedicate another M checksum processors for each group. In each group, the checkpoint are done using the basic weighted checksum scheme (see Figure 5). This scheme can survive M processor failures in each group. The advantage of this scheme is that the checkpoints are localized to a subgroup of processors, so the checkpoint encoding in each sub-group can be performed in parallel. Therefore, compared with the basic weighted checksum scheme, the performance of the one dimensional weighted checksum scheme is usually better. By using a pipelined encoding algorithm in each subgroup, the time to tolerate k simultaneous failures in a p-processor system is now reduced to
g Tdiskless − pipeline (k , p) = Tdiskless − pipeline (k , g ) = 1 + Ο ⋅ k ( β + 2γ ) m. m which is independent of the total number of processors p in the computing system. Therefore, in this fault tolerance scheme, the overhead to survive k failures in a p-processor system does not increase as the total number of processors p increases. It is in this sense that the sub-group based chain-pipelined checkpoint encoding algorithm is a scalable recovery algorithm.
769
Scalable Fault Tolerance
CHECKPOINT-FREE FAULT TOLERANCE FOR MATRIX MULTIPLICATION It has been proved in previous research (Huang & Abraham, 1984) that, for some matrix operations, the checksum relationship in input checksum matrices is preserved in the final computation results at the end of the operation. Based on this checksum relationship in the final computation results, Huang and Abraham have developed the famous algorithm-based fault tolerance (ABFT) (Huang & Abraham, 1984) technique to detect, locate, and correct certain processor miscalculations in matrix computations with low overhead. The algorithm-based fault tolerance proposed in (Huang & Abraham, 1984) was later extended by many researches (Afinson & Luk, 1988), (Banerjee, Rahmeh, Stunkel, Nair, Roy, Balasubramanian & Abraham, 1990), (Balasubramanian & Banarjee, 1990), (Boley, Brent, Golub & Luk, 1992), (Luk & Park, 1986). In order to be able to recover from a fail-stop process failure in the middle of the computation, a global consistent state of the application is often required when a process failure occurs. Checkpointing and message logging are typical approaches to maintain or construct such global consistent state in a distributed environment. But if there exists a checksum relationship between application data on different processes, such checksum relationship can actually be treated as a global consistent state. However, it is still an open problem that whether the checksum relationship in input checksum matrices in ABFT can be maintained during computation or not. Therefore, whether ABFT can be extended to tolerate fail-stop process failures in a distributed environment or not remains open. In this section, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices does not preserve during computation. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, it is possible to maintain the checksum relationship in the input checksum matrices during computation. Based on this checksum relationship maintained during computation, we demonstrate that it is possible to tolerate fail-stop process failures (which are typically tolerated by checkpoting or message logging) in distributed outer product version matrix multiplication without checkpointing. Because no periodical checkpoint or rollback-recovery is involved in this approach, process failures can often be tolerated with a surprisingly low overhead.
Figure 5. One dimensional weighted checksum scheme for diskless checkpointing
770
Scalable Fault Tolerance
Maintaining Checksum at the End of Computation For any general m-by-n matrix A defined by
a00 � a0 n −1 � � A = � am −1 0 � am −1 n −1 The column checksum matrix Ac of the matrix A is defined by
a00 � c A = am −1 0 i = m −1 ∑ i =0 ai 0
� a0 n −1 � � � am −1 n −1
i = m −1 � ∑ i =0 ai n −1
The row checksum matrix Ar of the matrix A is defined by j = n −1 a0 j ∑ a a � j =0 0 n −1 00 r A = � � � � j = n −1 am −1 0 � am −1 n −1 ∑ j =0 am −1 j
The full checksum matrix Af of the matrix A is defined by
a00 � f A = am −1 0 ∑ i = m −1 a i =0 i 0
∑
� a0 n −1 � � � am −1 n −1 �
∑
i = m −1 i =0
ai n −1
� j = n −1 a ∑ j =0 m−1 j i = m −1 j = n −1 ∑ i =0 ∑ j =0 ai j j = n −1 j =0
a0 j
Theorem 1: Assume A, B, and C are three matrices. If A B = C, then Ac Br = Cf . Proof: Assume A is m-by-n, B is n-by-k, then C is m-by-k. Let eT denote a 1-by-m vector [1, 1, . . ., 1], then
771
Scalable Fault Tolerance
A Ac = T ; e A B R = (B B e ); Ce C Cf = T . T e C e C E Ac B r
A = T (B B e ) e A AB ABe = T T e AB e ABe C Ce = T e C eT Ce = Cf.
Theorem 1 was first proved by Huang and Abraham in (Huang & Abraham, 1984). The reason that we repeat the proof here again is that we want to point out the fact that the proof of Theorem 1 is independent of the algorithms used for the matrix matrix multiplication operation. Therefore no mater which algorithm is used to perform the matrix matrix multiplication, the checksum relationship of the input matrices will always be preserved in the final computation results at the end of the computation. Based on this checksum relationship in the final computation results, the low-overhead algorithmbased fault tolerance technique has been developed in (Huang & Abraham, 1984) to detect, locate, and correct certain processor miscalculations in matrix computations.
Is the Checksum Maintained During Computation? Algorithm-based fault tolerance usually detects, locates, and corrects errors at the end of the computation. But in today’s high performance computing environment such as MPI, after a fail-stop process failure occurs in the middle of the computation, it is often required to recover from the failure first before the continuation of the rest of the computation. In order to be able recover from fail-stop failures occurred in the middle of the computation, a global consistent state of an application is often required in the middle of the computation. The checksum relationship, if exists, can actually be treated as a global consistent state. However, from Theorem 1, it is still uncertain whether the checksum relationship is preserved in the middle of the computation or not. In what follows, we demonstrate, for both cannon’s algorithm and fox’s algorithm for matrix matrix multiplication, this checksum relationship in the input checksum matrices is generally not preserved in the middle of the computation. Assume A is an (n − 1)-by-n matrix, B is an n-by-(n − 1) matrix. Then Ac = (aij)n×n, Br = (bij)n×n, and Cf = Ac * Br are all n-by-n matrices. For convenience of description, but without loss of generality, assume there are n2 processors with each processor stores one element from Ac, Br,and C f respectively. The n2 processors are organized into an n-by-n processor grid.
772
Scalable Fault Tolerance
Consider using the cannon’s algorithm (Cannon, 1969) in Figure 6 to perform Ac * Br in parallel on an n-by-n processor grid. We can prove the following Theorem 2. Theorem 2: If the cannon’s algorithm in Figure 6 is used to perform parallel matrix matrix multiplication, then there exist matrices A and B such that, at the end of each step s, where s = 0, 1, 2, · · ·, n − 2, the matrix C = (cij) is not a full checksum matrix. When the cannon’s algorithm in Figure 6 is used to perform Ac * Br in parallel for general matrix A and B, it can be proved that at the end of the sth step s
cij = ∑ ai , (i + j + k ) mod n ∗ b(i + j + k ) mod n, j k =0
It can be verified that C = (cij)n×n is not a full checksum matrix unless s = n − 1 which is the end of the computation. Therefore the checksum relationship in the matrix C is generally not preserved during computation in the cannon’s algorithm for matrix multiplication. It can also be demonstrated that the checksum relationship in the input matrix C is not preserved during computation in many other parallel algorithms for matrix matrix multiplication algorithms such as Fox’s algorithm.
Figure 6. Matrix matrix multiplication by Cannon’s algorithm with checksum matrices
773
Scalable Fault Tolerance
Figure 7. Matrix-matrix multiplication by outer product algorithm with input checksum matrices
An Algorithm That Maintains the Checksum during Computation Despite the checksum relationship of the input matrices is preserved in final results at the end of computation no matter which algorithm is used, from last subsection, we know that the checksum relationship is not necessarily preserved during computation. However, it is interesting to ask: is there any algorithm that preserves the checksum relationship during computation? Consider using the outer product version algorithm (Golub & Van Loan, 1989) in Figure 7 to perform Ac * Br in parallel. Assume the matrices Ac, Br, and Cf have the same data distribution scheme as before. Theorem 3: If the algorithm in Figure 7 is used to perform the parallel matrix matrix multiplication, then the matrix C = (cij)n×n is a full checksum matrix at the end of each step s, where s = 0, 2, · · ·, n − 1. Proof: It is trivial to show that, at the end of the sth step in Figure 7, the cij in the algorithm satisfies s
cij = ∑ aik ∗ bkj , k =0
where i, j = 0, 2, · · ·, n − 1. Note that Ac is a column checksum matrix, therefore n−2
∑a
t j
t =0
= an −1
j
where j = 0, 2, · · ·, n − 1. Since Br is a column checksum matrix, we have n−2
∑b t =0
774
it
= bi n −1
Scalable Fault Tolerance
where i = 0, 2, · · ·, n − 1. Therefore, for all j = 0, 2, · · ·, n − 1, we have n−2
n−2
s
∑ ctj = ∑∑ atk * bkj t =0
t =0 k =0
= ∑ ∑ atk * bkj k =0 t =0 n−2
s
s
= ∑ an −1k * bkj k =0
= cn −1 j Similarly, for all i = 0, 2, · · ·, n − 1, we have n−2
n−2
t =0
t =0 k =0
s
∑ cit = ∑∑ aik * bkt s n−2 = ∑ aik * ∑ bkt k =0 t =0 s
= ∑ aik * bkn −1 k =0
= cin −1 Therefore, we can conclude that C is a full checksum matrix. Hence, at the end of each step s, where s = 0, 2, · · ·, n − 1, C = (cij)n×n is a full checksum matrix. Theorem 3 implies that a coded global consistent state of the critical application data (i.e. the checksum relationship in Ac, Br, and C f) can be maintained in memory at the end of each iteration in the outer product version matrix matrix multiplication if we perform the computation with the checksum input matrices. However, in a high performance distributed environment, different processes may update their data in local memory asynchronously. Therefore, if a failure happens at a time when some processes have updated their local matrix in memory and other processes are still in the communication stage, then the checksum relationship in the distributed matrix will be damaged and the data on all processes will not form a global consistent state. But this problem can be solved by simply performing a synchronization before performing local memory update. Therefore, it is possible to maintain a coded global consistent state (i.e. the checksum relationship) of the matrix Ac, Br and Cf in the distributed memory at any time during computation. Hence, a single fail-stop process failure in the middle of the computation can be recovered from the checksum relationship. Note that it is also the outer product version algorithm that is ofen used in today’s high performance computing practice. The outer product version algorithm is more popular due to both its simplicity and its efficiency in modern high performance computer architecture. In the widely used parallel numerical
775
Scalable Fault Tolerance
linear algebra library ScaLAPACK (Blackford, Choi, Cleary, Petitet, Whaley, Demmel, et al. 1996), it is also the outer product version algorithm that is chosen to perform the matrix matrix mulitiplication. More importantly, it can also be proved that similar checksum relationship exists for the outer product version of many other matrix operations (such as Cholesky and LU factorization).
PRACTICAL NUMERICAL ISSUES Both the encoding schemes introduced in the scalable checkpointing and the algorithm-based fault tolerance presented before involve solving system of linear equations to recover multiple simultaneous process failures. Therefore, the practical numerical issues involved in recovering multiple simultaneous process failures have to be addressed.
Numerical Stability of Real Number Codes From previous section, it has been derived that, to be able to recover from any no more than m failures, the encoding matrix A has to satisfy: any square sub-matrix (including minor) of A is non-singular. This requirement for the encoding matrix coincides with the properties for the generator matrices of real number Reed-Solomon style erasure correcting codes. In fact, our weighted checksum encoding discussed before can be viewed as a version of the Reed-Solomon erasure coding scheme (Plank, 1997) in real number field. Therefore any generator matrix from real number ReedSolomon style erasure codes can actually be used as the encoding matrix of algorithm-based checkpointfree fault tolerance In the existing real number or complex-number Reed-Solomon style erasure codes in literature, the generator matrices mainly include: Vandermonde matrix (Vander), Vandermonde-like matrix for the Chebyshev polynomials (Chebvand), Cauchy matrix (Cauchy), Discrete Cosine Transform matrix (DCT), Discrete Fourier Transform matrix (DFT). Theoretically, these generator matrices can all be used as the encoding matrix of the algorithm-based checkpoint-free fault tolerance scheme. However, in computer floating point arithmetic where no computation is exact due to round-off errors, it is well known (Golub & Van Loan, 1989) that, in solving a linear system of equations, a condition number of 10k for the coefficient matrix leads to a loss of accuracy of about k decimal digits in the solution. Therefore, in order to get a reasonably accurate recovery, the encoding matrix A actually has to satisfy any square sub-matrix (including minor) of A is well-conditioned. The generator matrices from above real number or complex-number Reed-Solomon style erasure codes all contain ill-conditioned sub-matrices. Therefore, in these codes, when certain error patterns occur, an ill-conditioned linear system has to be solved to reconstruct an approximation of the original information, which can cause the loss of precision of possibly all digits in the recovered numbers.
Numerically Good Real Number Codes Based on Random Matrices In this section, we will introduce a class of new codes that are able to reconstruct a very good approximation of the lost data with high probability regardless of the failure patterns processes. Our new codes are based on random matrices over real number field. It is well-known (Edelmann, 1988) that Gaussian random matrices are well-conditioned. To estimate how well conditioned Gaussian random matrices are, we have proved the following theorem:
776
Scalable Fault Tolerance
Theorem 4: Let Gm×n be an m×n real random matrix whose elements are independent and identically distributed standard normal random variables, and let κ2(Gm×n) be the 2-norm condition number of Gm×n. Then, for any m≥2, n≥2 and x≥|n − m| + 1, κ2(Gm×n) satisfies n − m +1
c x 2π
n − m +1
c k2 (Gm×n ) x < P > x < n / ( n − m + 1) 2π
,
and
E (ln k2 (Gm×n )) < ln
n + 2.258, n − m +1
where 0.245 ≤ c ≤ 2.000 and 5.013 ≤ C ≤ 6.414 are universal positive constants independent of m, n and x. Due to the length of the proof for Theorem 4, we omit the proof here and refer interested readers to (Chen & Dongarra, n.d.) for complete proof. Note that any sub-matrix of a Gaussian random matrix is still a Gaussian random matrix. Therefore, a Gaussian random matrix would satisfy any sub-matrix of the matrix is well-conditioned with high probability. Theorem 4 can be used to estimate the accuracy of recovery in the weighted checksum scheme. For example, if an application uses 100,000 processes to perform computation and 20 processes to hold encodings, then the encoding matrix is a 20 by 100,000 Gaussian random matrix. If 10 processors fail concurrently, then the coefficient matrix in the recovery algorithm is a 20 by 10 Gaussian random matrix. From Theorem 4, we can getE(log10 κ2(Ar) < 1.25andP{κ2(Ar) > 100} < 3.1 × 10-11. Therefore, on average, we will loss about one decimal digit in the recovered data and the probability to loss 2 digits is less than 3.1 × 10−11.
EXPERIMENTAL EVALUATION In this section, we evaluate the performance of the introduced fault tolerance schemes experimental.
Performance of the Chain-Pipelined Checkpoint Encoding In this section, we evaluate the scalability of the proposed chain-pipelined checkpoint encoding algorithm using a preconditioned conjugate gradient (PCG) equation solver (Barrett, Berry, Chan, Demmel, Donato, Dongarra, Eijkhout, et al., 1994). The basic weighted checksum scheme is incorporates into our PCG code. The checkpoint encoding matrix we used is a pseudo random matrix. The programming environment we used is FT-MPI (Fagg & Dongarra, 2000), (Fagg, et al., 2004), (Fagg, et al., 2005). A process failure is simulated by killing one process in the middle of the computation. The lost data on the failed process is recovered by solving the checksum equation.
777
Scalable Fault Tolerance
We fix the number of simultaneous processor failures and increase the total number of processors for computing. But the problems to solve are chosen very carefully such that the size of checkpoint on each processor is always the same (about 25 Megabytes) in every experiment. By keeping the size of checkpoint per processor fixed, we are able to observe the impact of the total number of computing processors on the performance of the checkpointing. In all experiments, we performed checkpoint every 100 iterations and run PCG for 2000 iterations; In practice, there is an optimal checkpoint interval which depends on the failure rate, the time cost of each checkpoint and the time cost of each recovery. Much literature about the optimal checkpoint interval (Gelenebe, 1979), (Plank & Thomason, 2001), (Young, 1998) is available. We will not address this issue further here. Figure 8 reports both the checkpoint overhead (for one checkpoint) and the recovery overhead (for one recovery) for tolerating 4 simultaneous process failures on a IBM RS/6000 with 176 Winterhawk II thin nodes (each with 4 375 MHz Power3-II processors). The number of checkpoint processors in the experiment is four. We simulate a failure of four simultaneous processors by killing four processes during the execution. Figure 8 demonstrates that both the checkpoint overhead and the recovery overhead are very stable as the total number of computing processes increases from 60 to 480. This is consistent with our theoretical result in previous section.
Performance of the Algorithm-Based Checkpoint-Free Fault Tolerance In this section, we experimentally evaluate the performance overhead of applying this checkpoint-free fault tolerance technique to tolerate single fail-stop process failure in the widely used ScaLAPACK matrix-matrix multiplication kernel. The size of the problems and the number of computation processes used in our experiments are listed in Figure 9. All experiments were performed on a cluster of 64 dual-processor nodes with AMD Opteron(tm) Processor 240. Each node of the cluster has 2 GB of memory and runs a Linux operating system. The nodes are connected with Myrinet. The timer we used in all measurements is MPI Wtime. The programming environment we used is FT-MPI (Fagg, et al., 2005). When there is no failure occurs, the total overhead equals to the overhead for calculating encoding Figure 8. Scalability of the checkpoint encoding and recovery decoding
778
Scalable Fault Tolerance
Figure 9. Experiment configurations
Figure 10. The total overhead (time) for fault tolerance
at the beginning plus the overhead of performing computation with larger (checksum) matrices. If there are failures occur, then the total performance overhead equals the overhead without failures plus the overhead for recovering FT-MPI Environment and the overhead for recovering the application data from the checksum relationship. Figure 10 reports the execution times of the original matrix-matrix multiplication, the fault tolerant version matrix-matrix multiplication without failures, and the fault tolerant version matrix-matrix multiplication with a single fail-stop process failure. Figure 11 reports the total fault tolerance overhead (%).
779
Scalable Fault Tolerance
Figure 11 demonstrates that, as the number of processors increases, the total overhead (%) decreases. This is because, as the number of processors increases, the time overhead is pretty stable but the total amount of time to solve a problem increases. The percentage overhead equals to the time overhead divided by the total amount of time to solve the problem.
CONCLUSION AND FUTURE WORK In this chapter, we presented two scalable fault tolerance techniques for large-scale high performance computing. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques scale well as the total number of processors increases.
REFERENCES Adiga, N. R., et al. (2002). An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC’2002), Baltimore MD, USA, (pp. 1–22). Anfinson, J., & Luk, F. T. (1988, December). A Linear Algebraic Model of Algorithm-Based Fault Tolerance. IEEE Transactions on Computers, 37(12), 1599–1604. doi:10.1109/12.9736 Balasubramanian, V., & Banerjee, P. (1990). Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors. IEEE Transactions on Computers, C-39, 436–446. doi:10.1109/12.54837
Figure 11. The total overhead (%) for fault tolerance
780
Scalable Fault Tolerance
Banerjee, P., Rahmeh, J. T., Stunkel, C. B., Nair, V. S. S., Roy, K., & Balasubramanian, V. (1990). Algorithmbased fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39, 1132–1145. doi:10.1109/12.57055 Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., et al. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition., Philadelphia, PA: SIAM. Blackford, L. S., Choi, J., Cleary, A., Petitet, A., & Whaley, R. C. Demmel, et al. (1996). ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance. In Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), (p. 5). Boley, D. L., Brent, R. P., Golub, G. H., & Luk, F. T. (1992). Algorithmic fault tolerance using the lanczos method. SIAM Journal on Matrix Analysis and Applications, 13, 312–332. doi:10.1137/0613023 Cannon, L. E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT. Chen, Z., & Dongarra, J. (2005). Condition numbers of gaussian random matrices. SIAM Journal on Matrix Analysis and Applications, 27(3), 603–620. doi:10.1137/040616413 Chiueh, T., & Deng, P. (1996). Evaluation of checkpoint mechanisms for massively parallel machines. In FTCS, (pp. 370–379). Dongarra, J., Meuer, H., & Strohmaier, E. (2004). TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC’2004), Pittsburgh PA. New York: ACM. Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4), 543–560. doi:10.1137/0609045 Fagg, G. E., & Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, (pp. 346–353). Fagg, G. E., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., et al. (2004). Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany. Fagg, G. E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., & Pjesivac-Grbovic, J. (2005). Process faulttolerance: Semantics, design and applications for high performance computing. [Winter.]. International Journal of High Performance Computing Applications, 19(4), 465–477. doi:10.1177/1094342005056137 Fox, G. C., Johnson, M., Lyzenga, G., Otto, S. W., Salmon, J., & Walker, D. (1988). Solving Problems on Concurrent Processors: Vol. 1. Englewood Cliffs, NJ: Prentice-Hall. Gelenbe, E. (1979). On the optimum checkpoint interval. Journal of the ACM, 26(2), 259–270. doi:10.1145/322123.322131 Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore, MD: The John Hopkins University Press.
781
Scalable Fault Tolerance
Huang, K.-H., & Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. EEE Transactions on Computers, C-33, 518–528. doi:10.1109/TC.1984.1676475 Kim, Y. (1996, June). Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville. Luk, F. T., & Park, H. (1986). An analysis of algorithm-based fault tolerance techniques. SPIE Adv. Alg. and Arch. for Signal Proc., 696, 222–228. Message Passing Interface Forum. (1994). MPI: A Message Passing Interface Standard. (Technical Report ut-cs-94-230), University of Tennessee, Knoxville, TN. Plank, J. S. (1997, September). A tutorial on Reed-Solomon coding for fault-tolerance in RAIDlike systems. Software, Practice & Experience, 27(9), 995–1012. doi:10.1002/(SICI)1097024X(199709)27:9<995::AID-SPE111>3.0.CO;2-6 Plank, J. S., Kim, Y., & Dongarra, J. (1997). Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing, 43(2), 125–138. doi:10.1006/jpdc.1997.1336 Plank, J. S., & Li, K. (1994). Faster checkpointing with n+1 parity. In FTCS, (pp. 288–297). Plank, J. S., Li, K., & Puening, M. A. (1998). Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10), 972–986. doi:10.1109/71.730527 Plank, J. S., & Thomason, M. G. (2001, November). Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 61(11), 1570–1590. doi:10.1006/jpdc.2001.1757 Silva, L. M., & Silva, J. G. (1998). An experimental study about diskless checkpointing. In EUROMICRO’98, (pp. 395–402). Young, J. W. (1974). A first order approximation to the optimal checkpoint interval. Communications of the ACM, 17(9), 530–531. doi:10.1145/361147.361115
KEY TERMS AND DEFINITIONS Checkpointing: Checkpointing is a type of techniques to incorporate fault tolerance into a system. Erasure Correction Codes: An erasure correction code transforms a message of n blocks into a message with more than n blocks, such that the original message can be recovered from a subset of those blocks. Fail-Stop Failure: Fail-stop failure is a type of failures that cause the component of a system experiencing this type of failure stops operating. Fault Tolerance: Fault tolerance is the property of a system that enables it to continue operating properly after a failure occurred in the system. Message Passing Interface: Message Passing Interface is a specification for an API that allows different processes to communicate with one another.
782
Scalable Fault Tolerance
Parallel and Distributed Computing: Parallel and distributed computing is a sub-field of computer science that handles computing involving more than one processing unit. Pipeline: A pipeline is a set of data processing elements connected in series so that the output of one element is the input of the next one.
783
Section 10
Applications
785
Chapter 34
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems Yifeng Zhu University of Maine, USA Hong Jiang University of Nebraska – Lincoln, USA
ABSTRACT This chapter discusses the false rates of Bloom filters in a distributed environment. A Bloom filter (BF) is a space-efficient data structure to support probabilistic membership query. In distributed systems, a Bloom filter is often used to summarize local services or objects and this Bloom filter is replicated to remote hosts. This allows remote hosts to perform fast membership query without contacting the original host. However, when the services or objects are changed, the remote Bloom replica may become stale. This chapter analyzes the impact of staleness on the false positive and false negative for membership queries on a Bloom filter replica. An efficient update control mechanism is then proposed based on the analytical results to minimize the updating overhead. This chapter validates the analytical models and the update control mechanism through simulation experiments.
INTRODUCTION TO BLOOM FILTERS A standard Bloom filter (BF) (Bloom, 1970) is a lossy but space-efficient data structure to support membership queries within a constant delay. As shown in Figure 1, a BF includes k independent random hash functions and a vector B of a length of m bits. It is assumed that the BF represents a finite set S = {x1, x2,…,xn} of n elements from a universe U . The hash functions hi(x), 1 ≤ i ≤ k, map the universe U to the bit address space [1,m], shown as follows, H(x) = {hi(x) | 1 ≤ hi(x) ≤ m for 1 ≤ i ≤ k}
DOI: 10.4018/978-1-60566-661-7.ch034
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
(1)
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 1. A Bloom filter with a bit vector of m bits, and k independent hash functions. When an element x is added into the set represented, all bits indexed by those hash functions are set to 1.
Definition 1. For all x ∈ U, B[H(x)] ≡ {B[hi(x)] | 1 ≤ i ≤ k}. This notation facilitates the description of operations on the subset of B addressed by the hash functions. For example, B[H(x)] = 1 represents the condition in which all the bits in B at the positions of h1(x),…, and hk(x) are 1. “Setting B[H(x)]” means that the bits at these positions in B are set to 1. Representing the set S using a BF B is fast and simple. Initially, all the bits in B are set to 0. Then for each x ∈ S, an operation of setting B[H(x)] is performed. Given an element x, to check whether x is in S , one only needs to test whether B[H(x)] = 1. If no, then x is not a member of S; If yes, x is conjectured to be in S. Figure 1 shows the results after the element x is inserted into the Bloom filter. A standard BF has two well-known properties that are described by the following two theorems.
Theorem 1.Zero false negative For ∀x ∈ U, if ∃i, B[hi(x)] ≠ 1, then x Î /S.
For a static set S whose elements are not dynamically deleted, the bit vector indexed by those hash functions always never returns a false negative. The proof is easy and is not given in this chapter.
Theorem 2.Possible false positive / S . This probability is called the For ∀x ∈ U, if B[H(x)] = 1, then there is a small probability f+ that x Î + −kn/m k false positive rate and f ≈ (1 − e ) . Given a specific ratio of m/n, f+ is minimized when k = (m/n)ln2 + » (0.6185)m /n . and fmin Proof: The proof is based on the mathematical model proposed in (James, 1983; McIlroy, 1982). Detailed proof can be found in (Li et al., 2000; Michael, 2002). For the convenience of the reader, the proof is briefly presented here. After inserting n elements into a BF, the probability that a bit is zero is given by:
786
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 2. Expected false positive rate in a standard Bloom filter. A false positive is due to the collision of hash functions, where all indexed bits happen to be set by other elements.
kn
æ 1ö P0 (n ) = ççç1 - ÷÷÷ » e -kn /m . m ÷ø è
(2)
Thus the probability that k bits are set to 1 is k
kn ö æ æ çç ç 1 ö÷ ÷÷ ÷ P (k bits set) = ç1 - ç1 - ÷ ÷÷ » (1 - e -kn /m )k . çç çè m ÷ø ÷÷ø è
(3)
Assuming each element is equally likely to be accessed and |S||U|, then the false positive rate is æ |S f + = ççç1 çè |U
| ö÷ ÷÷ P (k bits set) » (1 - e -kn /m )k . |÷ø
Given a specific ratio of
m n
(4)
, i.e., the number of bits per element, it can be proved that the false posi-
787
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
tive rate f+ is minimized when k = (Michael, 2002)
m n
ln 2 and the minimal false positive rate is, as has been shown
f+ ≈ 0.5k = (0.6185)m/n
(5)
The key advantage of a Bloom filter is that its storage requirement falls several orders of magnitude below the lower bounds of error-free encoding structures. This space efficiency is achieved at the cost of allowing a certain (typically none-zero) probability of false positives, that is, it may incorrectly return an “yes” although x is actually not in S. Appropriately adjusting m and k can minimize this probability of false-positive to a sufficiently small value so that benefits from the space and time efficiency far outweigh the penalty incurred by false positives in many applications. For example, when the bit-element ratio is 8 and the number of hash functions is 6, the expected false positive rate is only 0.0216. Figure 2 shows the false positive rate under different configurations. In order to represent a dynamic set that is changing over time, (Li et al., 2000) proposes a variant named counting BF. A counting BF includes an array in which each entry is not a bit but rather a counter consisted of several bits. Counting Bloom filters can support element deletion operations. Let C = {cj | 1 ≤ j ≤ m} denote the counter vector and the counter cj represents the difference between the number of settings and the number of unsetting operations made to the bit B[j]. All counters cj for 1 ≤ j ≤ m are initialized to zero. When an element x is inserted or deleted, the counters C[H(x)] are incremented or decreased by one, respectively. If cj changes its value from one to zero, B[j] is reset to zero. While this counter array consumes some memory space, (Li et al., 2000) shows that 4 bits per counter will guarantee the probability of overflow minuscule even with several hundred million elements in a BF.
APPLICATIONS OF BLOOM FILTERS IN DISTRIBUTED SYSTEMS Bloom filters have been extensively used in many distributed systems where information dispersed across the entire system needs to be shared. For example, to reduce the message traffic, (Li et al, 2000) propose a web cache sharing protocol that employs a BF to represent the content of a web proxy cache and then periodically propagates that filter to other proxies. If a cache miss occurs on a proxy, that proxy checks the BFs replicated from other proxies to see whether they have the desired web objects in their caches. (Hong & Tao, 2003; Hua et al., 2008; Ledlie et al., 2002; Matei & Ian, 2002; Zhu et al., 2004; Zhu et al., 2008) use BFs to implement the function of mapping logical data identities to their physical locations in distributed storage systems. In these schemes, each storage node constructs a Bloom filter that summarizes the identities of data stored locally and broadcasts the Bloom filter to other nodes. By checking all filters collected locally, a node can locate the requested data without sending massive query messages to other nodes. Similar deployments of BFs have been found in geographic routing in wireless mobile systems (Pai-Hsiang, 2001), peer-to-peer systems (Hailong & Jun, 2004; John et al., 2000; Mohan & Kalogeraki, 2003; Rhea & Kubiatowicz, 2002), naming services (Little et al., 2002), and wireless sensor networks (Ghose et al. 2003; Luk et al. 2007). A common characteristic of distributed applications of BFs, including all those described above, is that a BF at a local host is replicated to other remote hosts to efficiently support distributed queries. In such dynamical distributed applications, the information that a BF represents evolves over time. However, the updating processes are usually delayed due to the network latency or the delay necessary in aggregating
788
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
small changes into single updating message in order to reduce the updating overhead. Accordingly the contents of the remote replicas may become partially outdated. This possible staleness in the remote replicas not only changes the probability of false positive answers to membership queries on the remote hosts, but also brings forth the possibility of false negatives. A false negative occurs when a BF replica answers “no” to the membership query for an element while that element actually exists in its host. It is generated when a new element is added to a host while the changes of the BF of this host, including the addition of this new element, have not been propagated to its replicas on other hosts. In addition, this staleness also changes the probability of false positives, an event in which an element is incorrectly identified as a member. Throughout the rest of this chapter, the probabilities of false negatives and false positives are referred to as the false negative rate and false positive rate, respectively. While the false negative and false positive rates for a BF at a local host have been well studied in the context of non-replicated BF (Bloom, 1970; Broder & Mitzenmacher, 2003; James, 1983; Li et al., 2000; Michael, 2002), very little attention has been paid to the false rates in the Bloom filter replicas in a distributed environment. In the distributed systems considered in this chapter, the false rates of the replicas are more important since most membership queries are performed on these replicas. A good understanding of the impact of the false negatives and false positives can provide the system designers with important and useful insights into the development and deployment of distributed BFs in such important applications as distributed file, database, and web server management systems in super-scales. Therefore, the first objective of this chapter is to analyze the false rates by developing analytical models and considering the staleness. Since different application may desire a different tradeoff between false rate (e.g, miss/fault penalty) and update overhead (e.g., network traffic and processing due to broadcasting of updates), it is very important and significant for the systems overall performance to be able to control such a tradeoff for a given application adaptively and efficiently. The second objective is to develop an adaptive control algorithm that can accurately and efficiently maintain a desirable level of false rate for any given application by dynamically and judiciously adjusting the update frequency. The primary contribution of this chapter is its developments of accurate closed-form expressions for the false negative and false positive rates in BF replicas, and the development of an adaptive replica-update control, based on our analytical model, that accurately and efficiently maintains a desirable level of false rate for any given application. To the best of our knowledge, this study is the first of its kind that has considered the impact of staleness of replicated BF contents in a distributed environment, and developed a mechanism to adaptively minimize such an impact so as to optimize systems performance. The rest of the chapter is organized as follows. Section 3 presents our analytical models that theoretically derive false negative and false positive rates of a BF replica, as well as the overall false rates in distributed systems. Section 4 validates our theoretical results by comparing them against results obtained from extensive experiments. The adaptive updating protocols based on our theoretical analysis models are presented in Section 5. Section 6 gives related work and Section 7 concludes the chapter. The chapter is extended from our previous publication (Zhu & Jiang, 2006).
FALSE RATES IN THEORY In many distributed systems, the information about what data objects can be accessed through a host or where data objects are located usually needs to be shared to facilitate the lookup. To provide high scal-
789
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 3. An example application of Bloom filters in a distributed system with 3 hosts.
ability, this information sharing usually takes a decentralized approach, to avoid potential performance bottleneck and vulnerability of a centralized architecture such as a dedicated server. While BFs were initially used in non-distributed systems to save the memory space in the 1980’s when memory was considered a precious resource (Lee, 1982; McIlroy, 1982), they have recently been extensively used in many distributed systems as a scalable and efficient scheme for information sharing, due to their low network traffic overhead. The inherent nature of such information sharing in almost all these distributed systems, if not all, can be abstracted as a location identification, or mapping problem, which is described next. Without loss of generality, the distributed system considered throughout this chapter is assumed to consist of a collection of γ autonomous data-storing host computers dispersed across a communication network. These hosts partition a universe U of data objects into γ subsets S1, S2,…,Sγ, with each subset stored on one of these hosts. Given an arbitrary object x in U, the problem is how to efficiently identify the host that stores x from any one of the hosts. BFs are useful to solve this kind of problems. In a typical approach, each host constructs a BF representing the subset of objects stored in it, and then broadcasts that filter to all the other hosts. Thus each host keeps γ − 1 additional BFs, one for every other host. Figure 3 shows an example of a system with three hosts. Note that a filter Bˆ i is a replica of Bi from Host i and Bˆ i may become outdated if the changes to Bi are not propagated instantaneously. While the solution to the above information sharing problem can implemented somewhat differently giving rise to a number of solution variants (Hua et al., 2008; Ledlie et al., 2002; Zhu et al., 2004), the analysis of false rates presented in this chapter can be easily applied to these variants. The detailed procedures of the operations of insertion, deletion and query of data objects are shown in Figure 4. When an object x is deleted from or inserted into Host i, the values of the counting filters Ci[H(x)] and bits Bi[H(x)] are adjusted accordingly. When the fraction of modified bits in Bi exceeds some threshold, Bi is broadcast to all the other hosts to update Bˆ i . To look up x, Host i performs the membership tests on all the BFs kept locally. If a test on Bi is positive, then x can potentially be accessed locally. If a test in the filter Bˆ j for any j ≠ i is positive, then x is conjectured to be on Host j with high probability. Finally, if none of the tests is positive, x is considered nonexistent in the system.
790
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 4. Procedures of adding, deleting and querying object x at host i
In the following, we begin the analysis by examining the false negative and false positive rate of a single BF replica and then present the analysis of the overall false rates of all BFs kept locally on a host. The experimental validations of the analytical models are presented in the next section.
False Rates of Bloom Filter Replicas Let B be a BF with m bits and Bˆ a replica of B. Let n and nˆ be the number of objects in the set represented by B and by Bˆ , respectively. We denote ∆1 (∆0) as the set of all one (zero) bits in B that are different than (i.e., complement of) the corresponding bits in Bˆ . More specifically, D1 = {B[i ] | B[i ] = 1, Bˆ[i ] = 0, "i Î [1, m ]}
791
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 5. An example of a BF B and its replica Bˆ where bits are reordered such that bits in ∆1 and ∆0 are placed together.
D0 = {B[i ] | B[i ] = 0, Bˆ[i ] = 1, "i Î [1, m ]}. Thus, ∆0 + ∆1 represent the set of changed bits in B that have not been propagated to Bˆ . The number of bits in this set is affected by the update threshold and update latency. Furthermore, if a nonempty ∆1 is hit by least one hash function of a membership test on Bˆ while all other hash functions of the same test hit bits in Bˆ - D1 - D0 with a value of one, then a false negative occurs in Bˆ . Similarly, a false positive occurs if the nonempty ∆1 is replaced by a nonempty ∆0 in the exact membership test scenario on a Bˆ described above. Lemma 1. Suppose that the numbers of bits in ∆1 and in ∆0 are mδ1 and mδ0, respectively. Then nˆ is a random variable following a normal distribution with an extremely small variance (i.e., extremely highly concentrated around its mean), that is, E (ˆ) n =-
m ln(e -kn /m + d1 - d0 ). k
(6)
Proof: In a given BF representing a set of n objects, each bit is zero with probability P0(n), given in Equation 2, or one with probability P1(n) = 1 − P0(n). Thus the average fractions of zero and one bits are P0(n) and P1(n), respectively. Ref. (Michael, 2002) shows formally that the fractions of zero and one bits are random variables that are highly concentrated on P0(n) and P1(n) respectively. B -1 -0
1
101101101 Bˆ - -
1111
101101101
0 0 0 0
1
0
0 0 0 0
1
0 111
Figure 5 shows an example of B and Bˆ where bits in ∆1 and ∆0 are extracted out and placed together. The expected numbers of zero bits in B − ∆1 − ∆0 and in Bˆ - D1 - D0 should be equal since the bits in them are always identical for any given B and Bˆ . Thus for any given n, δ1 and δ0, we have
792
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 6. Expected false negative rate of a Bloom filter replica when the configuration of its original Bloom filter is optimal.
P0 (n ) - d0 = E (P0 (ˆ)) n - d1
(7)
Substituting Equation 2 into the above equation, we have e -kn /m - d0 = e -kE (nˆ)/m - d1
(8)
After solving Equation 8, we obtain Equation 6. Pragmatically, in any given BF with n objects, the values of δ1 and δ0, which represent the probabilities of a bit falling in ∆1 and ∆0 respectively, are relatively small. Theoretically, the number of bits in ∆1 is less than the total number of one bits in B, thus we have δ1 ≤ 1 − e−kn/m. In a similar way, we can conclude that δ0 ≤ e−kn/m.
Theorem 3.False Negative Rate The expected false negative rate fˆ in the BF replica Bˆ is P1(n)k – (P1(n) − δ1)k, where P1(n) = 1 − e−kn/m. Proof: As mentioned earlier, a false negative in Bˆ occurs when at least one hash function hits the bits in ∆1 in Bˆ while the others hit the bits in Bˆ - D1 - D0 with a value of one. Hence, the false nega-
793
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 7. Expected false positive rate of a Bloom filter replica when the configuration of its original Bloom filter is optimal.
tive rate is æç ö÷
ç k ÷÷ i æ ö n - d0 ÷÷÷÷ø fˆ = å i =1 ççççç i ÷÷÷÷÷ d1 çççèP1 (ˆ) k
k -i
çè ÷ø
ök
æ
æ
ök
= çççèP1 (ˆ) n - d0 + d1 ÷÷÷÷ø - çççèP1 (ˆ) n - d0 ÷÷÷÷ø
Since P0(n) = 1 – P1(n) and P0 (ˆ) n = 1 - P1 (ˆ) n , Equation 7 can be rewritten as: E (P1 (ˆ)) n = P1 (n ) + d0 - d1
(9)
Hence k
æ ö æ ö E ( fˆ ) = çççèE (P1 (ˆ)) n - d0 + d1 ÷÷÷÷ø - çççèE (P1 (ˆ)) n - d0 ÷÷÷÷ø æ
k
ök
= P1 (n )k - çççèP1 (n ) - d1 ÷÷÷÷ø
(10)
Figure 6 shows the expected false negative rate when the false positive of the original BF is minimized. The minimal false positive rate is 0.0214, 0.0031 and 0.00046 when the bit-element ratio is 8, 12 and 16
794
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
respectively. Figure 6 shows that the false negative rates of a BF replica are more than 50% of the false positive rates of the original BF when δ1 is 5%, and more than 75% when δ1 is 10%. This proves that the false negative may be significant and should not be neglected in distributed applications.
Theorem 4.False Positive Rate + The expected false positive rate fˆ for the Bloom filter replica Bˆ is (P1(n) + δ0 − δ1)k , where P1(n) = 1 − e−kn/m. Proof: If Bˆ confirms positively the membership of an object while this object actually does not be/B long to B, then a false positive occurs. More specifically, a false positive occurs in Bˆ if for any x Î ˆ , all hit bits by hash functions of the membership test for x are ones in B - D1 - D0 , or for any x ∈ U, all hit bits are ones in Bˆ but at least one hit bit is in ∆0. Thus, we find that k æk ö ö÷ æ ök ç ÷ ÷÷ çççèP1 (ˆ) n - d0 )k -i n - d0 ÷÷÷÷ø + å çç ÷÷÷ d0i (P1 (ˆ) ÷ i ç |ø i =1 è ÷ ø n nk(P1(ˆ) n - d0 )k = P1(ˆ) |U |
æ n + fˆ = ççç1 çè |U
(11)
Considering n |U and Equation 9, we have n + E ( fˆ ) = (E (P1 (ˆ))) n k(E (P1 (ˆ)) n - d0 )k |U | æ
ök
æ
ök
= çççèP1 (n ) + d0 - d1 ÷÷÷÷ø -
n (P1 (n ) - d1 )k |U |
» çççèP1 (n ) + d0 - d1 ÷÷÷÷ø
(12)
Overall False Rates In the distributed system considered in this study, there are a total of γ hosts and each host has γ BFs, with γ−1 of them replicated from the other hosts. To look up an object, a host performs the membership tests in all the BFs kept locally. This section analyzes the overall false rates on each BF replica and each host. Give any BF replica Bˆ , the events of a false positive and a false negative are exclusive. Thus it is easy to find that the overall false rate of Bˆ is E ( foverall ) = E ( f - ) + E ( f + )
(13)
where E ( f - ) and E(f+) are given in Equation 10 and 12 respectively.
795
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 8. Comparisons of estimated and experimental fˆ of Bˆ when k is 6, 8 and 11 respectively. The initial object number in both B and Bˆ is 25, 75, 150 and 300 (m = 1200).
796
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Table 1. False positive rates comparisons when k is 6 and 8 respectively (m = 1200). + fˆ
k
nˆ
(percentage)
δ0
δ1
Estimated
Experimental
6
25
0.0942
0.2042
0.0002
0
6
25
0.0800
0.3650
0.0002
0
6
25
0.0600
0.4875
0.0001
0
6
75
0.0800
0.1608
0.0934
0.1090
6
75
0.0600
0.2833
0.0794
0.1090
6
75
0.0483
0.3758
0.0799
0.1090
6
150
0.0533
0.1042
2.2749
2.6510
6
150
0.0400
0.1800
2.3540
2.6510
6
150
0.0325
0.2508
2.1872
2.6530
6
300
0.0250
0.0417
23.6555
25.4790
6
300
0.0183
0.0692
25.4016
25.4710
6
300
0.0117
0.1000
24.7241
25.4750
8
25
0.1083
0.2425
0.00002
0
8
25
0.0792
0.4192
0.00002
0
8
25
0.0550
0.5425
0.00002
0
8
75
0.0792
0.1767
0.0525
0.0540
8
75
0.0550
0.3000
0.0504
0.0540
8
75
0.0425
0.3917
0.0506
0.0540
8
150
0.0475
0.1050
2.5163
2.5770
8
150
0.0350
0.1758
2.6783
2.5780
8
150
0.0283
0.2367
2.5384
2.5790
8
300
0.0192
0.0333
33.2078
33.2580
8
300
0.0133
0.0558
34.4915
33.2550
8
300
0.0083
0.0817
32.1779
33.2550
On Host i, BF Bi represents all the objects stored locally. While only false positives occur in Bi, both false positives and false negatives can occur in the replicas Bˆ j for any j ≠ i. Since the failed membership test in any BF leads to a lookup failure, the overall false positive and false negative rates on Host i are therefore + E ( fhost ) = 1 - (1 - fi + )
g
Õ
j =1, j ¹i
+
(1 - fˆj )
(14)
and
797
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Figure 9. Comparisons of estimated and experimental foverall in a distributed system with 5 hosts when k is 6, 8, and 11 respectively. The initial object number n on each host is 25, 75, 150 and 300 respectively. Then each host adds a set of new objects. The number of new objects on each host increases from 50 to 300 with a step size of 50. (m = 1200)
798
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Table 2. Overall false rate comparisons under optimum initial operation state when k is 6 and 8 respectively. 100 new objects are added on each host and then a set of existing objects are deleted from each host. The number of deleted objects increases from 10 to 100 with a step size of 10. (m = 1200) In the first group, initially Initially n = 150 and m/n = 8; in the second group, n = 100 and m/n = 12 initially. foverall
(percentage)
k
δ0
δ1
Estimated
Experimental
6
0.0100
0.1705
46.2259
45.2200
6
0.0227
0.1657
42.4850
40.6880
6
0.0347
0.1627
38.7101
37.2420
6
0.0458
0.1582
34.9268
33.8460
6
0.0593
0.1545
31.3748
30.4540
6
0.0715
0.1497
27.8831
27.3700
6
0.0837
0.1445
24.5657
24.8000
6
0.0938
0.1392
21.2719
22.5560
6
0.1045
0.1340
18.2490
20.4520
6
0.1165
0.1300
15.5103
18.7540
8
0.0123
0.2375
30.9531
29.6280
8
0.0255
0.2275
25.7946
23.6280
8
0.0413
0.2180
21.0943
18.0000
8
0.0552
0.2123
16.7982
14.6720
8
0.0658
0.2043
12.9800
12.0040
8
0.0772
0.1965
9.7307
9.7320
8
0.0920
0.1900
7.1016
7.7520
8
0.1075
0.1848
4.9936
6.1280
8
0.1237
0.1788
3.4031
4.8400
8
0.1377
0.1732
2.2034
3.8160
E ( fhost ) = 1-
g
Õ
j =1, j ¹i
-
(1 - fˆj )
(15)
+ where fi + , fˆj and fˆj are given in Theorem 2, 3 and 4 respectively. The probability that Host i fails a membership lookup can be expressed as follows: + + E ( fhost ) = E ( fhost + fhost - fhost fhost )
(16)
In practice, we can use the overall false rate of a BF replica to trigger updating process and use the overall false rate of all BFs on a host to evaluate the whole systems. In a typical distributed environment with many nodes, the updating of a Bloom filter replica Bˆ i stored on node j can be triggered by either the home node i or the node j. Since many nodes hold the replica of Bi, it is more efficient to let the home node i to initiate the updating process of all replicas of Bi. Otherwise, the procedure of checking
799
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
whether an updating is needed would be performed by all other nodes, wasting both network and CPU resources. Accordingly, we can only use the overall false rate of a BF replica E(foverall) as the updating criteria. On the other hand, E(fhost) can be used to evaluate the overall efficiency of all BFs stored on the same host.
EXPERIMENTAL VALIDATION This section validates our theoretical framework developed in this chapter by comparing the analytical results produced by our models with experimental results obtained through real experiments. We begin by examining a single BF replica. Initially the Bloom filter replica Bˆ is exactly the same as B. Then we artificially change B by randomly inserting new objects into B or randomly deleting existing objects from B repeatedly. For each specific modification made to B, we calculate the corresponding δ1 and δ0 and use 100,000 randomly generated objects to test the memberships against Bˆ . Since the actual objects represented in B are known in the experiments, the false negative and positive rates can be easily measured. Figure 8 compares analytical and real false negative rates, obtained from the theoretic models and from experiments respectively, by plotting the false negative rate in Bˆ as a function of δ1, a measure of update threshold, for different numbers of hashing functions (k = 6 and k = 8) when the initial number of objects in B are 25, 75, 150 and 300 respectively. Since the false negative rates are independent of δ0, only object deletions are performed in B. Table 1 compares the analytical and real false positive rates of Bˆ when k is 6 and 8 respectively. In these experiments, both object deletions and additions are performed in B while Bˆ remains unaltered. It is interesting that the false positive rates of Bˆ is kept around some constant for a specific nˆ although the objects in B changes in the real experiments. It is true that if the number of objects in B increases or decreases, the false positive rate in Bˆ should decrease or increase accordingly before the changes of B is propagated to Bˆ . However, due to the fact that n is far less than the total object number in the universe U, the change of the false positive rate in Bˆ is too small to be perceptible. These tests are made accordant with the real scenarios of BF applications in distributed systems. In such real applications, the number of possible objects is usually very large and thus BFs are deployed to efficiently reduce the network and network communication requirements. Hence, in these experiments the number of objects used to test Bˆ is much larger than the number of objects in B or Bˆ (100,000 random objects are tested). Under such large size of testing samples, the influence of the modification in B on the false positive rate of Bˆ is difficult to be observed. We also simulated the lookup problem in a distributed system with 5 hosts. Figure 9 shows the comparisons of the analytical and experimental average overall false rates on each host. In these experiments, we only added new objects without deleting any existing items so that δ0 is kept zero. The experiments presented in Table 2 considers both the deletion and addition of objects on each host when the initial state of BF on each host is optimized, this is, the number of hash functions is the optimal under the ratio between m and the initial number of objects n. This specific setting aims to emulate the real application where m/n and k are usually optimally or sub-optimally matched by dynamically adjusting the BF length m (Hong & Tao, 2003) or designing the BF length according to the average number of objects (Ledlie et al., 2002; Li et al., 2000; Little et al., 2002; Matei & Ian, 2002; Zhu et al., 2004). All the analytical
800
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
results have been very closely matched by their real (experimental) counterparts consistently, strongly validating our theoretical models.
Figure 10. In an environment of two servers, the figures show the overall false rate on one server when the initial number of elements in one server are 25 and 150 respectively. The ratio of bits per element is 8 and 6 hash functions are used. The rate for element addition and deletion are respectively 5 and 2 per time unit on each server.
801
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
REPLICA UPDATE PROTOCOL To reduce the false rate caused by staleness, the remote Bloom filter replica needs to be periodically updated. An update process is typically triggered if the percentage of dirty bits in a local BF exceeds some threshold. While a small threshold causes large network traffic and a large threshold increases the false rate, this tradeoff is usually reached by a trial-and-error approach that runs numerous (typically a large number of) trials in real experiments or simulations. For example, in the summery cache study (Li et al., 2000), it is recommended that if 10 percent of bits in a BF are dirty, then the BF propagates its changes to all replicas. However, this approach has the following disadvantages. 1.
2.
3.
It cannot directly control the false rate. To keep the false rate under some target value, complicated simulations or experiments have to be conducted to adjust the threshold for dirty bits. If the target false rate changes, this tedious process has to be repeated to find a “golden” threshold. It treats all dirty bits equally and does not distinguish the zero-dirty bits from the one-dirty bits. In fact, as shown in previous sections, the dirty one bits and the dirty zero bits exert different impacts on the false rates. It does not allow flexible update control. In many applications, the penalty of a false positive and a false negative are significantly different. For example, in summery cache (Li et al., 2000), a false positive occurs if a request is not a cache hit on some web proxy when the corresponding Bloom filter replica confirms so. The penalty of a false positive is a waste of query message to this local web proxy. A false negative happens if a request can be hit in a local web proxy but the Bloom filter replica mistakenly indicates otherwise. The penalty of a false negative is a round-trip delay in retrieving information from a remote web server through the Internet. Thus, the penalty of a false negative is much larger than that of a false positive. The updating protocols based on the percentage of dirty bits do not allow one to place more weight on the false negative rate, thus limiting the flexibility and efficiency of the updating process.
Based on the theoretic models presented in the previous sections, an updating protocol that directly controls the false rate is designed in this chapter. In a distributed system with γ nodes where each node has a local BF to represent all local elements, each node is responsible for automatically updating its BF replicas. Each node estimates the false rate of its remote BF replica and if the false rate exceeds some desire false rate, as opposed to a predetermined threshold on the percentage of dirty bits in the conventional updating approaches, a updating process is triggered. To estimate the false rate of remote BF replica Bˆ , each node has to record the number of elements stored locally (n), in addition to a copy of remote BF replica Bˆ . This copy is essentially the local BF B when the last updating is made. It is used to calculate the percentage of dirty one bits (δ1) and the dirty zero bits (δ0). Compared with the conventional updating protocols based on the total percentage of dirty bits, this protocol only needs to record one more variable (n), thus it does not significantly increase the maintenance overhead. This protocol allows more flexible updating protocols that consider the penalty difference between a false positive and a false negative. The overall false rate can be a weighted sum of the false positive rate and the false negative rate, shown as follows: E ( foverall ) = w +E ( f + ) + w -E ( f - )
802
(17)
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
where w+ and w− are the weights. The values of w+ and w− depends on the applications and also the application environments. We prove the effectiveness of this update protocol through event driven simulations. In this simulation, we made the following assumptions. 1.
2.
3.
Each item is randomly accessed. This assumption may not be realistic in some real workloads, in which an item has a greater than equal chance of being accessed again once it has been accessed. Though all previous theoretic studies on Bloom filters assume a workload with uniform access spectrum, further studies are needed to investigate the impact of this assumption. Each local node deletes or adds items at a constant rate. In fact, the deletion and addition rate changes dynamically throughout the lifetime of applications. This simplifying assumption is employed just to prove our concept while keeping our experiments manageable in the absence of a real trace or benchmark. The values of w+ and w− are 1. Their optimal values depend on the nature of the applications and environments.
We simulate a distributed system with two nodes where each node keeps a BF replica of the other. We assume the addition and deletion are 5 and 2 per time unit respectively and our desired false rate is 10%. Figure 10 shows the estimated false rate and the measured false rate of node 1 throughout the deletion, addition and updating processes. Due to the space limitation, the false rate on node 2, which is similar to node 1, is not shown in this chapter. In addition, we have changed the addition rate and deletion rates. Simulation results consistently indicate that our protocol is accurate and effective in control the false rate.
RELATED WORK Standard Bloom filters (Bloom, 1970) have inspired many extensions and variants, such as the Counting Bloom filters (Li et al., 2000), compressed Bloom filters (Michael, 2002), the space-code Bloom filters (Kumar et al., 2005), the spectral Bloom filters (Saar & Yossi, 2003), time-decaying Bloom filters (Cheng et al., 2005), and the Bloom filter state machine (Bonomi et al., 2006). The counting Bloom filters are used to support the deletion operation and handle a set that is changing over time (Li et al., 2000). Time-decaying Bloom filters maintains the frequency count for each item stored in the Bloom filters and the values of these frequency count decay with time (Cheng et al., 2005). Multi-Dimension Dynamic Bloom Filters (MDDBF) supports representation and membership queries based on the multiattribute dimension (Guo et al., 2006). Its basic idea is to represent a dynamic set A with a dynamic s × m bit matrix, in which there are s standard Bloom filters and each Bloom filter has a length of m bits. A novel Parallel Bloom Filters (PBF) and an additional hash table has been developed to maintain multiple attributes of items and verify the dependency of multiple attributes, thereby significantly decreasing false positives (Hua & Xiao, 2006). Bloom filters have significant advantages in space saving and fast query operations and thus have been widely applied in many distributed computer applications, such as aiding longest prefix matching (Dharmapurikar et al., 2006), and packet classification (Baboescu & Varghese, 2005). Extended Bloom filter provides better throughput performance for router applications based on hash tables by using a
803
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
small amount of multi-port on-chip memory (Song et al., 2005). Whenever space is a concern, a Bloom filter can be an excellent alternative to storing a complete explicit list. In many distributed applications, BFs are often replicated to multiple hosts to support membership query without contacting other hosts. However, these replicas might become stale since the changes of BFs usually cannot be propagated instantly to all replicas in order to reduce the update overhead. As a result, the BF replicas may return false negatives. This observation motivates the research presented in this chapter.
CONCLUSION Although false negatives do not occur in standard BF, this chapter shows that the staleness in a BF replica can produce false negative. We presents the theoretical analysis of the impact of staleness existing in many distributed BF applications on the false negative and false positive rates, and developed an adaptive update control mechanism that accurately and efficiently maintains a desirable level of false rate for a given application. To the best of our knowledge, we are the first to derive accurate closed-form expressions that incorporate the staleness into the analysis of the false negative and positive rates of a single BF replica, to develop the analytical models of the overall false rates of BF arrays that have been widely used in many distributed systems, and to develop an adaptively controlled update process that accurately maintains a desirable level of false rate for a given application. We have validated our analysis by conducting extensive experiments. The theoretical analysis presented not only provides system designers with significant theoretical insights into the development and deployment of BFs in distributed systems, but also are useful in practice for accurately determining when to trigger the processes of updating BF replicas in order to keep the false rates under some desired values, or, equivalently, minimize the frequency of updates to reduce update overhead.
ACKNOWLEDGMENT This work was partially supported by a faculty startup grant of University of Maine, and National Science Foundation Research Grants (CCF #0621493, CCF #0754951, CNS #0723093, DRL #0737583, CNS #0619430, CCF #0621526).
REFERENCES Baboescu, F., & Varghese, G. (2005). Scalable packet classification. IEEE/ACM Trans. Netw., 13(1), 2–14. Bloom, H. B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. doi:10.1145/362686.362692
804
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Bonomi, F., Mitzenmacher, M., Panigrah, R., Singh, S., & Varghese, G. (2006). Beyond bloom filters: from approximate membership checks to approximate state machines. Paper presented at the Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications. Broder, A., & Mitzenmacher, M. (2003). Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1(4), 485–509. Cheng, K., Xiang, L., Iwaihara, M., Xu, H., & Mohania, M. M. (2005). Time-Decaying Bloom Filters for Data Streams with Skewed Distributions. Paper presented at the Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications. Dharmapurikar, S., Krishnamurthy, P., & Taylor, D. E. (2006). Longest prefix matching using bloom filters. IEEE/ACM Trans. Netw., 14(2), 397–409. Ghose, F., Grossklags, J., & Chuang, J. (2003). Resilient Data-Centric Storage in Wireless Ad-Hoc Sensor Networks. Proceedings the 4th International Conference on Mobile Data Management (MDM’03), (pp. 45-62). Guo, D., Wu, J., Chen, H., & Luo, X. (2006). Theory and Network Applications of Dynamic Bloom Filters. Paper presented at the INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Hailong, C., & Jun, W. (2004). Foreseer: a novel, locality-aware peer-to-peer system architecture for keyword searches. Paper presented at the Proceedings of the 5th ACM/IFIP/USENIX International Conference on Middleware. Hong, T., & Tao, Y. (2003). An Efficient Data Location Protocol for Self.organizing Storage Clusters. Paper presented at the Proceedings of the 2003 ACM/IEEE conference on Supercomputing. Hua, Y., & Xiao, B. (2006). A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services. Proceedings of 13th International Conference of High Performance Computing (HiPC),(pp. 277-288). Hua, Y., Zhu, Y., Jiang, H., Feng, D., & Tian, L. (2008). Scalable and Adaptive Metadata Management in Ultra Large-Scale File Systems. Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS 2008). James, K. M. (1983). A second look at bloom filters. Communications of the ACM, 26(8), 570–571. doi:10.1145/358161.358167 John, K., David, B., Yan, C., Steven, C., Patrick, E., & Dennis, G. (2000). OceanStore: an architecture for global-scale persistent storage. SIGPLAN Not., 35(11), 190–201. doi:10.1145/356989.357007 Kumar, A., Xu, J., & Zegura, E. W. (2005). Efficient and scalable query routing for unstructured peerto-peer networks. Paper presented at the Proceedings INFOCOM 2005, 24th Annual Joint Conference of the IEEE Computer and Communications Societies.
805
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Ledlie, J., Serban, L., & Toncheva, D. (2002). Scaling Filename Queries in a Large-Scale Distributed File System. Harvard University, Cambridge, MA. Lee, L. G. (1982). Designing a Bloom filter for differential file access. Communications of the ACM, 25(9), 600–604. doi:10.1145/358628.358632 Li, F., Pei, C., Jussara, A., & Andrei, Z. B. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3), 281–293. Little, M. C., Shrivastava, S. K., & Speirs, N. A. (2002)... The Computer Journal, 45(6), 645–652. doi:10.1093/comjnl/45.6.645 Luk, M., Mezzour, G., Perrig, A., & Gligor, V. (2007). MiniSec: A Secure Sensor Network Communication Architecture. Proceedings of IEEE International Conference on Information Processing in Sensor Networks (IPSN), (pp. 479-488). Matei, R., & Ian, F. (2002). A Decentralized, Adaptive Replica Location Mechanism. Paper presented at the Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing. McIlroy, M. (1982). Development of a Spelling List. Communications, IEEE Transactions on [legacy, pre - 1988], 30(1), 91-99. Michael, M. (2002). Compressed bloom filters. IEEE/ACM Trans. Netw., 10(5), 604–612. Mohan, A., & Kalogeraki, V. (2003). Speculative routing and update propagation: a kundali centric approach. Paper presented at the IEEE International Conference on Communications, 2003. Pai-Hsiang, H. (2001). Geographical region summary service for geographical routing. Paper presented at the Proceedings of the 2nd ACM international symposium on Mobile ad hoc networking and computing. Rhea, S. C., & Kubiatowicz, J. (2002). Probabilistic location and routing. Paper presented at the IEEE INFOCOM 2002, Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies Proceedings. Saar, C., & Yossi, M. (2003). Spectral bloom filters. Paper presented at the Proceedings of the 2003 ACM SIGMOD international conference on Management of data. Song, H., Dharmapurikar, S., Turner, J., & Lockwood, J. (2005). Fast hash table lookup using extended bloom filter: an aid to network processing. Paper presented at the Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications. Zhu, Y., & Jiang, H. (2006). False Rate Analysis of Bloom Filter Replicas in Distributed Systems. Paper presented at the Proceedings of the 2006 International Conference on Parallel Processing. Zhu, Y., Jiang, H., & Wang, J. (2004). Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. Paper presented at the Proceedings of the 2004 IEEE International Conference on Cluster Computing.
806
Efficient Update Control of Bloom Filter Replicas in Large Scale Distributed Systems
Zhu, Y., Jiang, H., Wang, J., & Xian, F. (2008). HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems. IEEE Transactions on Parallel and Distributed Systems, 19(6), 750–763. doi:10.1109/TPDS.2007.70788
KEY TERMS AND DEFINITIONS Bloom Filter: A Bloom filter is a space-efficient data structure that supports membership queries. It consists of a bit array and all bits are initially set to 0. It uses a fix number of predefined independent hash functions. For each element, all hashed bits are set to 1. To check whether an element belongs to the set represented by a Bloom filter, one simply checks all bits pointed by the hash functions are 1. If not, the element is not in the set. If yes, the element is consider as a member. Bloom Filter Array: A Bloom filter array, consisted of multiple Bloom filters, represents multiple sets. It is a space-efficient data structure to evaluate whether an element is within these sets and which set this element belongs to if yes. Bloom Filter Replica: A Bloom filter replica is a replication of a Bloom filter. In a distributed environment, the original and replicated Bloom filters are typically stored on different servers for improved performance and fault tolerance. A Bloom filter replica will generate both false positives and false negatives. Bloom Filter Update Protocol: When the set that a Bloom filter represents is changed over time, the corresponding Bloom filter replica becomes out-dated. In order to reduce the probability that the Bloom filter replica reports the membership incorrectly, the replica needs to be updated frequently. The Bloom filter update protocol determines when a Bloom filter replica needs to be updated. Distributed Membership Query: Membership query is one fundamental function that reports where the target data, resource, or service is located. The membership query can be performed by a centralized server or by a group of distributed server. The latter approach has a stronger scalability and is referred as distributed memory query. False Negative: A false negative happens when an element is a member of the set that a Bloom filter represents but the Bloom filter mistakenly reports it is not. A standard Bloom filter has no false negatives. However, in a distributed system, a Bloom filter replica can generate false negatives when the replica is not timely updated. False Positive: A false positive happens when an element is not a member of the set that a Bloom filter represents but the Bloom filter mistakenly reports it is. The probability of false positives can be very slow when the Bloom filter is appropriately designed.
807
808
Chapter 35
Image Partitioning on Spiral Architecture Qiang Wu University of Technology, Australia Xiangjian He University of Technology, Australia
ABSTRACT Spiral Architecture is a relatively new and powerful approach to image processing. It contains very useful geometric and algebraic properties. Based on the abundant research achievements in the past decades, it is shown that Spiral Architecture will play an increasingly important role in image processing and computer vision. This chapter presents a significant application of Spiral Architecture for distributed image processing. It demonstrates the impressive characteristics of spiral architecture for high performance image processing. The proposed method tackles several challenging practical problems during the implementation. The proposed method reduces the data communication between the processing nodes and is configurable. Moreover, the proposed partitioning scheme has a consistent approach: after image partitioning each sub-image should be a representative of the original one without changing the basic object, which is important to the related image processing operations.
INTRODUCTION Image processing is a traditional area in computing science which has been used widely in many applications including the film industry, medical imaging, industrial manufacturing, weather forecasting etc. With the development of new algorithms and the rapid growth of application areas, a key issue emerges and attracts more and more challenging research in digital image processing. That issue is the dramatically increasing computation workload in image processing. The reasons can be classified into three groups: relatively low-power computing platform, huge image data to be processed and the nature of image-processing algorithms. DOI: 10.4018/978-1-60566-661-7.ch035
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Image Partitioning on Spiral Architecture
Inefficient computing is a relative concept. The microcomputer has been powerful enough in the last decade to make personal image processing practically feasible to the individual researcher for inexpensive image processing (Miller, 1993; Schowengerdt & Mehldau, 1993). In recent years, although such systems still functionally satisfy the requirements of most general purpose image-processing needs, the limited computing capacity in a standalone processing node has become inadequate to keep up with the faster growth of image-processing applications in such practical areas as real-time image processing and 3D image rendering. The huge amount of image data is another issue which has been faced by many image-processing applications today. Many applications such as computer graphics, rendering photo realistic images and computer-animated films consume the aggregate power of whole farms of workstations (Oberhuber, 1998). Although the common sense of what is “large” to the image data being processed has changed over time, expression in Megabytes or Gigabytes is observed from the application point of view (Goller, 1999). Over the past few decades, the image to be processed has become larger and larger. Consequently, the issue of how to decrease the processing time despite the growth of image data becomes an urgent point in digital image processing. Moreover, the nature of the traditional image-processing algorithms is another issue which reduces the processing speed. In digital image processing, the elementary image operators can be differentiated between point image operators, local imageoperators and global image operators (Braunl, Feyrer, Rapf, & Reinhardt, 2001). The main characteristic of point operator is that a pixel in the output image depends only on the corresponding pixel in the input image. Point operators are used to copy an image from one memory location to another, in arithmetic and logic operations, table lookup and image composition (Nicolescu & Jonker, 2002). Local operators create a destination pixel based on criteria that depend on the source pixel and the values of the pixels in some “neighbourhood” surrounding it. They are used widely in low-level image processing such as image enhancement by sharpening, blurring and noise removal. Global operators create a destination pixel based on the entire image information. A representative example of an operator within this class is the Discrete Fourier Transform (DFT). Compared with point operators, local operators and global operator are more computationally intensive. As a consequence of the above, image processing-related tasks involve the execution of a large number of operations on large sets of structured data. The processing power of the typical desktop workstation can therefore become a severe bottleneck in many image-processing applications. Thus, it may make sense to perform image processing on multiple workstations or on a parallel processing system. Actually, many image-processing tasks exhibit a high degree of data locality and parallelism, and map quite readily to specialized massively parallel computing hardware (Chen, Lee, & Cho, 1990; Siegel, Armstrong, & Watson, 1992; Stevenson, Adams, Jamieson, & Delp, April, 1993). For several image-processing applications, a number of existing programs have been optimized for execution on parallel computer architectures (Pitas, 1993). The parallel approach, as an alternative to replace the original sequential processing, promises many benefits to the development of image processing. Gloller and Leberl (Goller & Leberl, 2000) implemented shape-from-shading, stereo matching, re-sampling, gridding and visualization of terrain models, which are all compute-intensive algorithms in radar image processing, in such a manner that they execute either on a parallel machine or on a cluster of workstations which connects many computing nodes together via a local area network. Other typical applications of image processing on parallel computing platform can be seen in the field of remote image processing such as the 3D object mediator (Kok, Pabst, & Afsarmanseh, April, 1997), internetbased distributed 3D rendering and animation (Lee, Lee, Lu, & Chen, 1997), remote image-processing
809
Image Partitioning on Spiral Architecture
systems design using an IBM PC as front-end and a transputer network as back-end (Wu & Guan, 1995), telemedicine projects (Marsh, 1997), satellite image processing (Hawick et al., 1997) and the general approach for remote execution of software (Niederl & Goller, Jan, 1998). Many parallel image-processing tasks map quite readily to specialized massively parallel computing hardware. However, specific parallel machines require a significant investment but may only be needed for a short time. Accessing such systems is difficult and requires in-depth knowledge of the particular system. Alternatively, the users must turn to supercomputers, which may be unacceptable for many customers. These three aspects are the main reasons why parallel computing has not been widely adopted for computer vision and image processing. Clusters of workstations have been proposed as a cheap alternative to parallel machines. Driven by advances in network technology, cluster management systems are becoming a viable and economical parallel computing platform for the implementation of parallel processing algorithms. Moreover, the utilization of workstation clusters can yield many potential benefits, such as performance and reliability. It can be expected that workstation clusters can take over computing intensive tasks from supercomputers. Offsetting the many advantages mentioned above, the main disadvantages of clusters of workstation are high communication latency and irregular load patterns on the computing nodes. The system performance mainly depends on the amount and structure of communication between processing nodes. Thus, many coarse-grained parallel algorithms perform well, while fine-grained data decomposition methods like the ones in the Parallel Image-Processing Toolkit (PIPT) (Squyres, Lumsdaine, & Stevenson, 1995) require such a high communication bandwidth that execution on the cluster may even be slower than on a single workstation. Moreover, the coexistence of parallel and sequential jobs that is typical when interactive users work on the cluster makes scheduling and mapping a hard problem (Arpaci et al., May, 1995). Thus, taking care of the intercommunication required for processing is an important issue for distributed processing. For instance, if a particular processor is processing a set of rows, it needs information about the rows above and below its first and last rows, when row partitioning is effected (Bharadwaj, Li., & Ko, 2000; Siegel, Siegel, & Feather, 1982). The additional information must be exchanged between the corresponding nodes. It can be done by two approaches in general. In the first approach, explicit communication is built up on-demand between the processors (Siegel et al., 1992) and is carried out concurrently with the main processes. In another approach, the required data is over-supplied to the respective processor at the distribution phase (Siegel et al., 1982). In many cases, the second approach is a natural choice for the architecture that is considered, although it introduces additional data transfer. The facts revealed above is a key problem related to information partitioning. In term of the applications of image processing, it is defined as the problem of image data partitioning. Most image partitioning techniques can be classified into two groups: fine-grained decomposition and coarse-grained decomposition (Squyres et al., 1995). A fine-grained decomposition-based image-processing operation will assign an output pixel per processor and assign the required windowed data for each output pixel to the processor. Thus, each processor will perform the necessary processing for its output pixel. A coarse-grained decomposition will assign large contiguous regions of the output image to each of a small number of processors. Each processor will perform the appropriate window based operations to its own region of the image. Appropriate overlapping regions of the image will be assigned in order to properly accommodate the processing at the image boundaries. There are some difficulties as a consequence of the general data partitioning. The first one is the extra
810
Image Partitioning on Spiral Architecture
communication required between the processors, which has been mentioned above. This is inevitable when a processor participating in the parallel computation needs some additional information pertaining to the data residing in other processors (Bertsekas & Tsitsiklis, 1989; Siegel et al., 1992) for processing its share of the data. Another difficulty is that the number of processors available and the size of the input image may vary in the different applications, so the sizes of sub-images for distribution and the number of processors for a specific operation cannot be arbitrarily determined in the early stages. This chapter presents a highly efficient image partitioning method which is based on a special image architecture, Spiral Architecture. Using Spiral Architecture on a cluster of workstations, a new uniform image partitioning scheme is derived in order to reduce many overhead components that otherwise penalize time performance. With such scheme, uniform sub-images can be produced, which are near copies rather than different portions of the original image. Each sub-image can then be processed by the different processing nodes individually and independently. Moreover, this image-partitioning method provides a possible stereo method to deal with many traditional image-processing tasks simultaneously. Because each partitioned sub-image contains the main features of the original image, i.e. a representation of the original image, the different tasks can execute on the different processing nodes in parallel without interfering with each other. This method is a closed-form solution. In each application, the number of partitions can be decided based on the practical requirements and the practical system conditions. A formula is derived to build the relation between the number of partitions and the multiplier in Spiral Multiplication which is used to achieve image partitioning. The organization of this chapter is as follows. Spiral Architecture and its special mathematic operations are introduced on the section of Related Work which is followed by the detailed explanation of image partitioning on Spiral Architecture. In this section, several problems and the solutions are discussed regarding the implementation of image segmentation on the new architecture. Finally, the experimental results and conclusion are presented.
RELATED WORK Spiral Architecture Traditionally, almost all image processing and image analysis is based on the rectangular architecture, which is a collection of rectangular pixels in the column-row arrangement. However, rectangular architecture is not historically the only one used in image-processing research. Another architecture used often is the Spiral Architecture. Spiral Architecture is inspired from anatomical considerations of primate vision (Schwartz, 1980). The cones on the retina possess the hexagonal distribution feature as shown in Figure 1. The cones, with the shape of hexagons, are arranged in a spiral cluster. Each unit is a set of seven hexagons (Sheridan, Hintz, & Alexander, 2000). That is, each pixel has six neighbouring pixels. This arrangement is different from the 3×3 rectangular vision unit in Rectangular Architecture, where each pixel has eight neighbouring pixels. A collection of hexagonal pixels represented using spiral architecture is shown as Figure 2. The origin point is normally located on the centre of Spiral Architecture. In Spiral Architecture any pixel has only six neighbour pixels which have the same distance to the central hexagon of the seven-hexagon unit of vision. From research on the geometry of the cones in the pri-
811
Image Partitioning on Spiral Architecture
Figure 1. Distribution of Cones on the Retina (from (He, 1998))
Figure 2. A collection of hexagonal cells
812
Image Partitioning on Spiral Architecture
Figure 3. A labelled cluster of seven hexagons
mate’s retina it can be concluded that the cones’ distribution is distinguished by its potential powerful computation abilities.
Spiral Addressing It is obvious that the hexagonal pixels on Figure 2 cannot be labelled in column-row order as in rectangular architecture. Instead of labelling the pixel with a pair of numbers (x, y), each pixel is labelled with a unique number. Addressing proceeds in a recursive manner. Initially, a collection of seven hexagons are labelled as shown in Figure 3. Such a cluster of seven hexagons dilates so that six more clusters of seven hexagons are placed around the original cluster. The addresses of the centres of the additional six clusters are obtained by multiplying the adjacent address in Figure 3 by 10 (See Figure 4.) In each new cluster, the other pixels are labelled consecutively from the centre as shown in Figure 3. Dilation can then repeat to grow the architecture in powers of seven with unique assigned addresses. The hexagons thus tile the plane in a recursive modular manner along a spiral direction (Alexander, Figure 4. Dilation of the cluster of seven hexagons
813
Image Partitioning on Spiral Architecture
1995). It eventuates that spiral address in fact is a base-seven number. A cluster with size of 73 with the corresponding addresses is shown in Figure 5.
Mathematical Operations on Spiral Architecture Spiral Architecture contains very useful geometric and algebraic properties, which can be interpreted in terms of a mathematical object, the Euclidean ring. Two algebraic operations have been defined on Spiral Architecture: Spiral Addition and Spiral Multiplication. The neighbouring relation among the pixels on Spiral Architecture can be expressed uniquely by these two operations. Spiral Addition and Spiral Multiplication will be used together to achieve uniform and complete image partitioning which is very important to distributed image processing.
Figure 5. Hexagons with labelled addresses on Spiral Architecture (He, 1998)
814
Image Partitioning on Spiral Architecture
Spiral Addition Spiral Addition is an arithmetic operation with closure properties defined on the spiral address space so that the result of Spiral Addition will be an address in the same finite set on which the operation is performed (Sheridan, 1996). In addition, Spiral Addition incorporates a special form of modularity. To develop Spiral Addition, a scalar form of Spiral Addition is defined first as shown in Table 1. A procedure for Spiral Addition based on the Spiral Counting principle (Sheridan, 1996) is defined. For the convenience of our explanation, a common naming convention is followed. Any number X = (Xn Xn-1 ... X1) ∀Xi ∈{0, 1,...6}, where Xi is a digit of number X. Let a = (an an-1 ... a1) and b = (bn bn-1 ... b1) be two spiral addresses. Then the result of Spiral Addition of them is worked out as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9.
scale = 1; result = 0; OP1 = (OP1n OP1n-1 ... OP11) = (an an-1 ... a1) OP2 = (OP2n OP2n-1 ... OP21) = (bn bn-1 ... b1) C = OP1 + OP21 = (Cn Cn-1 ... C1) (Spiral Addition). Here, the carry rule is applied. For Spiral Addition between two single-digit addresses, it follows the rules as shown in Table 1; result = result + scale × C1, scale = scale × 10 (Here, “+” and “×” mean normal mathematical addition and mathematical multiplication respectively); CA = OP1; CB = OP2; OP1= (CBn CBn-1 ... CB2) = (OP1n-1OP1n-2 ... OP11) OP2= (Cn Cn-1 ... C2) = (OP2n-1OP2n-2 ... OP21) Repeatedly apply steps 3 through 6 until OP1 = 0; result = result + scale × OP2 (Here, “+” and “×” mean normal mathematical addition and mathematical multiplication respectively); Return result.
For example, for Spiral Addition 26+14, the procedure is shown below. In the demonstration, numbering “a.b” like “3.2” means “Step 3 and 2nd time” as mentioned above: 1. 2.
scale = 1; result = 0; OP1 = (2 6); OP2 = (1 4);
Table 1. Scalar Spiral Addition* (Sheridan, 1996) 0
1
2
3
4
5
6
0
0
1
2
3
4
5
6
1
1
63
15
2
0
6
64
2
2
15
14
26
3
0
1
3
3
2
26
25
31
4
0
4
4
0
3
31
36
42
5
5
5
6
0
4
42
41
53
6
6
64
1
0
5
53
52
* Bold type shows the scalar spiral address; normal type shows the results of Spiral Addition between the corresponding spiral addresses in the first row and the first column respectively.
815
Image Partitioning on Spiral Architecture
3.1 4.1 5.1 6.1 3.2 4.2 5.2 6.2 7. 8. 9.
C = 26 +4 = (2 5); result = 0 + 1 × 5 = 5; scale = 1 × 10 = 10; CA = 26; CB = 14; OP1 = 1; OP2 = 2; C = 1 + 2 = (1 5); result = 5 + 10 × 5 = 55; scale = 10 × 10 = 100; CA = 1; CB = 2; OP1 = 0; OP2 = 1; OP1 = 0; result = 55 + 100 × 1 = 155; Return 155.
To guarantee that all the pixels are still located within the original image area after Spiral Addition (Sheridan et al., 2000), a modulus operation is defined on the spiral address space. From Figure 5, it is shown that the spiral address is a base-seven number, so modular operations based on such a number system must execute accordingly. Suppose spiral_addressmax stands for the maximum spiral address in the given Spiral Architecture area, the modulus number is modulus = spiral_addressmax +1
(1)
where + is Spiral Addition. Then, the modular operation on spiral addressing system can be performed as follows. First, the address number and the corresponding modulus number are converted to their decimal formats and work out the result of the modulus operation in the decimal number system. Then, the result in decimal format is converted to its corresponding base-seven spiral address again. In addition, an Inverse Spiral Addition exists on spiral address space. That means for any given spiral address x there is a unique spiral address x in the same image area, which satisfies the condition x + x = 0 , where sign “+” stands for Spiral Addition. The procedure for computing the inverse value of a spiral address can be summarized briefly as follows. According to Table 1, it is shown that the inverse values of the seven basic spiral addresses, 0, 1, 2, 3, 4, 5 and 6, are 0, 4, 5, 6, 1, 2 and 3 respectively. So the inverse value p of any spiral address p = (pn pn-1 ... p1) can be computed as: p = (pn pn -1 p1 )
(2)
Furthermore, Spiral Addition meets the requirement of a bijective mapping. That is, each pixel in the original image maps one-to-one to each pixel in the output image after Spiral Addition.
Spiral Multiplication Spiral Multiplication is also an arithmetic operation with closure properties defined on the spiral addressing system so that the resulting product will be a spiral address in the same finite set on which the
816
Image Partitioning on Spiral Architecture
operation is performed. In addition, like Spiral Addition, Spiral Multiplication incorporates a special form of modularity. For basic Spiral Multiplication, a scalar form is defined as Table 2. The same naming convention is followed as in the last section in the Spiral Addition explanation. Multiplication of address a by the scalar α(α∈{0, 1, ... 6}) is obtained by applying scalar multiplication to each digital component of a according to the above scalar form, and denoted by: α × (a) = (αan αan-1 ... αa1) where a = (an an-1 ... a1) ∀ai ∈ {0, 1, ..., 6}
(3)
If the address in Spiral Multiplication is a common address like, b = (bn bn-1 ... b1) ∀bi ∈ {0, 1, ..., 6}
(4)
then n
a ´b = å a ´bi ´nml 10i -1 i =1
(5)
where ∑ denotes Spiral Addition, × denotes Spiral Multiplication and ×nml denotes normal mathematical multiplication. A carry rule is required in Spiral Addition to handle the addition of numbers composed of more than one digit. For example, to compute the Spiral Multiplication of 26×14, the procedure is shown below:
Table 2. Scalar Spiral multiplication* (Sheridan, 1996) 0
1
2
3
4
5
6
0
0
0
0
0
0
0
0
1
0
1
2
3
4
5
6
2
0
2
3
4
5
6
1
3
0
3
4
5
6
1
2
4
0
4
5
6
1
2
3
5
0
5
6
1
2
3
4
6
0
6
1
2
3
4
5
* Bold type shows the scalar spiral address; normal type shows the results of Spiral Multiplication between the corresponding spiral addresses on the first row and the first column respectively.
817
Image Partitioning on Spiral Architecture
26 × 14 = 26 × 4 ×nml 1+ 26 × 1 ×nml 10 = (2 × 4 6 × 4) ×nml 1 + (2 × 1 6 × 1) ×nml 10 = 53 ×nml 1 + 26 ×nml 10 = 33
In the above demonstration, the Spiral Addition procedure is omitted. Similarly to Spiral Addition, a modulus operation on the spiral address space is defined in order to guarantee that all the pixels are still located within the original Spiral area after Spiral Multiplication. Furthermore the transformation through Spiral Multiplication defined on spiral address space is a bijective mapping. That is, each pixel in the original image maps one-to-one to each pixel in the output image after Spiral Multiplication.Modulus Multiplication is shown as follows:Let p be the product of two elements a,b. That is, p = a× b
(6)
where a and b are two spiral addresses If p ≥ (modulus), then if a is a multiple of 10 p = (p + (p ÷ (modulus)))mod(modulus)
(7)
otherwise, p = p mod(modulus),
(8)
where modulus = spiral_addressmax + 1 and + is Spiral Addition.
(9)
Finally, another point relative to Spiral multiplication is the existence of a multiplicative inverse. Given a spiral address a, there should be another address b, such that a×b = 1 (Spiral Multiplication), denoted by a-1, i.e., b = a-1. Two cases must be considered to find out the inverse value for a spiral address. Here, it is assumed
818
Image Partitioning on Spiral Architecture
that spiral address 0 has no valid inverse value.
Case 1: a is Not a Multiple of 10 Let us assume a = (an an-1 ... a1) " ai Î {0, 1, ..., 6} and the inverse value b = (bn bn-1 ... b1) " bi Î {0, 1, ..., 6}. In general, it is easy to get the inverse values for the basic spiral addresses 1, 2, 3, 4, 5 and 6. They are 1, 6, 5, 4, 3 and 2 respectively. So the inverse value, b, can be constructed successfully by the following formula, b1 = a1-1
b2 = - (a2b1 ) ´b1 æ n -2 ö bn = - çççå an -i ´bi +1 ÷÷÷ ´b1 ÷ø çè i =0
(10)
Case 2: a is a Multiple of 10 a = k×10m (m < n) modulus = 10n = spiral_addressmax + 1
(11)
k-1 can be obtained by Equation (10). Then, the inverse value of a is, a-1 = k-1 × 10n-m (Spiral Multiplication)
(12)
Mimicking Spiral Architecture In order to implement the idea of Spiral Architecture on the applications for image processing, it is inevitable to use mimic Spiral Architecture based on the existing rectangular image architecture because of lack of mature devices for capturing image and for displaying image based on hexagonal image architecture. Mimic Spiral Architecture plays an important role in image processing applications on Spiral Architecture. It forwards the image data between image processing algorithms on Spiral Architecture and rectangular image architecture for display purpose (see Figure 6). Such mimic Spiral Architecture must retain the symmetrical properties of hexagonal grid system. In addition, mimic Spiral Architecture does not degrade the resolution of the original image. For a given picture represented on rectangular architecture, if it is re-represented on Spiral Architecture on which each hexagonal grid has the same area size as square grid on rectangular architecture, the image resolution is retained. In order to work out the size of hexagonal gird, the length of the side in a square grid is defined as
819
Image Partitioning on Spiral Architecture
Figure 6. Image processing on mimic Spiral Architecture
Figure 7. A square grid and a hexagonal grid which have the same size of area
Figure 8. Relation between mimic hexagonal grid and the connected square grid. si is the size of overlap area
1 unit length. Namely, the area of a square grid is 1 unit area. Then, for a hexagonal grid which has the same area size as square grid, the distance from the centre to the side in a hexagonal grid is 0.537 (see Figure 7). In order to work out the grey value of a hexagonal gird, the relations between the hexagonal grid and its connected square grid must be investigated. The purpose is to find out the different contribution of each connected square grid’s grey value to the referenced hexagonal grid (see Figure 8). Let N denote the number of square grids which are connected to a particular hexagonal grid. si denotes the size of overlap area between square grid i, one of connected square grid, and the hexagonal grid. Because the size of grid is 1 unit area (see Figure 7), the percentage of overlap area in a referenced
820
Image Partitioning on Spiral Architecture
hexagonal grid is, pi = si / 1×100% = si
(13)
Let gh denote the grey value of hexagonal grid, and gs denote the grey value of square grid. Thus, the grey value of hexagonal grid is calculated as the weighted average of the grey values of the connected square grids as, N
gh = å pi gsi . i =1
(14)
On the other hand, the reverse operation must be considered in order to map the images from virtual Spiral Architecture to rectangular architecture after image processing on Spiral Architecture (see Figure 6). After image processing on Spiral Architecture, the grey values of virtual hexagonal grids have been changed. Thus, the aim is to calculate the grey values of square grid from the connected hexagonal grids (see Figure 8.b). The same way as Equation (14) is used to calculate the grey value of square grid. However, pi stands for the percentage of overlap area in a referenced square grid (see Figure 8.b). Supposing there are M virtual hexagonal grids connected to a particular square grid, the square grid’s grey value is, M
gs = å pi ghi . i =1
(15)
Using Equation (14) and (15), the grey values of grids can be calculated easily as long as pi can be calculated. Wu et al. (Wu, He, & Hintz, 2004) proposed a practically achievable method for easily calculating the relation between mimic Spiral Architecture and the connected square grid on digital images.
IMAGE PARTITIONING ON SPIRAL ARCHITECTURE A novel image partitioning method is proposed for distributed image processing based on Spiral Architecture. Using this method each processing node will be assigned a uniform partitioned sub-image that contains all the representative information, so each processing node can deal with the assigned information independently without data exchanges between the processing nodes. The first requirement for such a partitioning scheme is that it should be configurable according to the number of partition required. Second, the partitioning has a consistent approach: after image partitioning each sub-image should be a representative of the original one without changing the basic object features. Finally, the partitioning should be fast without introducing extra cost to the system.
General Image Partitioning on Spiral Architecture Under the traditional rectangular image architecture there are three basic image partitioning, row partitioning, column partitioning and block partitioning (Bharadwaj et al., 2000; Koelbel, Loveman, Schreiber,
821
Image Partitioning on Spiral Architecture
Jr., & Zosel, 1994). Compared with rectangular image architecture, Spiral Architecture does not arrange the pixels row-wise, column-wise or in normal rectangular blocks. Instead, each pixel is positioned by a unique Spiral Address along the spiral rotation direction shown in Figure 9. The traditional partitioning methods are infeasible except for block partitioning. For example, the image on Figure 9 can be partitioned into seven parts evenly with seven sub-data sets like [0, 1, …, 6], [10, 11, …, 16], [20, 21, …, 26], [30, 31, …, 36], [40, 41, …, 46], [50, 51, …, 56], [60, 61, …, 66], where numbers in the brackets are the spiral addresses of the pixels of Figure 9. That is, Spiral Architecture can split the original picture into M (M = 7n,n = 1,2,...) parts. Based on Spiral addressing scheme, the continually consecutive hexagonal pixels are grouped together. Inside of each part, the total number of pixel is also a power of seven. The index of the partitioned sub-area is consistent with the spiral addressing system. Thus the pixels on the different sub-areas can be identified immediately. A real example of image segmentation based on the partition scheme above is show on Figure 10. From Figure 10, it is seen that such partition scheme simply splits the original image area into the equal size pieces but does not consider the image contents inside. For global image processing operation such as global Gaussian processing using distributed processing system, each node may process one segment of original image. During processing, the nodes have to exchange necessary information between them. Such kind of local communication will be a disadvantage. It will become greater as the number of partitions increase.
Uniform Image Partitioning on Spiral Architecture In Spiral Architecture, two algebraic operations have been defined, Spiral Addition and Spiral Multiplication. After an image is projected onto Spiral Architecture, each pixel on the image is associated with a particular hexagon and its Spiral address. The two operations mentioned above can then be used to define
Figure 9. Pixel arrangement on Spiral Architecture
822
Image Partitioning on Spiral Architecture
Figure 10. Simple equal size image partitioning on Spiral Architecture
two transformations on the spiral address space: image translation and rotating image partitioning. In our research, Spiral Multiplication is applied to achieve uniform image partitioning which is capable of balancing workload among the processing nodes and achieves zero data exchange between the nodes. From Figure 10, it is seen that simple image segmentation will result in much network overhead for node synchronization and algorithm programming. In Spiral Architecture, after an image is multiplied by a specific Spiral Address, the original image is partitioned into several parts. Each part is a near copy of the original image. Each copy results from a unique sampling of the input image. Each sample is mutually exclusive and the collection of all such samples represents a partitioning of the input image. As the scaling in effect represents the viewing of the image at a lower resolution, each copy has less information. However, as none of the individual light intensities have been altered in any way, the scaled images in total still hold all of the information contained in the original one (Sheridan, 1996). Consequently, the sub-images can be processed independently by the corresponding processing nodes without requiring data exchange between them. Figure 11 shows an example of image partitioning with Spiral Multiplication. The original image has 16807 hexagon pixels. The multiplier used in Spiral Multiplication is 100001. With the novel uniform image partitioning on Spiral Architecture, task parallelism can be achieved. An application containing complicated image processing often requires processing results in the different aspects such as histogram, edge map and spectrum distribution. Under the proposed image partitioning, all these tasks can be dealt with independently on the assigned sub-images. Such a parallel processing scheme increases the system efficiency. Moreover, because each node possesses less information than the original image, processing time will be shortened dramatically. Detailed demonstration will be shown in the experiment section. There are two points still existing in the above partitioning method that must be resolved before it is utilized in practical applications. First, it is known that the uniform image partitioning is simply
823
Image Partitioning on Spiral Architecture
achieved by Spiral Multiplication. However, the relation between the multiplier and the number of partitions has not been described yet. This is an important point in practical systems, since it must be able to determine the number of partitions according to the image to be processed and the practical system performance. Second, a complete sub-image may not be obtained when the multiplier used in Spiral Multiplication is a general spiral address. For example, when the spiral address is 55555, the original image is partitioned into several parts, but only the middle part holds the complete information of the original image. Other sub-images are scattered on different areas (see Figure 12). It would be necessary to collect the corresponding scattered parts together to form a complete sub-image before it is delivered to the corresponding node for distributed processing. In the following sections, solutions are proposed to deal with the two points mentioned above.
Computing the Number of Partitions It is necessary to determine the relation between the multiplier and the number of partitions. Further, the relation should be static, so for any given multiplier, the number of partitions is determined uniquely when the corresponding Spiral Multiplication is executed. From the aspect of a distributed processing application, the number of partitions often needs to be decided according to the image to be processed and the performance of the system platform before the processing procedure commences. With the help of a static relation between the multiplier and the partitioning number, it can be known that what multiplier to use in Spiral Multiplication in order to partition the image into the specific parts. In the work, it was found that it is unable to find such a relation between the Spiral Address and the partitioning number directly. In order to achieve this goal, the Spiral Architecture is refined with the help of the structure proposed in (He, 1998). This redefined architecture originally was used to find the spiral address for image rotation on Spiral Architecture. In this chapter, it will be used to construct the relation between the multiplier (spiral address) which is used in the Spiral Multiplication for a particular
Figure 11. Seven part (near copies) image partitioning on Spiral Architecture
824
Image Partitioning on Spiral Architecture
Figure 12. Spiral Multiplication by a common Spiral Address 55555
image partitioning and the number of partitions. The newly refined architecture contains three parameters to identify each of its hexagonal pixels. Then, every spiral address will be mapped into a set of three parameters. The refined Spiral Architecture is shown in Figure 13. The original Spiral Architecture is divided into six regions, which are denoted by r = 1, 2,...,6. In each region, the pixels are grouped into the different levels denoted by l = 0, 1,... along the radial direction. On each level, each pixel is regarded as an item denoted by i, where i = 0, 1,...,l clockwise, as shown in Figure 13. Each pixel can then be located uniquely by the three parameters, (r, l, i), in addition to the Spiral Address within Spiral Architecture. Based on the theory of Spiral Multiplication, every spiral address value x has a unique inverse value y for an image of a given size. They should meet the condition,
Figure 13. Redefined Spiral architecture
825
Image Partitioning on Spiral Architecture
(x × y)mod N = 1
(16)
where N is determined by the size of the image. Suppose the maximum spiral address of an image is amax, then N = amax + 1 (Spiral Addition). In order to find the relation between the multiplier and the number of partitions, the second step is to work out the inverse value of the multiplier with Equation (16), which is also a spiral address. This can be done instantly using the principles of Spiral Multiplication. Then, the parameter (r, l, i) corresponding to the inverse value of the multiplier can be found in the refined Spiral Architecture as shown in Figure 13. In a practical application, a table is made to map each spiral address to its corresponding parameter (r, l, i) beforehand. Naturally, there are no mathematical modules which yield the relation between multiplier and number of partitions. In this research, an inductive method is followed to exploit the principle. The number of partitions is counted manually after the image is transformed by Spiral Multiplication with a particular multiplier. For example, the number of partitions is counted manually when the inverse values of the multiplier are 0, 1, 2, 14, 15, 63 whose corresponding parameters in the refined Spiral Architecture are (0, 0, 0), (1, 1, 0), (1, 1, 1), (1, 2, 2), (1, 2, 1), (1, 2, 0). The numbers of partitions are 0, 1, 1, 4, 3, 4 respectively. In the work, more similar tests were made manually in order to reveal the relationship between the multiplier and the number of partitions. Based on the inductive method, the following formula is derived: PNumber (r , l, i ) = [l 2 - i(l - 1) + i(i - 1)] , r = 1, 2, ..., 6; l = 0, 1, 2, ... ; and i = 0, 1, ..., l .
where
(17)
Then, the above formula is tested by a special image partitioning whose multiplier for Spiral Multiplication and numbers of partitions are known. Initially, it is known that, for an image of 49 hexagonal pixels, it will be partitioned into seven near copies (See Figure 11) through Spiral Multiplication with the multiplier 10. In this case, the inverse value of 10 is also 10 according to the principles of Spiral Multiplication. From the refined Spiral Architecture, it is known that the corresponding parameters are (1, 3, 2). They are substituted in Equation (17). The number of partitions is calculated, and is seven, as expected. It is found that the number of partitions is only determined by the parameters l and i . That means the partitioning number is only related to the level number and the item number of the inverse value of the multiplier. The rotation angle is the only difference among the images transformed by Spiral Multiplication with different multipliers that correspond to the same parameters l and i, but different values of r. The angle difference is a multiple of 60 degrees. This point will be analysed in detail in the next section. In addition, every point on the border of two adjoining regions on the refined Spiral Architecture (See Figure 13) has two different sets of parameters, because it strides over two regions with different region numbers and different item numbers. Its corresponding number of partitions is identical, however, regardless of which set of parameters are substituted into Equation (17). For example, address 14 has
826
Image Partitioning on Spiral Architecture
the parameters (1, 2, 2) and (2, 2, 0), but the corresponding number of partitions is 4 if the inverse value of the multiplier used by the Spiral Multiplication for image partitioning is 4. Using the formula derived above, an image can be partitioned into as many parts as required, which are subsampled copies of the original image in Spiral Architecture. This image partitioning method is thus controllable and manageable according to the required precision and the capacity of processing nodes on the network for distributed image processing. Unfortunately, using Spiral Multiplication cannot partition the original image into any number of sub-images. For example, it is impossible to find a multiplier which can partition an image into two parts by Spiral Multiplication. Thus, in practical applications, the approximate number of partitions must be found to meet the requirements. The reason is that uniform image partitioning on Spiral Architecture is the result of Spiral Multiplication. The new positions of the pixels are determined uniquely by the principle of Spiral Multiplication. The relation of the pixel positions before and after Spiral Multiplication is a one-to-one mapping. Ordinary mathematical multiplication is defined on a continuous domain. However, Spiral Multiplication is actually a kind of address counting operation which is a procedure for pixel repositioning. Consequently, it cannot be guaranteed that a multiplier (spiral address) can be found to partition the input image into any number of parts. From the mathematical view, it cannot always guaranteed an integral solution of the multivariate formula as: l2 – i(l−1) + i(i−1) – PNumber = 0, where l = 0, 1, 2,...; and i = 0, 1, ..., l,
(18)
Complete Image Partitioning in Spiral Architecture With the formula developed in the previous section, a Spiral Multiplier can be decided to partition the original image into the required number of near copies. The number of partitions can be found to match practical requirements by an adaptive method such as Divisible Load Theory (DLT) (Bharadwaj et al., 2000). However, it is found that when the number of partitions is not a power of seven such as 7, 49 or 343 only one sub-image in the middle of the image area is a complete sub-image, while the other subimages are segmented into several fragments scattered to different positions in the original image area. For example, an image multiplied by a common spiral address 55555 gives the results shown in Figure 12. With the exception of the middle sub-image, the other three sub-images are each split into two fragments. This is unacceptable for distributed processing. Obviously, two problems must be resolved before distributing the data to the processing nodes. One is that the corresponding fragments that belong to the same sub-image must be identified. Another problem is that all the corresponding fragments must be moved together to become a complete sub-image. In this research, it is found that the boundaries of the different sub-image areas could be detected by investigating the neighbouring relation of the spiral addresses between the reference point and its six adjacent points: the neighbouring relation of spiral addresses along the boundary is different from the neighbouring relation within the sub-image area. All the points belonging to the same sub-image area
827
Image Partitioning on Spiral Architecture
Figure 14. Seven hexagon cluster with six addends of Spiral Addition
have a consistent relation. Consistency is destroyed only across a boundary between two different subimage areas. Moreover, it is shown that the consistency can be expressed by Spiral Addition. Figure 14 shows a seven-hexagon cluster. The six numbers n1, n2,...,n6 shown are addends for Spiral Addition, to be used later. The values of these addends are different under different Spiral Multiplications with the appropriate multipliers for the required image partitioning. The details of the method to calculate the addends will be explained later. Here, it is assumed these addends have been already given. Then, after image partitioning, all the points of the original image will move to new positions in the new image. In the output partitioned image, if a point’s original spiral address on the input image before partitioning is given, its six neighbouring points’ original spiral addresses will be determined by Spiral Addition with the addends as shown in Figure 14. For example, suppose a point’s spiral address on the original image is x and the original address of its neighbouring point below is y, corresponding to the position labelled n1 in Figure 14. If y = x + n1, these two points are in the same sub-image. Otherwise, these two points are in different sub-images and they both stay on the boundaries of the sub-images. Here, “+” stands for Spiral Addition including modular operation if necessary. Definition 4.1.A point is defined as an inside point, i.e. a point within a sub-image area, if the relation between the point’s address x and its six neighbouring points’ addresses yi for i ∈ {1, 2,...,6} satisfies Equation (19); otherwise it is defined as an adjoining point, i.e. a point on the boundary between two sub-image areas. yi = x + ni, i ∈ {1, 2,...,6}
(19)
Addition in Equation (19) is Spiral Addition including any necessary modular operation (See Section 1.1) rather than normal mathematical addition. Now, the remaining question is how to compute the addends ni, for i ∈ {1, 2, 3, 4, 5, 6}. During image partitioning, the values of addends are determined by Spiral Multiplication, which achieves the corresponding image partitioning. In other words, once the number of image partitions is determined, the multiplier used in Spiral Multiplication is determined as explained in the previous section. The values of addends as shown in Figure 14 are then fixed. Whether the point is an inside point or an adjoining point is determined by the condition mentioned above. In fact, the values of addends in Figure 14 are the original spiral addresses of the six points surrounding the centre of the image. An example is given below. Figure 15 shows the computation results of the Spiral Multiplication with multiplier “23” on an image of 49 points. As shown in the figure, all the points move to unique new positions. Based on the above explanation, the addends ni, i ∈ {1, 2, 3, 4, 5, 6}, are 15, 26, 31, 42, 53 and 64 respectively. The point
828
Image Partitioning on Spiral Architecture
with address “15” is an inside point because the relation between its address and its six neighbouring points’ addresses meet the condition shown in Equation (19). The point with address “25” is an adjoining point because some of its neighbouring points cannot meet the address relation of Equation (19). For example, its upper neighbouring point’s original address is “24”. The corresponding addend used for Spiral Addition in Equation (19) is n4 = 42. According to Equation (19) if the point of address “25” is an inside point, the original address of the neighbouring point above it should be “30”, i.e, 25+42 = 30 (Spiral Addition) rather than “24”. So the point of address “25” is an adjoining point. This checking procedure proceeds on each point as follows: 1. 2. 3. 4. 5. 6.
Initialize sub-image number sn = 1; Choose any unchecked point on the image as the next point to be checked; Label this point as sn; Label all the unchecked neighbouring points which meet the condition in Equation (19) as sn; Store the neighbouring points just labelled in step 4 temporarily in a buffer; Choose any one of the neighbouring points which was just labelled in step 4 as the next point to be checked; 7. Repeat steps 3 to 6 until no unchecked neighbouring points can be found in step 4; 8. Choose any one of the unchecked points stored in the buffer as the next point to be checked; 9. Repeat steps 3 to 8 until no unchecked point can be found in the buffer; 10. Clear the buffer and set sn= sn + 1;
Figure 15. Relocation of points after Spiral Multiplication with multiplier “23”
829
Image Partitioning on Spiral Architecture
Figure 16. Three labelled sub-image areas after image partitioning
11. Repeat steps 2 to 10 until no unchecked point can be found on the image. Then, all the points will be labelled by an area number. The fragments corresponding to the same sub-image are found as shown in Figure 16. The last requirement is to collect the corresponding fragments together to form a complete sub-image. Actually, suppose the number of partitions is not the power of seven. After image partitioning in Spiral Architecture all the sub-images are incomplete partitioned images except the middle one. It is known that Spiral Addition with a common addend will move each point to a new position and guarantee oneto- one mapping between the input image and the output image without changing the object shape, so it is a good technique to collect fragments of the sub-image. Moreover, from Figure 16 it is observed that all the sub-images have similar sizes and the sub-image in the middle area is always a complete subimage. There is a special case: when the number of partitions is a power of seven, all the sub-images have exactly the same size. This fact confirms that if the pixels in an incomplete sub-image can be moved into the middle sub-image area properly, this sub-image will be restored successfully. Since Spiral Addition is a consistent operation, if the point that was closest to the point with spiral address “0” on the original image is moved, other points will be automatically located to corresponding positions without changing the object shape in the image. Such movement is achieved using Spiral Addition as mentioned above. This operation is performed on each sub-image that has been given an area number in the previous step, and then all the incomplete sub-images will be restored one by one. Let us call the point which was closest to the point with spiral address “0” before image partitioning the relative centre of the sub-image. The addend of Spiral Addition for restoring the incomplete subimage is computed as follows.
830
Image Partitioning on Spiral Architecture
Figure 17. Four-part complete image partitioning in Spiral Architecture
Suppose the spiral address of the relative centre of the sub-image is x after image partitioning. Then the addend of Spiral Addition for collecting the fragments of sub-images is the inverse value of x, x , which is computed according to the principles of Spiral Addition. As a result, the relative centre is moved to the point of spiral address “0” and other points in the fragments are moved to corresponding positions to produce a complete sub-image. Figure 17 gives an examples showing the procedure mentioned above. The original images contain 16807 points. They are partitioned into four parts with multiplier 55555 and three parts with multiplier 56123 respectively. The separated sub-image areas are shown using different illuminations and labelled using different area numbers. Finally, the fragments of incomplete sub-images were collected together to produce a complete partitioned sub-image. The addends used in Spiral Addition for fragment collection are also shown on each sub-image. The complete sub-images so obtained can be distributed to different nodes for further processing. Figure 18 gives another example which shows image portioning to 3 parts on Spiral Architecture. The relevant addends are shown on the pictures.
EXPERIMENTS In order to demonstrate the advanced performance provided by distributed processing based on special image partitioning on Spiral Architecture, global Gaussian processing for image blurring is chosen as a testing algorithm. Gaussian processing is an algorithm used widely in image processing for several applications such as edge detection, image denoising, and image blurring. It can be mathematically explained as,
831
Image Partitioning on Spiral Architecture
Figure 18. Three-part complete image partitioning on Spiral Architecture
Figure 19. Prototype of distributed system topology
832
Image Partitioning on Spiral Architecture
L(x , y; t ) = g(x , y; t ) * f (x , y ) =
òò s
1 f (u, v ) e 2pt
(x -u )2 +(y -v )2 2t
dudv. (20)
2 where f maps the coordinates of the pixel (x, y) to a value representing the light intensity, i.e. f : Â ® Â . g(•) is the Gaussian kernel. L(•) represents the image after Gaussian processing. t is called the ‘coarseness scale’ and t > 0. σ stands for a set of points on the image area which participate in the Gaussian convolution. For global Gaussian processing, σ stands for the whole area of original image. As t increases the signal which is L becomes gradually smoother. In our work, partitioning approach is implemented on a cluster of workstations shown as Figure 19. One of the computers acts as a master computer (master node) and the remaining seven computers are used as slave computers (slave nodes). In the early phase of the processing, the master node is responsible for initial data partitioning and data delivery to the slave nodes. The data is then processed on the slave nodes. Depending on the image partition scheme, the slave nodes may or may not need to exchange data (denoted by dash line on Figure 19). For the scheme of simple image partitioning shown on Figure 10, data exchange between slave nodes is inevitable since each part does not represent the information of the whole image. During the procedure of global Gaussian processing, each slave node must obtain necessary pixel information which is located on other parts to complete the computation of Equation (20). On the other hand, if uniform image partitioning scheme based on Spiral Architecture is chosen (see Figure 11), each slave node can carry out the necessary processing independently without data exchange between nodes because each node possesses a near copy of the original image. The individual processing result will be sent to the master node where the relatively simple process is carried out to combine the individual result together to produce the final result of global Gaussian processing. Thus, the dash lines presented on Figure 19 can be removed. A three-level algorithm is designed, which consists of a parent process, seven child processes and seven remote processes. The parent process and the seven child processes reside on the master computer. Each slave node executes one of the remote processes. The parent process is mainly responsible for data management and process management, including data communication, command delivery, data manipulation, child process creation, process monitoring and process synchronization. Each remote process completes all the detail work on the data block assigned by the master node. Three techniques are applied for data communication between the processes. They are Share Memory, Message Queues and Sockets. The former two are used for the data exchanges between the parent processes and the child processes. The latter is used for Client-Server communication between the child processes and the remote processes, and between the remote processes if required. In the experiments, two approaches are used to achieve distributed processing. One used a single CPU with multiple processes. Another used multiple computers in a network, where each of them had one process to deal with the assigned sub-image. The test bed consists of eight computers (Ultra Sun-Workstations, of which each has a SPARC-based CPU with clock rate being approximately 333.6 MHz). The experimental results based on simple data partitioning (see Figure 10) is shown on Figure 20, where the data communication between slave nodes are necessary. In the figures, “1 Process/1 Node” actually is sequential processing, where only one computer and one process deal with the task. “7 Processes/1 Node” uses one CPU with multiple processes, as mentioned
833
Image Partitioning on Spiral Architecture
above. Finally, “7 Processes/7 Nodes” means that seven computers on a network are used to achieve distributed processing and each of them has only one process to process the assigned sub-image, the second approach mentioned above. As shown in the figure, distributed processing speeds up the data processing for the case shown, but this is not always true. For an image of 2401 pixels, processing based on a single CPU with multiple processes will take more time than sequential processing, because the CPU will require more extra time to deal with process management, so the time cost exceeds the time saved by distributed processing. This situation becomes more serious when the pixel number decreases to 343. Besides the extra cost for process or node management, data communication becomes a significant issue during the procedure. The total processing time is divided into data-processing time and non-data-processing time, including the time for data exchange, process management and sub-task synchronization. The statistical results for processing times are shown in Figure 21. The components of processing time under the different situations based on simple image partitioning. It shows that the fraction of time for data processing decreases as the size of the image decreases. The reason is that when the size of the image decreases, the system requires less time for data processing. This part of the time decreases dramatically. However, the non-data-processing time decreases relatively more slowly. The reason is that the time for process management does not change when the number of child process stays fixed. In addition, the throughput of data I/O is determined by the system I/O performance. The response of a high-speed hard disk to an image of 1MBytes and an image of 100KBytes is almost the same, so when the size of the image decreases, the time cost for data I/O through the hard disk does not change much. The situation is the same for data communication on high-speed Local Area Network (LAN). Moreover, Figure 10. Simple Equal Size Image Partitioning on Spiral Architecture shows that after equal-size image partitioning, processing nodes do not receive equal effective object information. Some nodes contain much more object information, while some nodes do not contain any effective object information. Consequently, some nodes finish their assigned tasks earlier than other nodes. The processing times on each node may range from one second to several minutes. The nodes with less object information must therefore wait for the nodes with more object information before they can receive new commands and update information for the next sub-task from the master node. That is another reason that sequential processing is sometimes faster than distributed processing. As discussed above, if uniform image partitioning scheme (see Figure 11) is chosen, system overheads and the complexities of program design will be greatly reduced because there will no data communication between slave nodes. The same task, global Gaussian processing as in the previous section, is now carried out again here based on the new partitioning scheme. Some statistics for processing time are shown in Figure 22 and Figure 23. Obviously, the computing complexity has been both reduced and nicely partitioned without discarding any information for distributed processing. In addition, as shown in Figure 23, most of the processing time is the cost of data processing. This processing system is clearly highly efficient. If the percentage of data processing time in the total processing time is used as the index of system efficiency, the new partitioning scheme improves the system efficiency about 2% from 96.94% to 98.73% compared to the same processing approach, “7 Processes/7 Nodes”, using the simple partitioning scheme.
834
Image Partitioning on Spiral Architecture
Figure 20. Image processing time based on simple image partitioning
835
Image Partitioning on Spiral Architecture
Figure 21. The components of processing time under the different situations based on simple image partitioning
CONCLUSION This chapter presents an application of Spiral Architecture to image partitioning which is important for distributed image processing. Based on the principle of Spiral Multiplication, a new image partitioning scheme is proposed. Using Spiral Multiplication an image can be partitioned into a few parts. Each part is an exclusive sampling of the original image and contains representative information from all areas of the original image. Consequently, each sub-image can be viewed as a near copy of the original im-
836
Image Partitioning on Spiral Architecture
Figure 22. Processing time comparison based on the normal equal-size partitioning and uniform partitioning on Spiral Architecture (Image of 16807 points)
Figure 23. The components of the processing time under the different partitioning scheme
837
Image Partitioning on Spiral Architecture
age. In distributed processing based on such an image partitioning scheme, each node will process the assigned sub-image independently without the data exchange normally required. This should speed up processing very much. In a practical system, the number of partitions is determined by the application requirement, the image to be processed and the system performance. However, the relation between the partitioning number and the multiplier (spiral address) used in Spiral Multiplication was not known. In this chapter, an equation was built up to describe this relationship, so the number of partitions can be worked out for the given multiplier and vice versa as required. Unfortunately, complete sub-images can be obtained by Spiral Multiplication only when the partitioning number is a power of seven. In other words, when the number of image partitions is some other value like 4 and 5, all the sub-images except one are split into a few fragments and scattered to different positions. It was impossible to tell which fragments belonged to which sub-image, an unacceptable flaw for parallel image processing. In this chapter, the neighbouring relation of the points is found out and explicitly expressed after Spiral Multiplication using Spiral Addition. The different sub-image areas are identified. Then, the points on the different sub-image areas are labelled. Finally, the fragments corresponding to the same sub-images are collected together to produce the complete sub-images. Such complete sub-images can be distributed to the different nodes for further processing.
REFERENCES Alexander, D. (1995). Recursively Modular Artificial Neural Network. Doctoral Thesis, Macquire University, Australia, Sydney, Australia. Arpaci, R. H., Dusseau, A. C., Vahdat, A. M., Liu, L. T., Anderson, T. E., & Patterson, D. A. (May, 1995). The interaction of parallel and sequential workloads on a network of workstations. Paper presented at the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, NJ: Prentice Hall. Bharadwaj, V., Li, X., & Ko, C. C. (2000). Efficient partitioning and scheduling of computer vision and image processing data on bus networks using divisible load analysis. Image and Vision Computing, 18, 919–938. doi:10.1016/S0262-8856(99)00085-2 Braunl, T., Feyrer, S., Rapf, W., & Reinhardt, M. (2001). Parallel Image Processing. Berlin: SpringerVerlag. Chen, C. M., Lee, S. Y., & Cho, Z. H. (1990). A Parellel Implementation of 3D CT Image Reconstruction on HyperCube Multiprocessor. IEEE Transactions on Nuclear Science, 37(3), 1333–1346. doi:10.1109/23.57385 Goller, A. (1999). Parallel and Distributed Processing of Large Image Data Sets. Doctoral Thesis, Graz University of Technology, Graz, Austria. Goller, A., & Leberl, F. (2000). Radar Image Processing with Clusters of Computers. Paper presented at the IEEE Conference on Aerospace.
838
Image Partitioning on Spiral Architecture
Hawick, K. A., James, H. A., Maciunas, K. J., Vaughan, F. A., Wendelborn, A. L., Buchhorn, M., et al. (1997). Geostationary-satellite Imagery Application on Distributed, High-Performance Computing. Paper presented at the High Performance Computing on the Information Superhighway: HPC Asia’97. He, X. (1998). 2D -Object Recognition With Spiral Architecture. Doctoral Thesis, University of Technology, Sydney, Sydney, Australia. Koelbel, C. H., Loveman, D. B., & Schreiber, R. S., Jr. G. L. S., & Zosel, M. E. (1994). The High Performance Fortran Handbook. Cambridge, MA: MIT Press. Kok, A. J. F., Pabst, J. L. v., & Afsarmanseh, H. (April, 1997). The 3D Object Mediator: Handling 3D Models on Internet. Paper presented at the High-Performance Computing and Networking, Vienna, Austria. Lee, C., Lee, T.-y., Lu, T.-c., & Chen, Y.-t. (1997). A World-wide Web Based Distributed Animation Environment. Computer Networks and ISDN Systems, 29, 1635–1644. doi:10.1016/S0169-7552(97)00078-0 Marsh, A. (1997). EUROMED - Combining WWW and HPCN to Support Advanced Medical Imaging. Paper presented at the High-Performance Computing and Networking, Vienna, Austria. Miller, R. L. (1993). High Resolution Image Processing on Low-cost Microcomputer. International Journal of Remote Sensing, 14(4), 655–667. doi:10.1080/01431169308904366 Nicolescu, C., & Jonker, P. (2002). A Data and Task Parallel Image Processing Environment. Parallel Computing, 28, 945–965. doi:10.1016/S0167-8191(02)00105-9 Niederl, F., & Goller, A. (Jan, 1998). Method Execution On A Distributed Image Processing Backend. Paper presented at the 6th EUROMICRO Workshop on Parallel and Distributed Processing, Madrid, Spain. Oberhuber, M. (1998). Distributed High-Performance Image Processing on the Internet. Doctoral Thesis, Graz University of Technology, Austria. Pitas, I. (1993). Parallel Algorithm for Digital Image Processing, Computer Vision and Neural Network. Chichester, UK: John Wiley & Sons. Schowengerdt, R. A., & Mehldau, G. (1993). Engineering a Scientific Image Processing Toolbox for the Macintosh II. International Journal of Remote Sensing, 14(4), 669–683. doi:10.1080/01431169308904367 Schwartz, E. (1980). ComputationalAnatomy and FunctionalArchitecture of Striate Cortex:ASpatial Mapping Approach to Perceptual Coding. Vision Research, 20, 645–669. doi:10.1016/0042-6989(80)90090-5 Sheridan, P. (1996). Spiral Architecture for Machine Vision. Doctoral Thesis, University of Technology, Sydney. Sheridan, P., Hintz, T., & Alexander, D. (2000). Pseudo-invariant Image Transformations on a Hexagonal Lattice. Image and Vision Computing, 18(11), 907–917. doi:10.1016/S0262-8856(00)00036-6 Siegel, H. J., Armstrong, J. B., & Watson, D. W. (1992). Mapping Computer-Vision-Related Tasks onto Reconfigurable Parallel-Processing Systems. IEEE Computer, 25(2), 54–63.
839
Image Partitioning on Spiral Architecture
Siegel, L. J., Siegel, H. J., & Feather, A. E. (1982). Parallel Processing Approaches to Image Correlation. IEEE Transactions on Computers, 31(3), 208–218. doi:10.1109/TC.1982.1675976 Squyres, J. M., Lumsdaine, A., & Stevenson, R. L. (1995). A Cluster-based Parallel Image Processing Toolkit. Paper presented at the IS&T Conference on Image and Video Processing, San Jose, CA. Stevenson, R. L., Adams, G. B., Jamieson, L. H., & Delp, E. J. (1993, April). Parallel Implementation for Iterative Image Restoration Algorithms on a Parallel DSP Machine. The Journal of VLSI Signal Processing, 5, 261–272. doi:10.1007/BF01581300 Wu, D. M., & Guan, L. (1995). A Distributed Real-Time Image Processing System. Real-Time Imaging, 1(6), 427–435. doi:10.1006/rtim.1995.1044 Wu, Q., He, X., & Hintz, T. (2004, June 21-24). Virtual Spiral Architecture. Paper presented at the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA.
KEY TERMS AND DEFINITIONS Distributed Processing: Distributed processing refer to a special computer system which is capable of running a program simultaneously on multiple nodes such as computers and processors. These nodes are connected to each other and managed by sophisticated software which detects idle notes and parcels out programs to utilize them. Image Partitioning: For distributed processing purpose, image partitioning is to efficiently segment the images into multiple parts. Each part is sent to a computing node and processed simultaneously. Spiral Architecture: Spiral Architecture is a special image architecture where the image is displayed by a set of hexagonal pixels. These pixels with the shape of hexagons are arranged in a spiral cluster. Each unit is a set of seven hexagons. That is, each pixel has six neighbouring pixels. Spiral Addressing: Spiral Addressing is a special addressing scheme which is used to uniquely identify each pixel on Spiral Architecture. The addressing number in fact is a base-seven number. Such kind of addressing labels all pixel in a recursive modular manner along a spiral direction. Spiral Addition: Spiral Addition is an arithmetic operation with closure properties defined on the spiral address space. Applying spiral addition on the image labelled by spiral address can achieve image transfer on Spiral Architecture. Spiral Multiplication: Spiral Multiplication is an arithmetic operation with closure properties defined on the spiral address space. Applying spiral multiplication on the image labelled by spiral address can achieve image rotation on Spiral Architecture.
ENDNOTE 1
840
This is a base 7 number. Unless specified otherwise, spiral addresses, addends used in Spiral Addition and multipliers used in Spiral Multiplication are base 7 numbers in the following sections.
841
Chapter 36
Scheduling Large-Scale DNA Sequencing Applications Sudha Gunturu Oklahoma State University, USA Xiaolin Li Oklahoma State University, USA Laurence Tianruo Yang St. Francis Xavier University, Canada
ABSTRACT This chapter studies a load scheduling strategy with near-optimal processing time that is designed to explore the computational characteristics of DNA sequence alignment algorithms, specifically, the Needleman-Wunsch Algorithm. Following the divisible load scheduling theory, an efficient load scheduling strategy is designed in large-scale networks so that the overall processing time of the sequencing tasks is minimized. In this study, the load distribution depends on the length of the sequence and number of processors in the network and, the total processing time is also affected by communication link speed. Several cases have been considered in the study by varying the sequences, communication and computation speeds, and number of processors. Through simulation and numerical analysis, this study demonstrates that for a constant sequence length as the numbers of processors increase in the network the processing time for the job decreases and minimum overall processing time is achieved.
INTRODUCTION Large-scale network-based computing has attracted tremendous efforts from both academia and industry because it is scalable, flexible, extendable, and economic with wide-spread applications across many disciplines in science and engineering. To address scalability issues for an important class of applications, researchers proposed a divisible load scheduling theory (DLT). These applications are structured as large numbers of independent tasks with low granularity (Bharadwaj. V., Ghose.D ., & Thomas Robertazzi. G ., 2003). They are thus amenable to embarrassingly parallelism, typically in master-slave fashion. DOI: 10.4018/978-1-60566-661-7.ch036
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Scheduling Large-Scale DNA Sequencing Applications
Such applications are called divisible load because a scheduler may divide the computation time among worker processes arbitrarily, both in terms of task and task sizes. Scheduling the tasks of a parallel application on the resources of a distributed computing platform efficiently is critical for achieving optimal performance (Bharadwaj. V., Ghose. D., & Mani.V ., 1995) The load distribution problem in distributed computing networks, consisting of a number of processors interconnected through communication links, has attracted a great deal of attention (Sameer Bataineh, Te-Yu Hsiung, & Thomas Robertazzi, 1994). Divisible Load Theory (DLT) is a methodology that is involved in the linear and continuous modeling of partitioning the computation and communication loads for parallel processing (Robertazzi,T.G, 2003). DLT is primarily used for handling large scale processing on network based systems. The DLT paradigm has demonstrated numerous applications such as edge detection in image processing, file compression, joining operations in relational databases, graph coloring and genetic searches (Wong Han Min., & Bharadwaj Veeravalli, 2005). Some more examples of real divisible applications include searching for pattern in text, audio, graphic files, database and measurement processing, data retrieval systems, some linear algebra algorithms, and simulations (Maciej Drozdowski., & Marcin Lawenda., 2005). Over the past few decades research in the field of molecular biology has made advancement that is coupled with advances in genomic technologies. This has led to an explosive growth in the biological information generated, in turn, led to the requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data. In this chapter a parallel strategy is designed to explore the computational characteristics of the Needleman-Wunsch algorithm that are used for biological sequence comparisons proposed in the literature. In designing the strategy the load is partitioned among the processors of the network using the DLT paradigm (Bharadwaj. V., Ghose. D., & Mani.V ., 1995). Two commonly used algorithms for sequence alignment are the Needleman-Wunsch Algorithm and Smith-Waterman Algorithm where the former is employed for Global Alignment and the latter is used for Local Alignment. The complexity of the Needleman-Wunsch Algorithm and Smith-Waterman Algorithm to align sequence of length x is given by O(x2) (Wong Han Min., & Bharadwaj Veeravalli, 2005). The algorithm used in this study is the Needleman-Wunsch Algorithm. The way that has been adopted in this study to for parallelizing the Needleman-Wunsch Algorithm is by computing the matrix elements in diagonal fashion by using a Multiple Instruction Multiple Data Systems. Divisible Load Theory is employed for handling the sequence alignment. The objective is to minimize the total processing time for sequence alignment. The partition of the load depends primarily on the matrix that is generated by the Needleman-Wunsch Algorithm. The network has been studied for variable link speed and constant link speed.
RELATED WORK The merging of the two rapid advancing technologies of molecular biology and computer science resulted in a new informatics science, namely bio informatics (Wong Han Min., & Bharadwaj Veeravalli, 2005). Over the past few years, the interest and research in the area of biotechnology has increased drastically. This area of study deals primarily with the methodologies of operating on molecular biological information. The present days of molecular biology is characterized by collection of large volumes of data.
842
Scheduling Large-Scale DNA Sequencing Applications
Information science when applied to biology produced a field called the “Bioinformatics”. The areas of bioinformatics and computational biology involve the use of techniques and concepts including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. The terms of bioinformatics and computational biology are often interchangeable. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution. The area of bioinformatics more clearly refers to the creation and advancement of algorithms, computational and statistical techniques, and also includes the theory to solve formal and practical problems arising from the management and analysis of biological data. Computational biology refers to hypothesis-driven investigation of a specific biological problem using computers, carried out with experimental or simulated data, with the primary goal of discovery and the advancement of biological knowledge. In other words, bioinformatics is concerned with the information while computational biology is concerned with the hypotheses. The most common operations on biological data include sequence analysis, protein structures predications, genome sequence alignment, phylogeny tree construction, pathway research and sequence database placement. One of the most basic and important application of bio informatics task is to find a set of homologies for a given sequence because the sequences are often related to the functions, if they are similar (Felix Autenrieth., Barry Isralewitz ., Zaida Luthey-Schulten., Anurag Sethi., & Taras Pogorelov., 2000) (Jones., Neil C., & Pavel A. Penvzner., 2004). The different bio informatics applications like sequence analysis, protein structures predications, genome sequence alignment, and phylogeny tree construction are distributed in different individual projects and they require high performance computational environments. Biologists use a tool called the BLAST for performing research (Altschul., Gish W., Miller W., Myers., & Lipman, 1990).This tool is a database search, in other words this is described as a Google for biological sequences. This tool provides a method for searching a nucleotide and protein database. This BLAST is designed in such a way that it can detect local and global alignment. Sequence Alignment is often used in biological analysis. This sequence alignment between any two newly discovered biological sequences can be aligned with the algorithms present in the literature and the similarity can be determined. This sequence alignment can be useful in understanding the function, structure and origin of the new gene. In sequence alignment two sequences are compared with the residues of one another while taking the positions of the residues into account. Residues in the sequence can be inserted, deleted or substituted to achieve maximum similarity or optimal alignment . For example, GenBank is growing at an exponential rate up to over 100 million of sequences 1 (Wong Han Min., & Bharadwaj Veeravalli, 2005) (Benson D.A ., Karsch-Mizrachi I., Lipman D. J ., Ostell J., Rapp B. A., & Wheller D.L., 2000). To meet the growing needs a wide variety of heuristics methods have been proposed for aligning the sequences such as FASTP, FASTA,BLAST, and FLASH (Yap T.K ., Frieder O., & Martino R.L ., 1998). The NIH Biomedical Information Science and Technology Initiative Consortium that was held on July 17, 2000 has agreed on formal definitions for bioinformatics and computational biology. They also recognized that there is no definition that could completely eliminate the overlap of the variations in interpretation by different individuals and organizations. One of the definition proposed by them are as follows:
843
Scheduling Large-Scale DNA Sequencing Applications
•
•
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data (Michael Huerta ., Florence Haseltine., &Yuan Liu ., 2000). Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems (Michael Huerta ., Florence Haseltine., &Yuan Liu ., 2000).
The areas of bioinformatics and computational biology use mathematical tools to extract useful information from data produced by high-throughput biological techniques such as genome sequencing. One of the most common representative problems in bioinformatics is the assembly of high-quality genome sequences from fragmentary “shotgun” DNA sequencing. Other common problems include the study of gene regulation using data from micro arrays or mass spectrometry (Cristianini N., & Hahn M., 2006). “Sequence Analysis” in biology can be explained by subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer (Felix Autenrieth., Barry Isralewitz ., Zaida Luthey-Schulten., Anurag Sethi., & Taras Pogorelov, 2000). Sequence analysis in molecular biology and bioinformatics is an automated, computerbased examination of characteristically fragments, for example a DNA-strand. It basically includes five biologically relevant topics: (1) This is used for the comparison of sequences in order to find similar sequences (sequence alignment). (2) In identification of gene-structures, reading frames, distributions of introns and exons and regulatory elements. (3) Used for prediction of protein structures. (4) Used for genome mapping. (5) Comparison of homologous sequences to construct a molecular phylogeny. Similarity detection is often used in biological analysis. This is widely used when a new gene sequence and unknown gene sequence can give significant understanding on the function, structure and origin of the new gene. While comparing two gene sequences, which is also known as aligning two sequences residues form, one sequence is compared with the residues of the other, in which the position of the residues is taken into consideration. The different operations that can be performed are insertion, deletion and substitution of residues in other sequence. Many algorithms have been proposed in the literature for comparing two biological sequences for similarities. The most popular algorithms in aligning the DNA are the Needleman-Wunsch algorithm. For protein alignment it’s the Smith-Waterman algorithm. In the sequence comparison a combination of the DLT approach and the algorithms are used in order to align the sequences accurately. In this chapter a load scheduling strategy is designed for large scale networks for sequence alignment and it has been observed that as the number of processors increases in the network results in a minimal computation time.
PROBLEM FORMULATION The Needleman-Wunsch algorithm is one of the algorithms that perform a global alignment on two sequences (called X and Y here). This algorithm finds its application in bioinformatics to align protein or nucleotide sequences (Vladimir Likic, 2000). The algorithm was first proposed by Saul Needleman and Christian Wunsch in 1970 (Walter B.Goad, 1987).The Needleman-Wunsch algorithm is an example
844
Scheduling Large-Scale DNA Sequencing Applications
Figure 1. Needleman-Wunsch algorithm after the generation of S matrix
of dynamic programming, and was the first application of dynamic programming to biological sequence comparison. The algorithm can be explained in the following steps 1. 2. 3. 4.
Initialize the matrix S=0 Fill in the matrix S with 1 if it is a match and 0 if it is a mismatch Compute score from right hand bottom based on the formula M[i,j]=S[i, j]+Max{M[i+1:x ],M[j+1:y]} Trace back from the left-top corner, and select the maximum value from the adjacent column and row, and so on. For example let us consider two sequences GTCAGTC and GCCTC. In order to align these sequences we first need to construct the matrix as shown in figures
The Needleman-Wunsch Algorithm as well as some of the characteristics of the S and M matrices that are generated by the algorithm are explained as follows. In aligning two biological sequences that are denoted as Seq X and Seq Y of length x and y, respectively, the algorithm generates two matrices represented by S and M as shown in Figure 1 and Figure 2. The matrices S and M are related to each other with the equation M[i,j]=S[i, j]+Max {[M[i+1: x ],M[j+1: y]]} for the range 1 <= p <= x, 1 <= q <= y where S p,q and M p,q represents the pth row and qth column of the matrices S and M respectively.
Figure 2. Needleman-Wunsch algorithm after the generation of M matrix
845
Scheduling Large-Scale DNA Sequencing Applications
Figure 3. Single level tree network
The network under consideration is a simple single level tree network (SLTN) which is shown in Figure 3, the root node can communicate only with one child at a time. The approach to the problem can be described in a series of steps. The first step is create a simple SLTN with a fixed number of nodes and apply divisible load theory on the same network. Further the number of nodes in the system is increased and DLT technique is applied. The two biological sequences are given to the network and the Needleman-Wunsch algorithm gives the alignment. The final aim of this chapter will include the computation time involved in processing the job. From the results it can be observed that by applying DLT technique the computation time decreases drastically. The objective is to design a strategy such that the processing time or the computation time for the alignment of the two biological sequences is a minimum. The two biological sequences are considered to be of length x and y. These sequences may vary from 1 character to 1000’s of characters. In the results section however the sequences are varied from length of 100 to 1000. We assume that all the processors in the network P1,P2,......, Pm already have Sequence x and Sequence y in their local memories, or they can be initialized in this way. To carry the process of sequence alignment in a multiprocessor environment one of the way will be by keeping a copy of the sequences in the local memory.
LOAD SCHEDULING STRATEGIES AND ANALYSIS The distribution strategy for the S matrix is given by a matrix consisting of 0’s and 1’s so it does not have any special kind of distribution. The M matrix is partitioned into sub matrices like Mp,q where p= 1,2,...m and q= 1,2,...z where each portion of Seq x and Seq y is contained in one particular cell of the matrix M. This assignment can be explained as shown in the Figure 4. The distribution pattern is as shown in figures 4 and 5. According to the Needleman-Wunsch Algorithm the last row will be calculated first. So the last row is given to the first processor or the root node of the system. In accordance with the Needleman-Wunsch Algorithm the timing diagram is as shown in Figure 6. The generalized equations are as shown below. The two sequences can be divided into a number of smaller parts. This can be explained from the example given below. Let us consider that sequence Seq x and Seq y are where,Seq x = GCCTCSeq y = GCTAC
846
Scheduling Large-Scale DNA Sequencing Applications
Figure 4. Illustration of the computational dependency of the element (p,q) in the M matrix
The length of Seq x is 5 and length of Seq y is 5. There for the total length should be 5. From the above example we can write the generalized equations as shown in equation (1). x 1 + x 2 + x 3 + ....... + x n -1 + x n = x y1 + y2 + y 3 + ....... + yn -1 + yn = y
(1)
From the timing diagram we can derive the generalized equation for the load on each processor is given in the equation (2). m
å n =2
x n = x n -1yn -1En -1 - 2C n -1x n -1 / yn En (2)
The load that is given to the first processor or the root node is given by equation (3). m
x 1 = x / [1 + å [y1E1 - 2C 1 ] / yn En ] n =2
(3)
The total completion time for the alignment of the two sequences can be given by
847
Scheduling Large-Scale DNA Sequencing Applications
Figure 5. Illustration of Load Distribution
m
T (m ) = xyE1 + å å Ei + 2C (m - 1) i =2
(4)
To enhance the understanding of the performance of Needleman-Wunsch Algorithm and divisible load strategy a single machine has been used (Wong Han Min., & Bharadwaj Veeravalli, 2005). Therefore the speedup, can be given bySpeedup= T(1)/ T(m) (5)where T(m) is the processing time of our strategy on a system using m-processors. T(1) is the processing time using a single processor and is given byT(1)=xyE1 (6)
RESULTS AND DISCUSSION This section presents the evaluation of the results of load scheduling for sequence alignment technique. The results have been tabulated for the single level tree network that has constant link speed and also variable link speed. In the experimental results the sequence lengths have been varied from 100 to 1000 characters.
848
Scheduling Large-Scale DNA Sequencing Applications
Figure 6. Timing diagram
Figure 7. Variable link speed
849
Scheduling Large-Scale DNA Sequencing Applications
Figure 8. Variable link speed 3-D graph
Table 1. Processing time variations for variable link speed Number of processors
Processing Time (Sec)
3
166915.49
5
62671.53
7
33498.78
10
20163.56
20
11250.11
30
6804.34
40
4493.82
50
3186.41
60
2387.78
70
1868.71
80
1514.16
90
1262.08
100
1076.88
850
Scheduling Large-Scale DNA Sequencing Applications
Figure 9. Graph for Variable Link Speed
Variable Link Speed This section briefly discusses about how does the processing time changes when the link speed has been varied. The graphs have been plotted for two ranges of link speed variations. Experiments have been conducted for variable link speed where link speed has been varied from 1-10 nanoseconds and 1-100 nanoseconds. From the graphs it can be observed that the processing time depends on the communication link speed C. In other words, the higher the link speed of the network the faster is the job processing. The link speed has been varied using the Random Generator method in java. The results and the tabulated values are as shown in the Figure 7, Figure 8 and Table 1. The tabular column related to the table present below represents the graph for “Number of Processors Vs Processing time” with X-axis as the number of processors and Y-axis as the Processing Time. This graph has been plotted for a constant sequence length of 1000. From the figure 7 it has been observed that for a constant sequence length as the number of processors increase the computation time decreases This also reemphasizes the definition of DLT that as more number of processors are added into the network the processing time decreases. Figure 8 demonstrates the 3-D representation of how the processing time varies with respect to the length of the sequence and number of processors. From the 3-D graph of the single processor tree network it can be observed that keeping the length of the sequence constant as the number of processors increase the processing time decreases. On the other hand it can also be observed that as the numbers of
851
Scheduling Large-Scale DNA Sequencing Applications
Figure 10. Graph for constant link speed
processors are kept constant and the length of sequences increases and the computation time increases. As discussed in the Problem Formulation chapter, the speedup has been calculated and the values are tabulated as shown for a constant sequence length of 1000. Figure 9 represents the graph for “Length of sequence Vs Processing Time” for a constant number
Figure 11. 3-D graph for constant link speed
852
Scheduling Large-Scale DNA Sequencing Applications
Table 2. Processing time variations for constant link speed Number of processors
Processing Time (Sec)
3
165000
5
62500
7
33400
10
20100
20
11200
30
6790
40
4480
50
3170
60
2380
70
1864
80
1511
90
1250
100
1074
of processors (m=100), in which X-axis represents the length of sequence and Y-axis represents the Processing Time. According to divisible load theory for a constant number of processors as the length of sequence increases the processing time increases because each processors in the system has more load on it. But from the graph we can say that the processing time is not constantly increasing. This can be attributed to the communication link, as the processing time is dependent on the communication link speed. From the results it can be concluded the greater the communication link speed the lesser the processing time for the sequences.
Constant Link Speed The variation of the processing time vary when the link speed has been varied. In the graphs shown below in Figure 10 and Figure 11, the link speed(C) has been taken as 5 nanoseconds. The results have been discussed as shown below. Figure 10 represents the graph for “Number of Processors Vs Processing time” with X-axis as the number of processors and Y-axis as the Processing Time. This graph has been plotted for a constant sequence length of 1000. From the graph it has been observed that for a constant sequence length as the number of processors increase the computation time decreases This adds strength to the definition of DLT that as more number of processors are added into the network the processing time decreases Figure 11 demonstrates the 3-D representation of how the processing time varies with respect to the length of the sequence and number of processors. From the 3-D graph of the single processor tree network it has been observed that keeping the length of the sequence constant as the number of processors increase the processing time decreases. On the other hand it has also been observed that as the numbers of processors are kept constant and the length of sequences increases the computation time increases. As discussed in the Methodology chapter, the speedup has been calculated and the values are tabulated as shown for a constant sequence length of 1000.
853
Scheduling Large-Scale DNA Sequencing Applications
Figure 12. Number of processors vs. processing time for a constant length of sequence
Figure 12 represents the graph for “Number of Processors Vs Processing Time” for a constant length of sequence, in which X-axis represents the number of processors and Y-axis represents the Processing Time. According to divisible load theory for a constant length of sequence as the number of processors increases the processing time should decrease, as more number of processors is being added to the system the load should be distributed among all of them. But from the graph it can be observed that the processing time is not constantly decreasing. This can be attributed to the communication link speed, as the processing time is dependent on the communication link speed. The greater the communication link speed, the lesser the processing time
CONCLUSION This chapter presented a method for the alignment of two biological sequences following divisible load scheduling paradigm (DLT). A parallel solution in a single level tree network has been proposed and the communication delays are assumed to be non zero. We adopted the Needleman-Wunsch algorithm for aligning two biological sequences. Following Divisible Load Theory (DLT) we can determine the number of residues that should be assigned to each processor in the network. The approach presented in this chapter is as follows. First we had a matrix S which is a matrix of order x*y where x is the length of the first sequence and y is the length of the second sequence. Then we
854
Scheduling Large-Scale DNA Sequencing Applications
derived the M matrix which will give the final values and depending on that we can align the sequences. We derived the equations that will determine the size of the sub matrices according to the processor speeds where here it is assumed that all processors have equal speeds and the communication speeds are varied. With these constraints the equations have been derived and the graphs have been plotted. We evaluated the performance by varying the communication link speed from 10-100 nanoseconds and for a constant link speed of 5 nanoseconds. First, we considered the performance of our strategy when the communication link has been maintained at a constant value of 5 nanoseconds. The results clearly demonstrated that the processing time decreased constantly for a constant number of processors for a constant length of sequence. Then the communication link speed has been varied from 10-100 nanoseconds and the performance has been observed. From the graph it can be observed that for a variable link speed the computation time decreases for a constant length of sequence. In certain cases the behavior of the graph was not uniform; this explains that the communication link plays a major role in the processing time of the sequence alignment. Extensions to this work can be deriving solutions that will further decrease the computation speed. This can be achieved by applying multi-installment strategy and performing the analysis using the Needleman-Wunsch Algorithm. The same problem of aligning biological sequences can be applied to various types of networks. The alignment of biological sequences can be solved using the Sellers algorithm (Michael Huerta, Florence Haseltine, and Yuan Liu, 2000) and the load distribution strategy. Further work can also be carried out on aligning multiple sequences with various types of clustering strategies. The same strategy of aligning sequences can be further extended to aligning multiple sequences using the algorithm like Berger-Munson algorithm (Berger M.P., & Munson P. J., 1991).
REFERENCES Altschul, G. W., M. W, Myers., & Lipman. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. Autenrieth., F., Isralewitz, B., Luthey-Schulten., Z., Sethi, A. & Pogorelov, T. Bioinformatics and Sequence Alignment. Bataineh, S., Hsiung, T.-Y., & Robertazzi, T. (1994). Closed form solutions for bus and tree networks of processors load sharing a divisible job. Institute of Electrical and Electronic Engineers, 43(10), 1184–119. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A., & Wheller, D. L. (2000, October). Genbank. Nucleic Acids Research, 28(1), 15–18. doi:10.1093/nar/28.1.15 Berger, M. P., & Munson, P. J. (1991). A novel randomized iteration strategy for aligning multiple protein sequences. Computer Applications in the Biosciences, 7, 479–484. Bharadwaj, V., Ghose, D., & Mani, V. (1995, April). Multi-installment load distribution in tree network with delays. Institute of Electrical and Electronic Engineers, 31(2), 555–567. Bharadwaj, V., Ghose, D., & Robertazzi, T. G. (2003, January). Divisible load theory: A new paradigm for load scheduling in distributed systems. Cluster Computing, 6(1), 7–17. doi:10.1023/A:1020958815308
855
Scheduling Large-Scale DNA Sequencing Applications
Cristianini, N., & Hahn, M. (2006). Introduction to Computational Genomics. Cambridge, UK: Cambridge University Press. Drodowski, M., Lawenda, M., & Guinand, F. (2006). Scheduling multiple divisible loads. International Journal of High Performance Computing Applications, 20(1), 19–30. doi:10.1177/1094342006061879 Drozdowski, M., & Lawenda, M. (2005). On Optimum Multi-installment Divisible Load Processing in Heterogeneous Distributed Systems, (LNCS 3648, pp. 231–240). Berlin: Springer. Fourment, M., & Gillings, M. R. (2008, February). A comparison of common programming languages used in bioinformatics. Bioinformatics (Oxford, England), 9. Goad, W. B. (1987). Sequence analysis. Los Alamos Science, (Special Issue), 288–291. Huerta, M. Haseltine, F., & Liu, Y. (2004, July). Nih working definition of bioinformatics and computational biology. Jones, N. C. & Penvzner, P.A. (2004, August). An Introduction to Bioinformatics Algorithms. Likic, V. (2000). The needleman-wunsch algorithm for sequence alignment. The University of Melbourne, Australia. Min, W. H., & Veeravalli, B. (2005, December). Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. Institute of Electrical and Electronic Engineering, 9(4), 489–501. Robertazzi, T. (2003). Ten reasons to use divisible load theory. Institute of Electrical and Electronic Engineering, 36(5), 63–68. Trelles, Andrade, & Valencia, Zapata, & Carazo. (1998, June). Computational space reduction and parallelization of a new clustering approach for large groups of sequences. Bioinformatics (Oxford, England), 14(5), 439–451. doi:10.1093/bioinformatics/14.5.439 Yap, T., Frieder, O., & Martino, R. (1998, March). Parallel computation in biological sequence analysis. Institute of Electrical and Electronic Engineers, 9(3), 283–294.
KEY TERMS AND DEFINITIONS Bioinformatics: Bioinformatics is the application of information technology to the field of molecular biology. Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Cluster Computing: A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.
856
Scheduling Large-Scale DNA Sequencing Applications
Computational Biology: Computational biology refers to hypothesis-driven investigation of a biological problem using computers, carried out with experimental or simulated data, with the primary goal of discovery and the advancement of biological knowledge. Computer Networks: A computer network is a group of interconnected computers. Networks may be classified according to a wide variety of characteristics. Divisible Load Theory: Divisible load theory is a methodology involving the linear and continuous modeling of partitionable computation and communication loads for parallel processing Parallel Computing: Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . Sequence Alignment: In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
ENDNOTE 1
http://www.ncbi.nlm.nih.gov/Genbank/
857
858
Chapter 37
Multi-Core Supported Deep Packet Inspection Yang Xiang Central Queensland University, Australia Daxin Tian Tianjin University, China
ABSTRACT Network security applications such as intrusion detection systems (IDSs), firewalls, anti-virus/spyware systems, anti-spam systems, and security visualisation applications are all computing-intensive applications. These applications all heavily rely on deep packet inspection, which is to examine the content of each network packet’s payload. Today these security applications cannot cope with the speed of broadband Internet that has already been deployed, that is, the processor power is much slower than the bandwidth power. Recently the development of multi-core processors brings more processing power. Multi-core processors represent a major evolution in computing hardware technology. While two years ago most network processors and personal computer microprocessors had single core configuration, the majority of the current microprocessors contain dual or quad cores and the number of cores on die is expected to grow exponentially over time. The purpose of this chapter is to discuss the research on using multi-core technologies to parallelize deep packet inspection algorithms, and how such an approach will improve the performance of deep packet inspection applications. This will eventually provide a security system the capability of real-time packet inspection thus significantly improve the overall status of security on current Internet infrastructure.
1. INTRODUCTION Current Internet is facing many serious attacks such as financial frauds, viruses and worms, distributed denial of service attacks, spyware, and spam. Although many network security applications such as intrusion detection systems (IDS), anti-virus/spam systems, and firewalls have been proposed to control DOI: 10.4018/978-1-60566-661-7.ch037
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Multi-Core Supported Deep Packet Inspection
the attacks, securing distributed systems and networks is still extremely challenging. There are unknown threats and zero day attacks (exploits released before the vendor patch is released to the public) appearing everyday, which place an impractical burden on network security systems. The key question here is can we have real time solutions to identify and eliminate attacks without excessive security and management overhead overburdening the networks and computer systems? To deal with the rapidly evolving threats today and more intelligent and automatic threats in the future, we urgently need new methods that support network security applications, at all times and in real time, without causing performance penalty to normal network and system operations. A multi-core processor combines two or more independent cores into a single package composed of a single integrated circuit (called a die), or more dies packaged together (Intel, 2007). Multi-core processors represent a major evolution in computing hardware technology. While two years ago most network processors and personal computer microprocessors had single core configuration, the majority of the current microprocessors contain dual or quad cores and the number of cores on die is expected to grow exponentially over time (Johnson & Welser, 2005). As the price of multi-core processors keeps falling, multi-core will eventually provide affordable processing power to support the real-time requirement of network security applications. Multi-core provides a network security application with more processing power from the hardware perspective. However, there are still significant software design challenges that must be overcome. Today the difficulty is not in building multi-core hardware, but programming it in a way that lets applications benefit from the continued growth in CPU performance (Sutter & Larus, 2005). From the server or router side, if the network security software is not fast enough, it can be very difficult to process every incoming packet then it would slow down the traffic. From the client side, it can also be very difficult to run network security applications without any interruption to normal applications because those computingintensive applications significantly slow down other simultaneously running applications. Taking advantage of the full power of multi-core processor requires an in-depth approach to realize the speedups by parallelizing the traditional deep packet inspection applications. In this chapter we discuss the research direction of using multi-core processors to support real-time deep packet inspection applications. Section 2 introduces the related work in the parallel approaches to enhance the performance of deep packet inspection applications. Section 3 presents our new system architecture of using multi-core to support deep packet inspection applications. Section 4 presents the basic packet-level parallelization and flow-level parallelization. Section 5 presents a new parallel string matching algorithm. Benefits of using multi-core are discussed in Section 6. Section 7 concludes this chapter.
2. RELATED WORK 2.1 Development of Multi-Core Processors In 1965, Gordon Moore observed an exponential growth in the number of transistors per integrated circuit and predicted that this trend would continue - a prediction today known as Moore’s Law (Moore, 1965). In reality, the doubling of transistors every couple of years has been maintained for almost 40 years. However, scaling up the processor’s frequency has been more difficult because of several constraints revealed recently. First, memory speeds are not increasing as quickly as processor’s logic speeds. Now the processor takes more clock cycles to access memory than before. The wasted clock cycles can nul-
859
Multi-Core Supported Deep Packet Inspection
lify the benefits of frequency increases in the processor. Second, manufacture difficulty has shown that smaller and denser transistors on chips need to be threaded together with ever-increasing lengths of wire interconnects. As these interconnects stretch from hundreds to thousands of meters in length on a single processor, path delays can offset the speed increases of the transistors. Finally, the power density problem has become an unsustainable problem. The number of transistors per chip has significantly increased in recent years. The power consumption and generated heat have been a serious problem for processors. Instead of developing chips that can run faster, processor designers are adding more cores and more cache to provide comparable or better performance at lower power. Additional transistors are being leveraged to create more diverse capability, such as virtualization technology or security features as opposed to driving to higher clock speeds. Multi-core processors are clocked at slower speeds and supplied with lower voltage to yield greater performance per watt. The development of multi-core processors has a significant impact on software applications. To take advantage of multi-core, software requires migration to a multi-threaded software model and necessitates incremental validation and performance tuning. Although kernel or system threads managed by the operating system can enhance the application performance, it is essential to have multiple user threads maintained by programmers to improve the performance of traditional applications.
2.2 Parallel Network Security Applications on Multi-Core Processors As the Internet traffic volumes and rates continue to race forward, it has become difficult for network security applications to process network packets in real-time. Many network security applications nowadays can process network packets at Mbps level. However, most network backbones and many local network interfaces operate at Gbps level. To improve the performance of the network security applications, most previous research focus on parallelism with the hardware approaches such as ASICs and FPGAs (Dharmapurikar, Krishnamurthy, Sproull, & Lockwood, 2004; Hayes & Luo, 2007; Liu, Zheng, Liu, Zhang, & Liu, 2006; Piyachon & Luo, 2006; Villa, Scarpazza, & Petrini, 2008). They require highly deliberate and customized programming, which is directly at odds with the pressing need to perform diverse, increasingly sophisticated forms of analysis. In (Paxson et al., 2006) the authors argued that it is time to fundamentally rethink the nature of using hardware to support network security applications. Previously, efforts in multi-core software design has been primarily on simultaneous multithreading (SMT) (Eggers et al., 1997; Tullsen, Lo, Eggers, & Levy, 1999) at a low level, which permits multiple independent threads of execution to better utilize the resources provided by microprocessor architectures. Most of current research is still focused on automatically mapping general-purpose applications onto multi-core systems with instruction, data, or thread level parallelization techniques (Sohi, Breach, & Vijaykumar, 1995; Taylor et al., 2004; Yan & Zhang, 2007) or relying on virtualization technologies such as VMware (WMware, 2008). Most of them are essentially extensions of utilizing shared-memory multiprocessors and can only execute coarse-grained threads. Network security applications have their own unique behavioral characteristics such as frequent memory or disk access, complex data structures, and high bandwidth and high speed requirements. There is a distinct mismatch between current multicore hardware development and high performance demand from network security applications. There has been very little preliminary research done in this area (Paxson, Sommer, & Weaver, 2007; Qi et al., 2007). In short, there is an imperative need for specifically re-designing network security applications from a software perspective based on multi-core hardware architecture.
860
Multi-Core Supported Deep Packet Inspection
2.3 Parallel Deep Packet Inspection Deep packet inspection refers to the process of checking packet payload and header in a network device. The applications of deep packet inspection include, for example, network security applications that filter out packets containing certain malicious Internet worms or computer viruses; content-based billing systems that analyze media files and bill the receiver based on the material transferred over the network; and content forwarding applications that look at the hypertext transport protocol headers and distribute the requests among the servers for load balancing (Dharmapurikar et al., 2004). Contrastingly, shallow packet inspection refers to the process of only checking packet header in a network device. Deep packet inspection requires much more processing power than shallow packet inspection. Most deep packet inspection applications have a common requirement for string matching. For example, the presence of a string of certain byte sequences in packet payloads can identify the presence of a virus, such as the well-known Internet worms Nimda, Code Red, and Slammer. One requirement of deep packet inspection applications is that the applications must be able to detect strings of different lengths starting at arbitrary locations in the packet payload because the location of such strings in the packet payload and their length is normally unknown. The other requirement of deep packet inspection applications is that they must be able to process network packets at line speed, otherwise it will causes the delay of network traffic, or the incompleteness of deep packet inspection. Dharmapurikar (Dharmapurikar et al., 2004) described a technique based on Bloom filters for detecting predefined signatures (a string of bytes) in the packet payload. A Bloom filter is a data structure for representing a set of strings in order to support membership queries. It used parallel hardware Bloom filters to isolate all packets that potentially contain predefined signatures. Another independent process eliminates false positives produced by Bloom filters. The authors implemented a prototype system in a Xilinx XCV2000E FPGA, using the Field-Programmable Port Extender (FPX) platform. Finite state machine approach is another popular method in deep packet inspection. Tripp (Tripp, 2006) described a finite state machine approach to string matching for an intrusion detection system. By splitting the search strings into multiple interleaved substrings and by combining the outputs from the individual finite state machines in an appropriate way it can perform string matching in parallel across multiple finite state machines. A VHDL model of a string matching engine based on the above ideas has been developed on a Xilinx XC2V250-6 FPGA and tested via simulation. This implementation is capable of matching up to 27 search strings in parallel, depending on the length of the strings.
3. SYSTEM DESIGN The idea of using multi-core processors to enhance the performance of network security applications is promising. However, the research in this area is just emerging and thus requires intensive exploration. It faces many challenges such as • • •
How can we actually use multi-core to continue running the network security applications while keeping the overall system performance? How can we efficiently partition and distribute the workload of network security applications between the different cores in the multi-core processor? How can we split network data and solve the data dependency problem?
861
Multi-Core Supported Deep Packet Inspection
Figure 1. System architecture of using multi-core processors to support deep packet inspection
• •
As multi-core uses shared off-chip memory, how can we smartly utilize the memory then it will bring less memory access latencies? How can we synchronize and coordinate different threads of the applications when it is parallelized on multi-core?
To best solve these challenges, we propose the new system architecture as in figure 1. The essential ability of this architecture is that it can process network packets in parallel and thus meet the real-time requirement. As is shown in the figure, the multi-processing scheduler coordinates and distributes the workload to different cores. The information from packets, events, flows, and messages are processed in the multi-core processor in parallel. The processor has spare cores to run other applications. As is illustrated in Figure 1, the proposed architecture must be able to process network packets at line speed. In another words, it must keep its performance in normal applications, such as forwarding packets in a router. To fully utilize the potential of multi-core, this system architecture will use different level of parallelization such as instruction-level parallelization, memory parallelization, loop-level parallelization, and fine-grained thread-level parallelization. High performance can be achieved through interaction between algorithms, strategies, and architectural design, from high-level decisions on data allocation and task partitioning to low-level micro-architectural decisions on instruction selection and scheduling. For each network security application, we also need to identify what the potential bottlenecks are and how to possibly avoid them. The potential bottlenecks could be packet processing, data normalization, data correlation, pattern generation, and pattern matching in such a parallel computing environment.
4. PACKET-LEVEL AND FLOW-LEVEL PARALLELIZATIONS As the instance of the aforementioned multi-core based deep packet inspection architecture, we test two levels of parallelization on multi-core: packet level and flow-level parallelization. As current network traffic speeds and volume are increasing at an exponential rate, the processing speed required by a deep
862
Multi-Core Supported Deep Packet Inspection
Figure 2. Experiment environment
packet inspection application is high. In this experiment, we evaluate the performance of multi-core parallel deep packet inspection methods based on Snort (Roesch, 1999), an open source intrusion detection system. The test environment is shown in Figure 2. To simulate the large volumes of network traffic, network traffic generator is used to test the capability of the intrusion detection system Snort to handle continuous high traffic loads. We use TG2.0 (McKenney, Lee, & Denny, 2008) to generate high volumes of network traffic. A dual-core computer with a 2.26GHz Intel Pentium processor and 512MB RAM is used in the experiment. The network adaptors used are 10/100M PCI Ethernet adaptors. We first use the packet-level parallelization to evaluate the multi-core supported intrusion detection system’s performance on deep packet inspection. The TG’s client is set to open a UDP socket to send packets to the TG server waiting at 192.168.10.1 and port 4322, with packet data length being 576 and packet number is 2000. On the multi-core system, the odd number packets are processed by one core and the even number packets are processed by another core. The detailed procedure is specified in the basic algorithm as follows. The testing results under different inter-packet transmission time (0.02 seconds equals 1/0.02=50 packets/sec) is shown in Figure 4, and the dropping rate (the percentage of dropped packets in all packets) is shown in Figure 5. Because the system needs to capture the incoming packets from the network adapter and analyze these packets for possible attacks, depending on the packet capturing and analyzing
Figure 3. Algorithm of packet-level parallelization
863
Multi-Core Supported Deep Packet Inspection
Figure 4. The number of packet analyzed on packet-level parallelization
speed (processing speed), and the speed of incoming packets (network speed), the system may be able to process all incoming packets, or have to drop some packets. If the processing speed is slower than the network speed (this may happen when the system is under heavy load), the system may drop some packets thus may lose some useful information, which will increase false positive and false negative rate. The figures show the trend that if the system is parallelized at packet-level by using more than one core, the dropping rate can be slightly decreased. The number of analyzed packets thus can also be slightly increased. This proves that multi-core can increase the processing speed of the whole system. For packet-level parallelization to be practical, no resource should actually share common information across separate packets. Although deep packet inspection applications like Snort receive input packetby-packet, they must aggregate distinct packets into flows such as TCP streams to prevent an attacker from disguising malicious communications by breaking the data up across several packets. Since packets
Figure 5. Dropping rate on packet-level parallelization
864
Multi-Core Supported Deep Packet Inspection
Figure 6. Algorithm of flow-level parallelization
from one flow will not affect the states of another flow, different flows can be processed by independent processing threads with no constraints on ordering. To test the flow-level parallelization performance on multi-core, we conduct another experiment based on flow-level parallelization. One TG’s client is set to open a UDP socket to send packets to the TG server waiting at 192.168.10.1 and port 4322, with packet data length being 576 and packet number is 1000, another TG’s client is set to open a UDP socket to send packets to the TG server waiting at 10.10.10.1 and port 5322, with packet data length being 576 and packet number is 1000. The two Snorts running on the multi-core processor are set to one analyzes the packets relative to the 192.168.10.0 class C network, and another analyzes the 10.10.10.0 class C network. The detailed procedure is specified in algorithm 2 in Figure 6. The testing results are shown in Figure 7 and Figure 8. From the experiments we find that the flowlevel parallelization based on multi-core has higher speed to analyze packets than single-core system. However, the double core system does not achieve double speedups. The experiment proves that the parallel strategies can improve the deep packet inspection performance. From the comparison between packet-level parallelization and flow-level parallelization we also find that flow-based parallelization has similar performance speedups as packet-level parallelization. However, it is a practical solution to meaningfully examine each packet in real deep packet inspection applications.
5. PARALLEL STRING-MATCHING In the above experiments we find that the parallel algorithms at packet-level and flow-level running on multi-core can minimize the number of dropped packets thus potentially increase the detection rate. In this chapter we do not discuss the detection rate measurement because it does not depend on just the parallel algorithms employed but also the intelligent algorithms used, such as neural networks, finite state machine, and so on. The improvement of performance on detection rate brought by multi-core can be found in our other papers (Chonka, Zhou, Knapp, & Xiang, 2008; Tian & Xiang, 2008). Parallel deep packet inspection applications face the following challenges: distributed network traffic over several hosts while avoiding overloading any of the cores; the traffic distribution scheme should
865
Multi-Core Supported Deep Packet Inspection
Figure 7. The number of packet analyzed on flow-level parallelization
not negatively impact the application’s ability to perform packet inspection. The deep packet inspection applications must be able to check the incoming payload against a rule set which contains the threat signatures. However, given the growing number of rules, these applications are becoming more difficult to perform inspections in real-time without fast string matching mechanisms. In this section, we evaluate another parallel mechanism which pays more attention on fast multiple string matching. Currently many string matching algorithms have been widely used in different applications. It is used in data filtering and data mining to find selected patterns, for example, from a stream of newsfeed; it is used in security applications to detect certain suspicious keywords, for example, in Snort; it is used in DNA searching by translating an approximate search to a search for a large number of exact patterns, and so on. Among them, Aho-Corasick algorithm (Aho & Corasick, 1975) is a linear-time algorithm for this problem, based on an automata approach. The drawback of this automata approach is the large amount of memory space it requires. Aho-Corasick algorithm and its extensions deal well with regular
Figure 8. Dropping rate on flow-level parallelization
866
Multi-Core Supported Deep Packet Inspection
Figure 9. Algorithm of parallel string matching
expression matching, but if the keyword set is larger, they are not practical because of its large number of states. Aho-Corasick algorithm constructs a trie with suffix tree-like set of links from each node representing a string (e.g. abc) to the node corresponding to the longest proper suffix (e.g. bc if it exists, else c if that exists, else the root). It also contains links from each node to the longest suffix node that corresponds to a dictionary entry; thus all of the matches may be enumerated by following the resulting linked list. It then uses the trie at runtime, moving along the input and keeping the longest match, using the suffix links to make sure that computation is linear. For every node that is in the dictionary and every link along the dictionary suffix linked list, an output is generated. Parallel string matching offers a scalable method for inspecting packets in a high speed network environment. But these parallel methods typically distribute the arriving packets evenly across the array of processors, each having a copy of the complete policy. Given the rising number of rules, for example the Snort contains more than thousands of rules, it follows that, for each packet, much more signature strings need to be checked against the payload of the packet. To the multiple string matching algorithms, such as Aho-Corasick algorithm, it not only requires a significant amount of memory for the state machine and preprocessing, but also increases content matching time, thus can not be sufficient for the high speed networks. In this experiment we first partition the pattern string sets into small sets, then build the Aho-Corasick automatons on each small set, and last run the matching algorithm in parallel. The detailed procedure is specified in algorithm 3 in Figure 9 and Figure 10. We evaluate the performance of the parallel string matching algorithm and compare it with the original Aho-Corasick algorithm. To test the performance of the parallel string matching algorithm, the size of pattern set is 1000, and the length of each pattern is from 3 to 33 characters; the size of the test data is 3953KB. The original and parallel Aho-Corasick algorithms run on the Linux OS. We test the performance under different set numbers. If the set number is k, we partition the pattern set into k groups and each group contains 1000/k strings. The results are shown in Table 1, Table 2, Table 3, and Table 4. In each table, the state number reflects the amount of the consumed memory. As the multi-core supports thread level parallelization, using multiple threads can significant reduce the consumed time for string matching. Parallel mechanisms can also reduce the time for automaton building. The total state numbers of parallel algorithms are shown in Figure 11. The speed of one parallel algorithm is decided by the pattern string set i which finishes the searching process lastly. Total time comparison is shown
867
Multi-Core Supported Deep Packet Inspection
Figure 10. The progress of parallel Aho-Corasick algorithm
Table 1. One thread algorithm Automaton Building
String Matching
State Number
Total Time
404.86ms
12756.67ms
3519
13161.53ms
Automaton Building
String Matching
State Number
Total Time
Thread 1
79.49ms
8013.18ms
3060
8092.67ms
Thread 2
102.46ms
7405.39ms
3509
7507.85ms
Automaton Building
String Matching
State Number
Total Time
Thread 1
41.24ms
2843.80ms
2216
2885.04ms
Thread 2
72.14ms
2811.93ms
2978
2884.07ms
Thread 3
59.34ms
2942.18ms
2224
3001.52ms
Thread 4
57.83ms
2568.61ms
3211
2626.44ms
Thread 1
Table 2. Two parallel threads algorithm
Table 3. Four parallel threads algorithm
868
Multi-Core Supported Deep Packet Inspection
Table 4. Eight parallel threads algorithm Automaton Building
String Matching
State Number
Total Time
Thread 1
18.13ms
1538.62ms
1073
1556.75ms
Thread 2
16.04ms
1415.35ms
1347
1431.39ms
Thread 3
18.56ms
1294.76ms
1448
1313.32ms
Thread 4
39.85ms
1624.30ms
1749
1664.15ms
Thread 5
18.26ms
1454.73ms
1175
1472.99ms
Thread 6
33.98ms
1611.35ms
1390
1645.33ms
Thread 7
18.37ms
1439.90ms
1402
1458.27ms
Thread 8
21.98ms
1276.07ms
2004
1298.05ms
Figure 11. The total state numbers
Figure 12. The total time
869
Multi-Core Supported Deep Packet Inspection
in Figure 12. From the results we find that this multi-core supported parallel mechanism can speed up Aho-Corasick string matching algorithm significantly.
6. DEEPER THINKING OF USING MULTI-CORE From the above evaluation results we find there are many benefits to use multi-core to support deep packet inspection applications. We summarize the benefits of using multi-core supported system architecture in deep packet inspection applications as high performance, comprehensive, intelligent, and scalable. Firstly, traditional deep packet inspection applications are based on serial or very limited parallel execution of packet processing (Dharmapurikar et al., 2004). For example, traditional single-threaded network-based intrusion detection systems log activities that it finds to a safeguarded database and detects if the events match any malicious event recorded in the knowledge base. It must read packet level information and process it on the processor in serial. Traditional anti-virus systems and network visualizes also heavily rely on serially reading packets, files, or logged data. The serial execution performance largely depends on the clock frequency of a single CPU, which has not been improved much in recent years. Therefore, they cannot process large amount of packets in real-time. If we can parallelize them at application level instead of relying on operating system level parallelization, then the workload of network security applications can be distributed to different cores to achieve high performance, in terms of latency, throughput, and CPU utilization. Secondly, multi-core supported deep packet inspection applications can provide comprehensive protection against different threats. Currently, if a router performs deep packet inspection, for example, to check a certain virus signature in the packet’s payload, its forwarding capability will be significantly affected. Most network providers cannot afford to slow down traffic to perform such security operations. Another fact is that current computer systems can only separately run a single network security application at a time because these computing-intensive network security applications exclusively occupy CPU time. With the support from multi-core and application level parallelization, these computing tasks can be divided into many threads and distributed to different cores for processing. Thus if we have enough cores, the network security applications will then be virtually invisible to users because other applications still have free cores to perform their tasks. This enables comprehensive protection as it can integrate as many modules (such as intrusion detection module, anti-virus module, and anti-spam module) as necessary. Thirdly, with the support from multi-core, deep packet inspection applications will have greatly improved intelligence compared to traditional applications because we can employ many computingintensive methods to perform packet inspection and classification and anomaly detection. In (Xiang & Zhou, 2006) we have tested the performance of using neural network to detect attack packets with the aid of packet marking schemes. It has advantages such as high detection rate and low false positive rate because it relies on more intelligent method rather than signature matching. However, it also has the limitation of long training time, thus it cannot provide real-time protection. How to improve the performance of the intrusion detection system by utilizing the power of multi-core becomes critical to have a high-intelligent deep packet inspection application. Lastly, multi-core supported deep packet inspection applications are scalable. They can be used on not only network level devices but also end host level devices. As we know, pure network-based security applications cannot fully capture the profile of each end host. Therefore in order to achieve the
870
Multi-Core Supported Deep Packet Inspection
best protection, security checks must be performed on both network processing devices and end hosts. Many tasks that must be done on infrastructure level computing nodes before can now be moved to the far end of personal computers, which not only alleviate the load of the information infrastructure but also make the security check more meaningful. On the other side, many tasks that must be done on end host level before can now be performed at the infrastructure level, such as checking virus signatures, which can effectively prevent the propagation of malicious codes from reaching the end hosts. They are also customizable according to their scalability for different requirements because switching different parallel applications on or off becomes easy with the support from multi-core.
7. CONCLUSION In this chapter we present new multi-core supported deep packet inspection architecture and an instance of the architecture. Leveraging the power of multi-core processors can be the answer to many yet-unsolved but crucial challenges in deep packet inspection applications such as isolated security environment, real-time attack detection and attack packets filtering, and real-time visualization of network monitoring. It enables sophisticated and stateful network processing rich in semantics and context as a routine capability provided by a network’s routers. The use of multi-core will support flexible recompilation of security software but rather than redesign of hardware. It will provide significant benefits to the security of future distributed networks and systems.
REFERENCES Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855 Amarasinghe, S. (2007). Multicore programming primer and programming competition. A course at MIT, Cambridge, MA. Chonka, A., Zhou, W., Knapp, K., & Xiang, Y. (2008). Protecting information systems from ddos attack using multicore methodology. Proceedings of IEEE 8th International Conference on Computer and Information Technology. Dharmapurikar, S., Krishnamurthy, P., Sproull, T. S., & Lockwood, J. W. (2004). Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1), 52–61. doi:10.1109/MM.2004.1268997 Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., & Tullsen, D. (1997). Simultaneous multithreading: A platform for next-generation processors. IEEE Micro, 17(5), 12–19. doi:10.1109/40.621209 Hayes, C. L., & Luo, Y. (2007). Dpico: A high speed deep packet inspection engine using compact finite automata. Proceedings of ACM/IEEE ANCS, (pp. 195-203). Intel (2007). Intel® multi-core: An overview. Johnson, C., & Welser, J. (2005). Future processors: Flexible and modular. Proceedings of 3rd IEEE/ACM/ IFIP International Conference on Hardware/Software Codesign and System Synthesis, (pp. 4-6).
871
Multi-Core Supported Deep Packet Inspection
Liu, H., Zheng, K., Liu, B., Zhang, X., & Liu, Y. (2006). A memory-efficient parallel string matching architecture for high-speed intrusion detection. IEEE Journal on Selected Areas in Communications, 24(10), 1793–1804. doi:10.1109/JSAC.2006.877221 McKenney, P. E., Lee, D. Y., & Denny, B. A. (2008). Traffic generator software release notes. Moore, G. (1965). Cramming more components onto integrated circuits. Electronics Magazine, 38(8). Paxson, V., Asanović, K., Dharmapurikar, S., Lockwood, J., Pang, R., Sommer, R., et al. (2006). Rethinking hardware support for network analysis and intrusion prevention. Proceedings of the 1st conference on USENIX Workshop on Hot Topics in Security. Paxson, V., Sommer, R., & Weaver, N. (2007). An architecture for exploiting multi-core processors to parallelize network intrusion prevention. Proceedings of IEEE Sarnoff Symposium. Piyachon, P., & Luo, Y. (2006). Efficient memory utilization on network processors for deep packet inspection. Proceedings of ACM/IEEE ANCS, (pp. 71-80). Qi, Y., Xu, B., He, F., Yang, B., Yu, J., & Li, J. (2007). Towards high-performance flow-level packet processing on multi-core network processors. Proceedings of 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, (pp. 17-26). Roesch, M. (1999). Snort - lightweight intrusion detection for networks. Proceedings of 13th USENIX LISA Conference, (pp. 229-238). Sohi, G. S., Breach, S. E., & Vijaykumar, T. N. (1995). Multiscalar processors. Proceedings of 22nd Annual International Symposium on Computer Architecture, (pp. 414-425). Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue; Tomorrow’s Computing Today, 3(7), 54–62. doi:10.1145/1095408.1095421 Taylor, M. B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., et al. (2004). Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. Proceedings of 31st Annual International Symposium on Computer Architecture, (pp. 2-13). Tian, D., & Xiang, Y. (2008). A multi-core supported intrusion detection system. Proceedings of IFIP International Conference on Network and Parallel Computing. Tripp, G. (2006). A parallel “string matching engine” for use in high speed network intrusion detection systems. Journal in Computer Virology, 2(1), 21–34. doi:10.1007/s11416-006-0010-4 Tullsen, D., Lo, J., Eggers, S., & Levy, H. (1999). Supporting fine-grain synchronization on a simultaneous multithreaded processor. Proceedings of the 5th International Symposium on High Performance Computer Architecture, (p. 54). Villa, O., Scarpazza, D. P., & Petrini, F. (2008). Accelerating real-time string searching with multicore processors. IEEE Computer, 41(4), 42–50. WMware. (2008).
872
Multi-Core Supported Deep Packet Inspection
Xiang, Y., & Zhou, W. (2006). Protecting information infrastructure from ddos attacks by mark-aided distributed filtering (madf). International Journal of High Performance Computing and Networking, 4(5/6), 357–367. doi:10.1504/IJHPCN.2006.013491 Yan, J., & Zhang, W. (2007). Hybrid multi-core architecture for boosting single-threaded performance. ACM SIGARCH Computer Architecture News, 35(1), 141–148. doi:10.1145/1241601.1241603
KEY TERMS AND DEFINITIONS Deep Packet Inspection: Deep Packet Inspection (DPI) is a form of computer network packet filtering that examines the data and/or header part of a packet as it passes an inspection point, searching for protocol non-compliance, viruses, spam, intrusions or predefined criteria to decide if the packet can pass or if it needs to be routed to a different destination, or for the purpose of collecting statistical information. This is in contrast to shallow packet inspection which just checks the header portion of a packet. High-Performance Security Systems: High-performance security systems refer to the software or hardware systems that perform security functions at high performance in terms of processing speed, data, or throughput. Intrusion Detection: Intrusion detection is the act of detecting actions that attempt to compromise the confidentiality, integrity or availability of a resource. Multi-Core: Multi-core represents a major evolution in the development of processor. A multi-core processor (or chip-level multiprocessor, CMP) combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (IC), called a die, or more dies packaged together. Network Security: Network security consists of the provisions made in an underlying computer network infrastructure, policies adopted by the network administrator to protect the network and the network-accessible resources from unauthorized access and consistent and continuous monitoring and measurement of its effectiveness (or lack) combined together. Parallel Algorithms: Parallel algorithms are algorithms that can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result. Router: Router is a networking device whose software and hardware are usually tailored to the tasks of routing and forwarding information.
873
874
Chapter 38
State-Carrying Code for Computation Mobility Hai Jiang Arkansas State University, USA Yanqing Ji Gonzaga University, USA
ABSTRACT Computation mobility enables running programs to move around among machines and is the essence of performance gain, fault tolerance, and system throughput increase. State-carrying code (SCC) is a software mechanism to achieve such computation mobility by saving and retrieving computation states during normal program execution in heterogeneous multi-core/many-core clusters. This chapter analyzes different kinds of state saving/retrieving mechanisms for their pros and cons. To achieve a portable, flexible and scalable solution, SCC adopts the application-level thread migration approach. Major deployment features are explained and one example system, MigThread, is used to illustrate implementation details. Future trends are given to point out how SCC can evolve into a complete lightweight virtual machine. New high productivity languages might step in to raise SCC to language level. With SCC, thorough resource utilization is expected.
INTRODUCTION The way in which scientific and engineering research is conducted has radically changed in the past two decades. Computers have been used widely for data processing, application simulation and performance analysis. As application programs’ complexity increases dramatically, powerful supercomputers are on demand. Due to the cost/performance ratio, computer clusters are commonly utilized and treated as virtual supercomputers. Such computing environments can be easily acquired for scientific and engineering applications. For each individual computer node, multi-core/many-core architecture is becoming popular in the computer industry. In the near future, hundreds and thousands of cores might be placed inside of computer DOI: 10.4018/978-1-60566-661-7.ch038
Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
State-Carrying Code for Computation Mobility
nodes on server-clusters. Multi-core clusters are promising high performance computing platforms where multiple processes can be generated and distributed across participating machines and the multithreading technique can be applied to take advantage of multi-core architecture on each node. This hybrid distributed/shared memory infrastructure fits the natural layout of computer clusters. Since computer clusters for high performance computing can change their configurations dynamically, i.e., computing nodes can join or leave the systems at runtime, the ability of re-arranging running jobs is on demand to exploit the otherwise wasted resources. Such dynamic rescheduling can optimize the execution of applications, utilize system resource effectively, and improve the overall system throughput. Since computation mobility, i.e., the ability of moving computations around, is one of the essences to this dynamic scheduling, it has become indispensable to scalable computing for the following outstanding features: •
•
•
•
Load Balancing: Evenly distributing workloads over multiple cores/processors can improve the whole computation’s performance. For scientific applications, computations are partitioned into multiple tasks running on different processors/computers. In addition to variant computing powers, multiple users and programs share the computation resources in non-dedicated computing environments where load imbalance occurs frequently even though the workload was initially distributed evenly. Therefore, dynamically and periodically adjusting workload distribution is required to make sure that all running tasks at different locations finish their execution at the same time in order to minimize total idle time. Such load reconfiguration needs to transfer tasks from one location to another. Load Sharing: From the system’s point of view, load sharing typically increases the throughput of computer clusters. Studies have indicated that a large fraction of workstations could be unused for a large fraction of time. Scalable computing systems seek to exploit otherwise idle cores/processors and improve the overall system efficiency. Data Locality: Sharing resources includes two approaches: moving data to computation or moving computation to data. Current applications favor data migration as in FTP, web, and Distributed Shared Memory (DSM) systems. However, when computation sizes are much smaller than data sizes, the code or computation migration might be more efficient. Communication frequency and volume will be minimized by converting remote data accesses into local ones. In data intensive computing, when client-server and RPC (Remote Procedure Call) infrastructure are not available, computation migration is an effective approach to accessing massive remote data. Fault Tolerance: Before a computer system crashes, local computations/jobs should be transferred to other machines without losing most existing computing results. Computation migration and checkpointing are effective approaches.
Computation migration feature has existed in some batch schedulers and task brokers, such as Condor (Bricker, Litzkow, & Livny, 1992), LSF (Zhou, Zheng, Wang, & Delisle, 1993) and LoadLeveler (IBM Corporation, 1993). However, they can only work at a coarse granularity (process) level and in homogeneous environments. So far, there is no effective task/computation migration solution in heterogeneous environments. This has become the major bottleneck in dynamic schedulers and obstacle for scalable computing to achieve high performance and effective resource utilization. State-Carrying Code (SCC) is a software mechanism that intends to provide the ability of moving computations around. It allows programs to contain both normal statements for application execution
875
State-Carrying Code for Computation Mobility
and special primitives inserted by the SCC precompiler for portable computation state construction. Such a program can stop on one machine, generate its computation state, and then restart on another machine. SCC brings mobility to applications and addresses some critical issues including computation granularity, virtualization level, computation/data locality, distributed synchronization, distributed data sharing, and highly efficient data conversion. The objective of this chapter is to introduce the relevant backgrounds, design strategies and limitations of SCC as well as its uses for computation mobility in heterogeneous environments. The next section discusses the related research and their limitations. Then the strategies and design of SCC are introduced to show how to support heterogeneous computation migration to take advantage of multicore architecture. The future work and conclusion are given at the end.
Computation Mobility Computation mobility concerns the movement of a computation starting at one node and continuing execution at another network node in distributed systems. Such computation relocation requires the movement of both programs and execution contexts. How to construct and reload the execution contexts at runtime is the essence of computation mobility. Sometimes data mobility might imply computation mobility when data is closely bound to computations and both of them have to be moved around together. In fact, such a scenario indicates the datacomputes rule: a computation only works on its own data. Computation mobility enabled systems can be classified by their computation granularities and implementation levels. Candidates for computation units could be processes, threads, and user defined objects. The engine of computation mobility is optionally placed at data, language, application, library, virtual machine, kernel, and platform levels. A variety of research activities have been reported in the literature.
Process Migration Process migration concerns the construction of process states. A process is an operating system abstraction representing an instance of a running computer program. Each process contains its own address space, program counter, stack, and heaps. All these dynamic factors form the process state which is normally buried in operating systems. Kernel-level process migration is supported by some distributed or networked operating systems. Some operating systems such as MOSIX (Barak & La’adan, 1998), Sprite (Douglis & Ousterhout, 1991), Mach (Accetta, et al., 1986), Locus (Walker, Popek, English, Kline, & Thiel, 1992), and OSF/1 (Zajcew, Roy, Black, & Peak, 1993) migrate the whole process images whereas others like Accent (Rashid & Robertson, 1981) and V Kernel (Theimer, Lantz, & Cheriton, 1985) apply “Copy-On-Reference” and “precopying” techniques to shorten the process freeze time. These operating systems can access process states efficiently and support preemptive migration at virtually any point. Thus, they provide good transparency and flexibility to end users. However, this approach brings much complexity into kernels and process images cannot be shared between different operating systems. Inability to provide a heterogeneous solution presents a severe drawback of this approach. User-level process migration as in Condor (Bricker, Litzkow, & Livny, 1992) and Libckpt (Plank, Beck, Kinsley, & Li, 1995) achieves similar results without the kernel modification. Normally user
876
State-Carrying Code for Computation Mobility
libraries are used and linked to the application at compile-time. Process state is constructed through library calls. Since system calls are invoked in library calls to fetch process images from the kernel, this approach only works on homogeneous platforms. Application-level approach supports process migration in a heterogeneous environment since the process state can be replicated in user programs. The Tui system (Smith & Hutchinson, 1998), MigThread (Jiang & Chaudhary, 2004), SNOW (Chanchio & Sun, 2001), Porch (Ramkumar & Strumpen, 1997), PREACHES (Ssu, Yao, & Fuchs, 1999), and Process Introspection (Ferrari, Chapin, & Grimshaw, 1997) apply source-to-source transformation to convert programs into semantically equivalent source programs for saving and recovering process states across binary incompatible machines. Such an approach deliberately sacrifices transparency and reusability although pre-compilers/preprocessors are usually available to improve transparency to a certain degree. Since the original process states are buried in operating systems and no proper system calls are provided to fetch them easily, most process migration systems are not being widely used in open systems. One solution is to duplicate process states in user space as in application or language approaches. The state replicas are used instead for migration in heterogeneous environments.
Thread Migration Threads are flows of control in running programs. One process might contain multiple threads which share the same address space including text and data segments. However, each thread has its own stack and heap. Thread migration enables fine-grained computation adjustment in parallel computing. As multithreading becomes a popular programming practice, thread migration is increasingly important in finetuning high-end computing to fit dynamic and non-dedicated environments. Different threads can be migrated to utilize different resources for load balancing and load sharing. The core of thread migration is about transferring thread state and necessary data from the local heap to the destination. Current thread migration research focuses on updating internal self-referential pointers in stacks and heaps. Three approaches exist in the literature. The first approach uses language and compiler support to maintain enough type information and identify pointers as in MigThread (Jiang & Chaudhary, 2004) and Arachne (Dimitrov & Rego, 1998). The second approach requires scanning the stacks at run-time to detect and translate the possible pointers dynamically. The representative implementation of this is Ariadne (Mascarenhas & Rego, 1995). Since some pointers in stack cannot possibly be detected (Itzkovitz, Schuster, & Shalev, 1998), the resumed execution can be incorrect. The third approach is the most popular one. It necessitates the partitioning of the address space and reservation of unique virtual addresses for the stack of each thread so that the internal pointers maintain the same values. A common solution is to preallocate memory space for threads on all machines and restrict each thread to migrate to its corresponding location on other machines. This “iso-address” solution requires a large address space and is not scalable since there are limitations on stacks and heaps (Itzkovitz, Schuster, & Shalev, 1998). Such systems include Millipede (Itzkovitz, Schuster, & Shalev, 1998), Amber (Chase, Amador, Lazowska, Levy, & Littlefield, 1996), UPVM system (Casa, Konuru, Prouty, Walpole, & Otto, 1994), PM2 (Antoniu & Boung, 2001), Nomad system (Milton, 1998), and the one proposed by Cronk et al (Cronk & Mehrotra, 1997). Based on the location, threads can be classified as kernel-, user-, and language-level threads. Kernel-level threads exist in operating systems and can be scheduled onto processors directly. User-level
877
State-Carrying Code for Computation Mobility
threads are defined and scheduled by libraries in user space. Language-level threads are defined in a programming language. For example, Java threads are defined in the Java language and implemented in the Java Virtual Machine (JVM). According to thread types, migration systems have to fetch thread states from different places and port them to different platforms. To our knowledge, only MigThread (Jiang & Chaudhary, 2004) and Jessica2 (Zhu, Wang, & Lan, 2002) can support heterogeneous thread migration. MigThread achieves this by defining its own data conversion scheme whereas Jessica2 relies on modified JVMs.
Checkpointing Checkpointing is the saving of computation state, usually in stable storage, so that it may be reconstructed later. Therefore, the major difference between migration and checkpointing is the medium: memory-tomemory vs. memory-to-file transfer. Checkpointing may use most migration strategies. Libckpt (Plank, Beck, Kinsley, & Li, 1995), PREACHES (Ssu, Yao, & Fuchs, 1999), Porch (Ramkumar & Strumpen, 1997), CosMic (Chung, 1997), and other user-directed checkpointing systems save process states in stable storage, such as magnetic disks. The memory exclusion technique has been employed effectively in incremental checkpointing, where pages are not checkpointed when they are clean. The technique, “compiler-assisted checkpointing,” uses compiler/preprocessor to ensure correct memory exclusion calls for better performance. For message passing and shared address space parallel computing applications, CoCheck (Stellner, 1996) and C3 (Bronevetsky, Marques, Schulz, Pingali, & Stodghill, 2004) manage to get clear-cut checkpoints which can be treated as computation states for migration. Hence, from the computation state’s point of view, migration and checkpointing systems are equivalent.
Virtual Machines To enable code portability in heterogeneous environments, Virtual Machine (VM) techniques are widely used, for example in JVM (Lindholm & Yellin, 1999), VMware (VMware Inc., 1999), and Xen (Barham, et al., 2003). They present the image of a dedicated raw machine to each user. Virtual machines allow the configuration of an entire operating system to be independent from that of the physical resource; it is possible to completely represent a VM “guest” machine by its virtual state and instantiate it in any VM “host.” Therefore, VMs provide stable computing environments to programs. Some migration systems, such as Jessica2 (Zhu, Wang, & Lan, 2002), can work on top of process VMs to enable the migration in heterogeneous environments. Process VMs are used to interpret computation states to hide low-level architecture variety. Since such process VMs play a role as data converter and provide uniform platforms, states can be fetched in a unique way and heterogeneity issue is resolved smoothly (Zhu, Wang, & Lan, 2002). However, VMs do not support efficient computation mobility since they always have difficulties in distinguishing useful resources. A safe way is to wrap up the whole abstract view of the underlying physical machine. Obviously, the efficiency drops dramatically and VMs themselves are not portable across different physical hosts.
878
State-Carrying Code for Computation Mobility
Mobile Agents A mobile agent is a software object, representing an autonomous computation that can travel in a network to perform tasks on behalf of its creator. It has the ability to interact with its execution environment, and to act asynchronously and autonomously upon it. The code is their object-oriented context, and most existing mobile agent systems, including Charm++ (Kale & Krishnan, 1998), Emerald (Jul, Levy, Hutchinson, & Blad, 1998), Telescript (White, 1996) and IBM Aglets (Lange & Oshima, 1998), implement their agents in object-oriented languages such as Java and Telescript. Mobile agents demand a different coding environment, including new language constructs, programming styles, compilers, and execution platforms. Although current mobile agent systems are intended for general applications and have demonstrated some progress in Internet/mobile computing, it is still not clear how they will perform for computation-intensive and high performance computing applications with object-oriented technology.
Deployment of SCC State-Carrying Code (SCC) takes advantage of multi-core SMP architecture, virtualizes computations, and achieves better application performance in heterogeneous environments. Computation mobility is the essential tool. All existing packages have indicated the fact that computation migration performance is mainly affected by computing units and the location of computation states. SCC provides a flexible, portable, practical, and efficient solution to computation mobility.
Granularity The term granularity refers to the size of computation units which can move around individually. Normally it indicates the flexibility that mobility systems can provide. Since Virtual Machines can only dump the whole system images, they support coarse-grained mobility. All computations on the VMs will be transferred together and they cannot be distinguished explicitly. In this case, sequential and parallel jobs are treated as the same. This extreme migration case is efficient only when all VM’s local jobs need to leave the current machine. VMs provide the virtually stable platform for applications. However, VMs themselves are not portable over various physical machines. In VM migration, applications are not aware of the movement, but obviously such coarse-grained migration incurs high overhead and inflexibility. From the traditional operating systems’ point of view, processes are the basic computational abstraction. All sequential computations are executed on just a single process. In parallel computing, the overall jobs need to be decomposed into multiple tasks. In multi-process parallel applications such as those using MPI (Message Passing Interface), tasks are assigned to processes for parallel execution. In such cases, processes are treated as computation units. Compared to VM migration, process migration has its advantages in reducing overheads significantly without worrying about the execution environment. It can also manipulate individual or partial applications. Parallel jobs with multiple processes can be reconfigured dynamically. Different implementations of process migration exhibit various degrees of transparency to programmers. To reduce the heavy inter-process context switch overhead, many modern parallel computing applications adopt multi-threading techniques. Each process may contain multiple threads to share the same
879
State-Carrying Code for Computation Mobility
address space. Former multi-processed applications can be replaced by multi-threaded counterparts. Otherwise, former processes are further partitioned into multiple local threads to achieve finer task decomposition. Synchronization overhead among computation units is further reduced. Sequential applications can be viewed as single-threaded instances whereas parallel ones consist of multiple threads. Once threads are used as computation units, thread migration can move sequential jobs and entire or partial parallel computations. Since process migration treats multi-threaded applications as a whole, it will either move the whole computation or perform no migration at all. Such “all-or-nothing” scenarios do not exist in thread migration. However, since some thread libraries are invisible to operating systems, the difficulty of implementing thread migration is higher than process migration. Charm++ (Kale & Krishnan, 1998), Emerald (Jul, Levy, Hutchinson, & Blad, 1998), and other mobile agent systems provide mechanisms at the language level to migrate the user defined objects/ agents. However, new languages and compilers expose everything to programmers. Therefore, many legacy systems have to be re-deployed. Transparency is the major drawback so that most systems have given up this approach. Multi-core architecture has been widely adopted. To take advantage of the extra computing cores, applications need to be implemented with multiple processes or threads. Normally they need to exchange or share data with each other. Thus multi-threading is the preferred selection. Also, all modern operating systems have been multi-threaded. Threads in applications can be mapped onto kernel threads naturally. SCC follows this multi-threading trend and threads will be the basic units of computation mobility although processes can be handled.
Positioning Computation States The main issue in SCC and all other migration systems is how to retrieve computation states quickly and precisely. Then the current computation can be stopped on the current machine and resumed on a new machine based on its state. Computation states are buried or recreated at data, language, application, library, thread virtual machine, kernel, and platform virtual machine levels, as shown in Figure 1. However, not all of them are suitable to user threads whose states consist of the execution contexts in kernel and the thread contexts in thread libraries. Platform Virtual Machine approach provides execution platforms such as VMware (VMware Inc., 1999), and Xen (Barham, et al., 2003). These VMs can only deal with the whole environment, not the fine-grained tasks such as threads. Kernel-level approach is the most effective and transparent method for process migration. However, it cannot deal with user threads and does not work in heterogeneous environments. Thread Virtual Machine approach can only provide execution platform for running threads, such as Jessica2 (Zhu, Wang, & Lan, 2002) in JVM (Lindholm & Yellin, 1999) which needs to be modified to support Java thread migration. However, the requirement of distributing and installing the modified JVMs makes applications only work in closed computer clusters. User-level approach provides computation mobility function in user libraries (Bricker, Litzkow, & Livny, 1992) (Plank, Beck, Kinsley, & Li, 1995). Unique library calls will be translated into different system calls to fetch execution contexts inside operating systems. Normally it is not straightforward to convert computation states from one system to another. Application-level approach constructs computation states inside the source code for better portability
880
State-Carrying Code for Computation Mobility
Figure 1. Implementation levels of computation mobility.
and heterogeneity (Smith & Hutchinson, 1998) (Jiang & Chaudhary, 2004) (Chanchio & Sun, 2001). Normally pre-compilers/preprocessors are provided for code transformation without programmers’ involvement. Computation units, such as processes or threads, are defined according to applications. Flexibility is another major advantage of this approach. Language-level approach requires new languages and compilers (Kale & Krishnan, 1998) (Jul, Levy, Hutchinson, & Blad, 1998). It is hard to persuade programmers to learn new languages. Sometimes data-level approach is applicable in regular scientific applications where computation states can be represented by pure variable sets. Due to the required characteristics in computer clusters, SCC selects the application-level approach for portability, flexibility, heterogeneity and scalability.
Issues in Application-level Approach There are two major issues in application-level thread migration: •
Availability of Source Code
The major restriction of the application-level approach is the requirement of source code. Since thread states need to be replicated in the programs, applications with only executable code cannot be transformed for mobility. However, for high performance computing, users are normally also the programmers. Most time the source code is available. To get rid of this restriction, migration engine will have to be buried at lower levels, such as user, kernel and platform virtual machine levels. However, as we discussed before, approaches at those levels will not be able to utilize multi-core architecture effectively and performance improvement will be limited. Therefore, application-level approach is the better choice if the source code is available. •
Associated Overheads In some fields, such as Internet Computing and Mobile Agents, prompt state construction is the key
881
State-Carrying Code for Computation Mobility
Figure 2. The Infrastructure of SCC.
issue so that computations (agents) can start and stop quickly. However, in the application-level approach, the overhead of constructing computation states is relatively higher. Therefore, it is not suitable for Internet/Agent computing. SCC aims at high-end computing where the most overhead lies in the scientific computing or business data processing (mainly the floating pointing operations or data searching). The cost to set up a state replica in the program is ignorable. Many Grand Challenging problems in Computational Physics, Computational Chemistry, and Bioinformatics have exhibited such characteristic. The work in MigThread (Jiang & Chaudhary, 2004) has demonstrated such phenomena.
Infrastructure SCC suggests supporting both process and thread migration at application level. Among many existing application-level process migration systems, MigThread (Jiang & Chaudhary, 2004) covers the most features required by SCC and supports thread migration to take advantage of multi-core architecture. In this chapter, MigThread is used as an example to demonstrate major features in SCC. SCC consists of two parts: a preprocessor (pre-compiler) and a run-time support module. The preprocessor is designed to transform user’s source code into a format from which the run-time support module can construct the computation state precisely and efficiently. The run-time support module constructs, transfers, and restores computation states dynamically as well as provides other run-time safety checks, as shown in Figure 2 . Most of the time, user assistance is not required unless the preprocessor encounters unsolvable third-party library calls. Manual support is a necessity for this case. The preprocessor of MigThread is similar to the ones in other SCC systems and conducts the following tasks: •
882
Information Collection: Collect related stack and heap data for future state construction. The stack
State-Carrying Code for Computation Mobility
Figure 3. The original function.
• • • •
data includes globally shared variables, local variables, function parameters, and program counters, etc. whereas the heap data contains dynamically allocated memory segments. Tag Definition: Create tags for heterogeneous data blocks. Position Labeling: Detect and label potential migration points. Control Dispatching: Insert switch statements to orchestrate execution flows. Safety Protection: Detect and overcome unsafe cases; seek human assistance/instruction for thirdparty library calls; and leave other unresolved cases to the run-time support module.
Its run-time support module consists of a thread record list (including globally shared data), a stack management module, a memory block management module, and a pointer-casting closure. Since activation frames in the stack are arranged in last-in-first-out (LIFO) order, stacks are maintained in linked lists. Meanwhile, heaps and PC Closures are maintained in red-black trees for random accesses. The run-time support module is activated through primitives inserted by the preprocessor at compile time. It is required to link this run-time support library with user’s applications in the final compilation. During the execution, its task list includes: • • • • • • •
Stack Maintenance: Keep a user-level stack of activation frames for each thread. Tag Generation: Fill out tag contents which are platform-dependent. Heap Maintenance: Keep a user-level memory management subsystem for dynamically allocated memory blocks. Migration: Construct, transfer, and restore computation state. Data Conversion: Translate computation states for destination platforms. Safety Protection: Detect and recover remaining unsafe cases. Pointer Updating: Identify and update pointers after migration or checkpointing.
State Construction The state data typically consists of the process data segment, stack, heap and register contents. In SCC, the computation state is in a platform-independent format to reduce migration restrictions. Therefore,
883
State-Carrying Code for Computation Mobility
SCC does not rely on any type of thread library or operating system. State construction is done by both the preprocessor and the run-time support module. The preprocessor collects globally shared variables, stack variables, function parameters, program counters, and dynamically allocated memory regions, into certain pre-defined data structures. Since the virtual address spaces might be different, pointers are marked at compile time and updated at runtime. In many SCC systems such as the Tui system (Smith & Hutchinson, 1998) and SNOW (Chanchio & Sun, 2001), related variables are collected at the migration points to reduce the size of actual computation states. However, due to pointer arithmetic operations, related variables are not always detected correctly. Figure 4. The transformed function in MigThread.
884
State-Carrying Code for Computation Mobility
Figure 5. Tag definition and generation in MigThread.
MigThread puts variables in two predefined structures to speed up the state construction process. This speedy way can discover the hidden variables although it might be over-conservative. Figures 3 and 4 show this process in MigThread. A simple function foo() is defined in Figure 3. In MigThread, the preprocessor transforms the function and generates a corresponding MTh_foo() shown in Figure 4. Non-pointer variables are collected in MThV whereas pointers are gathered in MThP (as shown in area 1). In thread stacks, each function’s activation frame contains MThV and MThP to record the current function’s computation states. The program counter (PC) is a register that contains the memory address of the current execution point within a program. Its content is represented as a series of integer values. In MigThread, it is declared as MThV.stepno in each affected function. Since all possible positions for migration have been detected at compile-time (as shown in area 4), different integer values of MThV.stepno correspond to different adaptation points. In the transformed code, after the function initialization in area 2, a switch statement is inserted to dispatch execution to each labeled point according to the value of MThV.stepno as shown in area 3. The switch and goto statements help control jump to resumption points quickly. SCC also supports user-level memory management for heaps. Eventually, all computation state related contents, including stacks and heaps, are moved out to the user space and handled by SCC directly for portability.
Data Conversion Schemes Computation states can be transformed into pure data. If different platforms use different data formats, the computation states constructed on one platform need to be interpreted by another. Thus, data conversion is unavoidable. Some application-level migration systems only work in homogeneous systems without the data conversion issue. Many SCC systems adopt symmetric data conversion approach which can
885
State-Carrying Code for Computation Mobility
be easily implemented. Both the sender and receiver need to convert data to and from an intermediate (universal) data format. Special compilers (Smith & Hutchinson, 1998) or data representation libraries such as XDR (Srinivasan, 1995) are employed. In open and homogeneous systems, data conversion will still have to be conducted twice although it is not necessary at all (in closes systems, they can be eliminated). MigThread (Jiang & Chaudhary, 2004) adopt asymmetric data conversion method to perform data conversion only on the receiver side. This approach is more flexible in open systems. Data conversion is only conducted when necessary, i.e., senders and receivers are on different platforms. The module in MigThread is called Coarse-Grain Tagged “receiver makes it right” (CGT-RMR). This tagged RMR version scheme can tackle data alignment and padding physically, convert data structures as a whole, and eventually generate a lighter workload compared to existing standards. It accepts ASCII character sets, handles byte ordering, and adopts IEEE 754 floating-point standard because of its dominance in the market. Since CGT-RMR in MigThread converts variables as a whole, in most cases it is faster than the ones in other SCC systems. In MigThread, programmers do not need to worry about data formats. The preprocessor parses the source code, sets up type systems, transforms source code, and communicates with the run-time support module through inserted primitives. With helps from the type system, CGT-RMR can analyze data types, flatten down aggregate types recursively, detect padding patterns, and define tags as in Figure 5. But the actual tag contents can be set only at run-time and they may not be the same on different platforms. Since all of the tedious tag definition work has been performed by the preprocessor, the programming style becomes extremely simple. Also, with global control, low-level issues such as data conversion status can be conveyed to upper-level scheduling modules. Therefore, easy coding style and performance gains come from the preprocessor. CGT-RMR is very efficient in handling large data chunks which are common in migration and checkpointing (Jiang & Chaudhary, 2004). Tags in CGT-RMR are used to describe data types and their paddings so that data conversion routines can handle aggregate types as well as common scalar types. Tags are defined and generated for these structures as well as dynamically allocated memory blocks in the heap. At compile time, it is still too early to determine the content of the tags. The preprocessor defines rules to calculate structure members’ sizes and variant padding patterns, and inserts sprintf() to glue partial results together. The actual tag generation has to take place at run-time when the sprintf() statement is executed. Only one statement is issued for each data type regardless of whether it is a scalar or aggregate type. The flattening procedure is accomplished by MigThread’s preprocessor during tag definition instead of the encoding/decoding process at run-time. Hence, programmers are freed from this responsibility. In MigThread, all memory segments for predefined data structures are represented in a “tag-block” format. The process/thread stack becomes a sequence of these structures and their tags. Memory blocks in heaps are also associated with such tags to express the actual layout in memory space. Therefore, the computation state physically consists of a group of memory segments associated with their own tags in a “tag-segment” pair format.
State Restoration To support open systems, if symmetric data conversion scheme is adopted, both sides need to convert data. But with the asymmetric one, the senders do not need to perform data conversion. Only the receivers have to convert the computation state, i.e., data, as required. Normally variables are converted one by one. MigThread can do this block by block. Since activation frames in stacks are re-run and heaps
886
State-Carrying Code for Computation Mobility
are recreated, a new set of segments in “tag-block” format is available on the new platform. MigThread first compares architecture tags by strcmp(). If they are identical and the blocks have the same sizes, the platforms remain unchanged and the old segment contents are simply copied over by memcpy() to the new architectures. This enables prompt processing between homogeneous platforms while symmetric conversion approaches still suffer data conversion overhead on both ends. If platforms have been changed, conversion routines are applied on all memory segments. For each segment, a “walk-through” process is conducted against its corresponding old segment from the previous platform. In these segments, according to their tags, memory blocks are viewed to alternately consist of scalar type data and padding slots. The high-level conversion unit is data slots rather than bytes in order to achieve portability. The “walk-through” process contains two index pointers pointing to a pair of matching scalar data slots in both blocks. The contents of the old data slots are converted and copied to the new data slots if byte ordering changes, and then the index pointers are moved down to the next slots. In the mean time, padding slots are skipped over. In MigThread, data items are expressed in “scalar type data - padding slots” pattern to support heterogeneity.
Safety Issues Some SCC systems such as the Tui system (Smith & Hutchinson, 1998) and SNOW (Chanchio & Sun, 2001) declare that they only work with programs written in type safe languages or substrates whereas others assume so. Most migration safety issues can be eliminated by the “type safety” requirement, but not all of them. MigThread can detect and handle more migration “unsafe” features, including pointer casting, pointers in unions, library calls, and incompatible data conversion (Jiang & Chaudhary, 2004). Then computation states will be precisely constructed to make those programs eligible for migration. Programmers are free to code in any programming style. Pointer casting does not mean the cast between different pointer types, but the cast to/from integral types, such as integer, long, or double. The problem is that pointers might hide in integral type variables. The central issue is to detect those integral variables containing pointer values (or memory addresses) so that they could be updated during state restoration. Casting could be direct or indirect. Pointer arithmetic and operations may cause harmful pointer casting which is the most difficult safety issue. In MigThread, an intra-procedural, flow-insensitive, and context-insensitive pointer inference algorithm was proposed to detect hidden pointers created by unsafe pointer casting, regardless of whether it is applied in pointer assignments or memcpy() library calls. The static analysis at compile time and dynamic checks at run-time work together to trace and recover unsafe pointer uses. Library calls bring difficulties to all migration schemes since it is hard to determine what is going on inside the library code. It is even harder for application-level schemes because they work on the source code and “memory leakage” might happen in the libraries. Without the source code of libraries, it is difficult to intercept all memory allocations because of the “blackbox” effect. The current version of MigThread provides a user interface to specify the syntax of certain library calls so that the preprocessor can know how to insert proper primitives for memory management and pointer tracing. Another unsafe factor is the incompatible data conversion. Between incompatible platforms, if data items are converted from higher precision formats to lower precision formats, precision loss may occur. Detecting incompatible data formats and conveying this low-level information up to the scheduling module can help move computations back or to proper nodes. In MigThread, the pointer inference algorithm, a friendly user interface, and the data conversion scheme
887
State-Carrying Code for Computation Mobility
CGT-RMR work together to eliminate unsafe factors to qualify almost all programs for migration.
Adaptation Point Analysis An adaptation point is a location in a program where a thread/process can be migrated or checkpointed correctly. The locations of adaptation points for computation migration are critical since the distance between two consecutive adaptation points determines the migration scheme’s sensitivity and overheads. If two adaptation points are too far apart, applications might be insensitive to the dynamic situation. But if they are too close, the related overheads with constructing, saving, and retrieving computation states will slow down the actual computation. Several methods have been proposed in the literature regarding how to insert adaptation points. The first approach is that adaptation points are inserted by users or initiated at a barrier (Abdel-Shafi, Speight, & Bennet, 1999). This method is straightforward, but it brings undue burden on inexperienced programmers who do not know the structure and workload of their applications. And for some large and complex applications where many developers are involved, it is very difficult to insert adaptation points by users. Some automatic adaptation point placement methods were proposed in order to overcome the disadvantage of the above approach. SNOW handles migration points by counting the number of floating point operations. That is, it inserts a migration point after a certain number of floating point operations. Since we do not always know the upper bound of a loop at compile time, this scheme cannot determine the count of operations inside a loop. Furthermore, this scheme is not applicable to non-scientific applications where most operations may not be floating point operations. Actually, it might be difficult to adopt a quantitative method because of pipelining, caches, and compiler optimizations. Therefore, such approach might be inaccurate under many circumstances. The adaptation point placement approach in (Li, Stewart, & Fuchs, 1994) inserts potential adaptation points inside loops. It uses a counter to determine when checkpointing actually occurs. The counter is initially set to a value, called “reduction factor,” and it is reduced by one for each loop. When the counter is equal to zero, the program does actual checkpointing. This scheme can only insert potential adaptation points at certain sparse points. With many small loops or loops with unknown upper bounds, checkpointing might not take place for a long time. Therefore, this method is fine with checkpointing, but if it is applied to migration, it will be insensitive to dynamic environment since migration is only allowed at sparse points. MigThread aggressively inserts a lot of potential adaptation points into users’ programs using its preprocessor. That is, it inserts at least one potential adaptation point into each nested loop, subroutine or branch. Whether a potential adaptation point will be actually activated is decided by a scheduler (or server) which determines the actual adaptation intervals according to the dynamic environment or using any existing optimal adaptation interval estimation methods, e.g Young’s Law (Young, 1974). When an actual migration or checkpoint is needed, the scheduler sends a signal to users’ programs in order to set a flag that controls each potential adaptation point. This approach can tolerate more adaptation points, and thus the applications can be more sensitive to their dynamic situations. Library and system calls can cause problems to all application-level migration approaches since no source code is available to insert potential adaptation points at the higher level. However, to achieve better portability of the application-level approach, it is reasonable to give up the sensitivity during the third-party library call procedures. Luckily, the execution time of most library calls is relatively short.
888
State-Carrying Code for Computation Mobility
For some parallel applications where relaxed memory consistency models are used, Migthread first inserts a pseudo-barrier primitive to synchronize computation progress of multiple threads/processes across different machines. If an actual migration is scheduled, real barrier operation will be activated to synchronize both computation progress and data copies. Therefore, migration can take place with consistent states.
Communication States Networking applications set up communication channels by certain protocols such as TCP/IP. During and after migration, messages and those channels themselves need to be handled properly. If migration happens in a closed system using a distributed or networked operating system, the OS can re-establish communication seamlessly. The kernel-level approach is always efficient and transparent. However, portability is not supported. Some SCC systems such as SNOW (Chanchio & Sun, 2001) proposed a new communication protocol above the regular TCP/IP and UDP/IP layers to deal with message channels in user space. This “connection-aware” protocol can reconstruct communication channels after migration. However, it only works within a closed PVM (Parallel Virtual Machine) system since the modified PVM library should be installed on all participating computers. This “closed-system” restriction is too strict for generic systems which have ports opened to communicate with outside applications. In open systems, resetting the communication layout is difficult since we do not have the full control of the whole system. To achieve portability, system call wrappers can be used again to forward data to shadow-threads. The performance may not be as good as in kernel-level and VM-level approaches. Portability gain is very attractive in heterogeneous distributed systems.
Future Trends SCC adopts the application-level thread/process migration approach. To achieve the complete high-level virtual machine abstraction, many features need to be enhanced or added. For example, communication states need to be improved to support open systems properly. Some future trends are listed to indicate possible research directions.
Resource Access Transparency Most SCC systems focus on code and execution state migration. In fact, the most difficult task is to access original system resources seamlessly. Such resources include data in memory, files in secondary storage, signals, communication connections, and even databases. Communication states are the example for connection migration. Modified communication primitives can tear down and set up connections transparently. However, this requires a new communication library. One possible solution is to install proxy server at the original node before any migration. To access old resources such as printers and databases, requests are sent back so that the proxy server can perform the local access and send back the results. However, this “leftover residue” method might slow down the performance because of the introduction of communication channels. Data in memory could be shared by multiple threads. After thread migration, such data will be shared across threads on multiple machines. To achieve such distributed sharing, Distributed Shared Memory
889
State-Carrying Code for Computation Mobility
(DSM) might be applied and a distributed lock mechanism needs to be implemented. This is similar to cache coherence or page/object-based DSM. Resources in secondary storage such as files can be accessed by migrated computations through global references or copies. If the resources are supported by global naming, they can be accessed anywhere. Otherwise, copying them to the new destination can enable local access to hide the remote access after the migration. Different resources require different strategies. SCC should be able to set up support accordingly.
Migration Policy More and more researchers have realized the benefit of moving computations to data over the traditional way of moving data to computations. SCC needs to analyze the data and computation locality. Then it activates data/computation migration to minimize communication frequency and volume. Tasks and their required data are distributed nearby since local accesses are much faster than remote ones. Such a task-data relationship needs to be identified during the task mapping period. More importantly, it should be adjusted dynamically. An SCC scheduler should be implemented to detect communication patterns and orchestrate data/computation migration according to the predefined migration policy. The migration policy is set up based on the size of data and computations. Its details need to be exploited further.
New Languages SCC adopts the application-level approach. It exhibits several disadvantages. Firstly, it needs to handle different kinds of thread libraries. Secondly, the source-to-source translation might not be able to handle all programming styles, i.e., migration safety issues will appear. Finally, SCC has to deal with different kinds of libraries which might handle stacks or heaps directly or indirectly. To support all migration features smoothly, powerful new programming languages might be the future direction. As part of DARPA’s High Productivity Computing Systems (HPCS) program, several programming languages are being developed. The representative ones include IBM’s X10 (Charles, et al., 2005), Sun Microsystems’s Fortess (Allen, et al., 2008), and Cray’s Chapel (Cray Inc., 2005). They all support parallel and distributed computing over multi-core clusters. However, computation mobility is not the significant feature of these on-going languages. If these languages take it as one of their main goals, programmers can take advantage of all features together without having to face the difficulty of combining multiple modules from different vendors. With the support from corresponding compilers on different platforms, these languages can deploy more new portable features easily. With the new language, SCC can be raised to the language level. The integrated package with computation mobility and other new features might attract more programmers. At least up to now, no new language has been planned to tackle issues in SCC because of the difficulties.
CONCLUSION This chapter points out the benefits of computation mobility in performance improvement, fault tolerance and throughput increase. To enable computation mobility, the key issue is to construct moveable
890
State-Carrying Code for Computation Mobility
computation states during application execution. Their granularity and position in software determine the solution’s portability and flexibility. The eventual goal is to support computation mobility in heterogeneous environments and take advantage of multi-core/many-core clusters. State-carrying code (SCC) is a software mechanism to achieve the above-mentioned computation mobility. Applications replicate their states at high level during their normal execution. Whenever the computation needs to be moved between different platforms, its computation state can be constructed easily. Since multithreading is widely used in multi-core architecture, threads are chosen as the basic computing units. Therefore, SCC adopts application-level thread migration to achieve computation mobility in heterogeneous systems. Deployment and major features of SCC are explained. MigThread and several other existing systems are used to demonstrate the implementation strategies. Some future trends are given to indicate how SCC can encapsulate computation states further and evolve into a lightweight virtual machine. New high productivity languages can even upgrade SCC to language level. After the much richer system resources such as cores are exploited thoroughly in clusters, performance gain and high availability are expected.
REFERENCES Abdel-Shafi, H., Speight, E., & Bennet, J. K. (1999). Efficient user-level thread migration and checkpointing on Windows NT clusters. Proceedings of the 3rd USENIX Windows NT Research Symposium, (pp. 1-10). Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., et al. (1986). March: A New Kernel Foundation for UNIX Development. Proceedings of the the Summer USENIX Confernce, (pp. 93-112). Allen, E., Cadthase, D., Hallett, J., Luchangco, V., Maessen, J., Ryu, S., et al. (2008). The fortress language specification, version 1.0. Santa Clara, CA: Sun Microsystems. Antoniu, G., & Boung, L. (2001). Dsm-pm2: A portable implementation platform for multi-threaded dsm consistency protocols. Proceedings of the 6th international workshop on high-level parallel programming models and supportive environments. Barak, A., & La’adan, O. (1998). The mosix multicomputer operating system for high per-formance cluster computing. Journal of Future Generation Computer System, 13(4-5), 361–372. doi:10.1016/ S0167-739X(97)00037-X Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al. (2003). Xen and the art of virtualization. Proceedings of the ACM symposium on operating systems principles. Bricker, A., Litzkow, M., & Livny, M. (1992). Condor Technical Summary, Version 4.1b. Madison, WI: University of Wisconsin - Madison. Bronevetsky, G., Marques, D., Schulz, M., Pingali, K., & Stodghill, P. (2004). Application-level checkpointing for shared memory programs. Proceedings of 11th international conference on architectural support for programming languages and operating systems.
891
State-Carrying Code for Computation Mobility
Casa, J., Konuru, R., Prouty, R., Walpole, J., & Otto, S. (1994). Adaptive Load Migration Systems for PVM. Proceedings of supercomputing, (pp. 390-399). Washington D.C. Chanchio, K., & Sun, X. H. (2001). Communication state transfer for the mobility of con-curent heterogeneous computing. Proceedings of the 2001 international conference on parallel processing. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: an object-oriented approach of non-uniform cluster computing. Proceedings of 20th annual ACM sigplan conference on object oriented programming, systems, languages, and applications, (pp. 519-538). Chase, J. S., Amador, F. F., Lazowska, E. D., Levy, H. M., & Littlefield, R. J. (1996). The amber systems: Parallel programming on a network of multiprocessors. Proceedings of acm symposium on operating system principles. Chung, P. E. (1997). Checkpointing in cosmic: a user-level process migration environment. Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems. Corporation, I. B. M. (1993). IBM Load Leveler: User’s Guide. Cray Inc. (2005). The Chapel language specification, version 0.4. Cronk, M. H., & Mehrotra, P. (1997). Thread migration in the presence of pointers. Proceedings of the mini-track on multithreaded systems, 30th hawaii interantional conference on system science. Dimitrov, B., & Rego, V. (1998). Arachne: A portable threads system supporting migrant threads on heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems, 9(5), 459–469. doi:10.1109/71.679216 Douglis, F., & Ousterhout, J. K. (1991). Transparent process migration: Design alternatives and the sprite implementation. Software, Practice & Experience, 21(8), 757–785. doi:10.1002/spe.4380210802 Ferrari, A. J., Chapin, S. J., & Grimshaw, A. S. (1997). Process introspection: A heterogeneous checkpoint/ restart mechanism based on automatic code modification, (Technical Report: CS-97-05). University of Virginia, Charlottesville, VA. Itzkovitz, A., Schuster, A., & Shalev, L. (1998). Thread migration and its applications in distributed shared memory systems. Journal of Systems and Software, 42(1), 71–87. doi:10.1016/S0164-1212(98)000089 Jiang, H., & Chaudhary, V. (2004). Process/thread migration and checkpointing in heterogeneous distributed systems. Proceedings of the 37th Hawaii International Conference on System Sciences, Hawaii, USA. Jul, E., Levy, H., Hutchinson, N., & Blad, A. (1998). Fine-grained mobility in the emerald system. ACM Transactions on Computer Systems, 6(1), 109–133. doi:10.1145/35037.42182 Kale, L. V., & Krishnan, S. (1998). Charm++: Parallel Programming with Message-Driven Objects. In G. V. Wilson, & P. Lu, Parallel programming using c++ (pp. 175-213). Cambridge, MA: MIT Press. Lange, D., & Oshima, M. (1998). Mobile agents with java: The aglet api. World Wide Web (Bussum), 1(3). doi:10.1023/A:1019267832048
892
State-Carrying Code for Computation Mobility
Li, C.-C. J., Stewart, E. M., & Fuchs, W. K. (1994). Compiler assisted full checkpointing. Software, Practice & Experience, 24, 871–886. doi:10.1002/spe.4380241002 Lindholm, T., & Yellin, F. (1999). The jave(tm) virtual machine specification (2nd Ed.).New York: Addison Wesley. Mascarenhas, E., & Rego, V. (1995). Ariadne: Architecture of a portable threads system supporting mobile process, (Tech. Rep. No. CSD-TR 95-017). Dept. of Computer Sciences, Purdue University, Southbend, IN. Milton, S. (1998). Thread migration in distributed memory multicomputers, (Tech. Rep. No. TR-CS-98-01). Dept. of Comp Sci & Comp Sciences Lab, Australia National University, Acton, Australia. Plank, J. S., Beck, M., Kinsley, G., & Li, K. (1995). Libckpt: Transparent checkpointing under unix. Usenix winter technical conference, (pp. 213-223). Ramkumar, B., & Strumpen, V. (1997). Portable checkpointing for heterogenous architectures. Symposium on fault-tolerent computing, (pp. 58-67). Rashid, R. F., & Robertson, G. (1981). Accent: A communication oriented network operating system kernel. Proceedings of the eighth acm symposium on operating systems principles, (pp. 64-75). Smith, P., & Hutchinson, N. C. (1998). Heterogeneous process migration: The tui system. Software, Practice & Experience, 28(6), 611–639. doi:10.1002/(SICI)1097-024X(199805)28:6<611::AIDSPE169>3.0.CO;2-F Srinivasan, R. (1995). XDR: External Data Representation Standard (Tech. Rep. No. RFC 1832). Ssu, K., Yao, B., & Fuchs, W. K. (1999). An adaptive checkpointing protocol to bound recovery time with message logging. Symposium on reliable distributed systems, (pp. 244-252). Stellner, G. (1996). Cocheck: Checkpointing and process migration for mpi. Proceedings of 10th international parallel processing symposium. Theimer, M. M., Lantz, K. A., & Cheriton, D. R. (1985). Preemptable remote execution facilities for the v-system. SIGOPS Oper. Syst. Rev., 19(5), 2–12. doi:10.1145/323627.323629 VMware Inc. (1999). VMware virtual platform. Walker, B., Popek, G., English, R., Kline, C., & Thiel, G. (1992). The locus distributed operating system. Ditributed Computing Systems: Concepts and Structures, 17(5). White, J. E. (1996). Telescript technology: Mobile agents. Journal of Software Agents. Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17, 530–531. doi:10.1145/361147.361115 Zajcew, R., Roy, P., Black, D., & Peak, C. (1993). An osf/l unix for massively parallel multi-computers. Proceedings of the winter 1993 conference, (pp. 449-468).
893
State-Carrying Code for Computation Mobility
Zhou, S., Zheng, X., Wang, J., & Delisle, P. (1993). Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Software, Practice & Experience, 23(12), 1305–1336. doi:10.1002/ spe.4380231203 Zhu, W., Wang, C.-L., & Lan, F. (2002). Jessica2: a distributed java virtual machine with trans-parent thread migration support. Proc. of IEEE international conference on cluster computing.
KEY TERMS AND DEFINITIONS Computation Mobility: The ability to move a running program from one machine to another Computation States: The required information to indicate the execution progress, including register contents, stacks, and heaps, etc. Data Conversion: The function to translate data from one format to another Migration Safety: The necessary features of a program to enable its mobility State-Carrying Code: Transformed programs which can acquire the running state in order to stop and restart the execution Thread/Process Migration: The feature to move a running thread/process from one machine to another Virtualization: The abstraction of system resources where computations can be executed for portability
894
895
Compilation of References
3GPP TS 23.234 V7.5.0 (2007). 3GPP system to WLAN interworking, 3GPP Specification. Retrieved May 1, 2008, from http://www.3gpp.org, 2007. A Blueprint for the Open Science Grids. (2004, December). Snapshot v0.9. Abawajy, J. (2004). Placement of file replicas in data grid environments. In Proceedings of international conference on computational science (Vol. 3038, pp. 66-73). Abdel-Shafi, H., Speight, E., & Bennet, J. K. (1999). Efficient user-level thread migration and checkpointing on Windows NT clusters. Proceedings of the 3rd USENIX Windows NT Research Symposium, (pp. 1-10). Abdennadher, N., & Boesch, R. (2005). Towards a peerto-peer platform for high performance computing. In HPCASIA’05 Proceedings of the Eighth International Conference in High-Performance Computing in AsiaPacific Region, (pp. 354-361). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/HPCASIA.2005.98
Abdennadher, N., & Boesch, R. (2006, August). A scheduling algorithm for high performance peer-to-peer platform. In W. Lehner, N. Meyer, A. Streit, & C. Stewart (Eds.), Coregrid Workshop, Euro-Par 2006 (p. 126-137). Dresden, Germany: Springer. Aberer, K., Cudr-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., & Punceva, M. (2003). P-Grid: A self-organizing structured p2p system. SIGMOD Record, 32(3), 29–33. doi:10.1145/945721.945729 Abramson, D., Buyya, R., & Giddy, J. (2002). A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems, 18(8), 1061–1074. doi:10.1016/S0167739X(02)00085-7 Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., et al. (1986). March: A New Kernel Foundation for UNIX Development. Proceedings of the the Summer USENIX Confernce, (pp. 93-112). ACR-NEMA. (2005). DICOM (Digital Image and Communications in Medicine). Retrieved June 15th, 2008, from http://medical.nema.org/
Copyright © 2010, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Compilation of References
Adabala, S., Chadha, V., Chawla, P., Figueiredo, R., Fortes, J., & Krsul, I. (2005, June). From virtualized resources to virtual computing Grids: the In-VIGO system. Future Generation Computer Systems, 21(6), 896–909. doi:10.1016/j.future.2003.12.021 Adamy, U., & Erlebach, T. (2004). Online coloring of intervals with bandwidth (LNCS Vol. 2909, pp. 1–12). Berlin: Springer. Adiga, N. R., et al. (2002). An overview of the BlueGene/L supercomputer. In Proceedings of the Supercomputing Conference (SC’2002), Baltimore MD, USA, (pp. 1–22). Adjie-Winoto, W., Schwartz, E., Blakrishnan, H., & Lilley, J. (1999). The design and implementation of an intentional naming system. Operating Systems Review, 34(5), 186–201. doi:10.1145/319344.319164 Adler, M., Halperin, E., Karp, R. M., & Vazirani, V. (2003, June). A stochastic process on the hypercube with applications to peer-to-peer networks. In Proc. of STOC. Adve, S. V., & Hill, M. D. (1993). A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems, 4(6), 613–624. doi:10.1109/71.242161 Afgan, E., Velusamy, V., & Bangalore, P. (2005). Grid resource broker using application benchmarking. European Grid Conference, (LNCS 3470, pp. 691-701). Amsterdam: Springer Verlag. Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K. L., Kranz, D., Kubiatowicz, J., et al. (1995). The MIT Alewife machine: architecture and performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), S. Margherita Ligure, Italy, (pp. 2-13). New York: ACM Press. Agarwal, A., Lim, B.-H., Kranz, D., & Kubiatowicz, J. (1990). April: a processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90), (pp. 104-114), Seattle, WA: ACM Press.
896
Agarwal, S., Chuah. C. N., & Katz, R. H. (2003). OPCA: Robust Inter-domain Policy Routing and Traffic Control, OPENARCH. Agrawal, D. P., & Zeng, Q.-A. (2006). Introduction to wireless and mobile systems (2nd Ed.). Florence, KY: Thomson. Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855 Akenine-Moller, T., & Haines, E., (2002, July). Realtime rendering (2nd Ed.). Wellesley, MA: A. K. Peters Publishing Company. Akyildiz, I., Mohanty, S., & Xie, J. (2005). A ubiquitous mobile communication architecture for nextgeneration heterogeneous wireless systems. IEEE Radio Communications, 43(6), 29–36. doi:10.1109/ MCOM.2005.1452832 Alam, S. R., Meredith, J. S., & Vetter, J. S. (2007, Sept.) Balancing productivity and performance on the cell broandband engine. IEEE Annual International Conference on Cluster Computing. Aldinucci, M., Danelutto, M., & Teti, P. (2003). An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems, 19(5), 611–626. doi:10.1016/S0167-739X(02)00172-3 Aldwairi, M., Conte, T., & Franzon, P., (2005) Configurable string matching hardware for speeding up intrusion detection. ACM SIGARCH Computer Architecture News, 33(1). Alexander, D. (1995). Recursively Modular Artificial Neural Network. Doctoral Thesis, Macquire University, Australia, Sydney, Australia. Alexandrov, A. D., Ibel, M., Schauser, K. E., & Scheiman, C. (1997, April). SuperWeb: Towards a global web-based parallel computing infrastructure. In Proceedings of the 11th IEEE International Parallel Processing Symposium (IPPS).
Compilation of References
Alexandrov, A. Ionescu, M. F. Schauser, K. E. & Scheiman, C. (1995). LogGP: Incorporating long messages into the LogP model. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, (pp. 95–105). New York: ACM Press. ALF for Cell BE Programmer’s Guide and API Reference. Retrieved from http://www01.ibm.com/chips/ techlib/techlib.nsf/techdocs/41838EDB5A15CCCD002 573530063D465 ALF for Hybrid-x86 Programmer’s Guide and API Reference. Retrieved from http://www01.ibm.com/chips/ techlib/techlib.nsf/techdocs/389BBE99638335B80025 735300624044
Alshwede, R., Cai, N, Li, S.-Y. R., & Yeung, R. W. (2000). Network information flow: Single Source. IEEE Transactions on information theory, (submitted for publication). Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., & Mock, S. (2004). Kepler: an extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece. Altschul, G. W., M. W, Myers., & Lipman. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Ali, A., McClatchey, R., Anjum, A., Habib, I., Soomro, K., Asif, M., et al. (2006). From grid middleware to a grid operating system. In Proceedings of the Fifth International Conference on Grid and Cooperative Computing, (pp. 9-16). China: IEEE Computer Society.
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., & Smith, B. (1990) The Tera computer system. In Proceedings of the 4th International Conference on Supercomputing (ICS’90), (pp. 1-6). Amsterdam: ACM Press.
Alima, L. O., El-Ansary, S., Brand, P., & Haridi, S. (2003). DKS (N, k, f): A Family of Low Communication, Scalable and Fault-tolerant Infrastructures for P2P Applications. In Proceedings of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid (pp. 344-350). New York: IEEE Computer Society Press.
Amarasinghe, S. (2007). Multicore programming primer and programming competition. A course at MIT, Cambridge, MA.
Allcock, W. (2003, Mar). GridFTP protocol specification. Global Grid Forum Recommendation GFD.20.
Amazon Elastic Compute Cloud. (2008, November). Retrieved from http://www.amazon.com/ec2
Allen, E., Cadthase, D., Hallett, J., Luchangco, V., Maessen, J., Ryu, S., et al. (2008). The fortress language specification, version 1.0. Santa Clara, CA: Sun Microsystems. Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryo, S., et al. (2008). The Fortress language specification, Version 1.0. Santa Clara, CA: Sun Microsystems, Inc. Allen, G., Davis, K., Dolkas, K. N., Doulamis, N. D., Goodale, T., Kielmann, T., et al. (2003). Enabling applications on the grid: A Gridlab overview. International Journal of High Performance Computing Applications: Special issue on Grid Computing: Infrastructure and Applications.
Amazon Elastic Compute Cloud (2007). Retrieved from www.amazon.com/ec2
Ambastha, N., Beak, I., Gokhale, S., & Mohr, A. (2003). A cache-based resource location approach for unstructured P2P network architectures. Graduate Research Conference, Department of Computer Science, Stony Brook University, NY. Ammann, P., Jajodia, S., & Ray, I. (1997). Applying formal methods to semantic-based decomposition of transactions. [TODS]. ACM Transactions on Database Systems, 22(2), 215–254. doi:10.1145/249978.249981 Ancilotti, P., Lazzerini, B., & Prete, C. A. (1990). A distributed commit protocol for a multicomputer system. IEEE Transactions on Computers, 39(5), 718–724. doi:10.1109/12.53589
897
Compilation of References
Anderson, D. P. (2004). BOINC: A system for publicresource computing and storage. In Grid’04 Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, (pp. 4-10). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://dx.doi.org/10.1109/ GRID.2004.14 Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M., & Werthimer, D. (2002). SETI@home: An experiment in public-resource computing. Communications of the ACM, 45(11), 56-61. New York: ACM Press. Retrieved from http://doi.acm.org/10.1145/581571.581573 Anderson, D., & Fedak, G. (2006). The computational and storage potential of volunteer computing. In Proceedings of The IEEE International Symposium on Cluster Computing and The Grid (CCGRID’06). Andrade, N., Brasileiro, F., Cirne, W., & Mowbray, M. (2007). Automatic Grid assembly by promoting collaboration in peer-to-peer Grids. Journal of Parallel and Distributed Computing, 67(8), 957–966. doi:10.1016/j. jpdc.2007.04.011 Andrade, N., Cirne, W., Brasileiro, F., & Roisenberg, R. (2003, October). OurGrid: An approach to easily assemble grids with equitable resource sharing. In JSSPP’03 Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing (LNCS). Berlin/Heidelberg, Germany: Springer. doi: 10.1007/10968987 Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335–371. doi:10.1145/1041680.1041681 Andrzejak, A., Domingues, P., & Silva, L. (2006). Predicting Machine Availabilities in Desktop Pools. In IEEE/ IFIP Network Operations and Management Symposium (pp. 225–234). Andrzejak, A., Kondo, D., & Anderson, D. P. (2008). Ensuring collective availability in volatile resource pools via forecasting. In 19th Ifip/Ieee Distributed Systems: Operations And Management (DSOM 2008). Samos Island, Greece.
898
Anfinson, J., & Luk, F. T. (1988, December). A Linear Algebraic Model of Algorithm-Based Fault Tolerance. IEEE Transactions on Computers, 37(12), 1599–1604. doi:10.1109/12.9736 Antoniu, G., & Boung, L. (2001). Dsm-pm2: A portable implementation platform for multi-threaded dsm consistency protocols. Proceedings of the 6th international workshop on high-level parallel programming models and supportive environments. Antoniu, G., Bougé, L., Hatcher, P., MacBeth, M., McGuigan, K., & Namyst, R. (2001). The Hyperion system: Compiling multithreaded Java bytecode for distributed execution. Parallel Computing, 27(10), 1279–1297. doi:10.1016/S0167-8191(01)00093-X Araujo, F., Domingues, P., Kondo, D., & Silva, L. M. (2008, April). Using cliques of nodes to store desktop grid checkpoints. In Coregrid Integration Workshop, Crete, Greece. Aridor, Y., Factor, M., & Teperman, A. (1999). cJVM: A Single System Image of a JVM on a Cluster. Paper presented at the Proceedings of the 1999 International Conference on Parallel Processing. Aridor, Y., Factor, M., Teperman, A., Eilam, T., & Schuster, A. (2000). Transparently obtaining scalability for Java applications on a cluster. Journal of Parallel and Distributed Computing, 60(10), 1159–1193. doi:10.1006/ jpdc.2000.1649 ARM. (2008). ARM Achieves 10 Billion Processor Milestone. Retrieved March 10, 2008, from http://www.arm. com/news/19720.html Arpaci, R. H., Dusseau, A. C., Vahdat, A. M., Liu, L. T., Anderson, T. E., & Patterson, D. A. (May, 1995). The interaction of parallel and sequential workloads on a network of workstations. Paper presented at the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Arpaci-Dusseau, R. H., Arpaci-Dusseau, A. C., Vahdat, A., Liu, L. T., Anderson, T. E., & Patterson, D. A. (1995). The interaction of parallel and sequential workloads on a network of workstations. SIGMETRICS, (pp. 267-278).
Compilation of References
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, R., Keutzer, K., et al. (2006, Dec). The Landscape of Parallel Computing Research: A View from Berkeley (Tech. Rep. No. UCB/EECS-2006-183). EECS Department, University of California, Berkeley. ASF. (2002). The Apache Tomcat Connector. Retrieved June 18, 2008, from http://tomcat.apache.org/connectorsdoc/ Audsley, N. C., Burns, A., Richardson, M. F., Tindall, K., & Wellings, A. (1993). Applying New Scheduling Theory to Static Priority Pre-emptive Scheduling. Software Engineering Journal, 8(5). Australian Partnership for Advanced Computing (APAC) Grid. (2005). Retrieved from http://www.apac.edu.au/ programs/GRID/index.html. Autenrieth., F., Isralewitz, B., Luthey-Schulten., Z., Sethi, A. & Pogorelov, T. Bioinformatics and Sequence Alignment. Awduche, D.O., Chiu, A., Elqalid, A., Widjaja, I., & Xiao, X. (2002). A Framework for Internet Traffic Engineering [draft 2]. Retrieved from IETF draft database. Azar, Y. Broder, A., et al. (1994). Balanced allocations. In Proc. of STOC (pp. 593–602). Baboescu, F., & Varghese, G. (2005). Scalable packet classification. IEEE/ACM Trans. Netw., 13(1), 2–14. Bader, D. A., & Agarwal, V. (2007, Dec). FFTC: Fastest fourier transform on the ibm cell broadband engine. In 14th IEEE international conference on high performance computing (hipc 2007) Goa, India, (pp. 18–21). Badia, R. M., Labarta, J. S., Sirvent, R. L., Perez, J. M., Cela, J. M., & Grima, R. (2003). Programming grid applications with GRID superscalar. Journal of Grid Computing, 1, 151–170. doi:10.1023/B:GRID.0000024072.93701. f3 Bailey, D., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., & Dagum, L. (1991). The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3), 63–73.
Baker, M., Buyya, R., & Laforenza, D. (2002). Grids and grid technologies for wide-area distributed computing. Software [SPE]. Practice and Experience, 32, 1437–1466. doi:10.1002/spe.488 Baker, S. (2007). Google and the wisdom of clouds. Business Week, Dec. 13. Retrieved from www.businessweek. com/magazine/content/07_52/b4064048925836.htm Bal, H. E., & Haines, M. (1998). Approaches for integrating task and data parallelism. IEEE Concurrency, 6(3), 74–84. doi:10.1109/4434.708258 Balasubramanian, V., & Banerjee, P. (1990). CompilerAssisted Synthesis of Algorithm-Based Checking in Multiprocessors. IEEE Transactions on Computers, C-39, 436–446. doi:10.1109/12.54837 Balaton, Z., Gombas, G., Kacsuk, P., Kornafeld, A., Kovacs, J., & Marosi, A. C. (2007, March 26-30). Sztaki desktop grid: a modular and scalable way of building large computing grids. In Proceedings of the 21st International Parallel And Distributed Processing Symposium, Long Beach, CA. Balazinska, M., Balakrishnan, H., & Stonebraker, M. (2004, March). Contract-based load management in federated distributed systems. In 1st Symposium on Networked Systems Design and Implementation (NSDI) (pp. 197-210). San Francisco: USENIX Association. Balazinska, M., Blakrishnan, H., & Karger, D. (2002). INS/Twine: a scalable peer-to-peer architecture for intentional resource discovery. In Pervasive 2002, Zurich, Switzerland, August. Berlin: Springer Verlag. Baldassari, J., Finkel, D., & Toth, D. (2006, November 13-15). Slinc: A framework for volunteer computing. In Proceedings of the 18th Iasted International Conference On Parallel And Distributed Computing And Systems (PDCS 2006). Dallas, TX. Baldridge, K., Biros, G., Chaturvedi, A., Douglas, C. C., Parashar, M., How, J., et al. (2006, January). National Science Foundation DDDAS Workshop Report. Retrieved from http://www.dddas.org/nsf-workshop-2006/wkshp report.pdf.
899
Compilation of References
Ban, B. (1997). JGroups - A Toolkit for Reliable Multicast Communication. Retrieved June 18, 2008, from http:// www.jgroups.org/javagroupsnew/docs/index.html Banerjee, P., Rahmeh, J. T., Stunkel, C. B., Nair, V. S. S., Roy, K., & Balasubramanian, V. (1990). Algorithmbased fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers, C-39, 1132–1145. doi:10.1109/12.57055 Bangerth, W., Matossian, V., Parashar, M., Klie, H., & Wheeler, M. (2005). An autonomic reservoir framework for the stochastic optimization of well placement. Cluster Computing, 8(4), 255–269. doi:10.1007/s10586005-4093-3 Banks, T. (2006). Web services resource framework (WSRF). Organization for the Advancement of Structured Information Standards (OASIS). Barak, A., & La’adan, O. (1998). The mosix multicomputer operating system for high per-formance cluster computing. Journal of Future Generation Computer System, 13(4-5), 361–372. doi:10.1016/S0167-739X(97)00037-X Barak, A., Guday, S., & R., W. (1993). The MOSIX Distributed Operating System, Load Balancing for UNIX (Vol. 672). Berlin: Springer-Verlag. Baratloo, A., Karaul, M., Kedem, Z., & Wyckoff, P. (1996). Charlotte: Metacomputing on the Web. In Proceeidngs of the 9th International Conference On Parallel And Distributed Computing Systems (PDCS-96). Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al. (2003). Xen and the art of virtualization. In 19th ACM Symposium on Operating Systems Principles (SOSP ’03) (pp. 164–177). New York: ACM Press. Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., et al. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition., Philadelphia, PA: SIAM. Barua, S., Thulasiram, R. K., & Thulasiraman, P. (2005, Aug.). High performance computing for a financial application using fast Fourier transform. In Euro-par parallel processing (p. 1246-1253). Lisbon, Portugal.
900
Bassi, A., Beck, M., Fagg, G., Moore, T., Plank, J. S., & Swany, M. (2002). The Internet BackPlane Protocol: A Study in Resource Sharing. In Second ieee/acm international symposium on cluster computing and the grid, Berlin, Germany. Bataineh, S., Hsiung, T.-Y., & Robertazzi, T. (1994). Closed form solutions for bus and tree networks of processors load sharing a divisible job. Institute of Electrical and Electronic Engineers, 43(10), 1184–119. Baude, F., Caromel, D., Huet, F., & Vayssiere, J. (2000, May). Communicating mobile active objects in java. In R. W. Marian Bubak Hamideh Afsarmanesh & B. Hetrzberger (Eds.), Proceedings of HPCN Europe 2000 (Vol. 1823, p. 633-643). Berlin: Springer. Retrieved from http://www-sop.inria.fr/oasis/Julien.Vayssiere/publications/18230633.pdf BBC. (2005). SMEF- Standard Media Exchange Framework. Retrieved June 15th, 2008, from http://www.bbc. co.uk/guidelines/smef/.15th June, 2008. BEinGRID. (2008). Business experiments in grids. Retrieved from www.beingrid.com Belalem, G., & Slimani, Y. (2006). A hybrid approach for consistency management in large scale systems. In Proceedings of the international conference on networking and services (pp. 71–76). Belalem, G., & Slimani, Y. (2007). Consistency management for data grid in optorsim simulator. In Proceedings of the international conference on multimedia and ubiquitous engineering (pp. 554–560). Bell, W. H., Cameron, D. G., Carvajal-Schiaffino, R., Millar, A. P., Stockinger, K., & Zini, F. (2003). Evaluation of an economy-based file replication strategy for a data grid. In Proceedings of the 3rdIEEE/ACM international symposium on cluster computing and the grid. Bell, W., Cameron, D., Capozza, L., Millar, P., Stockinger, K., & Zini, F. (2003). Optorsim - A grid simulator for studying dynamic data replication strategies. International Journal of High Performance Computing Applications, 17, 403–416. doi:10.1177/10943420030174005
Compilation of References
Bellavista, P., & Corradi, A. (2007). The Handbook of Mobile Middleware. New York: Auerbach publications. Beltrame, F., Maggi, P., Melato, M., Molinari, E., Sisto, R., & Torterolo, L. (2006, February 2-3). SRB Data grid and compute grid integration via the enginframe grid portal. In Proceedings of the 1st SRB Workshop, San Diego, CA. Retrieved from www.sdsc.edu/srb/Workshop/ SRB-handout-v2.pdf Ben Hassen, S., Bal, H. E., & Jacobs, C. J. H. (1998). A task- and data-parallel programming language based on shared objects. [TOPLAS]. ACM Transactions on Programming Languages and Systems, 20(6), 1131–1170. doi:10.1145/295656.295658 Bender, T. (1982). Community and Social Change in America. Baltimore, MD: The Johns Hopkins University Press. Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., & Widom, J. (2008). Databases with uncertainty and lineage. The VLDB Journal, 17(2), 243–264. doi:10.1007/ s00778-007-0080-z Benkert, K. Gabriel, E. & Resch, M. M. (2008). Outlier Detection in Performance Data of Parallel Applications. In the 9th IEEE International Workshop on Parallel Distributed Scientific and Engineering Computing (PDESC), Miami, Florida, USA. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A., & Wheller, D. L. (2000, October). Genbank. Nucleic Acids Research, 28(1), 15–18. doi:10.1093/ nar/28.1.15 Berezin, Y. A., & Vshivkov, V. A. (1980). The method of particles in rarefied plasma dynamic. Novosibirsk, Russia: Nauka (Science). Berger, M. P., & Munson, P. J. (1991). A novel randomized iteration strategy for aligning multiple protein sequences. Computer Applications in the Biosciences, 7, 479–484.
Berman, F., Casanova, H., Chien, A. A., Cooper, K. D., Dail, H., & Dasgupta, A. (2005). New Grid scheduling and rescheduling methods in the GrADS project. International Journal of Parallel Programming, 33(2-3), 209–229. doi:10.1007/s10766-005-3584-4 Berman, F., Chien, A., Cooper, K., Dongarra, J., Foster, I., & Gannon, D. (2001). The grads project: Software support for high-level grid application development. International Journal of High Performance Computing Applications, 15(4), 327–344. doi:10.1177/109434200101500401 Berman, F., Fox, G., & Hey, T. (Eds.). (2003). Grid computing making the global infrastructure a reality. New York: Wiley Series in Communication Networking & Distributed Systems. Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., & Faerman, M. (2003). Adaptive computing on the Grid using AppLeS. IEEE Transactions on Parallel and Distributed Systems, 14(4), 369–382. doi:10.1109/ TPDS.2003.1195409 Berman, F., Wolski, R., Figueira, S., Schopf, J., & Shao, G. (1996). Application-Level Scheduling on Distributed Heterogeneous Networks. In Proc. of supercomputing’96, Pittsburgh, PA. Berriman, G. B., Good, J. C., & Laity, A. C. (2003). Montage: A grid enabled image mosaic service for the national virtual observatory. In F. Ochsenbein (Ed.), Astronomical Data Analysis Software and Systems XIII, (pp. 145-167). Livermore, CA: ASP press. Bertossi, A. A., Pinotti, M. C., Rizzi, R., & Gupta, P. (2004). Allocating servers in infostations for bounded simultaneous requests. Journal of Parallel and Distributed Computing, 64, 1113–1126. doi:10.1016/S07437315(03)00118-7 Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, NJ: Prentice Hall. Bharadwaj, V., Ghose, D., & Mani, V. (1995, April). Multi-installment load distribution in tree network with delays. Institute of Electrical and Electronic Engineers, 31(2), 555–567.
901
Compilation of References
Bharadwaj, V., Ghose, D., & Robertazzi, T. G. (2003, January). Divisible load theory: A new paradigm for load scheduling in distributed systems. Cluster Computing, 6(1), 7–17. doi:10.1023/A:1020958815308 Bharadwaj, V., Li, X., & Ko, C. C. (2000). Efficient partitioning and scheduling of computer vision and image processing data on bus networks using divisible load analysis. Image and Vision Computing, 18, 919–938. doi:10.1016/S0262-8856(99)00085-2 Bhatt, S. N., Chung, F. R. K., Leighton, F. T., & Rosenberg, A. L. (1997). An optimal strategies for cycle-stealing in networks of workstations. IEEE Transactions on Computers, 46(5), 545–557. doi:10.1109/12.589220 Bhowmick, S. Eijkhout, V. Freund, Y. Fuentes, E. & Keyes, D. (in press). Application of Machine Learning in Selecting Sparse Linear Solver. Submitted for publication to the International Journal on High Performance Computing Applications. Bienkowski, M., Korzeniowski, M., & auf der Heide, F. M. (2005). Dynamic load balancing in distributed hash tables. In Proc. of IPTPS. BIRN. (2008). Biomedical informatics research network. Retrieved from www.nbirn.net/index.shtm Bishop, P., & Warren, N. (2002). JavaSpaces in Practice. New York: Addison Wesley.
Bloom, H. B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422–426. doi:10.1145/362686.362692 Bluetooth (2008). Retrieved November 2008 from www. bluetooth.com Boghosian, B., Coveney, P., Dong, S., Finn, L., Jha, S., Karniadakis, G. E., et al. (2006, June). Nektar, SPICE and vortonics: Using federated Grids for large scale scientific applications. In IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE). Paris: IEEE Computing Society. BOINC. (2008). Berkeley Open Infrastructure for Network Computing. Retrieved March 10, 2008 from http:// boinc.berkeley.edu BOINCstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.com/ stats/project_graph.php?pr=sah Bojadziew, G., & Bojadziew, M. (1997). Fuzzy Logic for Business, Finance, and Management Modeling, (2nd Ed.). Singapore: World Scientific Press. Boley, D. L., Brent, R. P., Golub, G. H., & Luk, F. T. (1992). Algorithmic fault tolerance using the lanczos method. SIAM Journal on Matrix Analysis and Applications, 13, 312–332. doi:10.1137/0613023
Black, F., & Scholes, M. (1973). The Pricing of Options and Corporate Liabilities. The Journal of Political Economy, 81(3). doi:10.1086/260062
Bolosky, W., Douceur, J., Ely, D., & Theimer, M. (2000). Feasibility of a Serverless Distributed file System Deployed on an Existing Set of Desktop PCs. In Proceedings of sigmetrics.
Blackford, L. S., Choi, J., Cleary, A., Petitet, A., & Whaley, R. C. Demmel, et al. (1996). ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance. In Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), (p. 5).
Bonomi, F., Mitzenmacher, M., Panigrah, R., Singh, S., & Varghese, G. (2006). Beyond bloom filters: from approximate membership checks to approximate state machines. Paper presented at the Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications.
Blair, G. S., Coulson, G., & Blair, L. DuranLimon, H., Grace, P., Moreira, R., & Parlavantzas, N. (2002). Reflection, self-awareness and self-healing in OpenORB. In WOSS ‘02 Proceedings of the First Workshop on Self-Healing Systems, (pp. 9-14).
Boyd, C. (2008, March/April). Data-parallel computing. ACM Queue; Tomorrow’s Computing Today, 6(2). doi:10.1145/1365490.1365499
902
Compilation of References
Boyer, R., & Moore, J. (1977). A fast string searching algorithm. Communications of the ACM, 20(10), 762–777. doi:10.1145/359842.359859
Bulhões, P. T., Byun, C., Castrapel, R., & Hassaine, O. (2004, May). N1 Grid Engine 6 Features and Capabilities [White Paper]. Phoenix, AZ: Sun Microsystems.
Boyle, P. P. (1986). Option Valuing Using a Three Jump Process. International Options Journal, 3(2).
Burnett, I. (2006). MPEG-21: Digital Item Adaptation Coding Format Independence, Chichester, UK. Retrieved 15th June, 2008, from http://www.ipsi.fraunhofer.de/delite/projects/mpeg7/Documents/mpeg21-Overview4318. htm#_Toc523031446.
Brandes, T. (1999). Exploiting advanced task parallelism in high performance Fortran via a task library. In EuroPar ‘99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing (pp. 833–844). London: Springer-Verlag. Braun, T., Feyrer, S., Rapf, W., & Reinhardt, M. (2001). Parallel Image Processing. Berlin: Springer-Verlag. Brecht, T., Sandhu, H., Shan, M., & Talbot, J. (1996). Paraweb: towards world-wide supercomputing. In Ew 7: Proceedings of the 7th workshop on acm sigops european workshop (pp. 181–188). New York: ACM. Bricker, A., Litzkow, M., & Livny, M. (1992). Condor Technical Summary, Version 4.1b. Madison, WI: University of Wisconsin - Madison. Brighten Godfrey, P., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In Proc. of IEEE INFOCOM. Broder, A., & Mitzenmacher, M. (2003). Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1(4), 485–509. Bronevetsky, G., Marques, D., Schulz, M., Pingali, K., & Stodghill, P. (2004). Application-level checkpointing for shared memory programs. Proceedings of 11th international conference on architectural support for programming languages and operating systems. Brune, M., Gehring, J., Keller, A., & Reinefeld, A. (1999). Managing clusters of geographically distributed high-performance computers. Concurrency (Chichester, England), 11(15), 887–911. doi:10.1002/(SICI)10969128(19991225)11:15<887::AID-CPE459>3.0.CO;2-J Bruneo, D., Scarpa, M., Zaia, A., & Puliafito, A. (2003). Communication paradigms for mobile grid users. In CCGRID 03 (p. 669).
Burns, J., & Gaudiot, J.-L. (2002). SMT layout overhead and scalability. IEEE Transactions on Parallel and Distributed Systems, 13(2), 142–155. doi:10.1109/71.983942 Butt, A. R., Johnson, T. A., Zheng, Y., & Hu, Y. C. (2004). Kosha: A Peer-to-Peer Enhancement for the Network File System. In Proceeding of International Symposium On Supercomputing SC’04. Butt, A. R., Zhang, R., & Hu, Y. C. (2003). A selforganizing flock of condors. In SC ’03 Proceedings of the ACM/IEEE Conference on Supercomputing, (p. 42). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/ SC.2003.10031 Buyya, R., Abramson, D., & Giddy, J. (2000). Nimrod/G: An architecture for a resource management and scheduling system in a global computational grid. In Proceedings of the 4th International Conference on High Performance Computing in the Asia-Pacific Region. Retrieved from www.csse.monash.edu.au/~davida/nimrod/nimrodg. htm Buyya, R., Abramson, D., & Giddy, J. (2000, June). An economy driven resource management architecture for global computational power grids. In 7th International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2000). Las Vegas, AZ: CSREA Press. Buyya, R., Abramson, D., & Venugopal, S. (2005). The Grid Economy. IEEE Journal.
903
Compilation of References
Buyya, R., Giddy, J., & Abramson, D. (2000). An evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications. Proceedings of the 2nd Workshop on Active Middleware Services, Pittsburgh, PA.
Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette, F., & Néri, V. (2004). Computing on large scale distributed systems: Xtremweb architecture, programming models, security, tests and convergence with grid. Future Generation Computer Science (FGCS).
Buyya, R., Yeo, C. S., & Venugopal, S. (2008, September). Market-oriented cloud computing: vision, hype, and reality for delivering it services as computing utilities. In HPCC’08 Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications. Los Alamitos, CA: IEEE CS Press.
Cappello, P., Christiansen, B., Ionescu, M., Neary, M., Schauser, K., & Wu, D. (1997). Javelin: Internet-Based Parallel Computing Using Java. In Proceedings of the sixth acm sigplan symposium on principles and practice of parallel programming.
Byers, J. Considine, J., & Mitzenmacher, M. (2003, Feb.). Simple load balancing for distributed hash tables. In Proc. of IPTPS. Cabrera, F., Copel, G., & Coxetal, B. (2002). Web Services Transaction (WS- Transaction). Retrieved from http:// www.ibm.com/developerworks/library/ws-transpec. Caesar, M., & Rexford, J. (2005, March). BGP routing policies in ISP networks, (Tech. Rep. UCB/CSD-05-1377). U. C. Berkeley, Berkeley, CA. Camiel, N., London, S., Nisan, N., & Regev, O. (1997, April). The PopCorn Project: Distributed computation over the Internet in Java. In Proceedings of the 6th international world wide web conference. Cannataro, M., & Talia, D. (2003). Towards the nextgeneration grid: A pervasive environment for knowledgebased computing. In Proceedings of the International Conference on Information Technology: Computers and Communications (pp.437-441), Italy. Cannon, L. E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. thesis, Montana State University, Bozeman, MT. Cao, J., Jarvis, S. A., Saini, S., Kerbyson, D. J., & Nudd, G. R. (2002). ARMS: An agent-based resource management system for grid computing. Science Progress, 10(2), 135–148.
904
Carlsson C. & Fullér, R. (2003). A Fuzzy Approach to Real Option Valuation. Journal of Fuzzy Sets and Systems, 39. Caromel, D., di Costanzo, A., & Mathieu, C. (2007). Peer-to-peer for computational Grids: Mixing clusters and desktop machines. Parallel Computing, 33(4–5), 275–288. doi:10.1016/j.parco.2007.02.011 Casa, J., Konuru, R., Prouty, R., Walpole, J., & Otto, S. (1994). Adaptive Load Migration Systems for PVM. Proceedings of supercomputing, (pp. 390-399). Washington D.C. Casanova, H., Legrand, A., & Quinson, M. SimGrid: a Generic Framework for Large-Scale Distributed Experimentations. In Proceedings of the 10th ieee international conference on computer modelling and simulation (uksim/ eurosim’08). Casanova, H., Legrand, A., Zagorodnov, D., & Berman, F. (2000, May). Heuristics for Scheduling Parameter Sweep Applications in Grid Environments. In Proceedings of the 9th heterogeneous computing workshop (hcw’00) (pp. 349–363). Casanova, H., Obertelli, G., Berman, F., & Wolski, R. (2000, Nov.). The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid. In Proceedings of supercomputing 2000 (sc’00). Castro, M., Costa, M., & Rowstron, A. (2004). Performance and Dependability of Structured Peer-to-Peer Overlays. In Proceedings of the 2004 Intl. Conf. on Dependable Systems and Networks (pp. 9-18). New York: IEEE Computer Society Press.
Compilation of References
Castro, M., Druschel, P., Hu, Y. C., & Rowstron, A. (2002). Topology-aware routing in structured peer-to-peer overlay networks. In Future Directions in Distributed Computing.
Chang, R., & Chang, J. (2006). Adaptable replica consistency service for data grids. In Proceedings of the third international conference on information technology: New generations (ITNG’06) (pp. 646–651).
Catlett, C., Beckman, P., Skow, D., & Foster, I. (2006, May). Creating and operating national-scale cyberinfrastructure services. Cyberinfrastructure Technology Watch Quarterly, 2(2), 2–10.
Chang, R., & Chen, P. (2007). Complete and fragmented replica selection and retrieval in data grids. Future Generation Computer Systems, 23(4), 536–546. doi:10.1016/j. future.2006.09.006
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dcache Warn: an I-fetch policy to increase SMT efficiency. In Proceedings of the 18th International Parallel & Distributed Processing Symposium (IPDPS’04), (pp. 74-83). Santa Fe, NM: IEEE Computer Society Press.
Chang, R., Chang, J., & Lin, S. (2007). Job scheduling and data replication on data grids. Future Generation Computer Systems, 23(7), 846–860. doi:10.1016/j.future.2007.02.008
Cazorla, F. J., Ramirez, A., Valero, M., & Fernandez, E. (2004). Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’04), (pp. 171-182). Portland, OR: IEEE Computer Society Press. CDO2. (2008). CDOSheet for pricing and risk analysis. Retrieved from www.cdo2.com
Chang, R., Wang, C., & Chen, P. (2005). Replica selection on co-allocation data grids. In Proceedings of the second international symposium on parallel and distributed processing and applications (Vol. 3358, pp. 584–593). Chao, C.-H. (2006, April). An Interest-based architecture for peer-to-peer network systems. In Proceedings of the International Conference AINA.
Chaarawi, M. Squyres, J. Gabriel, E. & Feki, S. (2008). A Tool for Optimizing Runtime Parameters of Open MPI. Accepted for publications in EuroPVM/MPI, September 7-10, Dublin, Ireland.
Chapman, B. M., Mehrotra, P., van Rosendale, J., & Zima, H. P. (1994). A software architecture for multidisciplinary applications: integrating task and data parallelism. In CONPAR 94 - VAPP VI: Proceedings of the Third Joint International Conference on Vector and Parallel Processing (pp. 664–676). London: Springer-Verlag.
Chamberlain, B. L., Callahan, D., & Zima, H. P. (2007). Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3), 291–312. doi:10.1177/1094342007078442
Chapman, B., Haines, M., Mehrota, P., Zima, H., & van Rosendale, J. (1997). Opus: A coordination language for multidisciplinary applications. Science Progress, 6(4), 345–362.
Chanchio, K., & Sun, X. H. (2001). Communication state transfer for the mobility of con-curent heterogeneous computing. Proceedings of the 2001 international conference on parallel processing.
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An objectoriented approach to non-uniform cluster computing. In OOPSLA ’05 Proceedings of the 20th annual ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications (pp. 519–538). New York: ACM.
Chandy, M., Foster, I., Kennedy, K., Koelbel, C., & Tseng, C.-W. (1994). Integrated support for task and data parallelism. The International Journal of Supercomputer Applications, 8(2), 80–98.
905
Compilation of References
Chase, J. S., Amador, F. F., Lazowska, E. D., Levy, H. M., & Littlefield, R. J. (1996). The amber systems: Parallel programming on a network of multiprocessors. Proceedings of acm symposium on operating system principles.
Chen, W., Liu, J., & Huang, H. (2004). An adaptive scheme for vertical handoff in wireless overlay networks. IEEE International Conference on Parallel and Distributed Systems (ICPADS) (pp. 541-548). Washington, DC: IEEE.
Chase, J. S., Irwin, D. E., Grit, L. E., Moore, J. D., & Sprenkle, S. E. (2003). Dynamic virtual clusters in a Grid site manager. In 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 2003) (p. 90). Washington, DC: IEEE Computer Society.
Chen, Z., & Dongarra, J. (2005). Condition numbers of gaussian random matrices. SIAM Journal on Matrix Analysis and Applications, 27(3), 603–620. doi:10.1137/040616413
Chaubal, Ch. (2003). Sun grid engine, enterprise edition—Software configuration guidelines and use cases. Sun Blueprints, Retrieved from www.sun.com/ blueprints/0703/817-3179.pdf Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S. (2003). Making gnutella-like p2p systems scalable. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (pp. 407-418). Chen, C. M., Lee, S. Y., & Cho, Z. H. (1990). A Parellel Implementation of 3D CT Image Reconstruction on HyperCube Multiprocessor. IEEE Transactions on Nuclear Science, 37(3), 1333–1346. doi:10.1109/23.57385 Chen, C.-H., & Lee, C.-Y. (1999). A cost effective lighting processor for 3D graphics application. Proceedings of International Conference on Image Processing, 2, 792–796. Chen, D. J., & Huang, T. H. (1992). Reliability analysis of distributed systems based on a fast reliability algorithm. IEEE Transactions on Parallel and Distributed Systems, 3(2), 139–154. doi:10.1109/71.127256 Chen, D. J., Chen, R. S., & Huang, T. H. (1997). A heuristic approach to generating file spanning trees for reliability analysis of distributed computing systems. Computers and Mathematics with Applications, 34(10), 115–131. doi:10.1016/S0898-1221(97)00210-1 Chen, T., Raghavan, R., Dale, J. N., & Iwata, E. (2007, Sept.). Cell Broadband Engine Architecture and its first implementation-A performance view. IBM. Journal of Research and Development (Srinagar), 51(5), 559–572.
906
Cheng, A. H., & Joung, Y. J. (2006). Probabilistic file indexing and searching in unstructured peer-to-peer networks. Computer Networks, 50(1), 106–127. doi:10.1016/j. comnet.2005.12.008 Cheng, K., Xiang, L., Iwaihara, M., Xu, H., & Mohania, M. M. (2005). Time-Decaying Bloom Filters for Data Streams with Skewed Distributions. Paper presented at the Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications. Chervenak, A. (2002). Giggle: A framework for constructing scalable replica location services. In Proceedings of the IEEE supercomputing (pp. 1–17). Chervenak, A., Foster, L., Kesselman, C., Salisbury, C., & Tueckem, S. (2000). The data grid: Towards an architecture for the distributed management and analysis of large scientific data sets. Journal of Network and Computer Applications, 23(3), 187–200. doi:10.1006/ jnca.2000.0110 Chien, A., Calder, B., Elbert, S., & Bhatia, K. (2003). Entropia: Architecture and performance of an enterprise desktop grid system. Journal of Parallel and Distributed Computing, 63, 597–610. doi:10.1016/S07437315(03)00006-6 Chinese National Grid (CNGrid) Project Web Site. (2007). Retrieved from http://www.cngrid.org/ Chiueh, T., & Deng, P. (1996). Evaluation of checkpoint mechanisms for massively parallel machines. In FTCS, (pp. 370–379).
Compilation of References
Choi, S., & Yeung, D. (2006). Learning-based SMT processor resource distribution via hill-climbing. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA’06), (pp. 239-251), Boston: IEEE Computer Society Press. Chonka, A., Zhou, W., Knapp, K., & Xiang, Y. (2008). Protecting information systems from ddos attack using multicore methodology. Proceedings of IEEE 8th International Conference on Computer and Information Technology. Chow, A. C., Gossum, G. C., & Brokenshire, D. A. (2005). A programming example: Large fft on the cell broadband engine. In Gspx. tech. conf. proc. of the global signal processing expo. Chrysanthis, P. K., & Ramamriham, K. (1994). Synthesis of extended transaction models using ACTA. ACM Transactions on Database Systems, 19(3), 450–491. doi:10.1145/185827.185843 Chrysanthis, P., & Ramamriham, K. (Eds.). (1992). ACTA: The SAGA continues. Transactions Models for Advanced Database Applications. San Francisco: Morgan Kaufmann. Chu, D., & Humphrey, M. (2004, November 8). Bmobile ogsi.net: Grid computing on mobile devices. In Grid computing workshop (associated with supercomputing 2004), Pittsburgh, PA. Chu, E., & George, A. (2000). Inside the fft black box: Serial and parallel fast Fourier transform algorithms. Boca Raton, FL: CRC Press LLC. Chu, X., Nadiminti, K., Jin, C., Venugopal, S., & Buyya, R. (2007, December). Aneka: Next-generation enterprise grid platform for e-science and e-business applications, e-Science’07: In Proceedings of the 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India (pp. 151-159). Los Alamitos, CA: IEEE Computer Society Press. For more information, see http://doi.ieeecomputersociety.org/10.1109/ESCIENCE.2007.12
Chung, P. E. (1997). Checkpointing in cosmic: a userlevel process migration environment. Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems. Ciancarini, P. (1996). Coordination Models and Languages as Software Integrators. SCM Comput. Surv., 28(2), 300–302. doi:10.1145/234528.234732 Ciarpaglini, S., Folchi, L., Orlando, S., Pelagatti, S., & Perego, R. (2000). Integrating task and data parallelism with taskHPF. In H. R. Arabnia (Ed.). Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2000. Las Vegas, NV: CSREA Press. Cirne, W., Brasileiro, F., Andrade, N., Costa, L., Andrade, A., & Novaes, R. (2006, September). Labs of the world, unite!!! Journal of Grid Computing, 4(3), 225–246. doi:10.1007/s10723-006-9040-x Clarke, B., & Humphrey, M. (2002, April 19). Beyond the ”device as portal”: Meeting the requirements of wireless and mobile devices in the legion grid computing system. In 2nd International Workshop On Parallel And Distributed Computing Issues In Wireless Networks And Mobile Computing (associated with ipdps 2002), Ft. Lauderdale, FL. CloudCamp. (2008). Retrived from http://www.cloudcamp.com/ CNGrid GOS Project Web site. (2007). Retrieved from http://vega.ict.ac.cn Coffman, E. G., Galambos, G., Martello, S., & Vigo, D. (1999). Bin-packing approximation algorithms: Combinatorial analysis. In D. Z. Du & P. M. Pardalos, (Ed.), Handbook of Combinatorial Optimization, (pp. 151–207). Dondrecht, the Netherlands: Kluwer. CoG Toolkit (n.d.). Retrieved from http://www.cogkit. org/ Cohen, B. (2002). BitTorrent Protocol 1.0. Retrieved from BitTorrent.org.
907
Compilation of References
Cohen, B. (2003). Incentives build robustness in BitTorrent. In Workshop on economics of peer-to-peer systems, Berkeley, CA. Condor Team. (2006). CondorVersion 6.4.7 Manual. Retrieved October 18, 2006, from www.cs.wisc.edu/ condor/manual/v6.4
Cray Inc. (2005). The Chapel language specification, version 0.4. Cristianini, N., & Hahn, M. (2006). Introduction to Computational Genomics. Cambridge, UK: Cambridge University Press.
Corbat, F. J., & Vyssotsky, V. A. (1965). Introduction and overview of the multics system. FJCC, Proc. AFIPS, 27(1), 185–196.
Cronk, M. H., & Mehrotra, P. (1997). Thread migration in the presence of pointers. Proceedings of the mini-track on multithreaded systems, 30th hawaii interantional conference on system science.
Coronato, A., & Pietro, G. D. (2007). Mipeg: A middleware infrastructure for pervasive grids. Journal of Future Generation Computer Systems.
Culler, D. E., Singh, J. P., & Gupta, A. (1998) Parallel computer architecture: a hardware/software approach, (1st edition). San Francisco: Morgan Kaufmann.
Corporation, I. B. M. (1993). IBM Load Leveler: User’s Guide.
Culler, D., & Karp, R. Patterson, D. Sahay, A. Schauser, K. E. Santos, E. Subramonian, R. & von Eicken, T. (1993). LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, (pp. 1–12). New York: ACM Press.
Corradi, A., Leonardi, L., & Zambonelli, F. (1997). Performance comparison of load balancing policies based on a diffusion scheme. In Proc. of the Euro-Par’97 (LNCS Vol. 1300). Springer: Germany. Costa, F., Silva, L., Fedak, G., & Kelley, I. (2008, in press). Optimizing the Data Distribution Layer of BOINC with BitTorrent. In 2nd workshop on desktop grids and volunteer computing systems (pcgrid 2008), Miami, FL. Cotroneo, D., Migliaccio, A., & Russo, S. (2007). The Esperanto Broker: a communication platform for nomadic computing systems. Software, Practice & Experience, 37(10), 1017–1046. doi:10.1002/spe.794 Cotton, W., Pielke, R., Walko, R., Liston, G., Tremback, C., & Jiang, H. (2003). RAMS 2001: Current status and future directions. Meteorology and Atmospheric Physics, 82(1-4), 5–29. doi:10.1007/s00703-001-0584-9 Coulson, G., Grace, P., Blair, G., Duce, D., Cooper, C., & Sagar, M. (2005, April). A middleware approach for pervasive grid environments. In Uk-ubinet/ uk e-science programme workshop on ubiquitous computing and e-research. Cox, J. C., Ross, S., & Rubinstein, M. (1979). Option Pricing: A Simplified Approach. Journal of Financial Economics, 3(7).
908
Czajkowski, K., Fitzgerald, S., Foster, I., & Kesselman, C. (2001). Grid information services for distributed resource sharing. 10th International Symposium on High Performance Distributed Computing (pp. 181-194). San Francisco: IEEE Computer Society Press. Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., & Stoica, I. (2001). Wide-Area Cooperative Storage with CFS. In Proceedings of the 11th ACM Symp. on Operating Systems Principles (pp. 202-215). New York: ACM Press. Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., & Stoica, I. (2003). Towards a common API for structured peer-to-peer overlays. In IPTPS03 Proceedings of the 2nd International Workshop on Peer-to-Peer Systems, (pp. 33-44). Heidelberg, Germany: SpringerLink. doi: 10.1007/b11823 Dai, Y. S., & Levitin, G. (2006). Reliability and performance of tree-structured grid services . IEEE Transactions on Reliability, 55(2), 337–349. doi:10.1109/ TR.2006.874940
Compilation of References
Dai, Y. S., Pan, Y., & Zou, X. K. (2006). A hierarchical modelling and analysis for grid service reliability. IEEE Transactions on Computers. Dai, Y. S., Xie, M., & Poh, K. L. (2002), Reliability analysis of grid computing systems. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC2002), (pp. 97-104). New York: IEEE Computer Press. Dai, Y. S., Xie, M., & Poh, K. L. (2005). Markov renewal models for correlated software failures of multiple types. IEEE Transactions on Reliability, 54(1), 100–106. doi:10.1109/TR.2004.841709 Dai, Y. S., Xie, M., & Poh, K. L. (2006).Availability modeling and cost optimization for the grid resource management system. IEEE Transactions on Systems, Man, and Cybernetics. Part A . Systems and Humans: a Publication of the IEEE Systems, Man, and Cybernetics Society., 38(1), 170. Dai, Y. S., Xie, M., Poh, K. L., & Liu, G. Q. (2003). A study of service reliability and availability for distributed systems. Reliability Engineering & System Safety, 79(1), 103–112. doi:10.1016/S0951-8320(02)00200-4 Dai, Y. S., Xie, M., Poh, K. L., & Ng, S. H. (2004). A model for correlated failures in N-version programming. IIE Transactions, 36(12), 1183–1192. doi:10.1080/07408170490507729 Dalal, S., Temel, S., & Little, M. (2003). Coordinating business transactions on the Web. IEEE Internet Computing, 7(1), 30–39. doi:10.1109/MIC.2003.1167337 Dang, N. N., & Lim, S. B. (2007). Combination of replication and scheduling in data grids. International Journal of Computer Science and Network Security, 7(3). Dang, V. D. (2004). Coalition Formation and Operation in Virtual Organisations. PhD thesis, Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Science, University of Southampton, Southampton, UK.
Das, S. K., Harvey, D. J., & Biswas, R. (2001). Parallel processing of adaptive meshes with load balancing. IEEE Transactions on Parallel and Distributed Systems, 12(12), 1269–1280. doi:10.1109/71.970562 Davies, N., Friday, A., & Storz, O. (2004). Exploring the grid’s potential for ubiquitous computing. IEEE Pervasive Computing / IEEE Computer Society [and] IEEE Communications Society, 3(2), 74–75. doi:10.1109/ MPRV.2004.1316823 Davis, C. (2007). Could Android open door for cellphone Grid computing? Retrieved March 10, 2008, from http:// www.google-phone.com/could-android-open-door-forcellphone-grid-computing-12217.php de Assunção, M. D., & Buyya, R. (2008, December). Performance analysis of multiple site resource provisioning: Effects of the precision of availability information [Technical Report]. In International Conference on High Performance Computing (HiPC 2008) (Vol. 5374, pp. 157–168). Berlin/Heidelberg: Springer. de Assunção, M. D., Buyya, R., & Venugopal, S. (2008, June). InterGrid: A case for internetworking islands of Grids. [CCPE]. Concurrency and Computation, 20(8), 997–1024. doi:10.1002/cpe.1249 De Roure, D., Jennings, N., & Shadbolt, N. (2005, March). The semantic grid: Past, present, and future. Proceedings of the IEEE, 93(3), 669–681. doi:10.1109/ JPROC.2004.842781 De Roure, M., & Surridge, D. (2003). Interoperability challenges in Grid for industrial applications. GGF9 Semantic Grid Workshop, Chicago. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. In Osdi’04: Sixth symposium on operating system design and implementation, (pp. 137–150). San Francisco, CA. DECI. (2008). DEISA extreme computing initiative. Retrieved from www.deisa.eu/science/deci
909
Compilation of References
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., et al. (2004). Pegasus: Mapping scientific workflows onto the grid. In M. Dikaiakos (Ed.), AxGrids 2004, (LNCS 3165, pp. 11-20). Berlin: Springer Verlag.
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2002). A border-based coordination language for integrating task and data parallelism. Journal of Parallel and Distributed Computing, 62(4), 715–740. doi:10.1006/jpdc.2001.1814
DEISA. (2008). Distributed European infrastructure for supercomputing applications. Retrieved from www. deisa.eu
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2003). Domain interaction patterns to coordinate HPF tasks. Parallel Computing, 29(7), 925–951. doi:10.1016/S01678191(03)00064-4
Demers, A., Keshav, S., & Shenker, S. (1989). Analysis and Simulation of a Fair Queuing Algorithm. Proceedings of ACM SIGCOMM. Deng, Z., Liu, J.W.S., Zhang, L., Mouna, S., & Frei, A. (1999). An Open Environment for Real-Time Applications. Real-Time Systems Journal, 16(2/3). DESHL. (2008). DEISA services for heterogeneous management layer. http://forge.nesc.ac.uk/projects/ deisa-jra7/ Desprez, F., & Vernois, A. (2006). Simultaneous scheduling of replication and computation for data-intensive applications on the grid. Journal of Grid Computing, 4(1), 66–74. doi:10.1007/s10723-005-9016-2 D-Grid (2008). Retrieved from www.d-grid.de/index. php?id=1&L=1
Diaz, M., Rubio, B., Soler, E., & Troya, J. M. (2004). SBASCO: Skeleton-based scientific components. In Proceedings of the 12th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP 2004) (pp. 318–325). Washington, DC: IEEE Computer Society. Dikaiakos, M. D. (2007). Grid benchmarking: vision, challenges, and current status. [New York: Wiley InterScience.]. Concurrency and Computation, 19, 89–105. doi:10.1002/cpe.1086 Dimitrakos, T., Golby, D., & Kearley, P. (2004, October). Towards a trust and contract management framework for dynamic virtual organisations. In eChallenges. Vienna, Austria.
Dharmapurikar, S., & Lockwood, J. (2006, October). Fast and Scalable Pattern Matching for Network Intrusion Detection Systems. Communication of the IEEE Journal, 24(10).
Dimitrov, B., & Rego, V. (1998). Arachne: A portable threads system supporting migrant threads on heterogeneous network farms. IEEE Transactions on Parallel and Distributed Systems, 9(5), 459–469. doi:10.1109/71.679216
Dharmapurikar, S., Krishnamurthy, P., & Taylor, D. E. (2006). Longest prefix matching using bloom filters. IEEE/ACM Trans. Netw., 14(2), 397–409.
Ding, Q., Chen, G. L., & Gu, J. (2002). A unified resource mapping strategy in computational grid environments. Journal of Software, 13(7), 1303–1308.
Dharmapurikar, S., Krishnamurthy, P., Sproull, T. S., & Lockwood, J. W. (2004). Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1), 52–61. doi:10.1109/MM.2004.1268997
Dixit, K. M. (1991). The SPEC benchmarks. Parallel Computing, 17(10-11), 1195–1209. doi:10.1016/S01678191(05)80033-X
Dheepak, R., Ali, S., Sengupta, S., & Chakrabarti, A. (2005). Study of scheduling strategies in a dynamic data grid environment. In Distributed Computing - IWDC 2004 (Vol. 3326). Berlin: Springer.
910
Dixit, S., & Wu, T. (2004). Content Networking in the Mobile Internet. New York: John Wiley & Sons. Dixon, C., Bragin, T., Krishnamurthy, A., & Anderson, T. (2006, September). Tit-for-Tat Distributed Resource Allocation [Poster]. The ACM SIGCOMM 2006 Conference.
Compilation of References
Domenici, A., Donno, F., Pucciani, G., & Stockinger, H. (2006). Relaxed data consistency with CONStanza. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid (pp. 425–429).
Doolan, D. C., Tabirca, S., & Yang, L. T. (2006). Mobile Parallel Computing. In Proceedings of the Fifth International Symposium on Parallel and Distributed Computing (ISPDC 06), (pp. 161-167).
Domenici, A., Donno, F., Pucciani, G., Stockinger, H., & Stockinger, K. (2004, Nov). Replica consistency in a Data Grid. Nuclear Instruments and Methods in Physics Research, 534, 24–28. doi:10.1016/j.nima.2004.07.052
Dorigo, M. (1992). Optimization, learning and natural algorithms (Tech. Rep.). Ph.D. Thesis, Politecnico di Milano, Milan, Italy.
Domingues, P., Araujo, F., & Silva, L. M. (2006, December). A dht-based infrastructure for sharing checkpoints in desktop grid computing. In Conference on e-science and grid computing (escience ’06), Amsterdam, The Netherlands. Donegan, B., Doolan, D. C., & Tabirca, S. (2008). Mobile Message Passing using a Scatternet Framework. International Journal of Computers, Communications & Control, 3(1), 51–59. Dong, X., Halevy, A. Y., & Yu, C. (2007). Data integration with uncertainty. In Vldb ’07: Proceedings of the 33rd International Conference on Very Large Data Bases (pp. 687–698). VLDB Endowment. Dongarra, J. J., & Eijkhout, V. (2003). Self-Adapting Numerical Software for Next-Generation Applications. International Journal of High Performance Computing Applications, 17(2), 125–131. doi:10.1177/1094342003017002002 Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., & White, A. (2003). Sourcebook of parallel computing. San Francisco: Morgan Kaufmann Publishers. Dongarra, J., Luszczek, P., & Petitet, A. (2003, August). The LINPACK Benchmark: past, present and future. Concurrency and Computation, 15(9), 803–820. doi:10.1002/cpe.728 Dongarra, J., Meuer, H., & Strohmaier, E. (2004). TOP500 Supercomputer Sites, 24th edition. In Proceedings of the Supercomputing Conference (SC’2004), Pittsburgh PA. New York: ACM.
Dorta, A. J., González, J. A., Rodriguez, C., & de Sande, F. (2003). LLC: A parallel skeletal language. Parallel Processing Letters, 13(3), 437–448. doi:10.1142/ S0129626403001409 Dorta, A. J., López, P., & de Sande, F. (2006). Basic skeletons in LLC. Parallel Computing, 32(7-8), 491–506. doi:10.1016/j.parco.2006.07.001 Douglis, F., & Ousterhout, J. K. (1991). Transparent process migration: Design alternatives and the sprite implementation. Software, Practice & Experience, 21(8), 757–785. doi:10.1002/spe.4380210802 Draves, S. (2005, March). The electric sheep screen-saver: A case study in aesthetic evolution. In 3rd european workshop on evolutionary music and art. Drodowski, M., Lawenda, M., & Guinand, F. (2006). Scheduling multiple divisible loads. International Journal of High Performance Computing Applications, 20(1), 19–30. doi:10.1177/1094342006061879 Drozdowski, M., & Lawenda, M. (2005). On Optimum Multi-installment Divisible Load Processing in Heterogeneous Distributed Systems, (LNCS 3648, pp. 231–240). Berlin: Springer. Duan, R., Prodan, R., & Fahringer, T. (2006). Run-time optimization for Grid workflow applications. International Conference on Grid Computing. Barcelona, Spain: IEEE Computer Society Press. Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In Proceedings of the fifth IEEE/ACM international workshop on grid computing (pp. 53–60).
911
Compilation of References
Dumitrescu, C., & Foster, I. (2004). Usage policy-based CPU sharing in virtual organizations. In 5th IEEE/ACM International Workshop on Grid Computing (Grid 2004) (pp. 53–60). Washington, DC: IEEE Computer Society.
Dümmler, J., Rauber, T., & Rünger, G. (2008). Mapping algorithms for multiprocessor tasks on multi-core clusters. In Proceedings of the 37th International Conference on Parallel Processing (ICPP08). New York: IEEE Computer Society.
Dumitrescu, C., & Foster, I. (2005, August). GRUBER: A Grid resource usage SLA broker. In J. C. Cunha & P. D. Medeiros (Eds.), Euro-Par 2005 (Vol. 3648, pp. 465–474). Berlin/Heidelberg: Springer.
Duvvuri, V., Shenoy, P., & Tewari, R. (2000). Adaptive leases: A strong consistency mechanism for the World Wide Web. In Proceedings of IEEE INFOCOM (pp. 834–843).
Dumitrescu, C., Raicu, I., & Foster, I. (2005). DIGRUBER: A distributed approach to Grid resource brokering. In 2005 ACM/IEEE Conference on Supercomputing (SC 2005) (p. 38). Washington, DC: IEEE Computer Society.
Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4), 543–560. doi:10.1137/0609045
Dumitrescu, C., Wilde, M., & Foster, I. (2005, June). A model for usage policy-based resource allocation in Grids. In 6th IEEE International Workshop on Policies for Distributed Systems and Networks (pp. 191–200). Washington, DC: IEEE Computer Society. Dümmler, J., Kunis, R., & Rünger, G. (2007). A scheduling toolkit for multiprocessor-task programming with dependencies. In Proceedings of the 13th International Euro-Par Conference (pp. 23–32). Berlin: Springer. Dümmler, J., Kunis, R., & Rünger, G. (2007). A comparison of scheduling algorithms for multiprocessortasks with precedence constraints. In Proceedings of the 2007 High Performance Computing & Simulation (HPCS’07) Conference (pp. 663–669). ECMS. Dümmler, J., Rauber, T., & Rünger, G. (2007). Communicating multiprocessor-tasks. In Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2007). Berlin: Springer. Dümmler, J., Rauber, T., & Rünger, G. (2008). A transformation framework for communicating multiprocessortasks. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) (pp. 64–71). New York: IEEE Computer Society.
912
Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R. L., & Tullsen, D. M. (1997). Simultaneous multithreading: a platform for next-generation processors. IEEE Micro, 17(5), 12–19. doi:10.1109/40.621209 Elias, J. A., & Moldes, L. N. (2002). Behaviour of the fast consistency algorithm in the set of replicas with multiple zones with high demand. In Proceedings of symposium in informatics and telecommunications. Elias, J. A., & Moldes, L. N. (2002). A demand based algorithm for rapid updating of replicas. In Proceedings of IEEE workshop on resource sharing in massively distributed systems (pp. 686– 691). Elias, J. A., & Moldes, L. N. (2003). Generalization of the fast consistency algorithm to a grid with multiple high demand zones. In Proceedings of international conference on computational science (ICCS 2003) (pp. 275–284). El-Moursy, A., & Albonesi, D. H. (2003). Front-end policies for improved issue efficiency in SMT processors. In Proceedings of the 9th International Symposium on HighPerformance Computer Architecture (HPCA’03), (pp. 31-40). Anaheim, CA: IEEE Computer Society Press. Elmroth, E., & Gardfjäll, P. (2005, December). Design and evaluation of a decentralized system for Grid-wide fairshare scheduling. In 1st IEEE International Conference on e-Science and Grid Computing (pp. 221–229). Melbourne, Australia: IEEE Computer Society Press.
Compilation of References
Enabling Grids for E-sciencE (EGEE) project. (2005). Retrieved from http://public.eu-egee.org. EnginFrame. (2008). Grid and cloud portal. Retrieved from www.nice-italy.com Epema, D. H. J., Livny, M., van Dantzig, R., Evers, X., & Pruyne, J. (1996). A worldwide flock of condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12(1), 53–65. doi:10.1016/0167739X(95)00035-Q ERCIM. (2005). Multimedia Informatics. ERCIM News, 62. Erl, T. (2005). Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Upper Saddle River, NJ: Prentice Hall. Evans, J. J. Hood, C. S.& Gropp, W. D. (2003). Exploring the Relationship Between Parallel Application Run-Time Variability and Network Performance. In Proceedings of the Workshop on High-Speed Local Networks (HSLN), IEEE Conference on Local Computer Networks (LCN), (pp. 538-547). Factor, M., Schuster, A., & Shagin, K. (2003). JavaSplit: a runtime for execution of monolithic Java programs on heterogenous collections of commodity workstations. Paper presented at the Proceedings of the IEEE International Conference on Cluster Computing. Fagg, G. E., & Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI 2000, (pp. 346–353). Fagg, G. E., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., et al. (2004). Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany.
Fagg, G. E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., & Pjesivac-Grbovic, J. (2005). Process faulttolerance: Semantics, design and applications for high performance computing. [Winter.]. International Journal of High Performance Computing Applications, 19(4), 465–477. doi:10.1177/1094342005056137 Fahringer, T., Prodan, R., Duan, R., Hofer, J., Nadeem, F., Nerieri, F., et al. (2006). ASKALON: A development and grid computing environment for scientific workflows. In I. J. Taylor, E. Deelman, D. G. Ganon, & M. Shields (Eds.), Workflows for e-Science (p. 530). Berlin: Springer Verlag. Faraj, A. Patarasuk, P. & Yuan, X. (2007). A Study of Process Arrival Patterns for MPI Collective Operations. International Conference on Supercomputing, (pp.168179). Faraj, A. Yuan, X. & Lowenthal, D. (2006). STAR-MPI: self tuned adaptive routines for MPI collective operations. In ICS ‘06: Proceedings of the 20th Annual International Conference on Supercomputing, (pp. 199-208). New York: ACM Press. Fasttrack product description. (2001). http://www.fasttrack.nu/index.html. Fedak, G., & Germain, C. N’eri, V., & Cappello, F. (2001, May). XtremWeb: A Generic Global Computing System. In Proceedings of the ieee international symposium on cluster computing and the grid (ccgrid’01). Fedak, G., Germain, C., Neri, V., & Cappello, F. (2002, May). XtremWeb: A generic global computing system. In CCGRID’01: Proceeding of the First IEEE Conference on Cluster and Grid Computing, workshop on Global Computing on Personal Devices, Brisbane, (pp. 582587). Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/ CCGRID.2001.923246 Fedak, G., He, H., & Cappello, F. (2008, November). BitDew: A Programmable Environment for Large-Scale Data Management and Distribution. In Proceedings of the acm/ieee supercomputing conference (sc’08), Austin, TX.
913
Compilation of References
Federation, M. D. The Biomedical Informatics Research Network (2003). In I. Foster & C. Kesselman (Eds.), The grid, blueprint for a new computing infrastructure (2nd ed.). San Francisco: Morgan Kaufmann.
Foster, I. (2002). What is the Grid? A three point checklist. Retrieved from http://www-fp.mcs.anl.gov/~foster/ Articles/WhatIsTheGrid.pdf
Femando, R., Harris, M., Wloka, M., & Zeller, C. (2004). Programming graphics hardware. In Tutorial on EUROGRAPHICS. NVIDIA Corporation.
Foster, I. (2006). Globus toolkit version 4: Software for service-oriented systems. In Proceedings of the international conference on network and parallel computing (pp. 2–13).
Fernandess, Y., & Malkhi, D. (2006). On Collaborative Content Distribution using Multi-Message Gossip. In Proceedings of the international parallel and distributed processing symposium. Rhodes Island, Greece: IEEE.
Foster, I. Kesselman, & C., Tuecke, S. (2002). The anatomy of the Grid: Enabling scalable virtual organizations. Retrieved from www.globus.org/alliance/publications/ papers/anatomy.pdf
Ferrari, A. J., Chapin, S. J., & Grimshaw, A. S. (1997). Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification, (Technical Report: CS-97-05). University of Virginia, Charlottesville, VA.
Foster, I. T., & Chandy, K. M. (1995). Fortran M: A language for modular parallel programming. Journal of Parallel and Distributed Computing, 26(1), 24–35. doi:10.1006/jpdc.1995.1044
Fink, S. J. (1998). A programming model for block-structured scientific calculations on smp clusters.Doctoral thesis, University of California, San Diego, CA. Fischer, L. (Ed.). (2004). Workflow Handbook 2004. Lighthouse Point, FL: Future Strategies Inc. Fisk, A. (2003). Gnutella dynamic query protocol v. 0.1. Retrieved from http://www9.limewire.com/developer/ dynamic query.html. Folding@home, (2008). Client statistics by OS. Retrieved March 10, 2008, from http://fah-web.stanford.edu/cgibin/main.py?qtype=osstats Fontán, J., Vázquez, T., Gonzalez, L., Montero, R. S., & Llorente, I. M. (2008, May). OpenNEbula: The open source virtual machine manager for cluster computing. In Open Source Grid and Cluster Software Conference – Book of Abstracts. San Francisco. Foster, I. (2000). Internet computing and the emerging grid. Nature. Retrieved from www.nature.com/nature/ webmatters/grid/grid.html Foster, I. (2002). The grid: A new infrastructure for 21st century science. Physics Today, 55, 42–47. doi:10.1063/1.1461327
914
Foster, I. T., & Iamnitchi, A. (2003). On death, taxes, and the convergence of peer-to-peer and grid computing. 2735, 118-128. Foster, I., & Kesselman, C. (1997). Globus: A metacomputing infrastr ucture toolkit. International Journal of Supercomputer Applications and High Performance Computing, 11(2), 115–128. doi:10.1177/109434209701100205 Foster, I., & Kesselman, C. (1999). The Grid: Blueprint for a New Computing Infrastructure. San Francisco: Morgan Kaufmann Publishers, Inc. Foster, I., & Kesselman, C. (2003). The Grid 2: Blueprint for a new computing infrastructure. San Francisco: Morgan-Kaufmann. Foster, I., & Kesselman, C. (2004). The Grid: Blueprint for a future computing infrastructure (2 Ed.). San Francisco: Morgan Kaufmann. Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayor, B., & Zhang, X. (2006, May). Virtual clusters for Grid communities. In 6th IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2006) (pp. 513–520). Washington, DC: IEEE Computer Society.
Compilation of References
Foster, I., Kesselman, C., & Nick, J. (2002). Grid services for distributed system integration. IEEE Computer, 35(6), 37–46.
Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., Tseng, C.-W., et al. (1990). Fortran D Language Specification (No. CRPC-TR90079), Houston, TX.
Foster, I., Kesselman, C., & Tuecke, S. (2001). The anatomy of the the grid: Enabling scalable virtual organization. The International Journal of Supercomputer Applications, 15(3), 200–222.
Fox, G., Williams, R., & Messina, P. (1994). Parallel computing works! San Francisco: Morgan Kaufmann Publishers.
Foster, I., Kesselman, C., Nick, J. M., & Tuecke, S. (2002). Grid services for distributed system integration. Computer, 35(6), 37–46. doi:10.1109/MC.2002.1009167 Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). The physiology of the grid: An open grid services architecture for distributed systems integration. Retrieved from citeseer.nj.nec.com/foster02physiology.html Foster, I., Kesselman, C., Tsudik, G., & Tuecke, S. (1998). A security Architecture for Computational Grids. ACM Conference on Computer and Communications Security. Foster, I., Kohr, D. R., Krishnaiyer, R., & Choudhary, A. (1996). Double standards: Bringing task parallelism to HPF via the message passing interface. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (pp. 36-36). New York: IEEE Computer Society.
FreePastry. (2008, November). Retrieved from http:// freepastry.rice.edu/FreePastry Frey, J., Mori, T., Nick, J., Smith, C., Snelling, D., Srinivasan, L., & Unger, J. (2005). The open grid services architecture, Version 1.0. Retrieved from www.ggf.org/ ggf_areas_architecture.htm Frey, J., Tannenbaum, T., Livny, M., Foster, I. T., & Tuecke, S. (2001, August). Condor-G: A computation management agent for multi-institutional Grids. In 10th IEEE International Symposium on High Performance Distributed Computing (HPDC 2001) (pp. 55–63). San Francisco: IEEE Computer Society. Frigo, M., & Johnson, S. (2005). The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2), 216–231. doi:10.1109/JPROC.2004.840301
Fourment, M., & Gillings, M. R. (2008, February). A comparison of common programming languages used in bioinformatics. Bioinformatics (Oxford, England), 9.
Fritsch, D., Klinec, D., & Volz, S. (2000). NEXUS positioning and data management concepts for location aware applications. In the 2nd International Symposium on Telegeoprocessing (Nice-Sophia-Antipolis, France), (pp. 171-184).
Fowler, M. (2008, November). Inversion of control containers and the dependency injection pattern. Retrieved from http://www.martinfowler.com/articles/injection. html
Fu, S., Xu, C. Z., & Shen, H. (April 2008). Random choices for Churn resilient load balancing in peer-topeer networks. Proc. of IEEE International Parallel and Distributed Processing Symposium.
Fox, F., & Gannon, D. (2001). Computational grids. Computing in Science & Engineering, 3(4), 74–77. doi:10.1109/5992.931906
Fu, Y., Chase, J., Chun, B., Schwab, S., & Vahdat, A. (2003). SHARP: An architecture for secure resource peering. In 19th ACM Symposium on Operating Systems Principles (SOSP 2003) (pp. 133–148). New York: ACM Press.
Fox, G. C., Johnson, M., Lyzenga, G., Otto, S. W., Salmon, J., & Walker, D. (1988). Solving Problems on Concurrent Processors: Vol. 1. Englewood Cliffs, NJ: Prentice-Hall.
Furmento, N., Hau, J., Lee, W., Newhouse, S., & Darlington, J. (2003). Implementations of a service-oriented architecture on top of jini, jxta and ogsa. In Proceedings of uk e-science all hands meeting.
915
Compilation of References
Gabriel, E. Fagg, G. Bosilca, G. Angskun, T. Dongarra, J. J. Squyres, J. M., et al. (2004). Open MPI: Goals, Concept, and Design of a Next Generation MPI Implemention. In D. Kranzlmueller, P. Kacsuk, J. J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, (LNCS, Vol. 3241, pp. 97-104). Berlin: Springer. Gabriel, E., & Huang, S. (2007). Runtime optimization of application level communication patterns. In Proceedings of the 2007 International Parallel and Distributed Processing Symposium, 12th International Workshop on High-Level Parallel Programming Models and Supportive Environments, (p. 185). Gaddah, A., & Kunz, T. (2003). A survey of middleware paradigms for mobile computing. Carleton University and Computing Engineering [Research Report]. Retrieved June 15th, 2008, from http://www.sce.carleton.ca/wmc/ middleware/middleware.pdf Gao, L. (2001). On inferring autonomous system relationships in the Internet. IEEE/ACM Transactions on networking, 9(6), December. Garbacki, P., Biskupski, B., & Bal, H. (2005). Transparent fault tolerance for grid application. In P. M. Sloot (Ed.) Advances in Grid Computing - EGC 2005, (pp. 671-680). Berlin: Springer Verlag. Garcés-Erice, L., Biersack, E. W., Felber, P. A., Ross, K. W., & Urvoy-Keller, G. (2003). Hierarchical Peer-to-Peer Systems. In Proceedings of the 9th Intl. Euro-Par Conf. (pp. 1230-1239). Berlin: Springer-Verlag. Garcia-Molina, H., & Salem, K. (1987). SAGAS. In Proceedings of ACM SIGMOD’87, International Conference on Management of Data, 16(3), 249-259. Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability. San Francisco: Freeman. GAT. (2005). Grid application toolkit. www.gridlab.org/ WorkPackages/wp-1/ Gelenbe, E. (1979). On the optimum checkpoint interval. Journal of the ACM, 26(2), 259–270. doi:10.1145/322123.322131
916
Gentzsch (2008). Top 10 rules for building a sustainable Grid. In Grid thought leadership series. Retrieved from www.ogf.org/TLS/?id=1 Gentzsch, W. (2002). Response to Ian Foster’s “What is the Grid?” GRIDtoday, August 5. Retrieved from www. gridtoday.com/02/0805/100191.html Gentzsch, W. (2004). Enterprise resource management: Applications in research and industry. In I. Foster & C. Kesselman (Eds.), The Grid 2: Blueprint for a new computing infrastructure (pp. 157 – 166). San Francisco: Morgan Kaufmann Publishers. Gentzsch, W. (2004). Grid computing adoption in research and industry. In A. Abbas (Ed.), Grid computing: A practical guide to technology and applications (pp. 309 – 340). Florence, KY: Charles River Media Publishers. Gentzsch, W. (2007). Grid initiatives: Lessons learned and recommendations. RENCI Report. Retrieved from www.renci.org/publications/reports.php Gentzsch, W. (Ed.). (2007). A sustainable Grid infrastructure for Europe, Executive Summary of the e-IRG Open Workshop on e-Infrastructures, Heidelberg, Germany. Retrieved from www.e-irg.org/meetings/2007-DE/ workshop.html GEONgrid. (2008). Retrieved from www.geongrid.org Georgakopoulos, D., Hornick, M., & Sheth, A. (1995). An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2), 119–153. doi:10.1007/ BF01277643 Ghare, G., & Leutenegger, L. (2004, June). Improving Speedup and Response Times by Replicating Parallel Programs on a SNOW. In Proceedings of the 10th workshop on job scheduling strategies for parallel processing. Ghinita, G., & Teo, Y. M. (2006). An adaptive stabilization framework for distributed hash tables. In Proceedings of the 20th IEEE Intl. Parallel and Distributed Processing Symp. New York: IEEE Computer Society Press.
Compilation of References
Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Lowbandwidth topology maintenance for robustness in structured overlay networks. In Proceedings of 38th Hawaii Intl. Conf. on System Sciences (p. 302). New York: IEEE Computer Society Press. Ghodsi, A., Alima, L. O., & Haridi, S. (2005). Symmetric replication for structured peer-to-peer systems. In Proceedings of the 3rd Intl. Workshop on Databases, Information Systems and Peer-to-Peer Computing (p. 12). Berlin: Spinger-Verlag. Ghormley, D., Petrou, D., Rodrigues, S., Vahdat, A., & Anderson, T. (1998, July). GLUnix: A global layer unix for a network of workstations. Software, Practice & Experience, 28(9), 929. doi:10.1002/(SICI)1097024X(19980725)28:9<929::AID-SPE183>3.0.CO;2-C Ghose, F., Grossklags, J., & Chuang, J. (2003). Resilient Data-Centric Storage in Wireless Ad-Hoc Sensor Networks. Proceedings the 4th International Conference on Mobile Data Management (MDM’03), (pp. 45-62). Gill, P. E. Murray, W. & Wright, M. H. (1993). Practical Optimization. London: Academic Press Ltd. Gilmont, T., Legat, J.-D., & Quisquater, J.-J. (1999). Enhancing the security in the memory management unit. In Proceedings of the 25th EuroMicro Conference (EUROMICRO’99). 1, 449-456. Milan, Italy: IEEE Computer Society Press. Gkantsidis, C., & Rodriguez, P. (2005, March). Network Coding for Large Scale Content Distribution. In Proceedings of ieee/infocom 2005, Miami, USA. gLite - Lightweight Middleware for Grid Computing. (2005). Retrieved from http://glite.web.cern.ch/glite. Globus: Grid security infrastructure (GSI) (n.d.). Retrieved from http://www.globus.org/security/ Globus: The grid resource allocation and management (GRAM) (n.d.). Retrieved from http://www.globus.org/ toolkit/docs/3.2/gram/
Goderis, D. et al. (2001, July). Service Level Specification Semantics and parameters: draft-tequila-sls-01.txt [Internet Draft]. Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2004). Load balancing in dynamic structured p2p systems. In Proceedings of INFOCOM (pp. 2253- 2262). New York: IEEE Press. Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2006). Load balancing in dynamic structured P2P systems. Performance Evaluation, 63(3). Godfrey, P. B., & Stoica, I. (2005). Heterogeneity and load balance in distributed hash tables. In Proceedings of INFOCOM (pp. 596-606). New York: IEEE Press. Goldberg, & D. E. (1989). Genetic algorithm: In search, optimization and machine learning. New York: AddisonWesley. Golding, R. A. (1992, Dec). Weak-consistency group communication and membership (Tech. Rep.). Computer and Information Sciences, University of California, Ph.D. Thesis. Goller, A. (1999). Parallel and Distributed Processing of Large Image Data Sets. Doctoral Thesis, Graz University of Technology, Graz, Austria. Goller, A., & Leberl, F. (2000). Radar Image Processing with Clusters of Computers. Paper presented at the IEEE Conference on Aerospace. Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore, MD: The John Hopkins University Press. Golumbic, M. C. (1980). Algorithmic Graph Theory and Perfect Graphs. New York: Academic Press. Gong, L. (2001, June). JXTA: A network programming environment. IEEE Internet Computing, 5(3), 88-95. Los Alamitos, CA: IEEE Computer Society. Retrieved from http://doi.ieeecomputersociety.org/10.1109/4236.93518
Goad, W. B. (1987). Sequence analysis. Los Alamos Science, (Special Issue), 288–291.
917
Compilation of References
Gontmakher, A., Mendelson, A., Schuster, A., & Shklover, G. (2006) Speculative synchronization and thread management for fine granularity threads. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA’06), (pp. 278-287). Austin, TX: IEEE Computer Society Press. Gonzalez-Castano, F. J., Vales-Alonso, J., Livny, M., Costa-Montenegro, E., & Anido-Rifo, L. (2003). Condor grid computing from mobile handheld devices. SIGMOBILE Mobile Comput. Commun. Rev., 7(1), 117–126. doi:10.1145/881978.882005 Gonzalo, C., & García-Martín, M.-A. (2006). The 3G IP Multimedia Subsystem (IMS): Merging the Internet and the Cellular Worlds. New York: Wiley. Goodale, T., Jha, S., Kaiser, H., Kielmann, T., Kleijer, P., Merzky, A., et al. (2008). A simple API for Grid applications (SAGA). Grid Forum Document GFD.90. Open Grid Forum. Retrieved from www.ogf.org/documents/ GFD.90.pdf Goodman, D. J., Borras, J., Mandayam, N. B., & Yates, R. D. (1997). INFOSTATIONS: A new system model for data and messaging services. Proceedings of the 47th IEEE Vehicular Technology Conference (VTC), Phoenix, AZ, (Vol. 2, pp. 969–973). Google (2008). Google App Engine. Retrieved from http:// code.google.com/appengine/ Google App Engine. (2008, November). Retrieved from http://appengine.google.com Google Groups. (2008). Cloud computing. Retrieved from http://groups.google.ca/group/cloud-computing Govindaraju, M., Krishnan, S., Chiu, K., Slominski, A., Gannon, D., & Bramley, R. (2002, June). Xcat 2.0: A component-based programming model for grid web services (Tech. Rep. No. Technical Report-TR562). Dept. of C.S., Indiana Univ., South Bend, IN. Graboswki, P., Lewandowski, B., & Russell, M. (2004). Access from j2me-enabled mobile devices to grid services. In Proceedings of Mobility Conference 2004, Singapore.
918
Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003). Introduction to parallel computing. Upper Saddle River, NJ: Pearson Education Limited. Grassi, V., Donatiello, L., & Iazeolla, G. (1988). Performability evaluation of multicomponent fault tolerant systems. IEEE Transactions on Reliability, 37(2), 216–222. doi:10.1109/24.3744 Graupner, S., Kotov, V., Andrzejak, A., & Trinks, H. (2002, August). Control Architecture for Service Grids in a Federation of Utility Data Centers (Technical Report No. HPL-2002-235). Palo Alto, CA: HP Laboratories Palo Alto. Gray, A. A., Arabshahi, P., Lamassoure, E., Okino, C., & Andringa, J.. (2004). A Real Option Framework for Space Mission Design. Technical report, VNational Aeronautics and Space Administration NASA. Gray, J. (1981). The transaction concept: Virtues and limitations. In Proceedings of the 7th International Conference on VLDB, (pp.144-154). Grelck, C., Scholz, S.-B., & Shafarenko, A. V. (2007). Coordinating data parallel SAC programs with S-Net. In Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007) (pp. 1–8). New York: IEEE. Grid Computing, I. B. M. (n.d.). Retrieved from http:// www-1.ibm.com/grid/ Grid Engine. (2001). Open source project. Retrieved from http://gridengine.sunsource.net/ Grid Interoperability Now Community Group (GINCG). (2006). Retrieved from http://forge.ogf.org/sf/ projects/gin. GridFTP (n.d.). Retrieved from http://www.globus.org/ toolkit/docs/4.0/data/gridftp/ GridSphere (2008). Retrieved from www.gridsphere. org/gridsphere/gridsphere GridWay. (2008). Metascheduling technologies for the grid. Retrieved from www.gridway.org/
Compilation of References
Grigg, A. (2002). Researvation-Based Timing Analysis – A Partitioned Timing Analysis Model for Distributed Real-Time Systems (YCST-2002-10). York, UK: University of York, Dept. of Computer Science. Grigoras, D. (2005). Service-oriented Naming Scheme for Wireless Ad Hoc Networks. In the Proceedings of the NATO ARW “Concurrent Information Processing and Computing”, July 3-10 2003, Sinaia, Romania, 2005, (pp. 60-73). Amsterdam: IOS Press Grigoras, D., & Riordan, M. (2007). Cost-effective mobile ad hoc networks management. Future Generation Computer Systems, 23(8), 990–996. doi:10.1016/j. future.2007.04.001 Grigoras, D., & Zhao, Y. (2007). Simple Self-management of Mobile Ad Hoc Networks. Proc of the 9th IFIP/ IEEEInternational Conference on Mobile and Wireless Communication Networks, 19-21 September 2007, Cork, Ireland. Grimme, C., Lepping, J., & Papaspyrou, A. (2008, April). Prospects of collaboration between compute providers by means of job interchange. In Job Scheduling Strategies for Parallel Processing (Vol. 4942, p. 132-151). Berlin / Heidelberg: Springer. Grimshaw, A. S., & Wulf, W. A. (1997). The legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), 39–45. doi:10.1145/242857.242867 Grit, L. E. (2005, October). Broker Architectures for Service-Oriented Systems [Technical Report]. Durham, NC: Department of Computer Science, Duke University. Grit, L. E. (2007). Extensible Resource Management for Networked Virtual Computing. PhD thesis, Department of Computer Science, Duke University, Durham, NC. (Adviser: Jeffrey S. Chase) Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins M., Watambe, Y., & Yamazaki, T., (2006). Synergistic Processing in Cell’s Multicore Architecture. IEEE Computer Society, 0272-1732/06.
Guiffaut, C., & Mahdjoubi, K. (2001, April). A Parallel FDTD Algorithm Using the MPI Library. IEEE Antennas andPropagation Magazine, 43(No. 2), 94–103. Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S., Shenker, S., & Stoica, I. (2003). The impact of dht routing geometry on resilience and proximity. In Proceedings of ACM SIGCOMM (pp. 381-394). New York: ACM Press. Guo, D., Wu, J., Chen, H., & Luo, X. (2006). Theory and Network Applications of Dynamic Bloom Filters. Paper presented at the INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Guo, S.-F., Zhang, W., Ma, D., & Zhang, W.-L. (2004, Aug.). Grid mobile service: using mobile software agents in grid mobile service. Machine learning and cybernetics, 2004. In Proceedings of 2004 International Conference on, 1, 178-182. Gupta, A., Sahin, O. D., Agarwal, D., & El Abbadi, A. (2004). Meghdoot: Content-based publish/subscribe over peer-to-peer networks. In Middleware’04 Proceedings of the 5th ACM/IFIP/USENIX International Conference on Middleware, (pp. 254-273). Heidelberg, Germany: SpringerLink. doi: 10.1007/b101561. Gupta, I., Birman, K., Linga, P., Demers, A., & Renesse, R. V. (2003). Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 160-169). Berlin: SpringerVerlag. Gustafson, J. (1988). Reevaluating A mdahl’s law. Communications of the ACM, 31, 532–533. doi:10.1145/42411.42415 Haahr, M., Cunningham, R., & Cahill, V. (1999). Supporting CORBA applications in a mobile environment. In MobiCom ‘99: Proceedings of the 5th Annual ACM/ IEEE International Conference on Mobile Computing and Networking, (pp. 36-47).
GSI (Globus Security Infrastructure). Retrieved from http://www.globus.org/Security/
919
Compilation of References
Hailong, C., & Jun, W. (2004). Foreseer: a novel, locality-aware peer-to-peer system architecture for keyword searches. Paper presented at the Proceedings of the 5th ACM/IFIP/USENIX International Conference on Middleware. Haji, M. H., Gourlay, I., Djemame, K., & Dew, P. M. (2005). A SNAP-based community resource broker using a three-phase commit protocol: A performance study. The Computer Journal, 48(3), 333–346. doi:10.1093/ comjnl/bxh088 Hakami, S. (1999). Optimum location of switching centers and the absolute centers and medians of a graph. Operations Research, 12, 450–459. doi:10.1287/opre.12.3.450 Halsall, F. (2000). Multimedia Communications: Applications, Networks, Protocols and Standards (Hardcover). New York: Addison Wesley. Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., & Olukotun, K. (2000). The Stanford Hydra CMP. IEEE Micro, 20(2), 71–84. doi:10.1109/40.848474 Hammond, L., Nayfeh, B. A., & Olukotun, K. (1997). A single-chip multiprocessor. IEEE Computer, 30(9), 79–85. Hammond, L., Wong, V., Chen, M., Carlstrom, B. D., Davis, J. D., & Hertzberg, B. (2004). Transactional Memory Coherence and Consistency. SIGARCH Comput. Archit. News, 32(2), 102. doi:10.1145/1028176.1006711 Harte, L., Wiblitzhouser, A., & Pazderka, T. (2006). Introduction to MPEG; MPEG-1, MPEG-2 and MPEG-4. Fuquay Varina, NC: Althos Publishing. Harvey, N. J., Jones, M. B., Saroiu, S., Theimer, M., & Wolman, A. (2003). SkipNet: A scalable overlay network with practical locality properties. In Proceedings of the 4th USENIX Symp. on Internet Technologies and Systems (pp. 113-126). USENIX Association.
920
Hawick, K. A., James, H. A., Maciunas, K. J., Vaughan, F. A., Wendelborn, A. L., Buchhorn, M., et al. (1997). Geostationary-satellite Imagery Application on Distributed, High-Performance Computing. Paper presented at the High Performance Computing on the Information Superhighway: HPC Asia’97. Hayes, B. (2007). Computing in a parallel universe. American Scientist, 95. Hayes, C. L., & Luo, Y. (2007). Dpico: A high speed deep packet inspection engine using compact finite automata. Proceedings of ACM/IEEE ANCS, (pp. 195-203). He, X. (1998). 2D -Object Recognition With Spiral Architecture. Doctoral Thesis, University of Technology, Sydney, Sydney, Australia. He, X., & Sun, X. (2005). Incorporating data movement into grid task scheduling. In Proceedings of grid and cooperative computing (pp. 394–405). He, X., Sun, X., & Laszewski, G. (2003). QoS guided Min-Min heuristic for grid task scheduling. Journal of Computer Science and Technology, Special Issue on Grid Computing, 18 (4). Heien, E., Fujimoto, N., & Hagihara, K. (2008). Computing low latency batches with unreliable workers in volunteer computing environments. In Pcgrid. Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). Provision of fault tolerance with grid-enabled and SLAaware resource management systems. In G. R. Joubert (Ed.) Parallel Computing: Current and Future Issues of High End Computing, (pp. 105-112), NIC-Directors. Heine, F., Hovestadt, M., Kao, O., & Keller, A. (2005). SLA-aware job migration in grid environments. In L. Grandinetti (Ed.), Grid Computing: New Frontiers of High Performance Computing (345-367). Amsterdam, The Netherland: Elsevier Press. Hennessy, J., & Patterson, D. (2006). Computer architecture: a quantitative approach (4th Ed.). San Francisco: Morgan Kaufmann.
Compilation of References
Hey, T., & Trefethen, A. E. (2002). The UK e-science core programme and the Grid. Future Generation Computer Systems, 18(8), 1017–1031. doi:10.1016/S0167739X(02)00082-1
Hoefler, T. Lichei, A. & Rehm, W. (2007). Low-Overhead LogGP Parameter Assessment for Modern Interconnect Networks. Proceedings of the IPDPS, Long Beach, CA, March 26-30. New York: IEEE.
Heymann, E., Fernandez, A., Senar, M. A., & Salt, J. (2003). The EU-Crossgrid approach for grid application scheduling. European Grid Conference, (LNCS 2970, pp. 17-24). Amsterdam: Springer Verlag.
Hong, T., & Tao, Y. (2003). An Efficient Data Location Protocol for Self.organizing Storage Clusters. Paper presented at the Proceedings of the 2003 ACM/IEEE conference on Supercomputing.
High Performance Fortran Forum. (1993). High performance Fortran language specification, version 1.0 (No. CRPC-TR92225). Center for Research on Parallel Computation, Rice University, Houston, TX.
Hopper, R. (2002). P/Meta - metadata exchange scheme. Retrieved June 15th, 2008, from http://www.ebu.ch/ trev_290-hopper.pdf
High Performance Fortran Forum. (1997). High performance Fortran language specification 2.0. Center for Research on Parallel Computation, Rice University, Houston, TX. Hill, M. D., & Marty, M. R. (2008, July). Amdahl’s Law in the Multicore Era. HPCA 2008, IEEE 14th International Symposium (pp.187). Hingne, V., Joshi, A., Finin, T., Kargupta, H., & Houstis, E. (2003). Towards a pervasive grid. In International parallel and distributed processing symposium (ipdps’03) (p. 207). Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., & Roussel, P. (2001). The microarchitecture of the Pentium 4 processor. Intel® Technology Journal, 5(1), 1-13. Ho, T., Medard, M., Koetter, R., Karger, D. R., Effros, M., Shi, J., & Leong, B. (2006, October). A random linear network coding approach to multicast. IEEE Transactions on Information Theory, 52(10). doi:10.1109/ TIT.2006.881746 Hockney, R., & Berry, M. (1994). PARKBENCH report: public international benchmarks for parallel computers. Science Progress, 3(2), 101–146. Hockney, R., & Eastwood, J. (1981). Computer simulation using particles. London: McGraw-Hill, Inc.
Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H., & Stockinger, K. (2000). Data management in an international data grid project. grid computing - GRID 2000 (pp.333-361). UK. Hovestadt, M. (2003). Scheduling in HPC resource management systems: Queuing vs. planning. In D. Feitelson (Ed.), Job Scheduling Strategies for Parallel Processing, (pp.1-20). Berlin: Springer Verlag. Hsiao, H.-C., & King, C.-T. (2003). A tree model for structured peer-to-peer protocols. In Proceedings of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid (pp. 336-343). New York: IEEE Computer Society Press. Hua, Y., & Xiao, B. (2006). A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services. Proceedings of 13th International Conference of High Performance Computing (HiPC),(pp. 277-288). Hua, Y., Zhu, Y., Jiang, H., Feng, D., & Tian, L. (2008). Scalable and Adaptive Metadata Management in Ultra Large-Scale File Systems. Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS 2008). Huang, A. (2003) Hacking the Xbox: an introduction to reverse engineering, (1st Ed.). San Francisco: No Starch Press.
921
Compilation of References
Huang, J., & Lilja, D. J. (1999). Exploiting basic block value locality with block reuse. Proceedings of 5th International Symposium on High-Performance Computer Architecture (HPCA’99), (pp. 106-114). Orlando, FL: IEEE Computer Society Press. Huang, K.-H., & Abraham, J. A. (1984). Algorithmbased fault tolerance for matrix operations. EEE Transactions on Computers, C-33, 518–528. doi:10.1109/ TC.1984.1676475 Huang, R., Casanova, H., & Chien, A. A. (2006, April). Using virtual Grids to simplify application scheduling. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Rhodes Island, Greece: IEEE. Huang, S. (2007). Applying Adaptive Software Technologies for Scientific Applications. Master Thesis, Department of Computer Science, University of Houston, Houston, TX. Huedo, E., Montero, R. S., & Llorente, I. M. (2004). A framework for adaptive execution in Grids. Software, Practice & Experience, 34(7), 631–651. doi:10.1002/ spe.584 Huerta, M. Haseltine, F., & Liu, Y. (2004, July). Nih working definition of bioinformatics and computational biology. Hull, J. C. (2006). Options, Futures, and Other Derivatives (6th Edition). Upper Saddle River, NJ: Prentice Hall. Hunold, S., Rauber, T., & Rünger, G. (2004). Multilevel hierarchical matrix-matrix multiplication on clusters. In Proceedings of the 18th International Conference of Supercomputing (ICS’04) (pp. 136–145). New York: ACM. Hunold, S., Rauber, T., & Rünger, G. (2008). Combining building blocks for parallel multi-level matrix multiplication. Parallel Computing, 34(6-8), 411–426. doi:10.1016/j. parco.2008.03.003
922
Huston, G. (n.d.). Peering and settlements Part-1. The Internet protocol journal. San Jose, CA: CISCO Systems. Hwang, J., & Arvamudham, P. (2004). Middleware services for P2P computing in wireless grid networks. IEEE Internet Computing, 8(4)40–46. doi:10.1109/ MIC.2004.19 Hwang, S., & Kesselman, C. (2003). GridWorkflow: A flexible failure handling framework for the Grid. In B. Lowekamp (Ed.), 12th IEEE International Symposium on High Performance Distributed Computing, (pp. 126—131). New York: IEEE press. Iamnitchi, A., Doraimani, S., & Garzoglio, G. (2006). Filecules in High-Energy Physics: Characteristics and Impact on Resource Management. In proceeding of 15th ieee international symposium on high performance distributed computing hpdc 15, Paris. Iamnitchi, A., Foster, I. T., & Nurmi, D. (2002). A peerto-peer approach to resource location in grid environments. In Hpdc (p. 419). IBM. (2007). Blue Gene. Retrieved March 10, 2008, from http://domino.research.ibm.com/comm/research_projects.nsf/pages/bluegene.index.html Information Services. (n.d.). Retrieved from http://www. globus.org/toolkit/mds/ Intel (2007). Intel® multi-core: An overview. Intel News Release. (2006). New dual-core Intel® Itanium® 2 processor doubles performance, reduces power consumption. Santa Clara, C: Author. Iosevich, V., & Schuster, A. (2005). Software Distributed Shared Memory: a VIA-based implementation and comparison of sequential consistency with home-based lazy release consistency: Research Articles. Software, Practice & Experience, 35(8), 755–786. doi:10.1002/ spe.656
Compilation of References
Iosup, A., & Epema, D. H. (2006). GRENCHMARK: A framework for analyzing, testing, and comparing grids. International Conference on Cluster Computing and the Grid (pp. 313-320). Singapore: IEEE Computer Society Press.
Jacob, B., Ferreira, L., Bieberstein, N., Gilzean, C., Girard, J.-Y., Strachowski, R., & Yu, S. (2003). Enabling applications for Grid computing with Globus. IBM Redbook. Retrieved from www.redbooks.ibm.com/abstracts/ sg246936.html?Open
Iosup, A., & Epema, D. H. (2007). Build-and-test workloads for Grid middleware: Problem, analysis, and applications. International Conference on Cluster Computing and the Grid (pp. 205-213). Rio de Janeiro, Brazil: IEEE Computer Society Press.
Jain, R. K. (1991). The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. New York: Wiley.
Iosup, A., Epema, D. H. J., Tannenbaum, T., Farrellee, M., & Livny, M. (2007, November). Inter-operating Grids through delegated matchmaking. In 2007 ACM/IEEE Conference on Supercomputing (SC 2007) (pp. 1–12). New York: ACM Press. Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., & Yocum, K. G. (2006, June). Sharing networked resources with brokered leases. In USENIX Annual Technical Conference (pp. 199–212). Berkeley, CA: USENIX Association. Ishikawa, Y., Matsuda, M., Kudoh, T., Tezuka, H., & Sekiguchi, S. (2003). The design of a latency-aware mpi communication library. In Proceedings of swopp03. ISO/IEC. (1995). SMDL (Standard Music Description Language) Overview. Retrieved June 15th, 2008, from http://xml.coverpages.org/gen-apps.html#smdl
James, K. M. (1983). A second look at bloom filters. Communications of the ACM, 26(8), 570–571. doi:10.1145/358161.358167 Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt, Y., & Zhang, L. (2000). On the placement of Internet instrumentation. In Proc. of INFOCOM. Jarvis, S. A., & Nudd, G. R. (2005, February). Performance-based middleware for Grid computing. Concurrency and Computation: Practactice and Experience, 17(2-4), 215–234. doi:10.1002/cpe.925 Jayram, T. S., Kimbrel, T., Krauthgamer, R., Schieber, B., & Sviridenko, M. (2001). Online server allocation in server farm via benefit task systems. Proceedings of the ACM Symposium on Theory of Computing (STOC’01), Crete, Greece, (pp. 540–549).
ISO/IEC. (2003). MPEG-7 Overview. Retrieved June 15th, 2008, from http://www.chiariglione.org/mpeg/ standards/mpeg-7/mpeg-7.htm.
Jean, K., Galis, A., & Tan, A. (2004). Context-aware grid services: Issues and approaches. In Computational science–iccs 2004: 4th international conference Krak’ow, Poland, June 6–9, 2004, proceedings, part iii (LNCS Vol. 3038, p. 1296). Berlin: Springer.
Itzkovitz, A., Schuster, A., & Shalev, L. (1998). Thread migration and its applications in distributed shared memory systems. Journal of Systems and Software, 42(1), 71–87. doi:10.1016/S0164-1212(98)00008-9
Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., & Schulzrinne, H. (2008). REsource LOcation And Discovery (RELOAD). Retrieved June 15th, 2008, from http:// tools.ietf.org/id/draft-bryan-p2psip-reload-04.txt.
Iwata, T., & Kurosawa, K. (2003). OMAC: One-Key CBC MAC. In 10th International Workshop on Fast Software Encryption (FSE’03), (LNCS Vol. 2887/2003, pp. 129153), Lund, Sweden. Berlin/Heidelberg: Springer.
Jha, S., Kaiser, H., El Khamra, Y., & Weidner, O. (2007, Dec. 10-13). Design and implementation of network performance aware applications using SAGA and Cactus. 3rd IEEE Conference on eScience and Grid Computing, (pp. 143- 150). Bangalore, India.
923
Compilation of References
Jiang, H., & Chaudhary, V. (2004). Process/thread migration and checkpointing in heterogeneous distributed systems. Proceedings of the 37th Hawaii International Conference on System Sciences, Hawaii, USA.
JSR166. (2004). Java concurrent utility package in J2SE 5.0 (JDK1.5). Retrieved June 24, 2008, from http:// java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ package-summary.html
Jiang, J. L., Yang, G. W., & Shi, M. L. (2006). Transaction Model for Service Grid Environment and Implementation Considerations. In Proceedings of IEEE International Conference on Web Services (pp. 949 – 950).
Jul, E., Levy, H., Hutchinson, N., & Blad, A. (1998). Fine-grained mobility in the emerald system. ACM Transactions on Computer Systems, 6(1), 109–133. doi:10.1145/35037.42182
Jiang, S., O’Hanlon, P., & Kirstein, P. (2004). Moving grid systems into the ipv6 era. In Proceedings of Grid And Cooperative Computing 2003 (LNCS 3033, pp. 490–499). Heidelberg, Germany: Springer-Verlag.
Jung, E. B., Choi, S.-J., Baik, M.-S., Hwang, C.-S., Park, C.-Y., & Young, S. (2005). Scheduling scheme based on dedication rate in volunteer computing environment. In Third international symposium on parallel and distributed computing (ispdc 2005), Lille, France.
Jiang, X.-F., Zheng, H.-W., Macian, C., & Pascual, V. (2008). Service Extensible P2P Peer Protocol. Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-jiangp2psip-sep-01.txt Jin, H., Xiong, M., Wu, S., & Zou, D. (2006). Replica Based Distributed Metadata Management in Grid Environment. Computational Science (LNCS 3944, pp. 1055-1062). Berlin: Springer-Verlag. John, K., David, B., Yan, C., Steven, C., Patrick, E., & Dennis, G. (2000). OceanStore: an architecture for global-scale persistent storage. SIGPLAN Not., 35(11), 190–201. doi:10.1145/356989.357007 Johnson, C., & Welser, J. (2005). Future processors: Flexible and modular. Proceedings of 3rd IEEE/ACM/ IFIP International Conference on Hardware/Software Codesign and System Synthesis, (pp. 4-6). Johnson, R. (2002). Spring Framework - a full-stack Java/JEE application framework. Retrieved June 18, 2008, from http://www.springframework.org/ Joisha, P. G., & Banerjee, P. (1999). PARADIGM (version 2.0): A new HPF compilation system. In IPPS ’99/SPDP ’99: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing (pp. 609–615). Washington, DC: IEEE Computer Society. Jones, N. C. & Penvzner, P.A. (2004, August). An Introduction to Bioinformatics Algorithms.
924
Kaashoek, M. F., & Karger, D. R. (2003). Koorde: A simple degree-optimal distributed hash table. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 98-107). Berlin: Springer-Verlag. Kale, L. V., & Krishnan, S. (1998). Charm++: Parallel Programming with Message-Driven Objects. In G. V. Wilson, & P. Lu, Parallel programming using c++ (pp. 175-213). Cambridge, MA: MIT Press. Kalogeraki, V., Gunopulos, D., & Zeinalipour-Yazti, D. (2002). A local search mechanism for peer-to-peer networks. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (pp. 300-307). Kang, D.-S. (2004) Speculation-aware thread scheduling for simultaneous multithreading. Doctoral Dissertation, University of Southern California, Los Angeles, CA. Kang, D.-S., Liu, C., & Gaudiot, J.-L. (2008). The impact of speculative execution on SMT processors. [IJPP]. International Journal of Parallel Programming, 36(4), 361–385. doi:10.1007/s10766-007-0052-3 Kangasharju, J. (2002). Implementing the Wireless CORBA Specification. PhD Disertation, Computer Science Department, University of Helsinki, Helsinki, Finland. Retrieved June 15th, 2008, from http://www. cs.helsinki.fi/u/jkangash/laudatur-jjk.pdf
Compilation of References
Karger, D. R., & Ruhl, M. (2004). Diminished chord: A protocol for heterogeneous subgroup. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 288-297). Berlin: Springer-Verlag. Karger, D. R., & Ruhl, M. (2004). Simple, efficient load balancing algorithms for peer-to-peer systems. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 131-140). Berlin: Springer-Verlag. Karger, D., Lehman, E., Leighton, T., Levine, M., et al. (1997). Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. of STOC (pp 654–663). Karonis, N. T., Toonen, B., & Foster, I. (2003). MPICHG2: A Grid-enabled implementation of the message passing interface. [JPDC]. Journal of Parallel and Distributed Computing, 63, 551–563. doi:10.1016/S07437315(03)00002-9 Karp, R. M. (1992). Online algorithms versus offline algorithms: How much is it worth to know the future? In J. van Leeuwen, (Ed.), Proceedings of the 12th IFIP World Computer Congress. Volume 1: Algorithms, Software, Architecture, (pp. 416–429). Amsterdam: Elsevier. Katzy, B., Zhang, C., & Löh, H. (2005). Virtual organizations: Systems and practices. In L. M. Camarinha-Matos, H. Afsarmanesh, & M. Ollus (Eds.), (p. 45-58). New York: Springer Science+Business Media, Inc. Keahey, K., Foster, I., Freeman, T., & Zhang, X. (2006). Virtual workspaces: Achieving quality of service and quality of life in the Grids. Science Progress, 13(4), 265–275. Kedrinskii, V. K., Vshivkov, V. A., Dudnikova, G. I., Shokin, Yu. I., & Lazareva, G. G. (2004). Focusing of an oscillating shock wave emitted by a toroidal bubble cloud. Journal of Experimental and Theoretical Physics, 98(6), 1138–1145. doi:10.1134/1.1777626 Keleher, P., Cox, A. L., & Zwaenepoel, W. (1992). Lazy release consistency for software distributed shared memory. Paper presented at the Proceedings of the 19th annual international symposium on Computer architecture.
Keleher, P., Cox, A. L., Dwarkadas, S., & Zwaenepoel, W. (1994). TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Paper presented at the Proceedings of Winter 1995 USENIX Conference. Kelly, W., Roe, P., & Sumitomo, J. (2002). G2: A grid middleware for cycle donation using. net. In Proceedings of the 2002 International Conference on Parallel and Distributed Processing Techniques and Applications. Kephart, J. O., & Chess, D. M. (2003). The vision of autonomic computing. Computer IEEE Computer Society, 36(1), 41–50. Kertész, A., Farkas, Z., Kacsuk, P., & Kiss, T. (2008, April). Grid enabled remote instrumentation. In F. Davoli, N. Meyer, R. Pugliese, & S. Zappatore (Eds.), 2nd International Workshop on Distributed Cooperative Laboratories: Instrumenting the Grid (INGRID 2007) (pp. 303–312). New York: Springer US. Kesselman, C., & Foster, I. (1998). The Grid: Blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers. Kessler, C. W., & Löwe, W. (2007). A framework for performance-aware composition of explicitly parallel components. In [Jülich/Aachen, Germany: IOS Press.]. Proceedings of the International Conference ParCo, 2007, 227–234. Khanna, G., Vydyanathan, N., Catalyurek, U., Kurc, T., Krishnamoorthy, S., Sadayappan, P., et al. (2006).Task scheduling and file replication for data-intensive jobs with batch-shared I/O. In Proceedings of high-performance distributed computing (HPDC) (pp. 241–252). Kielmann, T. Hofman, R. F. H. Bal, H. E. Plaat, A. & Bhoedjang, R. A. F. (1999). MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), 34(8),131-140. Kim, J.-S., Nam, B., Keleher, P. J., Marsh, M. A., Bhattacharjee, B., & Sussman, A. (2006). Resource discovery techniques in distributed desktop grid environments. In Grid (pp. 9-16).
925
Compilation of References
Kim, K. H., & Buyya, R. (2007, September). Fair resource sharing in hierarchical virtual organizations for global Grids. In 8th IEEE/ACM International Conference on Grid Computing (Grid 2007) (pp. 50–57). Austin, TX: IEEE. Kim, S., & Weissman, J. B. (2004). A genetic algorithm based approach for scheduling decomposable data grid applications. In Proceedings of international conference on parallel processing (Vol. 1, pp. 405–413). Kim, Y. (1996, June). Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville. Knuth, D. E. (1975). The art of computer programming. Volume 1: Fundamental Algorithms. Reading, MA: Addison Wesley. Koelbel, C. H., Loveman, D. B., & Schreiber, R. S., Jr. G. L. S., & Zosel, M. E. (1994). The High Performance Fortran Handbook. Cambridge, MA: MIT Press. Koetter, R., & Medard, M. (2003, October). An algebraic approach to network coding. IEEE/ACM Transactions on Networking (TON), 11(5), 782 – 795. Kok, A. J. F., Pabst, J. L. v., & Afsarmanseh, H. (April, 1997). The 3D Object Mediator: Handling 3D Models on Internet. Paper presented at the High-Performance Computing and Networking, Vienna, Austria. Kondo, D., Araujo, F., Malecot, P., Domingues, P., Silva, L. M., & Fedak, G. (2006). Characterizing result errors in internet desktop grids (Tech. Rep. No. INRIA-HALTech Report 00102840), INRIA, France. Kondo, D., Chien, A. A., & Casanova, H. (2007). Scheduling task parallel applications for rapid turnaround on enterprise desktop grids. Journal of Grid Computing, 5(4), 379–405. doi:10.1007/s10723-007-9063-y Kondo, D., Chien, A., & H., C. (2004, November). Rapid Application Turnaround on Enterprise Desktop Grids. In Acm conference on high performance computing and networking, sc2004.
926
Kondo, D., Fedak, G., Cappello, F., Chien, A. A., & Casanova, H. (2006, December). On Resource Volatility in Enterprise Desktop Grids. In Proceedings of the 2nd IEEE International Conference On E-Science And Grid Computing (eScience’06) (pp. 78–86). Amsterdam, Netherlands. Kondo, D., Taufer, M., Brooks, C., Casanova, H., & Chien, A. (2004, April). Characterizing and evaluating desktop grids: An empirical study. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’04). Koskela, T., Kassinen, O., Korhonen, J., Ou, Z., & Ylianttila, M. (2008). Peer-to-Peer Community Management using Structured Overlay Networks. In the Proc. of International Conference on Mobile Technology, Applications and Systems, September 10-12, Yilan, Taiwan. Koufaty, D., & Marr, D. (2003). Hyperthreading technology in the Netburst microarchitecture. IEEE Micro, 23(2), 56–65. doi:10.1109/MM.2003.1196115 Kowaliski, C. (2008). NVIDIA CEO talks down CPUGPU hybrids, Larrabee. The Tech Report, April 11th. Retrieved from http://techreport.com/discussions.x/14538 Kraeva, M. A., & Malyshkin, V. E. (1997). Implementation of PIC method on MIMD multicomputers with assembly technology. In Proc. of the High Performance Computing and Networking Europe 1997 Int. Conference. (LNCS, Vol.1255), (pp. 541-549). Berlin: Springer Verlag. Kraeva, M. A., & Malyshkin, V. E. (1999). Algorithms of parallel realization of PIC method with assembly technology. In Proceedings of 7th High Performance Computing and Networking Europe, (LNCS Vol. 1593), (pp. 329-338). Berlin: Springer Verlag. Kraeva, M. A., & Malyshkin, V. E. (2001). Assembly technology for parallel realization of numerical models on MIMD-multicomputers. International Journal on Future Generation Computer Systems, Elsevier Science, 17(6), 755–765. doi:10.1016/S0167-739X(00)00058-3
Compilation of References
Krafzig, D., Banke, K., & Slama, D. (2005). Enterprise SOA: Service-Oriented Architecture Best Practices. Upper Saddle River, NJ: Prentice Hall. Krauter, K., Buyya, R., & Maheswaran, M. (2002). A taxonomy and survey of grid resource management systems for distributed computing. Software, Practice & Experience, 32(2), 135–164. doi:10.1002/spe.432 Krishna, V. & Perry, M. (2007). Efficient mechanism Design. Krishnan, S., & Gannon, D. (2004). Xcat3: A framework for cca components as ogsa services. In Proceedings of Hips 2004, 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments. Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., et al. (2000). OceanStore: An Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (pp. 190-201). New York: ACM Press. Kühnemann, M., Rauber, T., & Rünger, G. (2004). A source code analyzer for performance prediction. In Proceedings of IPDPS’04 Workshop on Massively Parallel Processing (WMPP’04. New York: IEEE. Kuksheva, E. A., Malyshkin, V. E., Nikitin, S. A., Snytnikov, A. V., Snytnikov, V. N., & Vshivkov, V. A. (2005). Supercomputer simulation of self-gravitating media. International Journal on Future Generation Computer Systems, 21(5), 749–758. doi:10.1016/j.future.2004.05.019 Kumar, A. (2000). An efficient SuperGrid protocol for high availability and load balancing. IEEE Transactions on Computers, 49(10), 1126–1133. doi:10.1109/12.888048 Kumar, A., Xu, J., & Zegura, E. W. (2005). Efficient and scalable query routing for unstructured peer-to-peer networks. Paper presented at the Proceedings INFOCOM 2005, 24th Annual Joint Conference of the IEEE Computer and Communications Societies.
Kumar, R., Tullsen, D. M., & Jouppi, N. P. (2006). Core Architecture Optimization for Heterogeneous Chip Multiprocessors. In Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques (pact 2006) (pp. 23-32). Kumar, V. K. P., Hariri, S., & Raghavendra, C. S. (1986). Distributed program reliability analysis. IEEE Transactions on Software Engineering, SE-12, 42–50. Kumary, R. Tullsen D.M., Ranganathan, P., Jouppi, N.P., & Farkas, K.I., (2004). Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance. In Proceedings of the 31st International Symposium on Computer Architecture (ISCA’04), June, 2004. Kurkovsky, S. Bhagyavati, Ray, A., & Yang, M. (2004). Modeling a grid-based problem solving environment for mobile devices. In ITCC (2) (p. 135). New York: IEEE Computer Society. Kwok, T. T.-O., & Kwok, Y.-K. (2007). Design and Evaluation of Parallel String Matching Algorithms for Network Intrusion Detection Systems (NPC 2007), (LNCS 4672, pp. 344-353). Berlin: Springer. Lamehamedi, H., & Szymanski, B. shentu, Z., & Deelman, E. (2002). Data replication strategies in grid environments. In Proceedings of the fifth international conference on algorithms and architectures for parallel processing (pp. 378–383). Lamehamedi, H., Szymanski, B., Shentu, Z., & Deelman, E. (2003). Simulation of dynamic data replication strategies in data grids. In Proceedings of the international parallel and distributed processing symposium (pp. 10–20). Landers, M., Zhang, H., & Tan, K.-L. (2004). PeerStore: Better performance by relaxing in peer-to-peer backup. In Proceedings of the 4th Intl. Conf. on Peer-to-Peer Computing (pp. 72-79). New York: IEEE Computer Society Press. Lange, D., & Oshima, M. (1998). Mobile agents with java: The aglet api. World Wide Web (Bussum), 1(3). doi:10.1023/A:1019267832048
927
Compilation of References
Laszewski, G. v., Foster, I., & Gawor, J. (2000). Cog kits: A bridge between commodity distributed computing and high-performance grids. In ACM 2000 Conference on java grande (p.97 - 106). San Francisco, CA: ACM Press.
Lee, Y. C., & Zomaya, A. Y. (2006). Data sharing pattern aware scheduling on grids. In Proceedings of International Conference on Parallel Processing, (pp. 365–372).
Laure, E. (2001). OpusJava: A Java framework for distributed high performance computing. Future Generation Computer Systems, 18(2), 235–251. doi:10.1016/ S0167-739X(00)00094-7
Legrand, I., Newman, H., Voicu, R., Cirstoiu, C., Grigoras, C., Toarta, M., et al. (2004, September-October). Monalisa: An agent based, dynamic service system to monitor, control and optimize Grid based applications. In Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland.
Laure, E., Mehrotra, P., & Zima, H. P. (1999). Opus: Heterogeneous computing with data parallel tasks. Parallel Processing Letters, 9(2). doi:10.1142/ S0129626499000256 Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, H. (1993). Sequencing and Scheduling: Algorithms and Complexity. Amsterdam: North-Holland. Ledlie, J., Serban, L., & Toncheva, D. (2002). Scaling Filename Queries in a Large-Scale Distributed File System. Harvard University, Cambridge, MA.
Lei, M., & Vrbsky, S. V. (2006). A data replication strategy to increase data availability in data grids. In Proceedings of the international conference on grid computing and applications (pp. 221–227). Leslie, M., Davies, J., & Huffman, T. (2006). replication strategies for reliable decentralised storage. In Proceedings of the 1st Workshop on Dependable and Sustainable Peer-to-Peer Systems (pp. 740-747). New York: IEEE Computer Society Press.
Lee, C. (2003). Grid programming models: Current tools, issues and directions. In G. F. Fran Berman, T. Hey, (Eds.), Grid computing (pp. 555–578). New York: Wiley Press.
Leutenegger, S., & Sun, X. (1993). Distributed computing feasibility in a non-dedicated homogeneous distributed system. In Proceedings of SC’93, Portland, OR.
Lee, C., Lee, T.-y., Lu, T.-c., & Chen, Y.-t. (1997). A Worldwide Web Based Distributed Animation Environment. Computer Networks and ISDN Systems, 29, 1635–1644. doi:10.1016/S0169-7552(97)00078-0
Levitin, G., Dai, Y. S., & Ben-Haim, H. (2006). Reliability and performance of star topology grid service with precedence constraints on subtask execution. IEEE Transactions on Reliability, 55(3), 507–515. doi:10.1109/ TR.2006.879651
Lee, L. G. (1982). Designing a Bloom filter for differential file access. Communications of the ACM, 25(9), 600–604. doi:10.1145/358628.358632 Lee, S., Ren, X., & Eigenmann, R. (2008). Efficient content search in ishare, a p2p based internet-sharing system. In PCGRID. Lee, S.-W., & Gaudiot, J.-L. (2003). Clustered microarchitecture simultaneous multithreading. In 9th International Euro-Par Conference on Parallel Processing (Euro-Par’03), (LNCS Vol. 2790/2004, pp. 576-585), Klagenfurt, Austria. Berlin/Heidelberg: Springer.
928
Levitin, G., Dai, Y. S., Xie, M., & Poh, K. L. (2003). Optimizing survivability of multi-state systems with multi-level protection by multi-processor genetic algorithm. Reliability Engineering & System Safety, 82, 93–104. doi:10.1016/S0951-8320(03)00136-4 Li, C.-C. J., Stewart, E. M., & Fuchs, W. K. (1994). Compiler assisted full checkpointing. Software, Practice & Experience, 24, 871–886. doi:10.1002/spe.4380241002 Li, F., Pei, C., Jussara, A., & Andrei, Z. B. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3), 281–293.
Compilation of References
Li, J., Stribling, J., Gil, T. M., Morris, R., & Kaashoek, M. F. (2004). Comparing the performance of distributed hash tables under churn. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 87-99). Berlin: Springer-Verlag. Li, J., Stribling, J., Morris, R., & Kaashoek, M. F. (2005). Bandwidth-efficient management of dht routing tables. In Proceedings of 2nd Symp. on Networked Systems Design and Implementation (pp. 99-114). USENIX Association. Li, S.-Y. R., Yeung, R. W., & Cai, N. (2003, Feb.). Linear network coding. IEEE Transactions on Information Theory, 49(2), 371–381. doi:10.1109/TIT.2002.807285 Li, X., & Gaudiot, J.-L. (2006). Design trade-offs and deadlock prevention in transient fault-tolerant SMT processors. In Proceedings of 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06), (pp. 315-322). Riverside, CA: IEEE Computer Society Press. Li, Z. Zhang, Duan, Z., Gao, L.& Hou, Y.T.(2000). Decoupling QoS control from Core routers: A Novel bandwidth broker architecture for scalable support of guaranteed services. Proc. Of SIGCOMM’00, Stockholm, Sweden, (pp. 71-83). Li, Z., & Mohapatra, P. (2004, January). QoS Aware routing in Overlay networks (QRON). IEEE Journal on Selected Areas in Communications, 22(1). Li, Z., Sun, L., & Ifeachor, E. (2005). Challenges of mobile ad-hoc grids and their applications in e-healthcare. In Proceedings of Second International Conference on Computational Intelligence in Medicine And Healthcare (cimed’ 2005). Li, Z., Xu, X., Hu, W., & Tang, Z. (2006). Microarchitecture and performance analysis of Godson-2 SMT processor. In Proceedings of the 24th International Conference on Computer Design (ICCD’06), (pp. 485-490). San Jose, CA: IEEE Computer Society Press.
Liang, D., & Tripathi, S. (1996). Performance analysis of longlived transaction processing systems with rollbacks and aborts. IEEE Transactions on Knowledge and Data Engineering, 8(5), 802–815. doi:10.1109/69.542031 Likic, V. (2000). The needleman-wunsch algorithm for sequence alignment. The University of Melbourne, Australia. Lin, M. S., Chang, M. S., Chen, D. J., & Ku, K. L. (2001). The distributed program reliability analysis on ring-type topologies. Computers & Operations Research, 28, 625–635. doi:10.1016/S0305-0548(99)00151-3 Lin, Y., Liu, P., & Wu, J. (2006). Optimal placement of replicas in data grid environments with locality assurance. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS’06), 01, 465–474. Lindholm, T., & Yellin, F. (1999). The jave(tm) virtual machine specification (2nd Ed.).New York: Addison Wesley. Litchfield, S. (2008). A detailed comparison of Seires 60 (S60) Symbian smartphones. Retrieved March 10, 2008, from http://3lib.ukonline.co.uk/s60history.htm Litke, A., Skoutas, D., & Varvarigou, T. (2004). Mobile grid computing: Changes and challenges of resource management in a mobile grid environment. In Proceedings of Practical Aspects of Knowledge Management (PAKM 2004), Austria. Little, M. C., Shrivastava, S. K., & Speirs, N. A. (2002)... The Computer Journal, 45(6), 645–652. doi:10.1093/ comjnl/45.6.645 Litzkow, M. J., Livny, M., & Mutka, M. W. (1988, June). Condor – a hunter of idle workstations. In 8th International Conference of Distributed Computing Systems (pp. 104–111). San Jose, CA: Computer Society. Liu, C., & Gaudiot, J.-L. (2008). Resource sharing control in simultaneous multithreading microarchitectures. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems Conference (ACSAC’08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer Society Press.
929
Compilation of References
Liu, C., Qian, D., Liu, Y., Li, Y., & Wang, C. (2006). RSVP Context Extraction in IP Mobility Environments. Vehicular Technology Conference, 2006, VTC 2006-Spring, IEEE 63rd, (Vol. 2, pp. 756-760). Liu, G. Q., Xie, M., Dai, Y. S., & Poh, K. L. (2004). On program and file assignment for distributed systems. Computer Systems Science and Engineering, 19(1), 39–48. Liu, H., Zheng, K., Liu, B., Zhang, X., & Liu, Y. (2006). A memory-efficient parallel string matching architecture for high-speed intrusion detection. IEEE Journal on Selected Areas in Communications, 24(10), 1793–1804. doi:10.1109/JSAC.2006.877221 Liu, L.-L., Liu, Q., Natsev, A., Ross, K. A., Smith, J. R., & Varbanescu, A. L. (2007, July). Digital media indexing on the cell processor. In 16th international conference on parallel architecture and compilation techniques, Beijing, China (pp. 425–425). Liu, S., & Gaudiot, J.-L. (2007). Synchronization mechanisms on modern multi-core architectures. In Proceedings of the 12th Asia-Pacific Computer Systems Architecture Conference (ACSAC’07), (LNCS Vol. 4697/2007), (pp. 290-303), Seoul, Korea. Berlin/Heidelberg: Springer. Liu, S., & Gaudiot, J.-L. (2008). The potential of finegrained value prediction in enhancing the performance of modern parallel machines. In Proceedings of the 13th IEEE Asia-Pacific Computer Systems Conference (ACSAC’08), (pp. 1-8). Hsinchu, Taiwan: IEEE Computer Society Press.
Locke, C. D., Vogel, D. R., & Mesler, T. J. (1991). Building A Predictable Avionics Platform in Ada. In Proceedings of IEEE Real-Time Systems Symposium. Lodygensky, O., Fedak, G., Cappello, F., Neri, V., Livny, M., & Thain, D. (2003). XtremWeb & Condor: Sharing resources between Internet connected condor pools. In Proceedings of CCGRID’2003, Third International Workshop On Global And Peer-To-Peer Computing (GP2PC’03) (pp. 382–389). Tokyo, Japan. Loo, B. T., Huebsch, R., Stoica, I., & Hellerstein, J. M. (2004). The case for a hybrid p2p search infrastructure. In Proceedings of the 3rd Intl. Workshop on Peer-to-Peer Systems (pp. 141-150). Berlin: Springer-Verlag. Lopez, J., Aeschlimann, M., Dinda, P., Kallivokas, L., Lowekamp, B., & O’Hallaron, D. (1999, June). Preliminary report on the design of a framework for distributed visualization. In Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA’99) (pp. 1833–1839). Las Vegas, NV. Lovas, R., Dózsa, G., Kacsuk, P., Podhorszki, N., & Drótos, D. (2004). Workflow support for complex Grid applications: Integrated and portal solutions. In M. Dikaiakos (Ed.): AxGrids 2004, (LNCS 3165, pp. 129-138). Berlin: Springer Verlag. Ludtke, S., Baldwin, P., & Chiu, W. (1999). EMAN: Semiautomated software for high-resolution singleparticle reconstruction. Journal of Structural Biology, 128, 146–157. doi:10.1006/jsbi.1999.4174
Liu, X., Li, V., & Zhang, P. (2006). Joint radio resource management through vertical handoffs in 4G networks IEEE GLOBECOM (pp. 1-5). Washington, DC: IEEE.
Luk, F. T., & Park, H. (1986). An analysis of algorithmbased fault tolerance techniques. SPIE Adv. Alg. and Arch. for Signal Proc., 696, 222–228.
Livny, M., & Raman, R. (1998). High-throughput resource management. In The Grid: Blueprint for a new computing infrastructure (pp. 311-338). San Francisco: Morgan-Kaufmann
Luk, M., Mezzour, G., Perrig, A., & Gligor, V. (2007). MiniSec: A Secure Sensor Network Communication Architecture. Proceedings of IEEE International Conference on Information Processing in Sensor Networks (IPSN), (pp. 479-488).
Loan, C. V. (1992). Computational frameworks for the fast Fourier transform. Philadelphia, PA: Society for Industrial and Applied Mathematics.
930
Compilation of References
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005). Peer-to-peer grid computing and a. NET-based alchemi framework. high performance computing: Paradigm and Infrastructure. In M. Guo, (Ed.). New York: Wiley Press. Retrieved from www.alchemi.net
Malyshkin, V. E. (1995). Functionality in ASSY system and language of functional programming. In Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis. (pp. 92-97). AizuWakamatsu, Japan: IEEE Comp. Soc. Press.
Luther, A., Buyya, R., Ranjan, R., & Venugopal, S. (2005, June). Alchemi: A. NET-based enterprise grid computing system, In ICOMP’05 Proceedings of the 6th International Conference on Internet Computing, Las Vegas, USA.
Mandelbrot Set. (2008, November). Retrieved from http:// mathworld.wolfram.com/MandelbrotSet.html.
Lv, C., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002). Search and replication in unstructured peer-to-peer networks. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems (pp.258-259). Ma, M. J. M., Wang, C. L., & Lau, F. C. M. (2000). JESSICA: Java-enabled single-system-image computing architecture. Journal of Parallel and Distributed Computing, 60(10), 1194–1222. doi:10.1006/jpdc.2000.1650 Mahadevan, U., & Ramakrishnan, S. (1994) Instruction scheduling over regions: A framework for scheduling across basic blocks. In Proceedings of the 5th International Conference on Compiler Construction (CC’94), Edinburgh, (LNCS Vol. 786/1994, pp. 419-434). Berlin/ Heidelberg: Springer. Malécot, P., Kondo, D., & Fedak, G. (2006, June). Xtremlab: A system for characterizing internet desktop grids. In Poster in the 15th ieee international symposium on high performance distributed computing hpdc’06. Paris, France. Malyshkin V.E., Sorokin S.B., & K.G.Chauk (2008, May). Fragmented numerical algorithms for the library parallel standard subroutines. Accepted to publication in Siberian Journal of Numerical Mathematics, Novosibirsk, Russia. Malyshkin, V. (2006). How to create the magic wand? Currently implementable formulation of the problem. In New Trends in Software Methodologies, Tools and Techniques, Proceedings of the Fifth SoMeT_06, 147, 127-132.
Manku, G. (2004). Balanced binary trees for ID management and load balance in distributed hash tables. In Proc. of PODC. March, V., Teo, Y. M., & Wang, X. (2007). DGRID: A DHT-based resource indexing and discovery scheme for computational grids. In Proceedings of the 5th Australasian Symp. on Grid Computing and e-Research (pp. 41-48). Australian Computer Society, Inc. March, V., Teo, Y. M., Lim, H. B., Eriksson, P., & Ayani, R. (2005). Collision detection and resolution in hierarchical peer-to-peer systems. In Proceedings of the 30th IEEE Conf. on Local Computer Networks (pp. 2-9). New York: IEEE Computer Society Press. Marcuello, P., & Gonzalez, A. (1999) Exploiting speculative thread-level parallelism on a SMT processor. In Proceedings of the 7th International Conference on High-Performance Computing and Networking (HPCN Europe’99), Amsterdam, the Netherlands, (LNCS Vol. 1593/1999, pp. 754-763) Berlin/Heidelberg: Springer. Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty, D.A, Miller, J.A., & Upton, M. (2002). Hyper-threading technology architecture and microarchitecture. Intel® Technology Journal, 6(1), 4-15. Marsh, A. (1997). EUROMED - Combining WWW and HPCN to Support Advanced Medical Imaging. Paper presented at the High-Performance Computing and Networking, Vienna, Austria. Mascarenhas, E., & Rego, V. (1995). Ariadne: Architecture of a portable threads system supporting mobile process, (Tech. Rep. No. CSD-TR 95-017). Dept. of Computer Sciences, Purdue University, Southbend, IN.
931
Compilation of References
Mason, R., & Kelly, W. (2005). G2-p2p: A fully decentralized fault-tolerant cycle-stealing framework. In R. Buyya, P. Coddington, and A. Wendelborn, (Eds.), In AusGrid’05 Australasian Workshop on Grid Computing and e-Research, Newcastle, Australia, (Vol. 44 of CRPIT, pp. 33-39). Matei, R., & Ian, F. (2002). A Decentralized, Adaptive Replica Location Mechanism. Paper presented at the Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing. Mathe, J., Kuntner, K., Pota, S., & Juhasz, Z. (2003). The use of jini technology in distributed and grid multimedia systems. In MIPRO 2003, Hypermedia and Grid Systems (p. 148-151). Opatija, Croatia. Matjaz, B. J. (2008). BPEL and Java. Retrieved June 15th, 2008, from http://www.theserverside.com/tt/articles/ article.tss?l=BPELJava Matossian, V., Bhat, V., Parashar, M., Peszynska, M., Sen, M., & Stoffa, P. (2005). Autonomic oil reservoir optimization on the grid. [John Wiley and Sons.]. Concurrency and Computation, 17(1), 1–26. doi:10.1002/cpe.871 Mattson, T., Sanders, B., & Massingill, B. (2004). Patterns for parallel programming. New York: AddisonWesley. Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-peer Information System Based on the XOR Metric. In Proceedings of the 1st international workshop on peer-to-peer systems (iptps’02) (pp. 53–65). McGinnis, L., Wallom, D., & Gentzsch, W. (Eds.). (2007). 2nd International Workshop on Campus and Community Grids. retrieved from http://forge.gridforum.org/sf/go/ doc14617?nav=1 McIlroy, M. (1982). Development of a Spelling List. Communications, IEEE Transactions on [legacy, pre 1988], 30(1), 91-99. McKenney, P. E., Lee, D. Y., & Denny, B. A. (2008). Traffic generator software release notes.
932
McKnight, L., Howison, J., & Bradner, S. (2004, July). Wireless grids, distributed resource sharing by mobile, nomadic and fixed devices. IEEE Internet Computing, 8(4), 24–31. doi:10.1109/MIC.2004.14 McNair, J., & Fang, Z. (2004). Vertical handoffs in fourth-generation multinetwork environments. IEEE Wireless Communications., 11(3), 8–15. doi:10.1109/ MWC.2004.1308935 Merlin, J. H., Baden, S. B., Fink, S., & Chapman, B. M. (1999). Multiple data parallelism with HPF and KeLP. Future Generation Computer Systems, 15(3), 393–405. doi:10.1016/S0167-739X(98)00083-1 Merton, R. C. (1973). Theory of Real Option Pricing. The Bell Journal of Economics and Management Science, 4(1). doi:10.2307/3003143 Message Passing Interface Forum. (1994). MPI: A Message Passing Interface Standard. (Technical Report utcs-94-230), University of Tennessee, Knoxville, TN. Messig, M., & Goscinski, A. (2007). Autonomic system management in mobile grid environments. In Proceedings of the Fifth Australasian Symposium on ACSW Frontiers (ACSW’ 07), (pp. 49–58). Darlinghurst, Australia: Australian Computer Society, Inc. Metz, C. (2001). Interconnecting ISP networks. IEEE Internet Computing, 5(2), 74–80. doi:10.1109/4236.914650 Meyer, J. (1980). On evaluating the performability of degradable computing systems. IEEE Transactions on Computers, 29, 720–731. doi:10.1109/TC.1980.1675654 Michael, M. (2002). Compressed bloom filters. IEEE/ ACM Trans. Netw., 10(5), 604–612. Microsoft Live Mesh. (2008, November). Retrieved from http://www.mesh.com. Migliaccio, A. (2006). The Design and Development of a Nomadic Computing Middleware: the Esperanto Broker. PhD Dissertation, Department of Computer and System Engineering, Federico II, University of Naples, Naples, Italy.
Compilation of References
Migliardi, M., & Sunderam, V. (1999). The harness metacomputing framework. In Proceedings of Ninth Siam Conference on Parallel Processing for Scientific Computing. San Antonio, TX: SIAM. Miller, R. L. (1993). High Resolution Image Processing on Low-cost Microcomputer. International Journal of Remote Sensing, 14(4), 655–667. doi:10.1080/01431169308904366 Milton, S. (1998). Thread migration in distributed memory multicomputers, (Tech. Rep. No. TR-CS-98-01). Dept. of Comp Sci & Comp Sciences Lab, Australia National University, Acton, Australia. Min, W. H., & Veeravalli, B. (2005, December). Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. Institute of Electrical and Electronic Engineering, 9(4), 489–501. Mislove, A., & Druschel, P. (2004). Providing administrative control and autonomy in structured peer-to-peer overlays. Proceedings of the 3rd Intl. Workshop on Peerto-Peer Systems (pp. 162-172). Berlin: Springer-Verlag. Mitzenmacher, M. (1997). On the analysis of randomized load balancing schemes. In Proc. of SPAA. Mohamed, H. H., & Epema, D. H. (2005). Experiences with the KOALA co-allocating scheduler in multiclusters. International Conference of Cluster Computing and the Grid (pp. 784-791). Cardiff, UK: IEEE Computer Society Press. Mohamed, H., & Epema, D. (in press). KOALA: A co-allocating Grid scheduler. Concurrency and Computation. Mohan, A., & Kalogeraki, V. (2003). Speculative routing and update propagation: a kundali centric approach. Paper presented at the IEEE International Conference on Communications, 2003. Mondal, A., Goda, K., & Kitsuregawa, M. (2003). Effective load-balancing of peer-to-peer systems. In Proc. of IEICE DEWS DBSJ Annual Conference.
Montero, R. S., Huedo, E., & Llorente, I. M. (2008, September/October). Dynamic deployment of custom execution environments in Grids. In 2nd International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP ’08) (pp. 33–38). Valencia, Spain: IEEE Computer Society. Montgomery, D. C. (2004). Design and analysis of experiments (6 ed.). New York: Wiley. Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics Magazine, 38(8). Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. New York: Cambridge University Press. MRML. (2003). MRML- Multimedia Retrieval Markup Language. Retrieved June 15th, 2008, from http://www. mrml.net/ Murphy, A. L., Picco, G. P., & Roman, G. (2001). LIME: a middleware for physical and logical mobility. 21st International Conference on Distributed Computing Systems, (pp. 524-533). Mutka, M. W., & Livny, M. (1987). Profiling workstations’ available capacity for remote execution. In Proceedings of performance-87, the 12th ifip w.g. 7.3 international symposium on computer performance modeling, measurement and evaluation. Brussels, Belgium. Mutka, M., & Livny, M. (1991, July). The available capacity of a privately owned workstation environment. Performance Evaluation, 4(12). Mutz, A., Wolski, R., & Brevik, J. (2007). Eliciting honest value information in a batch-queue environment. In The 8th IEEE/ACM Int’ Conference on Grid Computing (Grid 2007) Austin, Texas, USA. Myers, D. S., Bazinet, A. L., & Cummings, M. P. (2008). Expanding the reach of grid computing: combining globus- and boinc-based systems. In Grids for Bioinformatics and Computational Biology. New York: Wiley. MyGrid. (2008). Retrieved from www.mygrid.org.uk
933
Compilation of References
N’takpé. T., & Suter, F. (2006). Critical path and area based scheduling of parallel task graphs on heterogeneous platforms. In Proceedings of the Twelfth International Conference on Parallel and Distributed Systems (ICPADS) (pp. 3–10), Minneapolis, MN. N’takpé. T., Suter, F., & Casanova, H. (2007). A comparison of scheduling approaches for mixed-parallel applications on heterogeneous platforms. In 6th International Symposium on Parallel and Distributed Computing (pp. 35–42). Hagenberg, Austria: IEEE Computer Press.
Nesargi, S., & Prakash, R. (2002). MANETconf: Configuration of Hosts in a Mobile Ad Hoc Network. In Proceedings of the IEEE Infocom 2002, New York, June 2002. Neuroth, H., Kerzel, M., & Gentzsch, W. (Eds.). (2007). German Grid Initiative D-Grid. Göttingen, Germany: Universitätsverlag Göttingen Publishers. Retrieved from www.d-grid.de/index.php?id=4&L=1
Nabrzyski, J., Schopf, J. M., & Weglarz, J. (2003). Grid Resource Management. Amsterdam: Kluwer Publishing.
Ni, J., & Lin, C. Chen, Z., & Ungsunan, P. (2007, September). A Fast Multi-pattern Matching Algorithm for Deep Packet Inspection on a Network Processor. In Proceedings of International Conference on Parallel Processing (ICPP 2007)(p.16).
Nakada, H., Matsuoka, S., Seymour, K., Dongarra, J., Lee, C., & Casanova, H. (2003). Gridrpc: A remote procedure call api for grid computing.
Nickolls, J., Buck. I, & Garland, M., (2008). Scalable Parallel Programming with CUPA. ACM QUEUE, March/ April, 6(2), 40-53
Nam, M., Choi, N., Seok, Y., & Choi, Y. (2004). WISE: Energy-efficient interface selection on vertical handoff between 3G networks and WLANs. IEEE PIMRC 2004, 1, (pp. 692-698). Washington, DC: IEEE.
Nicolescu, C., & Jonker, P. (2002). A Data and Task Parallel Image Processing Environment. Parallel Computing, 28, 945–965. doi:10.1016/S0167-8191(02)00105-9
Nanda, P. (2008, January). A three layer policy based architecture supporting Internet QoS. Ph.D. thesis, University of Technology, Sydney, Australia. Naor, M., & Wieder, U. (June 2003). Novel Architectures for P2P applications: The continuous-discrete approach. In Proc. SPAA. National e-Science Centre. (2005). Retrieved from http:// www.nesc.ac.uk. NEESgrid. (2008). Retrieved from www.nees.org/ Nelson, B. J. (1981). Remote Procedure Call. Palo Alto, CA: Xerox - Palo Alto Research Center. Nemirovsky, M. D., Brewer, F., & Wood, R. C. (1991). DISC: dynamic instruction stream computer. In Proceedings of the 24th Annual International Symposium on Microarchitecture (MICRO’91), Albuquerque, NM (pp. 163-171). New York: ACM Press.
934
Niederl, F., & Goller, A. (Jan, 1998). Method Execution On A Distributed Image Processing Backend. Paper presented at the 6th EUROMICRO Workshop on Parallel and Distributed Processing, Madrid, Spain. Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., & Kielmann, T. (2005). Ibis: a flexible and efficient Java-based Grid programming environment. Concurrency and Computation, 17(7/8), 1079-1108. Nieuwpoort, R. V. v., Maassen, J., Wrzesinska, G., Kielmann, T., & Bal, H. E. (2004). Satin: Simple and efficient Java-based grid programming. Journal of Parallel and Distributed Computing Practices. Nisan, N., London, S., Regev, O., & Camiel, N. (1998). Globally distributed computation over the internet - the popcorn project. In International conference on distributed computing systems 1998 (p. 592). New York: IEEE Computer Society.
Compilation of References
Norman, T. J., Preece, A., Chalmers, S., Jennings, N. R., Luck, M., & Dang, V. D. (2004). Agent-based formation of virtual organisations. Knowledge-Based Systems, 17, 103–111. doi:10.1016/j.knosys.2004.03.005 Oaks, S., Traversat, B., & Gong, L. (2002). JXTA in a Nutshell. Sebastopol, CA: O’Reilly Media, Inc. Oberhuber, M. (1998). Distributed High-Performance Image Processing on the Internet. Doctoral Thesis, Graz University of Technology, Austria. ObjectWeb. (2004). RUBBoS: Bulletin Board Benchmark. Retrieved June 19, 2008, from http://jmob.objectweb. org/rubbos.html ObjectWeb. (2005). TPC-W Benchmark (Java Servlets version). Retrieved June 19, 2008, from http://jmob. objectweb.org/tpcw.html OGF. (2008). Open Grid Forum. Retrieved from www. ogf.org Oh, J., Lee, S., & Lee, E. (2006). An adaptive mobile system using mobile grid computing in wireless network. In Computational Science And Its Applications - ICCSA 2006 (LNCS Vol. 3984, pp. 49-57). Berlin: Springer. Olukotun, K., & Hammond, L., (September 2005). The Future of Microprocessors. ACM Queue, September, 3(7), 26-29 OMG. (2002). Wireless Access and Terminal Mobility in CORBA Specification. Retrieved June 15th, 2008, from http://www.info.fundp.ac.be/~ven/CIS/OMG/new%20 documents%20from%20OMG%20on%20CORBA/ corba%20wireless.pdf Open Science Grid. (2005). Retrieved from http://www. opensciencegrid.org Open Source Metascheduling for Virtual Organizations with the Community Scheduler Framework (CSF) (Tech. Rep.) (2003, August). Ontario, Canada: Platform Computing. OpenPBS. The portable batch system software. (2005). Veridian Systems, Inc., Mountain View, CA. Retrieved from http://www.openpbs.org/scheduler.html
Oram, A. (2001). Peer-to-Peer: Harnessing the power of disruptive technologies. O’Reilly. Orlando, S., & Perego, R. (1999). COLTHPF, A run-time support for the high-level co-ordination of HPF tasks. Concurrency (Chichester, England), 11(8), 407–434. doi:10.1002/(SICI)1096-9128(199907)11:8<407::AIDCPE435>3.0.CO;2-0 Orlando, S., Palmerini, P., & Perego, R. (2000). Coordinating HPF programs to mix task and data parallelism. In Proceedings of the 2000 ACM Symposium on Applied Computing (SAC’00) (pp. 240–247). New York: ACM Press. Otebolaku, A., Adigun, M., Iyilade, J., & Ekabua, O. (2007). On modeling adaptation in context-aware mobile grid systems. In Icas ’07: Proceedings of the Third International Conference on Autonomic And Autonomous Systems (p. 52). Washington, DC: IEEE Computer Society. Otoo, E., Rotem, D., & Romosan, A. (2004). Optimal File-Bundle Caching Algorithms for Data-Grids. In Sc ’04: Proceedings of the 2004 acm/ieee conference on supercomputing (p. 6). Washington, DC: IEEE Computer Society. p2psip working group. (2008). Peer-to-Peer Session Initiation Protocol Specification. Retrieved June 15th, 2008, from http://www.ietf.org/html.charters/p2psipcharter.html Padala, P., & Wilson, J. N. (2003). GridOS: Operating system services for grid architectures. In High Performance Computing (pp. 353-362). Berlin: Springer. Padala, P., Shin, K. G., Zhu, X., Uysal, M., Wang, Z., Singhal, S., et al. (2007, March). Adaptive control of virtualized resources in utility computing environments. In 2007 Conference on EuroSys (EuroSys 2007) (pp. 289-302). Lisbon, Portugal: ACM Press. Pai-Hsiang, H. (2001). Geographical region summary service for geographical routing. Paper presented at the Proceedings of the 2nd ACM international symposium on Mobile ad hoc networking and computing.
935
Compilation of References
Pairot, C., Garcia, P., Rallo, R., Blat, J., & Gomez Skarmeta, A. F. (2005). The Planet Project: collaborative educational content repositories on structured peer-to-peer grids. CCGrid 2005, IEEE International Symposium on Cluster Computing and the Grid, (Vol. 1, pp. 35-42). Palankar, M., Onibokun, A., Iamnitchi, A., & Ripeanu, M. (2007). Amazon S3 for Science Grids: a Viable Solution? Poster: 4th USENIX mposium on Networked Systems Design and Implementation (NSDI’07). Pantry, S., & Griffiths, P. (1997). The Complete Guide to Preparing and Implementary Service Level Agreements (1st Ed.). London: Library Association Publishing. Parashar, M., & Browne, J. (2005, Mar). Conceptual and implementation models for the grid. Proceedings of the IEEE, 93(3), 653–668. doi:10.1109/ JPROC.2004.842780 Parashar, M., & Hariri, S. (Eds.). (2006). Autonomic computing: Concepts, infrastructure and applications. Boca Raton, FL: CRC Press. Parashar, M., & Lee, C. A. (2005, March). Scanning the issue: Special isssue on grid-computing. In Proceedings of the IEEE, 93 (3), 479-484. Retrieved from http://www. caip.rutgers.edu/TASSL/Papers/proc-ieee-intro-04.pdf Parashar, M., Matossian, V., Klie, H., Thomas, S. G., Wheeler, M. F., Kurc, T., et al. (2006). Towards dynamic data-driven management of the ruby gulch waste repository. In V. N. Alexandrox & et al. (Eds.), Proceedings of the Workshop on Distributed Data Driven Applications and Systems, International Conference on Computational Science 2006 (ICCS 2006) (Vol. 3993, pp. 384–392). Berlin: Springer Verlag. Parekh, A.K. & Gallager, R.G. (1994). A Generalised Processor Sharing Approach to Flow Control in Integrated Services Networks. IEEE Transactions on Networking 2(2). Park, H.-S., Yoon, S.-H., Kim, T.-Y., Park, J.-S., Do, M., & Lee, J.-Y. (2003). Vertical handoff procedure and algorithm between IEEE 802.11 WLAN and CDMA cellular network (LNCS, pp.103-112). Berlin: Springer.
936
Park, S., Kim, J., Ko, Y., & Yoon, W. (2003). Dynamic data grid replication strategy based on Internet hierarchy. In Proceedings of the second international workshop on grid and cooperative computing (GCC’2003). Park, S.-M., Ko, Y.-B., & Kim, J.-H. (2003, December). Disconnected operation service in mobile grid computing. In First International Conference on Service Oriented Computing (ICSOC’2003), Trento, Italy. Pascual, V., Matuszewski, M., Shim, E., Zheng, H., & Song, Y. (2008). P2PSIP Clients. Retrieved June 15th, 2008, from http://tools.ietf.org/id/draft-pascual-p2psipclients-01.txt Patel, J., Teacy, L. W. T., Jennings, N. R., Luck, M., Chalmers, S., & Oren, N. (2005). Agent-based virtual organisations for the Grids. International Journal of Multi-Agent and Grid Systems, 1(4), 237–249. Patterson, D.A., & Hennessy, J.L. Computer Organization and Design (3rd Ed.). Pavlidou, F. N. (1994). Two-dimensional traffic models for cellular mobile systems. IEEE Transactions on Communications, 42(234), 1505–1511. doi:10.1109/ TCOMM.1994.582831 Paxson, V., & Sommer, R. (2007). An Architecture Exploiting Multi-Core Processors to Parallelize Network Intrusion Prevention. In . Proceedings of IEEE Sarnoff Symposium, 3(7), 26–29. Paxson, V., Asanović, K., Dharmapurikar, S., Lockwood, J., Pang, R., Sommer, R., et al. (2006). Rethinking hardware support for network analysis and intrusion prevention. Proceedings of the 1st conference on USENIX Workshop on Hot Topics in Security. Pedroso, J., Silva, L., & Silva, J. (1997, June). Webbased metacomputing with JET. In Proc. of the acm ppopp workshop on java for science and engineering computation. Pelagatti, S. (2003). Task and Data Parallelism in P3L. In F. A. Rabhi & S. Gorlatch (Eds.), Patterns and skeletons for parallel and distributed computing (pp.155–186). London: Springer-Verlag.
Compilation of References
Pelagatti, S., & Skillicorn, D. B. (2001). Coordinating programs in the network of tasks model. Journal of Systems Integration, 10(2), 107–126. doi:10.1023/A:1011228808844 Pennebaker, W. B., & Mitchell, J. L. (1992). JPEG: Still Image Data Compression Standard (Digital Multimedia Standards). Berlin: Springer. Perez, C. E. (2003). Open Source Distributed Cache Solutions Written in Java. Retrieved June 24, 2008, from http://www.manageability.org/blog/stuff/distributedcache-java Perez, J.M., Bellens, P., Badia, R.M., & Labarta, J. (2007, August). CellSs: Programming the Cell/ B.E. made easier. IBM Journal of R&D, 51(5). Pericas, M., Cristal, A., Cazorla, F. J., Gonzalez, R., Jimenez, D. A., & Valero, M. (2007). A Flexible Heterogeneous Multi-Core Architecture. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, (pp. 13 -24). Perkins, C. (2003). RTP: Audio and Video for the Internet. New York: Addison-Wesley. Persistence of Vision Raytracer. (2008, November). Retrieved from http://www.povray.org Peterson, L., Muir, S., Roscoe, T., & Klingaman, A. (2006, May). PlanetLab Architecture: An Overview (Tech. Rep. No. PDN-06-031). Princeton, NJ: PlanetLab Consortium. Petrini, F. Kerbyson, D. J. & Pakin, S. (2003). The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. P-GRADE. (2003). Parallel grid run-time and application development environment. Retrieved from www. lpds.sztaki.hu/pgrade/ Pham, H. (2000). Software reliability. Singapore: Springer-Verlag.
Phan, T., Huang, L., & Dulan, C. (2002). Challenge: integrating mobile wireless devices into the computational grid. In Mobicom ’02: Proceedings of the 8th annual international conference on mobile computing and networking (pp. 271–278). New York: ACM Press. Pierson, J.-M. (2006, June). A pervasive grid, from the data side (Tech. Rep. No. RR-LIRIS-2006-015). LIRIS UMR 5205 CNRS/INSA de Lyon/Universit Claude Bernard Lyon 1/Universit Lumire Lyon 2/Ecole Centrale de Lyon. Retrieved from http://liris.cnrs.fr/publis/?id=2436 Pitas, I. (1993). Parallel Algorithm for Digital Image Processing, Computer Vision and Neural Network. Chichester, UK: John Wiley & Sons. Piyachon, P., & Luo, Y. (2006). Efficient memory utilization on network processors for deep packet inspection. Proceedings of ACM/IEEE ANCS, (pp. 71-80). Pjesivac-Grbovic, J., Bosilca, G., Fagg, G. E., Angskun, T., & Dongarra, J. J. (2007). MPI Collective Algorithm Selection and Quadtree Encoding. Parallel Computing, 33(9), 613–623. doi:10.1016/j.parco.2007.06.005 PlanetLab Europe. (2008). Retrieved from http://www. planet-lab.eu/. Plank, J. S. (1997, September). A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software, Practice & Experience, 27(9), 995–1012. doi:10.1002/(SICI)1097-024X(199709)27:9<995::AIDSPE111>3.0.CO;2-6 Plank, J. S., & Li, K. (1994). Faster checkpointing with n+1 parity. In FTCS, (pp. 288–297). Plank, J. S., & Thomason, M. G. (2001, November). Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing, 61(11), 1570–1590. doi:10.1006/ jpdc.2001.1757 Plank, J. S., Beck, M., Kinsley, G., & Li, K. (1995). Libckpt: Transparent checkpointing under unix. Usenix winter technical conference, (pp. 213-223).
937
Compilation of References
Plank, J. S., Kim, Y., & Dongarra, J. (1997). Faulttolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing, 43(2), 125–138. doi:10.1006/ jpdc.1997.1336 Plank, J. S., Li, K., & Puening, M. A. (1998). Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10), 972–986. doi:10.1109/71.730527 Polak, S., Slota, R., Kitowski, J., & Otfinowski, J. (2001). XML-based Tools for Multimedia Course Preparation. Archiwum Informatyki Teoretycznej i Stosowanej, 13, 3–21. Portal, C. H. R. O. N. O. S. (2004). Retrieved from http:// portal.chronos.org/gridsphere/gridsphere PortoResearch. (2008). Slicing Up the Mobile Services Revenue Pie. Retrieved March 10, 2008, from http://www. portioresearch.com/slicing_pie_press.html PPDG. (2006). From fabric to physics (Tech. Rep.). The Particle Physics Data Grid. PRACE. (2008). Partnership for advanced computing in Europe. Retrieved from www.prace-project.eu/ Preston, R. P., Badeau, R. W., Bailey, D. W., Bell, S. L., Biro, L. L., Bowhill, W. J., et al. (2002). Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In Digest of Technical Papers of the 2002 IEEE International Solid-State Circuits Conference (ISSCC’02), San Francisco, CA (Vol. 1, pp. 334-472). New York: IEEE Press. Proactive (2005). Proactive manual REVISED 2.2., Proactive, INRIA. Retrieved from http://www-sop.inria. fr/oasis/Proactive/ Prodan, R., & Fahringer, T. (2008, March). overhead analysis of scientific workflows in grid environments. Transactions on Parallel and Distributed Systems, 19(3), 378–393. doi:10.1109/TPDS.2007.70734 Pro-MPEG. (2005). Material eXchange Format (MXF). Retrieved 15th June, 2008, from http://www.pro-mpeg. org.
938
Pruyne, J., & Livny, M. (1996). A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. Journal on Future Generations of Computer Systems, 12. Qi, Y., Xu, B., He, F., Yang, B., Yu, J., & Li, J. (2007). Towards high-performance flow-level packet processing on multi-core network processors. Proceedings of 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, (pp. 17-26). Qiu, D., & Srikant, R. (2004). Modeling and performance analysis of bittorrent-like peer-to-peer networks. Computer Communication Review, 34(4), 367–378. doi:10.1145/1030194.1015508 Quan, D. M. (Ed.). (2008). A Framework for SLA-aware execution of Grid-based workflows. Saabbrücken, Germany: VDM Verlag. Quan, D. M., & Altmann, J. (2007). Business model and the policy of mapping light communication grid-based workflow within the SLA Context. In Proceedings of the International Conference of High Performance Computing and Communication (HPCC07), (pp. 285-295). Berlin: Springer Velag. Quan, D. M., & Altmann, J. (2007). Mapping a group of jobs in the error recovery of the Grid-based workflow within SLA context. In L. T. Yang (Ed.), Proceedings of the 21st International Conference on Advanced Information Networking and Applications (AINA 2007), (pp. 986-993). New York: IEEE press. Quasy, H. M. (2004). Middleware for Communications. Chichester, UK: John Wiley Sons ltd. Quoitin, B., & Bonaventure, O. (2005). A Co-operative approach to Inter-domain traffic engineering. 1st Conference on Next Generation Internet Networks Traffic Engineering (NGI 2005), Rome, Italy, April 18-20th. Quoitin, B., Uhlig, S., Pelsser, C., Swinnen, L., & Bonaventure, O. (2003). Internet traffic engineering with BGP: Quality of Future Internet Services. Berlin: Springer
Compilation of References
Raasch, S. E., & Reinhardt, S. K. (1999). Applications of thread prioritization in SMT processors. In Proceedings of the 3rd Workshop on Multithreaded Execution and Compilation (MTEAC’99), Orlando, FL. Raasch, S. E., & Reinhardt, S. K. (2003). The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT’03), (pp. 15–25). New Orleans, LA: IEEE Computer Society. Radulescu, A., & van Gemund, A. J. C. (2001). A low-cost approach towards mixed task and data parallel scheduling. In Proceedings of the International Conference on Parallel Processing (ICPP’01)(pp. 69–76). New York: IEEE Computer Society. Radulescu, A., Nicolescu, C., van Gemund, A. J. C., & Jonker, P. (2001). CPR: Mixed task and data parallel scheduling for distributed systems. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS’01) (pp. 39-46). New York: IEEE Computer Society. Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica selection in grid environment: A data-mining approach. In Proceedings of the ACM symposium on applied computing (pp. 695–700). Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica placement in data grid: A multi-objective approach. In Proceedings of the international conference on grid and cooperative computing (pp. 645–656). Rahman, R. M., Barker, K., & Alhajj, R. (2005). Replica placement in data grid: Considering utility and risk. In Proceedings of the international conference on information technology: Coding and computing (ITCC’05) (Vol. 1, pp. 354–359). Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., & Wilde, M. (2007). Falkon: a fast and light-weight task execution framework. In Ieee/acm supercomputing.
Ramakrishnan, L., Irwin, D., Grit, L., Yumerefendi, A., Iamnitchi, A., & Chase, J. (2006). Toward a doctrine of containment: Grid hosting with adaptive resource control. In 2006 ACM/IEEE Conference on Supercomputing (SC 2006) (p. 101). New York: ACM Press. Raman, R., Livny, M., & Solomon, M. H. (1998). Matchmaking: Distributed resource management for high throughput computing. In Hpdc (p. 140). Raman, R., Livny, M., & Solomon, M. H. (1999). Matchmaking: An extensible framework for distributed resource management. Cluster Computing, 2(2), 129–138. doi:10.1023/A:1019022624119 Ramaswamy, S. (1996). Simultaneous exploitation of task and data parallelism in regular scientific computations. Doctoral thesis, University of Illinois at UrbanaChampaign. Ramaswamy, S., Sapatnekar, S., & Banerjee, P. (1997). A framework for exploiting task and data parallelism on distributed memory multicomputers. IEEE Transactions on Parallel and Distributed Systems, 8(11), 1098–1116. doi:10.1109/71.642945 Ramaswamy, S., Simons, B., & Banerjee, P. (1996). Optimizations for efficient array redistribution on distributed memory multicomputers. Journal of Parallel and Distributed Computing, 38(2), 217–228. doi:10.1006/ jpdc.1996.0142 Ramjee, R., Li, L., La Porta, T., & Kasera, S. (2002). IP paging service for mobile hosts. Wireless Networks, 8, 427–441. doi:10.1023/A:1016534027402 Ramkumar, B., & Strumpen, V. (1997). Portable checkpointing for heterogenous architectures. Symposium on fault-tolerent computing, (pp. 58-67). Ranganathan, K., & Foster, I. (2001). Design and evaluation of dynamic replication strategies for a high performance data grid. In Proceedings of the international conference on computing in high energy and nuclear physics (pp. 260-263).
939
Compilation of References
Ranganathan, K., & Foster, I. (2002). Decoupling computation and data scheduling in distributed data intensive applications. In Proceedings of the 11th international symposium for high performance distributed computing (HPDC) (pp. 352–358).
Ranjan, R., Harwood, A., & Buyya, R. (2008, July). Peerto-peer resource discovery in global grids: A tutorial. IEEE Communication Surveys and Tutorials (COMST), 10(2), 6-33. New York: IEEE Communications Society Press. doi:doi:10.1109/COMST.2008.4564477
Ranganathan, K., & Foster, I. (2003). Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing, 1(1), 53–62. doi:10.1023/A:1024035627870
Ranjan, R., Rahman, M., & Buyya, R. (2008, May). A decentralized and cooperative workflow scheduling algorithm. In 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2008). Lyon, France: IEEE Computer Society.
Ranganathan, K., & Foster, I. T. (2001). Identifying dynamic replication strategies for a high-performance data grid. In Proceedings of the International Workshop on Grid Computing (GRID’2001) (pp. 75–86). Ranganathan, K., Iamnitchi, A., & Foster, I. (2002). Improving data availability through dynamic modeldriven replication in large peer-to-peer communities. In Proceedings of the 2nd IEEE/ACM international symposium on cluster computing and the grid (CCGRID’02) (pp. 376–381). Ranjan, R. (2007, July). Coordinated resource provisioning in federated grids. Doctoral thesis, The University of Melbourne, Australia. Ranjan, R., Buyya, R., & Harwood, A. (2005, September). A case for cooperative and incentive-based coupling of distributed clusters. In 7th IEEE International Conference on Cluster Computing. Boston, MA: IEEE CS Press. Ranjan, R., Harwood, A., & Buyya, R. (2006, September). SLA-based coordinated superscheduling scheme for computational Grids. In IEEE International Conference on Cluster Computing (Cluster 2006) (pp. 1–8). Barcelona, Spain: IEEE. Ranjan, R., Harwood, A., & Buyya, R. (2008). Coordinated load management in peer-to-peer coupled federated grid systems. (Technical Report GRIDS-TR-2008-2). Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia. doi: http://www. gridbus.org/reports/CoordinatedGrid2007.pdf
940
Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., & Stoica, I. (2003). Load Balancing in structured P2P systems. Proceedings of the 2nd Intl. Workshop on Peerto-Peer Systems (pp. 68-79). Berlin: Springer-Verlag. Rashid, R. F., & Robertson, G. (1981). Accent: A communication oriented network operating system kernel. Proceedings of the eighth acm symposium on operating systems principles, (pp. 64-75). Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A scalable content-addressable network. In SIGCOMM’01 Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, (pp. 161172). New York: ACM Press. Retrieved from http://doi. acm.org/10.1145/ 383059.383072 Ratnasamy, S., Handley, M., Karp, R., & Shenker, S. (2002). Topologically aware overlay construction and server selection. In Proc. of INFOCOM. Ratnasamy, S., Stoica, I., & Shenker, S. (2002). Routing algorithms for DHTs: Some open questions. Proceedings the 1st Intl. Workshop on Peer-to-Peer Systems (pp. 4552). Berlin: Springer-Verlag. Rauber, T., & Rünger, G. (1996). The compiler TwoL for the design of parallel implementations. In Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques(PACT’96)(pp. 292-301). Washington, DC: IEEE Computer Society.
Compilation of References
Rauber, T., & Rünger, G. (1999). Compiler support for task scheduling in hierarchical execution models. Journal of Systems Architecture, 45(6-7), 483–503. doi:10.1016/ S1383-7621(98)00019-8 Rauber, T., & Rünger, G. (1999). Parallel execution of embedded and iterated Runge-Kutta methods. Concurrency (Chichester, England), 11(7), 367–385. doi:10.1002/(SICI)1096-9128(199906)11:7<367::AIDCPE430>3.0.CO;2-G Rauber, T., & Rünger, G. (1999). Scheduling of data parallel modules for scientific computing. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing (PPSC), SIAM(CD-ROM), San Antonio, TX. Rauber, T., & Rünger, G. (2000). A transformation approach to derive efficient parallel implementations. IEEE Transactions on Software Engineering, 26(4), 315–339. doi:10.1109/32.844492 Rauber, T., & Rünger, G. (2005). TLib - A library to support programming with hierarchical multi-processor tasks. Journal of Parallel and Distributed Computing, 65(3), 347–360. Rauber, T., & Rünger, G. (2006). A data re-distribution library for multi-processor task programming. International Journal of Foundations of Computer Science, 17(2), 251–270. doi:10.1142/S0129054106003814 Rauber, T., & Rünger, G. (2007). Mixed task and data parallel executions in general linear methods. Science Progress, 15(3), 137–155. Rauber, T., Reilein-Ruß, R., & Rünger, G. (2004). GroupSPMD programming with orthogonal processor groups. Concurrency and Computation: Practice and Experience . Special Issue on Compilers for Parallel Computers, 16(2-3), 173–195.
Rauber, T., Reilein-Ruß, R., & Rünger, G. (2004). On compiler support for mixed task and data parallelism. In G. R. Joubert, W. E. Nagel, F. J. Peter, & W. V. Walter (Eds.), Parallel Computing: Software Technology, Algorithms, Architectures & Applications. Proceedings of 12th International Conference on Parallel Computing (ParCo’03) (pp. 23–30). New York: Elsevier. Rauber, T., Rünger, G., & Wilhelm, R. (1995). Deriving optimal data distributions for group parallel numerical algorithms. In Proceedings of the Conference on Programming Models for Massively Parallel Computers (PMMP’94) (pp. 33–41). Washington, DC: IEEE Computer Society. Ray, E. (2003). Learning XML. Sebastopol, CA: O’Reilly Media, Inc. Reed, D. A. (2003). Grids: The teragrid, and beyond. IEEE Computer, 36(1), 62–68. Reilein-Ruß, R. (2005). Eine komponentenbasierte Realisierung der TwoL Spracharchitektur. PhD Thesis, TU Chemnitz, Fakultät für Informatik, Chemnitz, Germany. Reinhardt, S., & Mukherjee, S. (2000). Transient fault detection via simultaneous multithreading. In ACM SIGARCH Computer Architecture News: Special Issue: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00), (pp. 25-36). Vancouver,Canada: ACM Press Rekhter, Y. & Li, T. (2002, January). A border gateway protocol 4 (BGP-4): draft-ietf-idr-bgp4-17.txt [Internet draft, work in progress]. Replica Location Service (RLS) (n.d.). Retrieved from http://www.globus.org/toolkit/docs/4.0/data/rls/ Reuters (2007). Global cellphone penetration reaches 50 pct. Retrieved March 10, 2008, from http://investing. reuters.co.uk/news/articleinvesting.aspx?type=media& storyID=nL29172095 Revees, C. (1993). Moderm heuristic techniques for combinatorial problems. Oxford, UK: Oxford Blackwell Scientific Publication.
941
Compilation of References
Rhea, S. C., & Kubiatowicz, J. (2002). Probabilistic location and routing. Paper presented at the IEEE INFOCOM 2002, Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies Proceedings. Rhea, S. C., Eaton, P. R., Geels, D., Weatherspoon, H., Zhao, B. Y., & Kubiatowicz, J. (2003). Pond: The oceanstore prototype. In Fast. Rodriguez, A., Gonzalez, A., & Malumbres, M. P. Performance evaluation of parallel mpeg-4 video coding algorithms on clusters of workstations. International Conference on Parallel Computing in Electrical Engineering (PARELEC’04), 354-357. Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling Churn in a DHT. Proceedings of the USENIX (pp. 127-140). USENIX Association. Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., et al. (2005). OpenDHT: A public DHT service and its uses. In Proceedings of ACM SIGCOMM (pp. 73-84). New York: ACM Press. Ricci, R., Oppenheimer, D., Lepreau, J., & Vahdat, A. (2006, January). Lessons from resource allocators for large-scale multiuser testbeds. SIGOPS Operating Systems Review, 40(1), 25–32. doi:10.1145/1113361.1113369 Richardson, I., & Richardson, I. E. G. (2003). H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia. Chichester, UK: Wiley. Ripeanu, M., Foster, I., & Iamnitchi, A. (2002). Mapping the gnutella network: properties of large-scale peer-topeer systems and implications for system design. IEEE Internet Computing, 6(1), 50-57. Robertazzi, T. (2003). Ten reasons to use divisible load theory. Institute of Electrical and Electronic Engineering, 36(5), 63–68. Rodriguez, B. (2002). EDLXML serialization. Retrieved 15th June, 2008, from download.sybase.com/pdfdocs/ prg0390e/prsver39edl.pdf Roesch, M. (1999). Snort - lightweight intrusion detection for networks. Proceedings of 13th USENIX LISA Conference, (pp. 229-238).
942
Roman, M., Kon, F., & Campbell, R. (2001). Reflective Middleware: From your Desk to your Hand. IEEE Communications Surveys, 2(5). Roure, D. D. (2003). Semantic grid and pervasive computing. http://www.semanticgrid.org/GGF/ggf9/gpc/ Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware’01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, (pp. 329-350). Heidelberg, Germany: SpringerLink. doi: 10.1007/3-540-45518-3 Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of IFIP/ACM Intl. Conf. on Distributed Systems Platforms (pp. 329-350). Berlin: Springer-Verlag. Rowstron, A., & Druschel, P. (2001, November). Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proceedings of the 18th ifip/acm international conference on distributed systems platforms (middleware 2001), Heidelberg, Germany. Rowstron, A., & Druschel, P. Pastry. (2001). Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Proc. of the 18th IFIP/ACM Int’l Conf. on Distributed Systems Platforms (Middleware). Rubio-Montero, A., Huedo, E., Montero, R., & Llorente, I. (2007, March). Management of virtual machines on globus Grids using GridWay. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007) (pp. 1–7). Long Beach, USA: IEEE Computer Society. Ruth, P., Jiang, X., Xu, D., & Goasguen, S. (2005, May). Virtual distributed environments in a shared infrastructure. IEEE Computer, 38(5), 63–69. Ruth, P., McGachey, P., & Xu, D. (2005, September). VioCluster: Virtualization for dynamic computational domain. In IEEE International on Cluster Computing (Cluster 2005) (pp. 1–10). Burlington, MA: IEEE.
Compilation of References
Ruth, P., Rhee, J., Xu, D., Kennell, R., & Goasguen, S. (2006, June). Autonomic live adaptation of virtual computational environments in a multi-domain infrastructure. In 3rd IEEE International Conference on Autonomic Computing (ICAC 2006) (pp. 5-14). Dublin, Ireland: IEEE. Saar, C., & Yossi, M. (2003). Spectral bloom filters. Paper presented at the Proceedings of the 2003 ACM SIGMOD international conference on Management of data. Saara Väärtö, S. (Ed.). (2008). Advancing science in Europe. DEISA – Distributed European Infrastructure for Supercomputing Applications. EU FP6 Project. Retrieved from www.deisa.eu/press/DEISA-AdvancingScienceInEurope.pdf SAGA. (2006). SAGA implementation home page Retrieved from http://fortytwo.cct.lsu.edu:8000/SAGA Sahai, A., Graupner, S., Machiraju, V., & Moorsel, A. (2003). Specifying and monitoring guarantees in commercial grids through SLA. In F. Tisworth (Ed.), Proceeding of the 3rd IEEE/ACM CCGrid2003, (pp.292—300). New York: IEEE press. Sairamesh, J., Stanbridge, P., Ausio, J., Keser, C., & Karabulut, Y. (2005, March). Business Models for Virtual Organization Management and Interoperability (Deliverable A - WP8&15 WP - Business & Economic Models No. V.1.5). Deliverable document 01945 prepared for TrustCom and the European Commission. Saito, Y., & Levy, H. M. (2000). Optimistic replication for internet data services. In Proceedings of international symposium on distributed computing (pp. 297–314). Saito, Y., & Shapiro, M. (2005). Optimistic replication. ACM Computing Surveys, 37(1), 42–81. doi:10.1145/1057977.1057980 Saleh, O., & Hefeeda, M. (2006). Modeling and caching of peer-to-peer traffic. In Proc. of 14th IEEE International Conference on Network Protocols (ICNP’06), (pp. 249-258).
Salkintzis, A. K. (2004). Interworking techniques and architectures for WLAN-3G integration toward 4G mobile data networks. IEEE Wireless Communications, 11(3), 50–61. doi:10.1109/MWC.2004.1308950 Salkintzis, A. K., Fords, C., & Pazhyannur, R. (2002). WLAN-GPRS integration for next generation mobile data networks. IEEE Wireless Communications, 9(5), 112–124. doi:10.1109/MWC.2002.1043861 Salsano, S. (2001 October). COPS usage for Diffserv resource allocation (COPS-DRA) [Internet Draft]. Samet, H. (2008, November). The design and analysis of spatial data structures. New York: Addison-Wesley Publishing Company. Sanders, R. (2008). SETI@home looking for more volunteers. Retrieved 10 March, 2008, from http://www. berkeley.edu/news/media/releases/2008/01/02_setiahome.shtml Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting Replication and Data Reuse to Efficiently Schedule Data-intensive Applications on Grids. In Proceedings of the 10th workshop on job scheduling strategies for parallel processing. Santos-Neto, E., Cirne, W., Brasileiro, F., & Lima, A. (2004). Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids. In Proceedings of 10th workshop on job scheduling strategies for parallel processing (Vol. 3277, pp. 210–232). Sarmenta, L. F. G. (2002). Sabotage-tolerance mechanisms for volunteer computing systems. Future Generation Computer Systems, 18(4), 561–572. doi:10.1016/ S0167-739X(01)00077-2 Sarmenta, L. F. G., & Hirano, S. (1999). Bayanihan: Building and studying volunteer computing systems using Java. Future Generation Computer Systems, 15(5/6), 675-686. Saroiu, S., et al. (2002). A Measurement Study of Peerto- Peer File Sharing Systems. In Proc. of MMCN.
943
Compilation of References
Scales, D. J., & Gharachorloo, K. (1997). Towards transparent and efficient software distributed shared memory. Paper presented at the Proceedings of the sixteenth ACM symposium on Operating systems principles. Schiffmann, W., Sulistio, A., & Buyya, R. (2007). Using Revenue management to Determine Pricing of Revervations. Proc. 3rd International Conference on e-Science and Grid Computing (eScience 2007) Bangalore, India, December 10-13. Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. In Proceedings of Mobile Computing Systems and Applications, (pp. 85-90). Schintke, F., & Reinefeld, A. (2003). Modeling replica availability in large data grids. Journal of Grid Computing, 1(2), 219–227. doi:10.1023/ B:GRID.0000024086.50333.0d Schirrmeister, F. (2007). Multi-core Processors: Fundamentals, Trends, and Challenges, Embedded Systems Conference, (pp. 6-15). Schowengerdt, R. A., & Mehldau, G. (1993). Engineering a Scientific Image Processing Toolbox for the Macintosh II. International Journal of Remote Sensing, 14(4), 669–683. doi:10.1080/01431169308904367
Schwiegelshohn, U., & Yahyapour, R. (1999). Resource allocation and scheduling in metasystems. In 7th International Conference on High-Performance Computing and Networking (HPCN Europe ’99) (pp. 851–860). London, UK: Springer-Verlag. Schwiegelshohn, U., & Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal of Scheduling, 3(5), 297– 320. doi:10.1002/1099-1425(200009/10)3:5<297::AIDJOS50>3.0.CO;2-D Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., & Dubey, P. (2008). Larabee: a many-core x86 architecture for visual computing. [TOG]. ACM Transactions on Graphics, 27(3). doi:10.1145/1360612.1360617 Seltzer, M. I., Krinsky, D., & Smith, K. A. (1999). The case for application-specific benchmarking. Workshop on Hot Topics in Operating Systems (pp. 102-109). Rio Rico, AZ: IEEE Computer Society Press. Sensor Networks. Retrieved from http://www.sensornetworks.net.au/network.html SETIstats. (2008). Seti@home Project Statistics. Retrieved March 10, 2008, from http://boincstats.com/stats/ project_graph.php?pr=bo
Schüller, F., Qin, J., Nadeem, F., Prodan, R., Fahringer, T., & Mayr, G. (2006). Performance, scalability and quality of the meteorological grid workflow MeteoAG. In Austrian Grid Symposium. Innsbruck, Austria: OCG Verlag.
Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., & Casanova, H. (2002). Overview of GridRPC: A remote procedure call API for Grid computing. In Proceedings of the Third International Workshop on Grid Computing, Baltimore, MD (LNCS 2536, pp. 274–278). Berlin: Springer.
Schwartz, E. (1980). Computational Anatomy and Functional Architecture of Striate Cortex: A Spatial Mapping Approach to Perceptual Coding. Vision Research, 20, 645–669. doi:10.1016/0042-6989(80)90090-5
Sfiligoi, K. O., Venekamp, G., Yocum, D., Groep, D., & Petravick, D. (2007). Addressing the Pilot security problem with gLExec (Tech. Rep. No. FERMILAB-PUB07-483-CD). Fermi National Laboratory, Batavia, IL.
Schwarz, K., Blaha, P., & Madsen, G. K. (2002). Electronic structure calculations of solids using the WIEN2k package for material sciences. Computer Physics Communications, 147(71).
Shankland, S. (2007). Sun starts bidding adieu to mobilespecific Java. Retrieved March 10, 2008, from http:// www.news.com/8301-13580_3-9800679-39.html?part= rss&subj=news&tag=2547-1_3-0-20 SHARCNET. (2008). Shared Hierarchical Academic Research Computing Network (SHARCNET).
944
Compilation of References
ShareGrid Project. (2008, November). Retrieved from http://dcs.di.unipmn.it/sharegrid. Shen, H., & Xu, C. (2006,April). Hash-based proximity clustering for load balancing in heterogeneous DHT networks. In Proc. of IPDPS. Shen, H., & Xu, C.-Z. (2007). Locality-aware and Churnresilient load balancing algorithms in structured peer-topeer networks. [TPDS]. IEEE Transactions on Parallel and Distributed Systems, 18(6), 849–862. doi:10.1109/ TPDS.2007.1040 Shen, H., Xu, C., & Chen, G. (2006). Cycloid: A scalable constant-degree P2P overlay network. Performance Evaluation, 63(3), 195–216. doi:10.1016/j.peva.2005.01.004 Shen, J. P., & Lipasti, M. (2004). Modern Processor Design: Fundamentals of Superscalar Processors (1st Ed.). Shen, W., & Zeng, Q.-A. (2007). Cost-function-based network selection strategy in heterogeneous wireless networks. IEEE International Symposium on Symposium on Ubiquitous Computing and Intelligence (UCI-07). Washington, DC: IEEE. Shen, W., & Zeng, Q.-A. (2008). Cost-function-based network selection strategy in integrated heterogeneous wireless and mobile networks. To appear in IEEE Transactions on Vehicle Technology. Sheridan, P. (1996). Spiral Architecture for Machine Vision. Doctoral Thesis, University of Technology, Sydney. Sheridan, P., Hintz, T., & Alexander, D. (2000). Pseudoinvariant Image Transformations on a Hexagonal Lattice. Image and Vision Computing, 18(11), 907–917. doi:10.1016/S0262-8856(00)00036-6 Shi, W., Lee, H.-H., Ghosh, M., & Lu, C. (2004). Architectual support for high speed protection of memory integrity and confidentiality in multiprocessor systems. In Proceedings of the 13th International Conference on Parallel Architectures and Computation Techniques (PACT’04), Antibes Juan-les-Pins, France (pp.123-134). New York: IEEE Computer Society.
Shin, C.-H., & Gaudiot, J.-L. (2006). Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread. Journal of Parallel and Distributed Computing, 66(10), 1304–1321. doi:10.1016/j. jpdc.2006.06.003 Shin, C.-H., Lee, S.-W., & Gaudiot, J.-L. (2003). Dynamic scheduling issues in SMT architectures. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS’03), Nice, France, (p. 77b). New York: IEEE Computer Society. Shirts, M., & Pande, V. (2000). Screen savers of the world, unite! Science, 290, 1903–1904. doi:10.1126/science.290.5498.1903 Shoch, J. F., & Hupp, J. A. (1982). 03). The “worm” programs - early experience with a distributed computation. Communications of the ACM, 3(25). Shoykhet, A., Lange, J., & Dinda, P. (2004, July). Virtuoso: A System For Virtual Machine Marketplaces [Technical Report No. NWU-CS-04-39]. Evanston/Chicago: Electrical Engineering and Computer Science Department, Northwestern University. Siagri, R. (2007). Pervasive computers and the GRID: The birth of a computational exoskeleton for augmented reality. In 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The foundations of software engineering (pp.1-4), Croatia. Siddiqui, M., Villazón, A., & Fahringer, T. (2006). Grid capacity planning with negotiation-based advance reservation for optimized QoS. In 2006 ACM/IEEE Conference on Supercomputing (SC 2006) (pp. 21–21). New York: ACM. Siddiqui, M., Villazon, A., Hoffer, J., & Fahringer, T. (2005). GLARE: A Grid activity registration, deployment, and provisioning framework. Supercomputing Conference. Seattle, WA: IEEE Computer Society Press.
945
Compilation of References
Siegel, H. J., Armstrong, J. B., & Watson, D. W. (1992). Mapping Computer-Vision-Related Tasks onto Reconfigurable Parallel-Processing Systems. IEEE Computer, 25(2), 54–63. Siegel, L. J., Siegel, H. J., & Feather, A. E. (1982). Parallel Processing Approaches to Image Correlation. IEEE Transactions on Computers, 31(3), 208–218. doi:10.1109/ TC.1982.1675976 Silagadze, Z. (1997). Citations and the Mandelbrot-Zipf’s law. Complex Systems, 11, 487–499. Silva, L. M., & Silva, J. G. (1998). An experimental study about diskless checkpointing. In EUROMICRO’98, (pp. 395–402). SIMDAT. (2008). Grids for industrial product development. Retrieved from www.scai.fraunhofer.de/ about_simdat.html Simmonds, A., & Nanda, P. (2002). Resource Management in Differentiated Services Networks. In C McDonald (Ed.), Proceedings of ‘Converged Networking: Data and Real-time Communications over IP,’ IFIP Interworking 2002, Perth, Australia, October 14 - 16, (pp. 313 – 323). Amsterdam: Kluwer Academic Publishers. Singh, M. P., & Vouk, M. A. (1997). Scientific workflows: Scientific computing meets transactional workflows. Retrieved January 13, 2006 from http://www.csc.ncsu. edu/faculty/mpsingh/papers/databases/workf lows / sciworkflows.html Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer, R. J., & Joyner, J. B. (2005). Power5 system microarchitecture. IBM Journal of Research and Development, 49(4/5), 505–521. Sips, H. J., & van Reeuwijk, C. (2004). An integrated annotation and compilation framework for task and data parallel programming in Java. In Parallel Computing (PARCO): Software Technology, Algorithms, Architectures and Applications (pp. 111–118). New York: Elsevier.
946
Skillicorn, D. B. (1999). The network of tasks model, (TR1999-427). Queen’s University, Kingston, Canada. Smallen, S., Casanova, H., & Berman, F. (2001, Nov.). Tunable on-line parallel tomography. In Proceedings of Supercomputing’01, Denver, CO. Smarr, L., & Catlett, C. E. (1992, June). Metacomputing. Communications of the ACM, 35(6), 44–52. doi:10.1145/129888.129890 SMIL/ W3C. (2005). SMIL- Synchronized Multimedia Integration Language. Retrieved June 15th, 2008 from http://www.w3.org/AudioVideo/ Smith, B. J. (1981). Architecture and applications of the HEP multiprocessor computer system. In SPIE Proceedings of Real Time Signal Processing IV, 298, 241-248. Smith, P., & Hutchinson, N. C. (1998). Heterogeneous process migration: The tui system. Software, Practice & Experience, 28(6), 611–639. doi:10.1002/(SICI)1097024X(199805)28:6<611::AID-SPE169>3.0.CO;2-F SMPTE. (2004). Metadata dictionary registry of metadata element descriptions. Retrieved June 15th, 2008, from http://www.smpte-ra.org/mdd/rp210-8.pdf Snavely, A., & Weinberg, J. (2006). Symbiotic spacesharing on SDSC’s datastar system. Job Scheduling Strategies for Parallel Processing. (LNCS 4376, pp.192209). St. Malo, France: Springer Verlag. Snytnikov, V. N., Vshivkov, V. A., Kuksheva, E. A., Neupokoev, E. V., Nikitin, S. A., & Snytnikov, A. V. (2004). Three-dimensional numerical simulation of a nonstationary gravitating n-body system with gas. Astronomy Letters, 30(2), 124–138. doi:10.1134/1.1646697 SOAP/W3C. (2003). SOAP Version 1.2 Part 1: Messaging Framework. Retrieved June 15th, 2008, from Http://www. w3.org/TR/2003/REC-soap12-part1-20030624/ Soh, H., Shazia Haque, S., Liao, W., & Buyya, R. (2006). Grid programming models and environments. In YuanShun Dai, et al. (Eds.) Advanced parallel and distributed computing (pp. 141–173). Hauppauge, NY: Nova Science Publishers.
Compilation of References
Sohi, G. S., Breach, S. E., & Vijaykumar, T. N. (1995). Multiscalar processors. Proceedings of 22nd Annual International Symposium on Computer Architecture, (pp. 414-425).
Srinivasan, S. H. (2005). Pervasive wireless grid architecture. In Proceedings of The Second Annual Conference on Wireless On-demand Network Systems and Services (pp.83-88), Switzerland.
Song, H., Dharmapurikar, S., Turner, J., & Lockwood, J. (2005). Fast hash table lookup using extended bloom filter: an aid to network processing. Paper presented at the Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications.
Ssu, K., Yao, B., & Fuchs, W. K. (1999). An adaptive checkpointing protocol to bound recovery time with message logging. Symposium on reliable distributed systems, (pp. 244-252).
Song, Q., & Jamalipour, A. (2005). Network selection in an integrated wireless LAN and UMTS environment using mathematical modeling and computing techniques. IEEE Wireless Communications, 12(3), 42–48. doi:10.1109/ MWC.2005.1452853 Song, Y., Jiang, X., Zheng, H., & Deng, H. (2008). P2PSIP Client Protocol. Retrieved June 15th, 2008, from http:// tools.ietf.org/id/draft-jiang-p2psip-sep-01.txt. Sonnek, J. D., Nathan, M., Chandra, A., & Weissman, J. B. (2006). Reputation-based scheduling on unreliable distributed infrastructures. In ICDCS (p. 30). Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S., & Nudd, G. R. (2003). Local grid scheduling techniques using performance prediction. In S. Govan (Ed.), IEEE Proceedings - Computers and Digital Techniques Vol 150, (pp. 87-96). New York: IEEE Press. Spring.NET. (2008, November). Retrieved from http:// www.springframework.net. Squyres, J. M., Lumsdaine, A., & Stevenson, R. L. (1995). A Cluster-based Parallel Image Processing Toolkit. Paper presented at the IS&T Conference on Image and Video Processing, San Jose, CA. SRB (Storage Resource Broker) (n.d.). Retrieved from http://www.sdsc.edu/srb/index.php/Main_Page Srinivasan, R. (1995). XDR: External Data Representation Standard (Tech. Rep. No. RFC 1832).
Steen, van M., Homburg, P., & Tanenbaum, A. S. (1999). Globe: a wide area distributed system. Concurrency, IEEE [See also IEEE Parallel & Distributed Technology], 7, 70-78. Stellner, G. (1996). Cocheck: Checkpointing and process migration for mpi. Proceedings of 10th international parallel processing symposium. Stemm, M., & Katz, R. H. (1998). Vertical handoffs in wireless overlay networks. ACM Mobile Networking (MONET) [New York: ACM.]. Special Issue on Mobile Networking in the Internet, 3(4), 335–350. Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G., Kontothanassis, L., & Parthasarathy, S. (1997). Cashmere2L: software coherent shared memory on a clustered remote-write network. SIGOPS Oper. Syst. Rev., 31(5), 170–183. doi:10.1145/269005.266675 Stevenson, R. L., Adams, G. B., Jamieson, L. H., & Delp, E. J. (1993, April). Parallel Implementation for Iterative Image Restoration Algorithms on a Parallel DSP Machine. The Journal of VLSI Signal Processing, 5, 261–272. doi:10.1007/BF01581300 Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of ACM SIGCOMM (pp. 149-160). New York: ACM Press. Stone, N. (2004). GWD-I: An architecture for grid checkpoint recovery services and a GridCPR API. Retrieved October 15, 2006 from http://gridcpr.psc.edu/GGF/docs/ draft-ggf-gridcpr-Architecture-2.0.pdf
947
Compilation of References
Storz, O., Friday, A., & Davies, N. (2003, October). Towards ‘ubiquitous’ ubiquitous computing: an alliance with ‘the grid’. In Proceedings of the First Workshop On System Support For Ubiquitous Computing Workshop (UBISYS 2003) in association with Fifth International Conference On Ubiquitous Computing, Seattle, WA. Retrieved from http://ciae.cs.uiuc.edu/ubisys/papers/ alliance-w-grid.pdf Stuart, W., & Koch, T. (2000). The Dublin Core Metadata Initiative: Mission, Current Activities, and Future Directions, (Vol. 6). Retrieved June 15th, 2008, from http:/ www/dlib.org/dlib/december00/weibel/12weibel.html Su, M. EI-kady, I., Bader, D. A., & Lin, S. (2004, August). A Novel FDTD Application Featuring OpenMP-MPI Hybrid Parallelization. In 33rd international conference on parallel processing(icpp) Montreal, Canada, (pp. pp. 373–379). Subhlok, J., & Vondran, G. (1995). Optimal mapping of sequences of data parallel tasks. ACM SIGPLAN Notices, 30(8), 134–143. doi:10.1145/209937.209951 Subhlok, J., & Yang, B. (1997). A new model for integrated nested task and data parallel programming. In Proceedings of the 6th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (pp. 1–12). New York: ACM Press. Sun, M., Sun, J., Lu, E., & Yu, C. (2005). Ant algorithm for file replica selection in data grid. In Proceedings of the first international conference on semantics, knowledge, and grid (SKG 2005) (pp. 64–66). Sun, Y., & Xu, Z. (2004). Grid replication coherence protocol. In Proceedings of the 18th international parallel and distributed processing symposium (pp. 232–239). SunGrid. (2005). Sun utility computing. Retrieved from www.sun.com/service/sungrid/ SURA Southeastern Universities Research Association. (2007). The Grid technology cookbook: Programming concepts and challenges. Retrieved from www.sura.org/ cookbook/gtcb/
948
Suter, F., Desprez, F., & Casanova, H. (2004). From heterogeneous task scheduling to heterogeneous mixed parallel scheduling. In Proceedings of the 10th International Euro-Par Conference (Euro-Par’04), (LNCS: Vol. 3149, pp. 230–237). Pisa, Italy: Springer. Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue; Tomorrow’s Computing Today, 3(7), 54–62. doi:10.1145/1095408.1095421 Svirskas, A., Arevas, A., Wilson, M., & Matthews, B. (2005, October). Secure and trusted virtual organization management. ERCIM News (63). Taesombut, N., & Chien, A. (2004). Distributed virtual computer (dvc): Simplifying the development of high performance grid applications. In Workshop on Grids and Advanced Networks (GAN ’04), IEEE Cluster Computing and the Grid (ccgrid2004) Conference, Chicago. Taflove, A., & Hagness, S. (2000). Computational Electrodynimics: The Finite-Difference Time-Domain Method, second edition. Boston: Artech House. Tai, A., Meyer, J., & Avizienis, A. (1993). Performability enhancement of fault-tolerant software. IEEE Transactions on Reliability, 42(2), 227–237. doi:10.1109/24.229492 Tanenbaum, A. S., & Steen, M. V. (2008). Distributed Systems: Principles and Paradigms. Upper Saddle River, NJ: Prentice Hall. Tang, F. L., Li, M. L., & Huang, Z. X. (2004). Real-time transaction processing for autonomic Grid applications. Engineering Applications of Artificial Intelligence, 17(7), 799–807. doi:10.1016/S0952-1976(04)00122-8 Tang, M., Lee, B., Tang, X., & Yeo, C. K. (2005). Combining data replication algorithms and job scheduling heuristics in the data grid. In Proceedings of European conference on parallel computing (pp. 381–390). Tang, M., Lee, B., Yeo, C., & Tang, X. (2005). Dynamic replication algorithms for the multi-tier data grid. Future Generation Computer Systems, 21(5), 775–790. doi:10.1016/j.future.2004.08.001
Compilation of References
Tang, M., Lee, B., Yeo, C., & Tang, X. (2006). The impact of data replication on job scheduling performance in the data grid. Future Generation Computer Systems, 22(3), 254–268. doi:10.1016/j.future.2005.08.004
Teo, Y. M., & Mihailescu, M. (2008). Collision avoidance in hierarchical peer-to-peer systems. In Proceedings of 7th Intl. Conf. on Networking (pp. 336-341). New York: IEEE Computer Society Press.
Tang, X., & Xu, J. (2005). QoS-aware replica placement for content distribution. IEEE Transactions on Parallel and Distributed Systems, 16(10), 921–932. doi:10.1109/ TPDS.2005.126
Terzis, A., Wang, L., Ogawa, J. & Zhang, L. (1999, December). A two tier resource management model for the Internet, Global Internet, (pp. 1808 – 1817).
Tanin, E., Harwood, A., & Samet, H. (2007). Using a distributed quadtree index in peer-to-peer networks. [Heidelberg, Germany: SpringerLink.]. The VLDB Journal, 16(2), 165–178. doi:. doi:10.1007/s00778-0050001-y Taubman, D., & Marcellin, M. (2001). JPEG2000: Image Compression Fundamentals, Standards and Practice. Berlin: Springer. Taufer, M., Anderson, D., Cicotti, P., & III, C. L. B. (2005). Homogeneous redundancy: a technique to ensure integrity of molecular simulation results using public computing. In Proceedings of The International Heterogeneity In Computing Workshop. TAVERNA. (2008). The Taverna Workbench 1.7. Retrieved from http://taverna.sourceforge.net/ Taylor, I., Shields, M., Wang, I., & Philp, R. (2003). Distributed p2p computing within triana: A galaxy visualization test case. In International Parallel and Distributed Processing Symposium (IPDPS’03). Nice, France: IEEE Computer Society Press.
Thain, D., & Livny, M. (2004). Building reliable clients and services. In The grid2 (pp. 285–318). San Francisco: Morgan Kaufman. Thain, D., Tannenbaum, T., & Livny, M. (2002). Condor and the grid. John Wiley & Sons Inc. Thatte, S. (2003). BPEL4WS, business process execution language for web services. Retrieved June 15th, 2008, from http://xml.coverpages.org/ni2003-04-16-a.html The EU Data Grid Project (n.d.). Retrieved from http:// www.eu-datagrid.org/. The Globus Alliance (n.d.). Retrieved from http://www. globus.org/ The seti@home project. Retrieved from http://setiathome. ssl.berkeley.edu/ The TrustCoM Project. (2005). Retrieved from http:// www.eu-trustcom.com. Theimer, M. M., Lantz, K. A., & Cheriton, D. R. (1985). Preemptable remote execution facilities for the v-system. SIGOPS Oper. Syst. Rev., 19(5), 2–12. doi:10.1145/323627.323629
Taylor, M. B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., et al. (2004). Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. Proceedings of 31st Annual International Symposium on Computer Architecture, (pp. 2-13).
Theiner, D., & Rutschmann, P. (2005). An inverse modelling approach for the estimation of hydrological model parameters. (I. Publishing, Ed.) Journal of Hydroinformatics.
Tendler, J. M., Dodson, J. S. Jr, Fields, J. S., Le, H., & Sinharoy, B. (2002). Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 5–25.
Thistle, M. R., & Smith, B. J. (1988). A processor architecture for Horizon. In Proceedings of the 1988 ACM/IEEE conference on Supercomputing (SC’88), Orlando, FL, (pp. 35-41). New York: IEEE Computer Society Press.
949
Compilation of References
Thomasian, A. (1997). A performance comparison of locking methods with limited wait depth. IEEE Transactions on Knowledge and Data Engineering, 9(3), 421–434. doi:10.1109/69.599931 Thornton, J. E. (1970). Design of a computer - the Control Data 6600. Upper Saddle River, NJ: Scott Foresman & Co. Thulasiraman, P., Khokhar, A., Heber, G., & Gao, G. (2004, Jan.). A fine-grain load adaptive algorithm of the 2d discrete wavelet transform for multithreaded architectures. [JPDC]. Journal of Parallel and Distributed Computing, 64(1), 68–78. doi:10.1016/j.jpdc.2003.06.003doi:10.1016/j. jpdc.2003.06.003 Thulasiraman, P., Theobald, K. B., Khokhar, A. A., & Gao, G. R. (2000, July). Multithreaded algorithms for the fast Fourier transform. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada, (p. 176-185). Tian, D., & Xiang, Y. (2008). A multi-core supported intrusion detection system. Proceedings of IFIP International Conference on Network and Parallel Computing. Tian, R., Xiong, Y., Zhang, Q., Li, B., Zhao, B. Y., & Li, X. (2005). Hybrid Overlay Structure Based on Random Walks. In Proceedings of the 4th Intl. Workshop on Peer-to-Peer Systems (pp. 152-162). Berlin: SpringerVerlag.
Trelles, Andrade, & Valencia, Zapata, & Carazo. (1998, June). Computational space reduction and parallelization of a new clustering approach for large groups of sequences. Bioinformatics (Oxford, England), 14(5), 439–451. doi:10.1093/bioinformatics/14.5.439 TRIANA. (2003). The Triana Project. Retrieved from www.trianacode.org/ Tripp, G. (2006). A parallel “string matching engine” for use in high speed network intrusion detection systems. Journal in Computer Virology, 2(1), 21–34. doi:10.1007/ s11416-006-0010-4 Tsaregorodtsev, A., Garonne, V., & Stokes-Rees, I. (2004). Dirac: A scalable lightweight architecture for high throughput computing. In Fifth IEEE/ACM International Workshop On Grid Computing (Grid’04). Tseng, Y-C., Shen, C-C. & Chen, W-T. (2003). Integrating Mobile IP with ad hoc networks. IEEE Computer, May, 48-55. Tsouloupas, G., & Dikaiakos, M. D. (2007). GridBench: A tool for the interactive performance exploration of Grid infrastructures. Journal of Parallel and Distributed Computing, 67(9), 1029–1045. doi:10.1016/j. jpdc.2007.04.009 Tsoumakos, D., & Rousseopoulos, N. (2006). Analysis and comparison of p2p search methods. In Proceedings of the 1st International Conference on Scalable Information Systems (INFOSCALE 2006), No. 25.
Tirado-Ramos, A., Tsouloupas, G., Dikaiakos, M. D., & Sloot, P. M. (2005). Grid resource selection by application benchmarking: A computational haemodynamics case study. International Conference on Computational Science. (LNCS 3514, pp. 534-543). Atlanta, GA: Springer Verlag.
Tuck, N., & Tullsen, D. M. (2005). Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05), (pp. 5-15), San Francisco: IEEE Computer Society.
TOP500. (2007). TOP 500 Supercomputer Sites, Performance Development, November 2007. Retrieved March 10, 2008 from http://www.top500.org/lists/2007/11/ performance_development
Tullsen, D. M., & Brown, J. A. (2001). Handling longlatency loads in a simultaneous multithreading processor. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’01), (pp. 318–327). Austin, TX: IEEE Computer Society.
950
Compilation of References
Tullsen, D. M., Eggers, S. J., & Levy, H. M. (1995). Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), Santa Margherita Ligure, Italy (pp. 392-403). New York: ACM Press. Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., & Stamm, R. L. (1996). Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96), Philadelphia, (pp. 191–202). New York: ACM Press. Tullsen, D. M., Lo, J. L., Eggers, S. J., & Levy, H. M. (1999). Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA’99), Orlando, FL (pp. 54-58). New York: IEEE Computer Society. Turner, D., & Chen, X. (2002). Protocol-dependent message-passing performance on linux clusters. Proceedings of the 2002 IEEE International Conference on Linux Clusters, pp. 187-194. New York: IEEE Computer Society. TV-Anytime. (2005). TV-Anytime. Retrieved June 15th, 2008, from http://www.tv-anytime.org UDDI. (2004). UDDI Version 3.0.2. Retrieved June 15th, 2008, from http://www.Oasis-Open.org/committees/ uddi-spec/doc/spec/v3/uddi-v3.0.2-20041019.Htm Uhlig, S., Bonaventure, O., & Quoitin, B. (2003). Internet traffic engineering with minimal BGP configuration. 18th International Teletraffic Congress. Unicore (n.d.). Retrieved from http://unicore.sourceforge.net UNICORE. (2008). UNiform Interface to COmputing Resources. Retrieved from www.unicore.eu/ Ururahy, C., & Rodriguez, N. (2004). Programming and coordinating grid environments and applications. In Concurrency and computation: Practice and experience.
Vakali, A., & Pallis, G. (2003). Content Delivery Networks: Status and Trends. IEEE Internet Computing, (November 6): 68–74. doi:10.1109/MIC.2003.1250586 Valkovskii, V. A., & Malyshkin, V. E. (1988). Synthesis of parallel programs and systems on the basis of computational models. Novosibirsk, Russia: Nauka. van der Houwen, P. J., & Messina, E. (1999). Parallel Adams methods. Journal of Computational and Applied Mathematics, 101(1-2), 153–165. doi:10.1016/S03770427(98)00214-3 van der Houwen, P. J., & Sommeijer, B. P. (1991). Iterated Runge-Kutta methods on parallel computers. SIAM Journal on Scientific and Statistical Computing, 12(5), 1000–1028. doi:10.1137/0912054 Van der Wijngaart, R. F., & Frumkin, M. A. (2004). Evaluating the information power Grid using the NAS Grid benchmarks. International Parallel and Distributed Processing Symposium. Santa Fe, NM: IEEE Computer Society Press. van der Wijngaart, R. F., & Jin, H. (2003). The NAS parallel benchmarks, multi-zone versions (No. NAS-03-010). NASA Ames Research Center, Sunnydale, CA. van Reeuwijk, C., Kuijlman, F., & Sips, H. J. (2003). Spar: A set of extensions to Java for scientific computation. Concurrency and Computation, 15, 277–299. doi:10.1002/cpe.659 Vanneschi, M. (2002). The programming model of ASSIST, an environment for parallel and distributed portable applications. Parallel Computing, 28(12), 1709–1732. doi:10.1016/S0167-8191(02)00188-6 Vanneschi, M., & Veraldi, L. (2007). Dynamicity in distributed applications: Issues, problems and the ASSIST approach. Parallel Computing, 33(12), 822–845. doi:10.1016/j.parco.2007.08.001 Vazhkudai, S. (2003, Nov). Enabling the co-allocation of grid data transfers. In Proceedings of the fourth international workshop on grid computing (pp. 41–51).
951
Compilation of References
Vazhkudai, S., & Ma, X. V. F., Strickland, J., Tammineedi, N., & Scott, S. (2005). Freeloader:scavenging desktop storage resources for scientific data. In Proceedings of Supercomputing 2005 (SC’05), Seattle, WA.
Venugopal, S., Nadiminti, K., Gibbins, H., & Buyya, R. (2008). Designing a resource broker for heterogeneous Grids. Software, Practice & Experience, 38(8), 793–825. doi:10.1002/spe.849
Vazhkudai, S., & Syed, J., & Maginnis T. (2002). PODOS The design and implementation of a performance oriented Linux cluster. Future Generation Computer Systems, 18(3), 335–352. doi:10.1016/S0167-739X(01)00055-3
Villa, O., Scarpazza, D. P., & Petrini, F. (2008). Accelerating real-time string searching with multicore processors. IEEE Computer, 41(4), 42–50.
Vazhkudai, S., Tuecke, S., & Foster, I. (2001). Replica selection in the globus data grid. In Proceedings of the first IEEE/ACM international conference on cluster computing and the grid (CCGRID 2001) (pp. 106–113). Vázquez-Poletti, J. L., Huedo, E., Montero, R. S., & Llorente, I. M. (2007). A comparison between two grid scheduling philosophies: EGEE WMS and Grid Way. Multiagent and Grid Systems, 3(4), 429–439. Vecchiola, C., & Chu, X. (2008). Aneka tutorial series on developing task model applications. (Technical Report). Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia. Veldema, R., Hofman, R. F. H., Bhoedjang, R., & Bal, H. E. (2001). Runtime optimizations for a Java DSM implementation. Paper presented at the Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande. Venugopal, S., & Buyya, R. (2005, Oct). A deadline and budget constrained scheduling algorithm for escience applications on data grids. In Proceedings of the 6th international conference on algorithms and architectures for parallel processing (ICA3PP-2005) (pp. 60–72). Venugopal, S., Buyya, R., & Ramamohanarao, K. (2006). A taxonomy of data grids for distributed data sharing, management, and processing. ACM Computing Surveys, 1, 1–53. Venugopal, S., Buyya, R., & Winton, L. (2004). A grid service broker for scheduling distributed data-oriented applications on global grids. Proceedings of the 2nd workshop on Middleware for grid computing, Toronto, Canada, (pp. 75–80). Retrieved from www.Gridbus. org/broker
952
VMware Inc. (1999). VMware virtual platform. Voss, M. J., & Eigenmann, R. (2000). ADAPT: Automated De-coupled Adaptive Program Transformation. International Conference on Parallel Processing, Toronto, Canada, (pp. 163). Vshivkov, V. A., Nikitin, S. A., & Snytnikov, V. N. (2003). Studying instability of collisionless systems on stochastic trajectories. JETP Letters, 78(6), 358–362. doi:10.1134/1.1630127 Vuduc, R., Demel, J., & Bilmes, J. A. (2004). Statistical Models for Empirical Search-Based Performance Tuning. International Journal of High Performance Computing Applications, 18(1), 65–94. doi:10.1177/1094342004041293 Vydyanathan, N., Krishnamoorthy, S., Sabin, G., Çatalyürek, Ü. V., Kurç, T. M., Sadayappan, P., et al. (2006). An integrated approach for processor allocation and scheduling of mixed-parallel applications. In Proceedings of the 2006 International Conference on Parallel Processing (ICPP’06) (pp. 443–450). New York: IEEE. Vydyanathan, N., Krishnamoorthy, S., Sabin, G., Çatalyürek, Ü. V., Kurç, T. M., Sadayappan, P., et al. (2006). Locality conscious processor allocation and scheduling for mixed parallel applications. In Proceedings of the 2006 IEEE International Conference on Cluster Computing, September 25-28, 2006, Barcelona, Spain. New York: IEEE. Wachter, H., & Reuter, A. (Eds.). (1992). Contracts: A means for Extending Control Beyond Transaction Boundaries. Advanced Transaction Models for New Applications. San Francisco: Morgan Kaufmann.
Compilation of References
Waldburger, M., & Stiller, B. (2006). Toward the mobile grid:service provisioning in a mobile dynamic virtual organization. In. Proceedings of the IEEE International Conference on Computer Systems and Applications, 2006, (pp.579–583).
Wang, J., Zeng, Q.-A., & Agrawal, D. P. (2003). Performance analysis of a preemptive and priority reservation handoff scheme for integrated service-based wireless mobile networks. IEEE Transactions on Mobile Computing, 2(1), 65–75. doi:10.1109/TMC.2003.1195152
Waldvogel, M., & Rinaldi, R. (2002). Efficient topologyaware overlay network. In Proc. of HotNets-I.
Wang, T., Vonk, J., Kratz, B., & Grefen, P. (2008). A survey on the history of transaction management: from flat to grid transactions. Distributed and Parallel Databases, 23(3), 235–270. doi:10.1007/s10619-008-7028-1
Walker, B., Popek, G., English, R., Kline, C., & Thiel, G. (1992). The locus distributed operating system. Ditributed Computing Systems: Concepts and Structures, 17(5). Walker, D. W. (1990). Characterising the parallel performance of a large-scale, particle-in-cell plasma simulation code. International Journal on Concurrency: Practice and Experience., 2(4), 257–288. doi:10.1002/ cpe.4330020402 Wall, D. W. (1991). Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA (ASPLOS-IV), (pp. 176-188). New York: ACM Press. Wang, C., Hsu, C., Chen, H., & Wu, J. (2006). Efficient multi-source data transfer in data grids. In Proceedings of the sixth IEEE international symposium on cluster computing and the grid (CCGRID’06) (pp. 421–424). Wang, C., Xiao, L., Liu, Y., & Zheng, P. (2004). Distributed caching and adaptive search in multilayer P2P networks. In International Conference on Distributed Computing Systems (ICDCS’04) (pp. 219-226). Wang, H., Katz, R., & Giese, J. (1999). Policy-enabled handoffs across heterogeneous wireless networks. Mobile Computing Systems and Applications (PWMCSA), (pp. 51-60). Wang, H., Liu, P., & Wu, J. (2006). A QoS-aware heuristic algorithm for replica placement. Journal of Grid Computing, 96–103.
Wang, Y., Scardaci, D., Yan, B., & Huang, Y. (2007). Interconnect EGEE and CNGRID e-infrastructures through interoperability between gLite and GOS middlewares. In International Grid Interoperability and Interoperation Workshop (IGIIW 2007) with e-Science 2007 (pp. 553–560). Bangalore, India: IEEE Computer Society. Wang, Z., Yu, B., Chen, Q., & Gao, C. (2005). Wireless grid computing over mobile ad-hoc networks with mobile agent. In Skg ’05: Proceedings of the first international conference on semantics, knowledge and grid (p. 113). Washington, DC: IEEE Computer Society. Wasson, G., & Humphrey, M. (2003). Policy and enforcement in virtual organizations. In 4th International Workshop on Grid Computing (pp. 125–132). Washington, DC: IEEE Computer Society. Watt, A. Lilley Chris, & J., Daniel. (2003). SVG Unleashed. Indianapolis, IN: SAMS. Wei, B., Fedak, G., & Cappello, F. (2005). scheduling independent tasks sharing large data distributed with BitTorrent. In The 6th IEEE/ACM International Workshop On Grid Computing, 2005, Seattle, WA. Weiser, M. (1991, February). The computer for the 21st century. Scientific American, 265(3), 66–75. Wesner, S., Dimitrakos, T., & Jeffrey, K. (2004, October). Akogrimo - the Grid goes mobile. ERCIM News, (59), 32-33.
953
Compilation of References
West, E. A., & Grimshaw, A. S. (1995). Braid: Integrating task and data parallelism. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers’95) (p. 211). New York: IEEE Computer Society. Whaley, R. C., & Petite, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software, Practice & Experience, 35(2), 101–121. doi:10.1002/spe.626 White Paper, A. M. D. (2008). The industry-changing impact of accelerated computing. White, J. E. (1996). Telescript technology: Mobile agents. Journal of Software Agents. Wieczorek, M., Prodan, R., & Fahringer, T. (2005). Scheduling of scientific workflows in the ASKALON Grid environment. SIGMOD Record, 09. WiFi (2008). Retrieved November 2008 from http:// www.ieee802.org/11/ Wikipedia, Gauss-Jordan elimination. Retrieved from http://en.wikipedia.org/wiki/Gauss-Jordan_elimination Wikipedia, Max-flow min-cut theorem. Retrieved from http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem Wilkinson, T. (1998). Kaffe - a clean room implementation of the Java virtual machine. Retrieved 2002, from http://www.kaffe.org/ Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., & Yelick, K. (2006, May). The Potential of the Cell Processor for Scientific Computing. In Computing frontiers (cf’06) Ischia, Italy (pp. 9–20). Winkler, P., & Zhang, L. (2003). Wavelength assignment and generalized interval graph coloring. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’03), Baltimore, MD, (pp. 830–831). Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd Ed.). San Francisco: Morgan Kaufmann.
954
Woeginger, G. J. (1997). There is no asymptotic PTAS for two-dimensional vector packing. Information Processing Letters, 64, 293–297. doi:10.1016/S00200190(97)00179-8 Wolski, R. (2003). Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4), 41–49. doi:10.1145/773056.773064 Wong, S.-W., & Ng, K.-W. (2006). Security support for mobile grid services framework. In Nwesp’06: Proceedings of the international conference on next generation web services practices (pp.75–82). Washington, DC: IEEE Computer Society. World Wide Web Consortium (W3C). (n.d.). Web services activity. Retrieved from http://www.w3.org/2002/ws/ WSDL/W3C. (2005). WSDL: Web Services Description Language (WSDL) 1.1. Retrieved June 15th, 2008, from http://www.w3.org/TR/wsdl. Wu, D. M., & Guan, L. (1995). A Distributed Real-Time Image Processing System. Real-Time Imaging, 1(6), 427–435. doi:10.1006/rtim.1995.1044 Wu, G., Chu, C. W., Wine, K., Evans, J., & Frenkiel, R. (1999). WINMAC: A novel transmission protocol for infostations. Proceedings of the 49th IEEE Vehicular Technology Conference (VTC), Houston, TX, (Vol. 2, pp. 1340–1344). Wu, Q., He, X., & Hintz, T. (2004, June 21-24). Virtual Spiral Architecture. Paper presented at the International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA. Wyckoff, P., McLaughry, S. W., Lehman, T. J., & Ford, D. A. (1998). T Spaces. IBM Systems Journal, 37, 454–474. Xiang, Y., & Zhou, W. (2006). Protecting information infrastructure from ddos attacks by mark-aided distributed filtering (madf). International Journal of High Performance Computing and Networking, 4(5/6), 357–367. doi:10.1504/IJHPCN.2006.013491
Compilation of References
Xie, M. (1991). Software reliability modeling. Hackensack, NJ: World Scientific Publishing Company. Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing systems reliability: Models and analysis. New York: Kluwer Academic Publishers. Xu, C. (2005). Scalable and Secure Internet Services and Architecture. Boca Raton, FL: Chapman & Hall/ CRC Press. Xu, J. (2003). On the fundamental tradeoffs between routing table size and network diameter in peer-to-peer networks. In Proceedings of INFOCOM (pp. 2177-2187). New York: IEEE Press. Xu, M., Sabouni, A., Thulasiraman, P., Noghanian, S., & Pistorius, S. (2007, Sept.). Image Reconstruction using Microwave Tomography for Breast Cancer Detection on Distributed Memory Machine. In International conference on parallel processing (icpp) XiAn, China (p. 1-8). Xu, Y., Liu, H., & Zeng, Q.-A. (2005). Resource management and Qos control in multiple traffic wireless and mobile Internet systems. [WCMC]. Wiley’s Journal of Wireless Communications and Mobile Computing, 2(1), 971–982. doi:10.1002/wcm.360 Xu, Z., Mahalingam, M., & Karlsson, M. (2003). Turning heterogeneity into an advantage in overlay routing. In Proc. of INFOCOM. Xu, Z., Min, R., & Hu, Y. (2003). HIERAS: A DHT based hierarchical p2p routing algorithm. In Proceedings of the 2003 Intl. Conf. on Parallel Processing (pp. 187-194). New York: IEEE Computer Society Press. Xu, Z., Tang, C., & Zhang, Z. (2003). Building topology-aware overlays using global soft-state. In Proc. of ICDCS. Yamamoto, W., & Nemirovsky, M. (1995). Increasing superscalar performance through multistreaming. In Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT’95), (pp. 49-58). Limassol, Cyprus: IFIP Working Group on Algol.
Yamin, A., Augustin, I., Barbosa, J., da Silva, L., Real, R., & Cavalheiro, G. (2003). Towards merging contextaware, mobile and grid computing. International Journal of High Performance Computing Applications, 17(2), 191–203. doi:10.1177/1094342003017002008 Yan, C., Rogers, B., Englender, D., Solihin, Y., & Prvulovic, M. (2006). Improving cost, performance, and security of memory encryption and authentication. In Proceedings of 33rd Annual International Symposium on Computer Architecture (ISCA’06), (pp. 179-190). Boston: IEEE Computer Society Press. Yan, J., & Zhang, W. (2007). Hybrid multi-core architecture for boosting single-threaded performance. ACM SIGARCH Computer Architecture News, 35(1), 141–148. doi:10.1145/1241601.1241603 Yang, B., & Garcia-Molina, H. (2002). Improving search in peer-to-peer networks. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS’02) (pp. 5). Yang, B., & Xie, M. (2000). A study of operational and testing reliability in software reliability analysis. Reliability Engineering & System Safety, 70, 323–329. doi:10.1016/S0951-8320(00)00069-7 Yang, C., Yang, I., Chen, C., & Wang, S. (2006). Implementation of a dynamic adjustment mechanism with efficient replica selection in data grid environments. In Proceedings of the ACM symposium on applied computing (pp. 797–804). Yang, H. T., Wang, Z. H., & Deng, Q. H. (2008). Scheduling optimization in coupling independent services as a Grid transaction. Journal of Parallel and Distributed Computing, 68(6), 840–854. doi:10.1016/j. jpdc.2008.01.004 Yang, Y. G., Jin, H., & Li, M. L. (2004). Grid computing in China. Journal of Grid Computing, 2(2), 193–206. doi:10.1007/s10723-004-4201-2 Yap, T., Frieder, O., & Martino, R. (1998, March). Parallel computation in biological sequence analysis. Institute of Electrical and Electronic Engineers, 9(3), 283–294.
955
Compilation of References
Yau, D. K. Y., & Lam, S. S. (1996). Adaptive RateControlled Scheduling for Multimedia Applications. In Proceedings of ACM Multimedia Conference. Yavatkar, R., Pendarakis, D., & Guerin, R. (2000, January). A framework for policy based admission control, (RFC 2753). Yeager, K. C. (1996). The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2), 28–40. doi:10.1109/40.491460 Yee, K. (1966, May). Numerial solution of initial boundary value problems involving maxwell’s equations in isotropic media. IEEE Transactions on Antennas and Propagation, AP-14(8), 302–307. Yeh, C.-H., Parhami, B., Varvarigos, E. A., & Lee, H. (2002, July). VLSI layout and packaging of butterfly networks. In Acm symposium on parallel algorithms and architectures Winnipeg, Canada (pp. 196–205). Yeo, C. S., & Buyya, R. (2007). Integrated Risk Analysis for a Commercial Computing Service. Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007, IEEE CS Press, Los Alamitos, CA, USA). Yinglian, X. O’Hallaron D. (2002). Locality in search engine queries and its implications for caching. In Proceedings of the IEEE Infocom (pp. 1238-1247). Young, J. W. (1974). A first order approximation to the optimal checkpoint interval. Communications of the ACM, 17(9), 530–531. doi:10.1145/361147.361115 Yu, W., & Cox, A. (1997). Java/DSM: A Platform for Heterogeneous Computing. Concurrency (Chichester, England), 9(11), 1213–1224. doi:10.1002/(SICI)10969128(199711)9:11<1213::AID-CPE333>3.0.CO;2-J Yu, W., Mittra, R., Su, T., Liu, Y., & Yang, X. (2006). Parallel Finite-Difference Time-Domain Method. Boston: Artech House publishers. Zajcew, R., Roy, P., Black, D., & Peak, C. (1993). An osf/l unix for massively parallel multi-computers. Proceedings of the winter 1993 conference, (pp. 449-468).
956
Zander, J. (2000). Trends and challenges in resource management future wireless networks. In Proceedings of the IEEE Wireless Communications and Networks Conference (WCNC), Chicago, (Vol. 1, pp. 159–163). Zegura, E. Calvert, K. et al. (1996). How to model an Internetwork. In Proc. of INFOCOM. Zhang, G., & Parashar, M. (2003). Dynamic context-aware access control for grid applications. In 4th international workshop on grid computing (grid 2003), (pp. 101 – 108). Phoenix, AZ: IEEE Computer Society Press. Retrieved from citeseer.ist.psu.edu/zhang03dynamic.html Zhang, Q., Guo, C., Guo, Z., & Zhu, W. (2003). Efficient mobility management for vertical handoff between WWAN and WLAN. IEEE Communications Magazine, 41(11), 102–108. doi:10.1109/MCOM.2003.1244929 Zhang, X. Y., Zhang, Q., Zhang, Z., Song, G., & Zhu, W. (2004). A construction of locality-aware overlay network: mOverlay and its performance. IEEE Journal on Selected Areas in Communications, 22(1), 18–28. doi:10.1109/JSAC.2003.818780 Zhang, X., Freschl, J. L., & Schopf, J. M. (2003, June). A performance study of monitoring and information services for distributed systems. In HPDC’03: Proceedings of the Twelfth International Symposium on High Performance Distributed Computing, (pp. 270-281). Los Alamitos, CA: IEEE Computer Society Press. Zhao, B. Y., Duan, Y., Huang, L., Joseph, A., & Kubiatowicz, J. (2003). Brocade: landmark routing on overlay networks. In Proceedings of the 2nd Intl. Workshop on Peer-to-Peer Systems (pp. 34-44). Berlin: SpringerVerlag. Zhao, B. Y., Kubiatowicz, J., & Oseph, A. D. (2001). Tapestry: An infrastructure for fault-tolerant wide-area location and routing (Tech. Rep. UCB/CSD-01-1141). University of California at Berkeley, Berkeley, CA. Zhao, S., & Lo, V. (2001, May). Result Verification and Trust-based Scheduling in Open Peer-to-Peer Cycle Sharing Systems. In Proceedings of Ieee Fifth International Conference on Peer-To-Peer Systems.
Compilation of References
Zhou, D., & Lo, V. M. (2006). Wavegrid: A scalable fast-turnaround heterogeneous peer-based desktop grid system. In IPDPS. Zhou, J., Ou, Z., Rautiainen, M., & Ylianttila, M. (2008b). P2P SCCM: Service-oriented Community Coordinated Multimedia over P2P. In Proceedings of 2008 IEEE International Conference on Web Services, Beijing, China, September 23-26, (pp. 34-40). Zhou, J., Rautiainen, M., & Ylianttila, M. (2008a). Community coordinated multimedia: Converging contentdriven and service-driven models. In proceedings of 2008 IEEE International Conference on Multimedia & Expo, June 23-26, 2008, Hannover, Germany. Zhou, S., Zheng, X., Wang, J., & Delisle, P. (1993). Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Software, Practice & Experience, 23(12), 1305–1336. doi:10.1002/spe.4380231203 Zhou, X., Kim, E., Kim, J. W., & Yeom, H. Y. (2006). ReCon: A fast and reliable replica retrieval service for the data grid. In Proceedings of IEEE international symposium on cluster computing and the grid (pp. 446–453). Zhou, Y., Iftode, L., & Li, K. (1996). Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. SIGOPS Oper. Syst. Rev., 30(SI), 75-88.
Zhu, W., Wang, C. L., & Lau, F. C. M. (2002). JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support. Paper presented at the Proceedings of the IEEE International Conference on Cluster Computing. Zhu, Y., & Hu, Y. (2005). Efficient, proximity-aware load balancing for DHT-based P2P systems. Proc. of IEEE TPDS, 16(4). Zhu, Y., & Jiang, H. (2006). False Rate Analysis of Bloom Filter Replicas in Distributed Systems. Paper presented at the Proceedings of the 2006 International Conference on Parallel Processing. Zhu, Y., Jiang, H., & Wang, J. (2004). Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. Paper presented at the Proceedings of the 2004 IEEE International Conference on Cluster Computing. Zhu, Y., Jiang, H., Wang, J., & Xian, F. (2008). HBA: Distributed Metadata Management for Large ClusterBased Storage Systems. IEEE Transactions on Parallel and Distributed Systems, 19(6), 750–763. doi:10.1109/ TPDS.2007.70788 Zilka, A. (2006). Terracotta - JVM Clustering, Scalability and Reliability for Java. Retrieved June 19, 2008, from http://www.terracotta.org
Zhu, F., & McNair, J. (2004). Optimizations for vertical handoff decision algorithms. IEEE Wireless Communications and Network Conference (WCNC), (pp. 867-872).
957
1
About the Contributors
Kuan-Ching Li received the PhD and MS degrees in Electrical Engineering and Licenciatura in Mathematics from the University of São Paulo, Brazil. After he received his PhD, he was a postdoctoral scholar in the University of California – Irvine (UCI) and University of Southern California (USC). His main research interests include cluster and grid computing, parallel software design, and life science applications. He has authored over 60 research papers and book chapters, and co-editor of book entitled "Handbook of Research on Scalable Computing Technologies" published by IGI Global and volumes of LNCS and LNAI published by Springer. He has served as Guest Editor of a number of journal special issues, including The Journal of Supercomputing (TJS), International Journal of Ad Hoc and Ubiquitous Computing (IJAHUC), and International Journal of Computer Applications in Technology (IJCAT). In addition, he has served on the steering, organizing, and program committees of several conferences and workshops, including Conference co-chair of CSE'2008 (Sao Paulo, Brazil) and Program co-chair of APSCC'2008 (Yilan, Taiwan), AINA'2008 (Okinawa, Japan). He is a senior member of the IEEE. Ching-Hsien Hsu received the B.S. and Ph.D. degrees in Computer Science from Tung Hai University and Feng Chia University, Taiwan, in 1995 and 1999, respectively. He is currently an associate professor of the department of Computer Science and Information Engineering at Chung Hua University, Taiwan. Dr. Hsu's research interest is primarily in parallel and distributed computing, grid computing, P2P computing, RFID and services computing. Dr. Hsu has published more than 80 academic papers in journals, books and conference proceedings. He was awarded as annual outstanding researchers by Chung Hua University in 2005, 2006 and 2007 and got the excellent research award in 2008. He is serving in a number of journal editorial boards, including International Journal of Communication Systems, International Journal of Computer Science, International Journal of Grid and High Performance Computing, International Journal of Smart Home and International Journal of Multimedia and Ubiquitous Engineering. Laurence T. Yang is a professor at Department of Computer Science of St Francis Xavier University, Canada. His research includes high performance computing and networking, embedded systems, ubiquitous/pervasive computing and intelligence. He has published around 300 papers (including around 80+ international journal papers such as IEEE and ACM Transactions) in refereed journals, conference proceedings and book chapters in these areas. He has been involved in more than 100 conferences and workshops as a program/general/steering conference chair and more than 300 conference and workshops as a program committee member. He served as the vice-chair of IEEE Technical Committee of Supercomputing Applications (TCSA) until 2004, currently is the chair of IEEE Technical Committee
Copyright © 2010, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
About the Contributors
of Scalable Computing (TCSC), the chair of IEEE Task force on Ubiquitous Computing and Intelligence. In addition, he is the editors-in-chief of several international journals and few book series. He is serving as an editor for numerous international journals. He has been acting as an author/co-author or an editor/ co-editor of 25 books from Kluwer, Springer, IGI, Nova Science, American Scientific Publishers and John Wiley & Sons. He has won 5 Best Paper Awards (including the IEEE 20th International Conference on Advanced Information Networking and Applications (AINA-06)) and 1 Best Paper Nomination in 2007; as well as a Distinguished Achievement Award, 2005; Canada Foundation for Innovation Award, 2003. Jack Dongarra holds an appointment at the University of Tennessee and holds the title of Distinguished Research Staff at Oak Ridge National Laboratory (ORNL), Turing Fellow at the University of Manchester. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches and in 2008 he was the recipient of the IEEE Medal of Excellence in Scalable Computing. He is a Fellow of the AAAS, ACM, and the IEEE and a member of the National Academy of Engineering. Hans P. Zima is a Principal Scientist at the Jet Propulsion Laboratory, California Institute of Technology, and a Professor Emeritus of the University of Vienna, Austria. He received his Ph.D. degree in Mathematics and Astronomy from the University of Vienna in 1964. His major research interests have been in the fields of high-level programming languages, compilers, and advanced software tools. In the early 1970s, while working in industry, he designed and implemented one of the first high-level real-time languages for the German Air Traffic Control Agency. During his tenure as a Professor of Computer Science at the University of Bonn, Germany, he contributed to the German supercomputer project "SUPRENUM", leading the design of the first Fortran-based compilation system for distributedmemory architectures (1989). After his move to the University of Vienna, he became the chief designer of the Vienna Fortran language (1992) that provided a major input for the High Performance Fortran de-facto standard. From 1997 to 2007, Dr. Zima headed the Priority Research Program "Aurora", a tenyear program funded by the Austrian Science Foundation. His research over the past years focused on the design of the "Chapel" programming language in the framework of the DARPA-sponsored HPCS project "Cascade". More recently, Dr. Zima has become involved in the design of space-borne faulttolerant high capability computing systems. Dr. Zima is the author or co-author of about 200 publications, including 4 books. *** David Allenotor has a B.Sc. and M.Sc. in Computer Science, University of Benin 1996 and 2000. He also has M.Sc. in Computer Science, University of Manitoba, Canada. 2005. At present he is a Ph.D. candidate in Department of Computer Science, University of Manitoba, Canada and a member of the Computational Finance Derivatives Lab (CFD). His interest and field is Grid Computing, Cloud Computing, applications fuzzy logic to Computational Finance Derivatives, and Modeling of Financial Engineering problems. Jörn Altmann is Associate Professor at the International University of Bruchsal, Germany, where he heads the group of Computer Networks and Distributed Systems. Dr. Altmann received his B.Sc. degree,
2
About the Contributors
his M.Sc. degree (1993), and his Ph.D. (1996) from the University of Erlangen-Nürnberg, Germany. Dr. Altmann's current research centers on the economics of Internet services and Internet infrastructures, integrating economic models into distributed systems. In particular, he focuses on capacity planning, network topologies, and resource allocation. Carlos Eduardo Rodrigues Alves is an Associate Professor of the Computer Science Department of São Judas Tadeu University. He obtained his Ph.D. in Computer Science at the Institute of Mathematics and Statistics of the University of São Paulo in 2002. He was a graduate of the Instituto Tecnológico de Aeronática where he finished both his undergraduate course in Electronics Engineering and his M.Sc. degree in Electrical and Computer Engineering. His research interests are the design of efficient sequential and parallel algorithms. Marcos Dias de Assunção is a PhD candidate at the University of Melbourne, Australia. His PhD thesis is on peering and resource allocation across Grids. He has previously obtained a masters degree on network management at the Federal University of Santa Catarina, Brazil. The current topics of his interest include Grid scheduling, virtual machines, and network virtualisation. Alan A. Bertossi got the Laurea Degree in Computer Science from the University of Pisa (Italy) in 1979. Currently, he is a Professor of Computer Science at the Department of Computer Science of the University of Bologna (Italy). His main research interests are the design and analysis of algorithms for high-performance, parallel, distributed, wireless, fault-tolerant, and real-time systems. He has published 45 refereed papers on international archival journals, as well as several other papers in conference proceedings, book chapters, and encyclopedias. He served as a guest coeditor for special issues of international journals, mainly on algorithms for wireless networks. Since 2000, he has been in the editorial board of Information Processing Letters. Rajkumar Buyya is an Associate Professor and Reader of Computer Science and Software Engineering; and Director of the Grid Computing and Distributed Systems (GRIDS) Laboratory at the University of Melbourne, Australia. Dr. Buyya has authored/co-authored over 250 publications. He has co-authored three books: Microprocessor x86 Programming, BPB Press, New Delhi, 1995, Mastering C++, Tata McGraw Hill Press, New Delhi, 1997, and Design of PARAS Microkernel. The books on emerging topics that he edited include, High Performance Cluster Computing (Prentice Hall, USA, 1999), High Performance Mass Storage and Parallel I/O (IEEE and Wiley Press, USA, 2001), Content Delivery Networks (Springer, Germany, 2008), and Market Oriented Grid and Utility Computing (Wiley Press, USA, 2009). Edson Norberto Caceres is a Professor of the Department of Computer Science and Statistics of the Federal University of Mato Grosso do Sul, where he has been a former Pro-Rector of Undergraduate Stuties. He hold a PhD in Computer Science obtained at the Federal University of Rio de Janeiro in 1992. His research interests include the design of parallel algorithms, especially graph algorithms. He is the Director of Education of the Brazilian Computer Society. In addition to belonging to the Federal University of Mato Grosso do Sul since the early eighties, currenly he is also with the Brazilian Ministery of Education as the General Coordinator of Student Relations.
3
About the Contributors
Franck Cappello holds a Senior Researcher position at INRIA. He leads the Grand-Large project at INRIA, focusing on High Performance issues in Large Scale Distributed Systems. He has initiated the XtremWeb (Desktop Grid) and MPICH-V (Fault tolerant MPI) projects. He was the director of the Grid5000 project from its beginning and until 2008 and is the scientific director of ALADDIN/Grid5000, the new 4 years INRIA project aiming to sustain the Grid5000 infrastructure. He has contributed to more than 50 Program Committees. He is editorial board member of the international Journal on Grid Computing, Journal of Grid and Utility Computing and Journal of Cluster Computing. He is a steering committee member of IEEE HPDC and IEEE/ACM CCGRID. He is the Program co-Chair of IEEE CCGRID'2009 and System Software area co-chair of SC'2009 and was the General Chair of IEEE HPDC'2006. Jih-Sheng Chang received his B.E. degree from the Department of Computer Science and Information Engineering, I-Shou University, Kaohsiung, Taiwan in 2002 and his M.S. degree from the Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan in 2004. He is a Ph.D. candidate at the Department of Computer Science and Information Engineering at National Dong Hwa University currently. His academic research interests focuses on wireless network technology and grid computing. Ruay-Shiung Chang received his B.S.E.E. degree from National Taiwan University in 1980 and his Ph.D. degree in Computer Science from National Tsing Hua University in 1988. He is now a professor in the Department of Computer Science and Information Engineering, National Dong Hwa University. His research interests include Internet, wireless networks, and grid computing. Dr. Chang is a member of ACM, a senior member of IEEE, and a founding member of ROC Institute of Information and Computing Machinery. Dr. Chang also served on the advisory council for the Public Interest Registry (www.pir.org) from 2004/5 to 2007/4. Jinjun Chen received his Ph.D. degree in Computer Science and Software Engineering from Swinburne University of Technology, Melbourne, Australia, in 2007. He is currently a Lecturer in Centre for Complex Systems and Services in the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia. His research interests include: Scientific Workflow Management and Applications, Workflow Management and Applications in Web Service or SOC Environments, Workflow Management and Applications in Grid (Service)/Cloud Computing Environments, Software Verification and Validation in Workflow Systems, QoS and Resource Scheduling in Distributed Computing Systems such as Cloud Computing, Service Oriented Computing (SLA, Negotiation, Engineering, Composition), Semantics and Knowledge Management, Cloud Computing. Zizhong Chen received a B.S. degree in mathematics from Beijing Normal University, P. R. China, in 1997, and M.S. and Ph.D. degrees in computer science from the University of Tennessee, Knoxville, in 2003 and 2006, respectively. He is currently an assistant professor of computer science at Colorado School of Mines. His research interests include high performance computing, parallel, distributed, and grid computing, fault tolerance and reliability, numerical algorithms and software, and computational science and engineering. The goal of his research is to develop techniques, design algorithms, and build software tools for computational science applications to achieve both high performance and high reliability on a wide range of computational platforms.
4
About the Contributors
Shang-Feng Chiang was graduate students at the Department of Electrical Engineering, National Taiwan University. Kuo Chiang and Ruo-Jian Yu were research assistants at the Department of Electrical Engineering, National Taiwan University. Kenneth Chiu is an assistant professor at SUNY Binghamton. His interests are in the areas of scientific data management, web services, and grid computing. He has served as program co-chair for IEEE e-Science 2007, and as workshops chair for e-Science 2008. He is involved in a number of multidisciplinary projects with domain scientists, and he is PI or co-PI on five research/education awards from the NSF or DOE, five of which are still active. He received his A.B. from Princeton University and his Ph.D. from Indiana University, both in computer science. Yuanshun Dai is currently an assistant professor with the Department of Electrical Engineering and Computer Science and the Department of Industrial and Information Engineering at the University of Tennessee, Knoxville. He was the program chair of the 12th IEEE Pacific Rim Symposium on Dependable Computing (PRDC 06). He was also the general chair of the IEEE Symposium on Dependable Autonomic and Secure Computing (DASC) in 2005, 2006, 2007. He was an Associate Editor of IEEE Transactions on Reliability. His research interests are dependability, security, grid computing, and autonomic computing. He published over 60 papers and 5 books. F. Dehne received a M.C.S. degree (Dipl. Inform.) from the RWTH Aachen University, Germany in 1983 and a Ph.D. (Dr. Rer. Nat.) from the University of Würzburg, Germany in 1986. In 1986 he joined the School of Computer Science at Carleton University in Ottawa, Canada as an Assistant Professor. He was appointed Associate Professor and Professor of Computer Science in 1990 and 1997, respectively. From 2000 to 2003 and 2006 to 2008 he served as Director of the School of Computer Science. His current research interests are in the areas of Parallel Computing, Coarse Grained Parallel Algorithms, Parallel Computational Geometry, Parallel Data Warehousing & OLAP, and Parallel Bioinformatics. He is a Senior Member of the IEEE, member of the ACM Symposium on Parallel Algorithms and Architectures Steering Committee, and former Vice-Chair of the IEEE Technical Committee on Parallel Processing. He is an Editorial Board member for IEEE Transaction on Computers, Information Processing Letters, Journal of Bioinformatics Research and Applications, and Int. Journal of Data Warehousing and Mining. Evgueni Dodonov is currently finishing his PhD research at the University of São Paulo, Brazil. He completed his Master degree at Federal University of São Carlos in 2004, and worked in the computer industry for several years. His research interests include autonomic computing, file systems, process behavior evaluation and distributed programming. Daniel C. Doolan is a lecturer in the School of Computing, Robert Gordon University, Scotland. His main research interest is in Mobile and Parallel Computing. He has published over 40 articles in the areas of mobile multimedia and parallel computation.
5
About the Contributors
Dou Wanchun received his Ph.D. degree in Mechanical and Electronic Engineering from Nanjing University and Scientific and Technology, Nanjing, P.R. China, in 2001. He is currently a full professor in Department of Computer Scientific and Technology at Nanjing University, Nanjing, P.R. China. His research interests include: Scientific Workflow Management and Applications, Workflow Management and Applications in Web Service, QoS and Resource Scheduling in Distributed Computing Systems. Jörg Dümmler received his Master degree in Computer Science from the Chemnitz University of Technology in 2004 and is pursuing doctoral research since then. His research interest include scheduling and mapping of mixed parallel applications, parallel programming models for distributed memory platforms and transformation tools for the development of parallel applications. M. Rasit Eskicioglu received his B.Sc. in Chemical Engineering from Istanbul Technical University, Turkey, M.Sc. in Computer Engineering from Middle East Technical University, Turkey, and Ph.D. in Computing Science from University of Alberta, Canada. His research interests are mainly in the systems area, including operating systems, cluster and grid computing, high-speed network interconnects, and mobile networks. He has investigated ways to make software DSM systems more efficient and scalable using high-speed, programmable interconnects. Currently he is looking at wireless sensor networks and their applications to real world problems, such as environmental monitoring. Dr. Eskicioglu is currently an associate professor in Computer Science Department, at the University of Manitoba, Canada. He is a member of ACM and senior member of IEEE. Thomas Fahringer received his Ph.D. in 1993 from the Vienna University of Technology. Between 1990 and 1998, Fahringer worked as Assistant Professor at the University of Vienna, where he was promoted as Associate Professor in 1998. Since 2003, Fahringer is a Full Professor in Computer Science at the Institute of Computer Science, University of Innsbruck, where he is leading a research group developing the ASKALON Grid application development and computing environment. Fahringer's main research interests include software architectures, programming paradigms, compiler technology, performance analysis, and prediction for parallel and distributed Grid systems. Fahringer is currently coordinating the IST-034601 edutain@grid project and is involved in numerous Austrian (SFB Aurora, Austrian Grid) and European Grid (EGEE, CoreGrid, K-Wf Grid, ASG) projects. He is the author of over 100 papers, including two books, 20 journal articles, and two best paper awards (ACM and IEEE). Tore Ferm received his Bachelor of Computer Science Degree from Sydney University in 2004. He is currently working in the Telecommunication industry in Sydney, Australia. Gilles Fedak received its PhD degree from University Paris-XI in 2003. He is currently junior INRIA researcher at the LIP Laboratory. He is mainly interested in research around Desktop Grids. He has designed several Desktop Grid middleware, most notably XtremWeb (Desktop Grid) and BitDew (Data Management). Edgar Gabriel is an Assistant Professor in the Department of Computer Science at the University of Houston, Texas, USA. He got his PhD and Dipl.-Ing. in mechanical engineering from the University of Stuttgart. His research interests are Message Passing Systems, High Performance Computing, Parallel Computing on Distributed Memory Machines, and Grid Computing
6
About the Contributors
Jean-Luc Gaudiot received his Diplôme d’Ingénieur from the École Supérieure d’Ingénieurs en Electrotechnique et Electronique, Paris, France in 1976 and the M.S. and Ph.D. degrees in Computer Science from the University of California, Los Angeles in 1977 and 1982, respectively. He is currently a Professor and Chair of the Electrical Engineering and Computer Science Department at the University of California, Irvine. His research interests include multithreaded architectures, fault-tolerant multiprocessors, and implementation of reconfigurable architectures. He has published over 170 journal and conference papers. Dr. Gaudiot is a Fellow of IEEE and AAAS. Wolfgang Gentzsch is dissemination advisor for the DEISA Distributed European Initiative for Supercomputing Applications, and a member of the Board of Directors of the Open Grid Forum. Before, he was Chairman of the German D-Grid Initiative; managing director of MCNC Grid and Data Center Services in Durham; adjunct professor of computer science at Duke University; and visiting scientist at RENCI Renaissance Computing Institute at UNC Chapel Hill in North Carolina. At the same time, he was a member of the US President’s Council of Advisors for Science and Technology. Before he joined Sun in Menlo Park, CA, in 2000, as the senior director of Grid Computing, he was the President, CEO, and CTO of start-up companies Genias and Gridware, and a professor of mathematics and computer science at the University of Applied Sciences in Regensburg, Germany. Gentzsch studied mathematics and physics at the Technical Universities in Aachen and Darmstadt, Germany. Lin Guan is a Lecturer in the Department of Computer Science at Loughborough University, UK. Her research interests include performance modeling/evaluation of computer networks, Quality of Service (QoS) analysis and enhancement, such as congestion control mechanisms with QoS constraints, mobile computing and wireless networks. Sudha Gunturu received B.Tech degree in Computer Science and Engineering, from Jawaharlal Nehru Technological University, Hyderabad, India., in 2005. Currently, she is pursuing her MS degree in the Computer Science Department of Oklahoma State University, Stillwater, OK. Her research interests include bioinformatics, scheduling computational loads in parallel and distributed systems and grid computing. Minyi Guo received his Ph.D. degree in computer science from University of Tsukuba, Japan. Before 2000, Dr. Guo had been a research scientist of NEC Corp., Japan. He is now a full professor at Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His research interests include pervasive computing, parallel and distributed processing, parallelizing compilers and software engineering. He is a member of the ACM, IEEE, IEEE Computer Society, and IEICE. Phalguni Gupta received the Doctoral degree from the Indian Institute of Technology Kharagpur, India in 1986. He works in the field of data structures, sequential algorithms, parallel algorithms, on-line algorithms. From 1983 to 1987, he was in the Image Processing and Data Product Group of the Space Applications Centre (ISRO), Ahmedabad, India and was responsible for software for correcting image data received from Indian Remote Sensing Satellite. In 1987, he joined the Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India. Currently he is a Professor in the department. He is responsible for several research projects in the area of image processing, graph theory and network flow. Dr. Gupta is a member of the Association for Computing Machinery (ACM).
7
About the Contributors
Peter Graham is an Associate Professor in the Computer Science Department and Associate Dean (Research) for the Faculty of Science at the University of Manitoba. He is also an adjunct scientist at TRLabs, Winnipeg. His current research interests include large-scale parallel and distributed systems, pervasive computing, and mobile computing. He is a member of the ACM, IEEE and USENIX. Alan Grigg is a senior researcher in the Systems Engineering Research Centre at Loughborough University, UK, a BAE Systems funded position. His research interests include real-time embedded systems design, analysis and implementation issues around scheduling, communication and reconfiguration. Dan Grigoras is Senior Lecturer in the Department of Computer Science of National University of Ireland, Cork, where he leads the Mobile and Cluster Computing Group. His main research interests are in Mobile Networking and Parallel Computing, especially MANET management, middleware services, mobile applications design, load balancing and load sharing. He published one book, co-edited seven others and 44 papers in journals and proceedings of conferences. He is also involved in many conferences and workshops. Xiangjian He is an Associate Professor at the Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. In the previous few years, he has received many research grants including four Australian National Grants for research in the fields of computing and telecommunication. His current research interests include multi-scale computing, computer vision, and network security and QoS. More information can be found at http://www-staff.it.uts.edu.au/~sean/ Yong J. Jang received the B.S. degree from the School of Electrical and Electronic Engineering in Yonsei University in March, 2008. He is in the master degree program from the School of Electrical and Electronic Engineering in Yonsei University. Yong’s research interest includes multi-processor system and SOC design. Hai Jiang is an Assistant Professor in the Department of Computer Science at Arkansas State University. He received his Ph. D. in Computer Science from Wayne State University, Detroit, Michigan in December, 2003. His current research interests include Parallel Computing, Distributed Systems, High Performance Computing and Communication, Modeling and Simulation, and System Security. He is a member of the IEEE, the IEEE Computer Society, and the ACM. His personal web page is at http:// www.csm.astate.edu/~hjiang. Hong Jiang received B.Sc. and M.A.Sc. in Computer Engineering in 1982 from Huazhong University of Science and Technology and in 1987 from University of Toronto; and PhD in Computer Science in 1991 from Texas A&M University. Since 1991 he has been at the University of Nebraska-Lincoln, where he is Professor in Computer Science and Engineering. His present research interests are computer architecture, parallel I/O, parallel/distributed computing, cluster and Grid computing, performance evaluation, real-time systems, middleware, and distributed systems for distance education. He has over 150 publications in major journals and international Conferences in these areas and his research has been supported by NSF, DOD and the State of Nebraska.
8
About the Contributors
Yanqing Ji received his Ph.D. in Computer Engineering from Wayne State University, Detroit, Michigan, in 2007. He is currently an Assistant Professor in the Department of Electrical and Computer Engineering at Gonzaga University, Spokane, Washington. His research interests include parallel and distributed systems, application-level thread migration/checkpointing, multi-agent systems, and their biomedical applications. His website is at: http://barney.gonzaga.edu/~ece/yji.html. Derrick Kondo received his PhD in Computer Science from the University of California at San Diego, and his BS from Stanford University. Currently, he is a research scientist at INRIA Rhône-Alpes in the MESCAL team. His interests lie in the area of volunteer computing and desktop grids. In particular, he leads research on the measurement and characterization of Internet distributed systems, their simulation and modelling, and resource management. He founded and continues to serve as co-chair of the Workshop on Volunteer Computing and Desktop Grids (PCGrid), and also co-chaired the BOINC 2008 workshop on volunteer computing and distributed thinking. He is serving as guest co-editor of a special issue in 2009 of the Journal of Grid Computing on volunteer computing and desktop grids. King Tin Lam received his B.Eng. degree in Electrical and Electronic Engineering and M.Sc. degree in Computer Science both from the University of Hong Kong in 2001 and 2006 respectively. He worked in the IT Department of the Hongkong and Shanghai Banking Corporation for five years between graduations of the two degrees. Mr. Lam is currently a full-time Ph.D. candidate in the Department of Computer Science at the University of Hong Kong. His research interests include distributed Java virtual machines for cluster computing, software transactional memory and server clustering technologies. Xiaobin Li received the B.S. degree in electrical engineering from Chongqing University, China, in 1990 and the M.S and the Ph.D. degree in electrical and computer engineering from the University of California, Irvine, in 2001 and 2005, respectively. He is now a senior engineer of Enterprise Microprocessor Group at Intel Corporation, where he is developing XEON microprocessors. His research interests are in the fault-tolerant computing and the power and thermal management. Xiaolin Li is currently an assistant professor in Computer Science Department at Oklahoma State University (OSU), USA, and director of Scalable Software Systems Laboratory (S3Lab,http://s3lab. cs.okstate.edu/). He received the Ph.D. degree in Computer Engineering from Rutgers University, USA. His research interests include distributed systems, sensor networks, network security, and bioinformatics. He is in the executive committee of IEEE Technical Committee of Scalable Computing (TCSC) and the coordinator of Sensor Networks. He has been a TPC chair for several international conferences and workshops and is on the editorial board of three international journals. He regularly reviews NSF grant proposals as a panelist. He is a member of IEEE and ACM. Chen Liu received his B.E. degree in Electrical Engineering from University of Science and Technology of China in 2000. He received the M.S. degree in Electrical Engineering from the University of California, Riverside in 2002 and the Ph.D. degree in Electrical and Computer Engineering from the University of California, Irvine in 2008. He currently works as an Assistant Professor in the Department of Electrical and Computer Engineering at Florida International University. His current research interests are high-performance microprocessor design and multi-thread multi/many-core architecture.
9
About the Contributors
Shaoshan Liu is currently a Ph.D. candidate in Computer Architecture at the University of California, Irvine. He received the B.S. degree in Computer Engineering, M.S. in Computer Engineering, and M.S. in Biomedical Engineering in 2005, 2006, and 2007 respectively, all from the University of California, Irvine. His research interests include high performance parallel computer systems, runtime systems, and biomedical engineering. He has been with Intel Research as a member of the Managed Runtime Optimization (MRO) lab, and Broadcom Corporation as a Device Verification and Test (DVT) engineer. Paul Malécot is Ph.D. candidate in computer science from LRI, Paris South University (France) under the direction of Franck Cappello and Gilles Fedak. He is a member of the INRIA Grand-Large project. His research interests includes the characterization of volunteer desktop computing platforms. Victor Malyshkin received his M.S. degree in Mathematics from the State University of Tomsk (1970), Ph.D. degree in Computer Science from the Computing Center of the Russian Academy of Sciences (1984), Doctor of Sciences (Dr.h.) degree from the State University of Novosibirsk (1993). From 1970 he had a job in software industry. In 1979 he jointed the Computing Center RAS where he is presently the head of Supercomputer Software Department. He also found the Chair of Parallel Computing Technologies at the State Technical University of Novosibirsk and the Chair bof Parallel Computing in the Novosibirsk State (Classical) University.. He is one of the organizers of the PaCT (Parallel Computing Technologies) series of international conferences that held each odd year. He published over 100 scientific papers on parallel and distributed computing, parallel program synthesis, supercomputer software and applications, parallel implementation of the large scale numerical models. His current research interests include parallel computing technologies; parallel programming languages and systems; methods of parallel implementation of the large scale numerical models; dynamic load balancing; methods, algorithms and tools for parallel program synthesis. Verdi March is a Research Fellow at the Department of Computer Science, National University of Singapore, and a Research Scientist at the Asia-Pacific Science & Technology Center (APSTC), Sun Microsystems Inc.. He completed his PhD from the Department of Computer Science, NUS, in 2007. He received his BSc in Computer Science from the University of Indonesia in 2000. Verdi is currently leading the HPC research projects in APSTC. His main research interest includes the performance analysis of HPC systems, and distributed systems such as grid computing and peer-to-peer computing. Rodrigo Fernandes de Mello is currently a faculty member at the Institute of Mathematics and Computer Sciences, Department of Computer Science, University of São Paulo, São Carlos, Brazil. He completed his PhD degree at the University of São Paulo, São Carlos in 2003. His research interests include autonomic computing, load balancing, scheduling and bio-inspired computing. Marian Mihailescu is pursuing a PhD in Computer Science at the Department of Computer Science, National University of Singapore. He received his BSc in Computer Science in 2005 from the Polytechnic University of Bucharest, Romania. His main research interests include grid computing, peer-to-peer systems, resource allocation and game theory, with a focus on pricing mechanisms.
10
About the Contributors
Farrukh Nadeem received his Master's degree in Computer Science from Punjab University of College of Information Technology, Lahore, Pakistan, in 2002. Currently he is employed as a Ph.D. student at the Institute of Computer Science, University of Innsbruck, Austria, where he is working in area of performance modeling and prediction for high performance Grid computing. Nadeem is the author of over 10 scientific papers and co-author of two book chapters. Priyadarsi Nanda is a Lecturer at School of Computing and Communications in the Faculty of Engineering and IT at the University of Technology Sydney (UTS), Australia. He has a wide ranging career in teaching, research, industry and consultancy. He received B.Eng. degree in Computer Engineering from Shivaji University, India, M.Eng. degree in Computer and Telecommunication from University of Wollongong, Australia and PhD in Computing Science from University of Technology, Sydney, Australia in 1990, 1996 and 2008 respectively. Details of his research and teaching are available at http://www-staff.it.uts.edu.au/~pnanda/ Doohwan. Oh has received the B.S. degree in of Kyung Hee University in 2007. Doohwan is in the master degree program from the School of Electrical and Electronic Engineering, Yonsei University. His research interest includes ASIC designs and SOC (system on a chip) development. Zhonghong Ou received his M.Sc degree in electronic engineering from Beijing University of Posts and Telecommunications, Beijing, in 2005. He is now pursuing his PhD degree at both Beijing University of Posts and Telecommunications, China and University of Oulu, Finland. His current research interests span the fields of P2PSIP systems, hierarchical P2P networks, routing algorithms, and protocols. Manish Parashar is Professor of Electrical and Computer Engineering at Rutgers University, , where he also is Director of the NSF Center for Autonomic Computing and The Applied Software Systems Laboratory and director of the Applied Software Systems Laboratory (TASSL). He received a BE degree in Electronics and Telecommunications from Bombay University, India and MS and Ph.D. degrees in Computer Engineering from Syracuse University. His research interests include autonomic computing, parallel & distributed computing (including peer-to-peer and Grid computing), scientific computing, and software engineering. Manish has received the IBM Faculty Award (2008), Rutgers University Board of Trustees Award for Excellence in Research (2004-2005), the NSF CAREER Award (1999), TICAM, University of Texas at Austin, Distinguished Fellowship (1999-2001), Enrico Fermi Scholarship, Argonne National Laboratory (1996). He is a senior member of IEEE/IEEE Computer Society and ACM. For more information please visit http://www.ece.rutgers.edu/~parashar/. Jean-Marc Pierson: Since September 2006, Jean-Marc Pierson serves as a University Professor in Computer Science at the University Paul Sabatier, Toulouse 3 (France). Jean-Marc Pierson received his PhD from the ENS-Lyon, France in1996. He was an Associate Professor at the University Littoral Coted'Opale (1997-2001) in Calais, then at INSA-Lyon (2001-2006). He is a member of the IRIT Laboratory. His main interests are related to large-scale distributed systems, funded by several projects in Grids and Pervasive environments, with applications in biomedical informatics. He serves on several PCs in the Grid and Pervasive computing area. His researches focus on security, cache and replica management, monitoring and more recently energy aware distributed systems. For more information, please visit http://www.irit.fr/~Jean-Marc.Pierson/
11
About the Contributors
M. Cristina Pinotti received the Laurea degree in Computer Science from the University of Pisa (Italy) in 1986. Currently, she is a Professor of Computer Science at the University of Perugia. She spent visiting periods at the University of North Texas and at the Old Dominion University (USA). Her research interests are the design and analysis of algorithms for wireless networks, sensor networks, parallel and distributed systems, and special purpose architectures. She has published about 50 refereed papers on international journals, conferences, and workshops. She has been a guest coeditor for special issues of international journals. She is in the editorial board of the International Journal of Parallel, Emergent and Distributed Systems. Radu Prodan received his Master's degree in Computer Science from the Technical University of Cluj-Napoca, Romania, in 1997. Between 1998 and 2001 he served as Research Assistant in Switzerland at ETH Zurich, University of Basel and the Swiss Centre for Scientific Computing. In 2001 he joined the Institute for Software Science, University of Vienna, where he earned his Ph.D. in 2004 from the Vienna University of Technology. Prodan is currently an assistant professor at the Institute of Computer Science, University of Innsbruck. He is interested in distributed software architectures, compiler technology, performance analysis, and scheduling for parallel and Grid computing. Prodan participated in several national and European projects and is currently workpackage leader in the IST-034601 edutain@ grid project. He is the author of over 50 papers, including one book, over 10 journal articles, and one IEEE best paper award. Dang Minh Quan is senior researcher at the School of Information Technology at the International University in Bruchsal. Dr. Quan received his Eng. degree (2001), his M.Sc. degree (2003) from Hanoi University of Technology, VietNam, and his Ph.D. (2006) from the University of Paderborn, Germany. Dr. Quan's current research focuses on High Performance Computing, and Grid computing. In particular, he put special focus on supporting management of SLA-based workflows in the Grid. Rajiv Ranjan is a post doctoral research fellow in the Grids laboratory, Department of Computer Science and Software Engineering, the University of Melbourne. Dr. Ranjan has authored/co-authored more than 15 papers, which are published in well reputed international conferences, journals, and edited books. His current research interest includes design, development, and implementation of algorithms, software frameworks, and middleware services for realizing autonomic Grid and Cloud computing systems. In particular, he researches on next generation decentralized protocols, data indexing algorithms, and fault-tolerant scheduling heuristics for autonomic management of applications in large scale Grid and Cloud computing environment. Thomas Rauber received his Master degree, his PhD degree, and the Habilitation in Computer Science from the University des Saarlandes (Saarbrücken) in 1986, 1990, and 1996 respectively. From 1996 to 2002, he has been professor for computer science at the Martin-Luther-University Halle-Wittenberg. He joined the University Bayreuth in 2002 where he holds the chair for parallel and distributed systems. His research interest include parallel and distributed algorithms, programming environments for parallel and distributed systems, compiler optimizations and performance prediction. Mika Rautiainen is currently working as a post-doctoral researcher at the University of Oulu. He received his M.Sc (eng.) and Dr. Tech. degrees from the Department of Electrical and Information En-
12
About the Contributors
gineering, University of Oulu, Finland, in 2001 and 2006, respectively. His research interests include content-based multimedia management and retrieval systems, pattern recognition, and digital image and video processing and understanding. Ala Rezmerita is a Ph.D. student in the Cluster and Grid group of the LRI laboratory at Paris-South University and is a member of the Grand-Large team of INRIA. She has obtained a Master in computer science in 2005 from the French University of Paris 7 – Denis Diderot. Her research interests include parallel and distributed computing, grid middleware and Desktop Grid. Romeo Rizzi was born in 1967. He received the Laurea degree in Electronic Engineering from the Politecnico di Milano in 1991, and in 1997 he received a Ph.D. in Computational Mathematics and Informatics from the University of Padova, Italy. Afterwards, he held Post-Doc and other temporary positions at research centers like CWI (Amsterdam, Holland), BRICS (Aarhus, Denmark) and IRST (Trento, Italy). In March 2001, he became an Assistant Professor at the University of Trento. Since 2005, he is with the University of Udine, as an Associated Professor. He is fond of combinatorial optimization and algorithms and has a background in operations research. Won W. Ro received the B.S. degree in Electrical Engineering from Yonsei University, Seoul, Korea, in 1996. He received the M.S. and Ph.D. degrees in Electrical Engineering from the University of Southern California in 1999 and 2004, respectively. He also worked as a research scientist in Electrical Engineering and Computer Science Department in University of California, Irvine. Dr. Ro has worked as an Assistant Professor in the Department of Electrical and Computer Engineering of the California State University, Northridge. He also worked as a college intern in Apple Computer Inc. and as a contract software engineer in ARM Inc. His current research interest includes high-performance microprocessor design, compiler optimization, and embedded system designs. (http://escal.yonsei.ac.kr) Gudula Rünger received her Master degree and her PhD degree in mathematics from the University of Cologne in 1985 and 1989, respectively, and the Habilitation in Computer Science from the University des Saarlandes (Saarbrücken) in 1996. From 1997 to 2000, she has been professor for computer science at the University Leipzig. Since 2000 she is full professor at the Technical University of Chemnitz. Her research interest include parallel applications, parallel programming languages and libraries, scientific computing, software tools for mixed programming models, as well as algorithmic and parallel adaptivity. Haiying Shen received the BS degree in Computer Science and Engineering from Tongji University, China in 2000, and the MS and Ph.D. degrees in Computer Engineering from Wayne State University in 2004 and 2006, respectively. She is currently an Assistant Professor in the Department of Computer Science and Computer Engineering of the University of Arkansas. Her research interests include distributed and parallel computer systems and networks, with an emphasis on peer-to-peer networks, wireless networks, resource management in cluster and grid computing, and data processing. She has been a PC member of many conferences, and a member of IEEE and ACM. Wei Shen is currently a Ph.D. candidate at the University of Cincinnati, USA. He received his B.E. degree from Anhui Normal University, China, in 1997, and an M. E. degree from Nanjing University
13
About the Contributors
of Posts and Telecommunications in 2001, both in electrical engineering. His current research interests include resource and mobility management of wireless and mobile networks, QoS provision, and the next generation heterogeneous wireless networks. Mohammad Shorfuzzaman is a PhD student in the Department of Computer Science, University of Manitoba (UofM), Canada. He received his B.Sc.Engg. (Bachelor of Science and Engineering) in Computer Science and Engineering from Bangladesh University of Engineering and Technology, Bangladesh in 2001 and M.Sc. degree in Computer Science from University of Manitoba in 2005. Prior to his current study in UofM, he worked as a Lecturer in Asian University of Bangladesh for one year. His research interests include distributed systems and in particular Grid computing. Siang Wun Song is a Professor of the Department of Computer Science, University of Sao Paulo, Brazil, where has been a former dean of the Institute of Mathematics and Statistics. He holds a PhD in Computer Science obtained at Carnegie Mellon University in 1981. He was on the editorial boards of Parallel Computing, and Parallel and Distributed Computing Practices. He is currently on the editorial boards of Parallel Processing Letters, Scalable Computing: Practice and Experience, and Journal of the Brazilian Computer Society. His area of interest is the design of parallel algorithms. Jun-Zhao Sun received his Dr. Eng. degree in computer science in 1999 from the Harbin Institute of Technology in China. He has been a senior researcher at the Department of Electrical and Information Engineering, University of Oulu in Finland since 2000. From 2006 he works as an academy research fellow for the Academy of Finland. His research interests are in mobile and pervasive computing, wireless sensor networks, context awareness, middleware, and mobility management. Sabin Tabirca is lecturer in the Department of Computer Science of National University of Ireland, Cork. His main research interest is in Mobile Multimedia with an emphasis on visualisation and graphics. He has published more than 130 articles in the areas of HPC computing and Mobile Multimedia. Feilong Tang received his Ph.D degree in Computer Science and Technology from Shanghai Jiao Tong University (SJTU), China in 2005. Now, he works with the Department of Computer Science and Engineering, Shanghai Jiao Tong University. His research interests focus on grid and pervasive computing, distributed transaction processing, wireless sensor networks, and distributed computing. Yong Meng Teo is an Associate Professor with the Department of Computer Science at the National University of Singapore, and an Associate Senior Scientist at the Asia-Pacific Science & Technology Center, Sun Microsystems Inc. He heads the Computer Systems Research Laboratory and the Information Technology Unit. He was a Fellow of the Singapore-Massachusetts Institute of Technology Alliance from 2002-2006. He received his MSc and PhD in Computer Science from the University of Manchester, UK, in 1987 and 1989. His main research interest is in parallel and distributed systems covering the organization, programming models, networking and performance of multi-core, grid and peer-to-peer systems. Current projects include peer-to-peer networks, performance analysis of large systems, faulttolerant consensus in distributed systems and component-based modeling and simulation.
14
About the Contributors
Parimala Thulasiraman received B.Eng. (Honors) and M.A.Sc. degrees in Computer Engineering from Concordia University, Montreal, Canada and obtained her Ph.D. from University of Delaware, Newark, DE, USA after finishing most of her formalities in McGill University, Montreal, Canada. She is now an Associate Professor with the Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada. The focus of her research is on parallel and algorithms for applications such as computational biology, computational finance, medical imaging or computational medicine on advanced architectures. Over the past few years, she has been working on distributed algorithms for mobile networks using nature inspired algorithms such as Ant Colony Optimization techniques. She has published several papers in the above areas in leading journals and conferences and has graduated many students. Parimala has organized conferences as local chair, program chair and tutorial chair. She has has been serving as a reviewer and program committee member for many conferences. She has also been a reviewer for many leading journals. She is a member of the ACM and IEEE societies. Ruppa K. Thulasiram (Tulsi) is an Associate Professor with theDepartment of Computer Science, University of Manitoba, Winnipeg, Manitoba. He received his Ph.D., from Indian Institute of Science, Bangalore, India and spent years at Concordia University, Montreal, Canada; Georgia Institute of Technology, Atlanta; and University of Delaware as Post-doc, Research Staff and Research Faculty before taking up the position with University of Manitoba. Tulsi has undergone training in Mathematics, Applied Science, Aerospace Engineering, Computer Science and Finance during various stages of his schooling and post doctoral positions. Tulsi's current primary research interests is in the emerging area of Computational Finance. Tulsi has developed a curriculum for cross-disciplinary computational finance course at University of Manitoba and currently teaching this at both graduate and undergraduate level. He has trained and graduated many students in this area.His research interests include Scientific and Grid Computing, Bio-inspired algorithms for Finance, M-Commerce Applications, and Mathematical Finance, where he has been training many graduate students. He has published number of papers in the areas of High Temperature Physics, Gas Dynamics, Scientific Computing and Computational Finance in leading journals and conferences and has own best and distinguished paper awards in prominent conferences. Tulsi has been serving in many conference technical committees related to parallel and distributed computing, Neural Networks, Computational Finance as program chair, general chair etc and has been a reviewer for many conferences and journals. He is a member of the ACM and IEEE societies. Daxin Tian received the BS, MS.and Ph.D (Hons) degrees, in computer science, from the Jilin University, Changchun, China, in July 2002, July 2005, and December 2007, respectively. His research interests include network security, intrusion detection system, neural network, machine learning. Sameer Tilak is an assistant research scientist at the University of California, San Diego. He is involved in the design and development of the cyberinfrastructure for a number of large-scale sensorbased environmental observing system initiatives including the Global Lake Ecological Observatory Network (GLEON) and the Coral Reef Environmental Observatory Network (CREON). He received his Ph.D. and M.S. in computer science from SUNY Binghamton in 2005 (degree conferred: January 2006) and 2002 respectively. He received his M.S. in computer science from the University of Rochester in 2003. His research interests include wireless networks (specifically ad-hoc and sensor networks), grid computing, stream data management, and parallel discrete-event simulation. He has served as a TPC
15
About the Contributors
member for numerous conferences and workshops including IEEE Percom 2009, DCOSS (2008-2009), IEEE LCN (2007-2009), IEEE SECON 2009, ACM-IEEE MSWiM 2008, IEEE SenseApp (2007-2009) and IEEE e-Science 2007. Cho-Li Wang received his Ph.D. degree in Computer Engineering from University of Southern California in 1995. He is currently an associate professor of the Department of Computer Science at the University of Hong Kong. Dr. Wang’s research interests mainly focus on distributed Java virtual machines on clusters, Grid middleware, and software systems for pervasive/mobile computing. Dr. Wang is serving in a number of editorial boards, including IEEE Transaction on Computers (TC), Multiagent and Grid Systems (MAGS), and the International Journal of Pervasive Computing and Communications (JPCC). He is the regional coordinator (Hong Kong) of IEEE Technical Committee on Scalable Computing (TCSC). Sheng-De Wang was born in Taiwan in 1957. He received the B.S. degree from National Tsing Hua University, Hsinchu, Taiwan, in 1980, and the M. S. and the Ph. D. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1982 and 1986, respectively. Since 1986 he has been on the faculty of the department of electrical engineering at National Taiwan University, Taipei, Taiwan, where he is currently a professor. From 1995 to 2001, he also served as the director of computer operating group of computer and information network center, National Taiwan University. He was a visiting scholar in Department of Electrical Engineering, University of Washington, Seattle during the academic year of 1998-1999. From 2001 to 2003, He has been served as the Department Chair of Department of Electrical Engineering, National Chi Nan University, Puli, Taiwan for the 2-year appointment. His research interests include parallel and distributed computing, embedded systems, and intelligent systems. Dr. Wang is a member of the Association for Computing Machinery and IEEE computer societies. He is also a member of Phi Tau Phi Honor society. Yang Xiang is currently with School of Management and Information Systems, Central Queensland University. His research interests include network and system security, and wireless systems. He has served or is serving as PC Chair for the The 11th IEEE International Conference on High Performance Computing and Communications (HPCC 09), The 3rd IEEE International Conference on Network and System Security (NSS 09), and The 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS 08). He has served as or is serving as guest editor for ACM Transactions on Autonomous and Adaptive Systems, Journal of Network and Computer Applications, and Concurrency and Computation: Practice and Experience. Meilian Xu is a PhD student in the University of Manitoba. She received her M.Sc degree in Computer Science from Peking University and B.E. degree in Computer Science from East China Normal University in China. Her research interest is high performance computing and parallel algorithm design for applications such as medical imaging on parallel systems, focusing on multi-core architectures such as Cell Broadband Engine architecture. She has published several papers in leading conferences in this direction. She is a member of the ACM and IEEE societies. Jaeyoung Yi received the B.S. degree in Mathematics department and Computer Science department from Yonsei University in March, 2008. Currently, she is in the master degree program of the School
16
About the Contributors
of Electrical and Electronic Engineering, Yonsei University. multi-processor system on a chip architectures.
Jaeyoung’s research interest includes
Mika Ylianttila is a professor and adjunct professor in computer science and information networks at the Information Processing laboratory and Research Manager at the MediaTeam Oulu research group at the University of Oulu, Finland. His research interests include mobile applications and services, protocol design and performance, and communication and middleware architectures. He is a senior member of IEEE. Jiehan Zhou is currently working as a research scientist at MediaTeam, Information Processing laboratory, University of Oulu. He obtained his PhD in manufacturing and automation from the Huazhong University of Science and Technology, Wuhan, China in 2000. He did 2-year postdoctoral research In CIMS, Department of Automation, Tsinghua University, Beijing, China. He worked in VTT/Oulu Finland and INRIA/Sophia Antipolis France for a 18-month ERCIM fellowship. His current research interests include middleware, community coordinated multimedia, service-oriented computing, ontology engineering, semantic Web, and protocol engineering. Yifeng Zhu received BSc in electrical engineering in 1998 from the Huazhong University of Science and Technology, China, and the MS and PhD degrees in computer science from the University of Nebraska, Lincoln, in 2002 and 2005, respectively. He is currently an assistant professor in Electrical and Computer Engineering at University of Maine. His research interests include parallel I/O storage systems, supercomputing, energy aware memory systems, and wireless sensor networks. He served as the program chair of IEEE NAS’09 and IEEE SNAPI’07, the guest editor of a special issue of the International Journal of High Performance Computing and Networking. He received Best Paper Award at IEEE CLUSTER’07. Albert Y. Zomaya currently holds the Chair of High Performance Computing and Networking in the School of Information Technologies at Sydney University. He is the author/co-author of seven books, more than 300 papers, and the editor of eight books and eight conference proceedings. He serves as an associate editor for 16 leading journals. Professor Zomaya is the recipient of the Meritorious Service Award (in 2000) and the Golden Core Recognition (in 2006), both from the IEEE Computer Society. He is a Chartered Engineer (CEng), a Fellow of the American Association for the Advancement of Science, the IEEE, the Institution of Engineering and Technology (U.K.), and a Distinguished Engineer of the ACM.
17
1
Index
A α-approximation algorithm 645, 649 Abstract Data and Communication Library (ADCL) 583, 585, 587, 588, 589, 590, 591, 592, 593, 594, 595, 598, 600, 601, 602, 603 access points (APs) 719, 721, 725, 726 adaptation point 888 ALF programming system 297, 309 Amazon Cloud 63, 79, 83, 86 AMD 266, 277, 279, 291, 313, 314, 315, 317, 319, 322, 323, 332, 336 American option 472 analytical hierarchy process (AHP) 723 Aneka Coordinator 195, 196, 197, 198, 199, 200, 201, 208, 209 Aneka-Federation 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 202, 205, 207, 208, 209, 212, 213, 214, 217 Apache Tomcat 658, 659, 660, 661, 663, 664, 665, 669, 671, 672, 673, 674, 677, 678, 679 application-level approach 881, 882, 888, 890 application programmer's interface (API) 43, 66, 67, 71, 84, 86 application scalability 73 applications, data-intensive 1, 3, 8, 42, 51, 52, 74, 116 arithmetic operation 816, 840, assembly technology (AT) 295, 296, 297, 299, 300, 301, 303, 305, 309, 310 atomicity, consistency, isolation, & durability (ACID) 422 atomic transaction 422, 425, 429, 441 authentication 573, 574, 576, 582
Automatically Tuned Collective Communications (ATCC) 586 Automatically Tuned Linear Algebra Software (ATLAS) 586 autonomic computing 22, 26, 28 autonomous system (AS) 740, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 759
B back-end pipeline 555, 559, 562 base stations (BSs) 719, 721, 725, 726 Basic Linear Algebra Software (BLAS) library 586, 605 basic similarity algorithm 383, 384, 385, 387 behavior classification 339, 342, 344, 345, 346 behavior extraction 338, 339, 341, 346, 347, 348 behavior prediction 339, 342, 343, 345, 347 benchmarks 92, 95, 96, 102, 103, 105, 109, 112, 113, 114, 115 Berkeley Open Infrastructure for Network Computing (BOINC) 32, 33, 37, 38, 39, 40, 41, 42, 43, 44, 48, 49, 50, 51, 52, 53, 54, 56 best effort (BE) model 739, 754 bin-packing 645 bioinformatics 843, 844, 856, 857 BitTorrent 124, 128, 138 Bloom filter 787, 793, 794, 799, 802, 803, 804, 806, 807, , 861 Bloom filter array 807 Bloom filter replica 799, 802, 803, 807, Bloom filter update protocol 807 Bluetooth 705, 706, 707, 712, 713, 714, 715, 716, 717 Volume 1: pgs 1-485 Volume II: pgs 486-894
Copyright © 2010, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Index
branch prediction 568 branch prediction units (BPU) 567
C cache coherence communication 559 call option 472, 477, 481 call tracing 340 Cell Broadband Engine (Cell/B.E.) architecture 312, 314, 315, 320, 321, 322, 323, 324, 325, 327, 328, 329, 330, 331, 332, 333, 335, 336 checkpointing 875, 878, 883, 886, 888, 891, 892, 893 chip multi-processing (CMP) 556, 557, 558, 559, 578 chords 143, 145, 148, 149, 160 churn 128, 144, 146, 147, 154, 158, 160, 164, 165, 168, 169, 174, 176, 179, 180, 181, 182, 183, 184, 186 clients 33, 34, 36, 37, 38, 39, 44, 49, 50, 60, 65, 79 client schedule coordinator 33, 34 client schedule coordinator coordinator 44 close to the metal (CTM) parallel programming tool 291 Cloud computing 53, 54, 62, 76, 79, 80, 83, 84 Clouds, Aneka 210, 211, 212 Clouds, enterprise 191, 192, 193, 194, 195, 198, 200, 208, 212, 217 Cloud services, decentralized 195, 196, 197, 198, 203, 207 cluster 234, 250, 262, 263, 264, 265, 266, 267, 268, 270, 275, 312, 322, 323, 324, 328, 330, 331, 332 cluster computing 856, 891, 892, 894 coarse-grained SMP designs 563 coarse-grained synchronization 564 coarse-grain multithreading 556 collaboration disciplines 414 communicating multiprocessor tasks (CMtasks) 260, 261, 262, 265, 274 communication reliability 220 communication round 378, 379, 380, 393 communication, synchronization and 552, 564 community coordinated multimedia (CCM) 682, 683, 684, 685, 686, 687, 688, 689,
2
690, 691, 692, 693, 694, 695, 696, 697, 698, 703 compensating transaction 422, 424, 426, 433, 434, 435, 436, 437, 438 compiler tool 246, 248, 249, 250, 251, 254, 255, 256, 261, 262, 265, 269, 273, 277, 294, 321, 322, 323, 324, 330, 333 computational biology 843, 844, 856 computational grid 470, 471, 472 computation mobility 874, 875, 876, 878, 879, 880, 881, 890, 891 compute unified device architecture (CUDA) 288, 291 concurrent-read exclusive-write (CREW) 670, 678 condition number 776, 777 controlled flooding mechanisms 123 cross-organizational collaboration 397, 400, 402, 404, 413, 414 cross-organizational service invocation, QoS of 416 cycle stealing 32
D data conversion 876, 878, 885, 886, 887 data grids 2, 12, 63, 512, 513, 514, 515 data packets 280, 282, 283, 284, 285, 287 data parallelism 246, 247, 248, 249, 251, 268, 269, 270, 272, 273, 274, 275, 319 data replication 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515 data separation 69 deadlock monitoring, dynamic 571, 572 deadlock situations 563, 564, 567, 569, 570, 571 decoding process 285 deep packet inspection (DPI) 873 DEISA Extreme Computing Initiative (DECI) 62, 77, 78, 79, 80, 83, 86 desktop grid 41, 42, 48, 54, 55, 56, 57, 60 Differentiated Services (Diffserv) 739, 742, 751, 753, 754, 757, 758, 759 digital image processing 808, 809
Index
diskless checkpointing 760, 761, 763, 764, 765, 767, 768, 770, 782 distributed computing 1, 2, 7, 13, 14, 21, 27, 32, 36, 39, 57, 58, 60, 86, 87, 88 Distributed European Initiative for Supercomputing Applications (DEISA) project 62, 70, 73, 76, 77, 78, 79, 80, 83, 86, 87 distributed hash tables (DHTs) 124, 143, 160, 161, 163, 164, 165, 166, 169, 174, 180, 186, 190, 193, 201, 207, 217 distributed Java virtual machine (DJVM) 658, 659, 660, 661, 662, 663, 664, 665, 666, 670, 671, 673, 674, 675, 677, 678, 679, 681 Distributed Membership Query 807 distributed processing 810, 822, 824, 827, 831, 833, 834, 838, 840 distributed transaction processing (DTP) 423 divisible load theory (DLT) 827, 841, 842, 844, 846, 851, 853, 854 domain-specific service 397, 398 dynamic evaluation 339 dynamic host configuration protocol (DHCP) server 705, 707, 708 dynamic load balancing 296, 299, 300, 305, 306, 307, 308 dynamic queries 130, 131, 133, 134, 135, 136, 138 dynamic tunability 309 dynamism 163, 164, 165, 169, 171, 186
experimental design 91, 95, 97, 102, 105, 109, 116, 118 experimental results 385, 387, 392, 393
F false negative 786, 789, 791, 792, 793, 794, 795, 797, 799, 800, 802, 803, 804, 805, 807, 864 false positive 786, 787, 788, 789, 791, 792, 794, 795, 797, 799, 800, 802, 804, 807, 864, 870 fast Fourier transform (FFT) 312, 314, 324, 325, 326, 327, 328, 329, 330, 331, 332, 335 fault tolerance 22, 39, 44, 45, 52, 447, 469, 486, 487, 490, 552, 566, 567, 570, 579 fetch policy 552, 559, 560, 561, 562, 578 financial options 472 fine-grained parallelism 564 fine-grained synchronization 565, 581 fine-grain multithreading 556 finger 143, 145, 147, 156, 157, 162 finite different time domain (FDTD) 312, 314, 315, 316, 317, 318, 319, 320, 321, 323, 324, 325, 331, 332, 334, 335 Foster, Ian 2, 12, 15, 21, 22, 25, 26, 27, 33, 40, 41, 46, 56, 57, 59, 63, 70, 83, 84, 85, 89, 90, 92, 96, 119 front-end pipeline 555, 559, 560, 562 functional unit (FU) 566
E
G
ECC-like mechanisms 568, 569 encoding algorithm 282, 283 encoding/decoding speed 281 encoding process 284, 285 EnginFrame (portal technology) 75, 76, 83, 87 error correcting code (ECC) 567, 568, 569 error recovery 442, 444, 446, 447, 448, 450, 451, 452, 456, 465, 466, 467, 468, 469 errors, large-scale 445, 468 errors, small-scale 450, 467, 468 European option 472 evolving systems 608, 609, 611, 643 execution traces 342
Gaussian processing 822, 831, 833, 834 gaussian random matrices 781 general purpose computation on GPUs (GPGPUs) 289 general timing constraint model 402 global object space (GOS) 659, 663, 665, 666, 667, 668, 672, 673, 674, 675, 676, 679, 681 Globus grid middleware 2, 4, 5, 6, 8, 12, 13, 21, 23, 24, 65, 67, 71, 75, 85, 87, 92, 96, 119 Google app engine 84 granularity 841, 875, 876, 879, 891 graphics data 279, 288, 289, 290, 291, 292
3
Index
graphics processing unit (GPU) 278, 279, 288, 289, 290, 291, 293, 314 Greedy join algorithm 705, 709 grey relational analysis (GRA) 723 grid application toolkit (GAT) 21, 66, 84 grid-based workflow 469 grid compute commodities (gccs) 472, 473, 474 grid computing 2, 4, 6, 12, 13, 25, 26, 27, 28, 29, 56, 58, 66, 81, 82, 86, 87, 119, 220, 221, 222, 223, 242, 243, 421, 422, 243, 472, 473, 478, 512, 514, 515 grid-enabled operating system (GridOS) 3, 12 grid engine 83 grid environment 346, 401, 404 GridFTP 5, 6, 10, 12, 13, 65, 66 grid index information service (GIIS) 65 grid infrastructure 396 grid middleware 1, 2, 4, 6, 12, 24, 26, 66 grid performance 220 grid performance measure 220 grid, pervasive 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 17, 25, 26, 28, 29, 30 grid portals 83 grid reliability 219, 220, 241 grid resource allocation and management (GRAM) 5, 6, 12, 13 grid resource information service (GRIS) 65 grid resource pricing 472 grid resources utilization and pricing (GRUP) matrix 478 grid security 13 grid security infrastructure (GSI) 5, 6, 8, 12, 13 grid service modeling 223 grid service performance 223 grid service reliability 223, 227, 241, 242 grid system performance 221 grid systems reliability 221 grid transaction service (GridTS) 421, 422, 425, 426, 427, 428, 429, 433, 434, 435, 436, 437, 438, 439 groups, collision of 141
H handoff, horizontal 721, 722, 725, 728, 729, 730, 731, 732, 735
4
handoff, vertical 721, 722, 723, 724, 725, 728, 730, 734, 735, 736 handoff, vertical, downward (DVH) 722, 730 handoff, vertical, upward (UVH) 722, 730 heterogeneity 163, 164, 165, 166, 168, 171, 180, 181, 185, 186, 187, 189 heterogeneous multi-core processors 278, 314, 335 high performance computing (HPC) 583, 585, 587, 710, 711 home-based lazy release consistency 668, 669, 670, 677, 678 homogeneous multi-core processors 277, 291, 313 hyper-threading (HT) 559
I IBM Blue Gene 583, 596, 599 ICOUNT policy 560, 561, 562, 563, 569, 571, 572 IEEE 802.11 (standard) 719, 736 IEEE 802.11x (standard) 705, 707, 717 ILP wall 313 image partitioning 808, 810, 811, 814, 821, 823, 824, 825, 826, 827, 828, 830, 831, 832, 833, 834, 835, 836, 838, 840 image processing 808, 809, 810, 811, 814, 819, 821, 822, 823, 827, 831, 836, 838, 842 improved similarity algorithm 387, 388, 389, 392, 393 index caching 123, 124, 127, 128, 134, 136 indirect swap network (ISN) 312, 314, 324, 326, 327, 329, 335 InfiniBand interconnect 584 infostation 645, 646, 647 instruction fetch queue (IFQ) 563, 568, 569, 571, 572 instruction per cycle (IPC) 555 integrated heterogeneous wireless and mobile network (IHWMN) 718, 719, 720, 721, 722, 723, 724, 728, 734, 737 Integrated Services (Intserv) 739, 742, 751, 759 inter-domain 742, 743, 752, 754 interest groups 123, 125
Index
Internet service provider (ISP) 744, 745, 747, 748, 758 Internet volunteer desktop grids, (IVDG) 36 inter-operation 517, 518, 519, 526, 534, 535 interval graph coloring 645, 647, 655 intra-domain 740, 741, 742, 752, 753, 754 intrusion detection systems (IDSs) 277, 286, 287, 858 I/O redirection, transparent 663, 674 IP-based networks 707 issue queue (IssueQ) 567, 568, 569
J Java 2 Platform, Enterprise Edition (J2EE) 658, 660 Java Bytecode 661, 662, 667, 678, 679 job deadlines 41 job descriptions 34, 96 job management 4, 13, 23, 39, 64, 94 job parameters 34 job scheduling 33, 56, 59, 120, 346, 349, 350, 352 jobs, execution of 37, 39, 46 jobs, high-throughput 36, 41 jobs, homogeneous 38 jobs, low latency 41, 57 jobs, pilot 46 jobs, stand-alone 37 job submission 6, 39
K Kesselman, Carl 12, 15, 23, 25, 26, 41, 56, 63, 70, 84, 89, 90, 92, 96, 119 key-value pair 141, 142
L language extension 249 leading thread (LT) 568, 569, 570, 571, 572 load balancing method 163, 164, 167, 186, 187 load distribution problem 842 load queue (LQ) 567, 570, 571 load value queue (LVQ) 568, 569, 570, 571 logical file names (LFN) 5, 10, 13 long-latency instructions 552, 559, 560, 561 long-lived transaction 421, 422, 424, 425, 431, 435, 439, 441
M MANETs, IP-based 707, 708, 715 mapping 243, 246, 248, 249, 250, 251, 253, 258, 259, 262, 263, 264, 265, 267, 268, 269, 274, 302, 312, 314, 318, 324, 326 Master-Worker computing paradigm 32, 33, 60 matrix matrix multiplication 770, 772, 773, 774, 775 memory hierarchy 277, 279, 331 memory management unit (MMU) 574, 576 Memory Wall problem 313, 553, 560 message passing 21, 22, 85 message passing interface (MPI) 71, 73, 761, 762, 763, 772, 777, 778, 779, 781, 782 microarchitecture (µarch) 552, 557, 559, 573, 576, 577, 579, 580, 581 middleware 659, 663, 677, 679, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 694, 695, 697, 698, 700, 701, 703 mixed parallelism 248 mobile ad hoc networks (MANETs) 705, 706, 707, 708, 709, 710, 715, 717 mobile agent 879, 880 mobile computing 689, 690, 691, 700 mobile message passing interface (MMPI) 713, 714, 715, 717 mobile middleware 706 monitoring and discovery service (MDS) 6, 10, 65 Moore’s law 553, 582 multi-core architecture 278, 292, 313, 314, 335 multi-core processing 858, 859, 860, 861, 862, 863, 864, 865, 867, 868, 870, 871, 872, 873, 874, 875, 876, 879, 881, 882, 890, 891 multi-core processors 248, 263, 266, 268, 269, 271, 275, 276, 277, 278, 279, 280, 281, 282, 285, 286, 287, 288, 291, 292, 294, 312, 313, 314, 315, 319, 333, 335, 336, 557, 559, 576, 577, 579 multi-dimensional queries 217 multi-mode interfaces 719, 721 multiprocessors 313, 314 multiprocessor scheduling 645, 647
5
Index
multiprocessor task (M-task) 246, 247, 248, 255, 257, 258, 260, 262, 263, 264, 265, 267, 268, 274, 275 multi-programming workload 555, 557 multi protocol label switching (MPLS) 739, 741, 742, 751
N near copy 823, 833, 836 NEC Earth Simulator 583 Needleman-Wunsch Algorithm 841, 842, 845, 846, 848, 855 net_id 705, 709, 710, 715 network bandwidth 51, 60, 67, 68, 69 network coding 277, 280, 281, 285, 293 network policy 739 nodes 124, 125, 139, 140, 141, 142, 144, 145, 146, 147, 149, 151, 152, 153, 154, 155, 157, 158, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 190, 191, 192, 194, 195, 197, 200, 202, 206, 207, 208, 210, 212, 213, 214, 217 nodes, predecessor 142, 143, 145, 146, 149, 151, 152, 155, 156, 157, 171, 173 nodes, successor 142, 143, 145, 146, 147, 148, 149, 150, 151, 152, 155, 156, 157, 162, 171, 173 non-singular sub-matrix 768, 776 NVIDIA 279, 288, 291, 292, 293
O object transaction service (OTS) 423 occupancy counter (OC) 562, 563 on-line algorithm 645, 648, 649, 652 Open Grid Forum (OGF) 66, 67, 80, 85, 87 open grid services architecture (OGSA) 23, 71, 87 overlay networking 191
P parallel computing 809, 810, 862, 877, 878, 879
6
parallelism 246, 247, 248, 249, 250, 251, 254, 255, 256, 265, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 280, 281, 287, 288, 292, 313, 315, 319, 320, 321, 336 parallelism, instruction level (ILP) 277, 294, 313, 552, 555, 556, 557, 559, 565, 566, 573, 577 parallelism, thread level (TLP) 294, 552, 556, 557, 558, 559, 561, 562, 573, 577 parallel processing system 809 parallel program, fragmented 295 parallel programming 246, 247, 248, 250, 269, 271, 273, 274, 288, 291, 295, 319 parallel-programming workload 555, 557 parallel programs 246, 250, 275, 295, 300, 301, 308 parallel shaders 290 parallel task 246, 251, 253, 272, 275 parameter jobs 72 particle in cell (PIC) method 295, 296, 297, 298, 299, 302, 303, 304, 305, 306, 307, 308, 310 partitionable analysis 607 partitioning, functional 607 partitioning, physical 607 patrolling thread (PT) 573, 574, 576 pattern matching 287, 288 peering 534, 535, 543, 546 peer-to-peer networks 123, 124, 125, 126, 128, 130, 138, 140, 161, 187, 188, 189, 215, 216 peer-to-peer networks, Gnutella-like 125 peer-to-peer networks, structured 124, 126, 189 peer-to-peer networks, unstructured 123, 124, 126, 138 peer to peer (P2P) 2, 4, 12, 18, 23, 24, 35, 37, 39, 40, 42, 43, 44, 52, 53 performance prediction 92, 95, 97, 99, 118 physical file names (PFN) 5, 13 Piconet 712, 713, 714 pipelining 766 pixel shader 290, 291 policy based management 742 policy based network architecture 740, 741 policy based networking (PBN) 739, 759
Index
policy compliance 741 policy management 740 policy negotiation 742 policy rules 741, 756 policy statements 739 portal 6, 10, 25, 63, 72, 74, 75, 81, 83, 85, 87 power processor element (PPE) 314, 315, 320, 321, 322, 328, 329, 330, 335 Power Wall problem 313, 557 price variant factor (pvf) 473, 475 process migration 876, 877, 879, 880, 882, 889, 892, 893 process scheduling 338, 347 program composition 70 program counter (PC) 557, 573, 583, 584, 595 proximity 144, 159, 163, 164, 165, 168, 169, 175, 176, 179, 180, 186, 188, 189, 190 proxy-based clustered architecture 4 proxy-based wireless grid architecture 4 proxy server, dedicated 4 put option 472, 477
Q QoS architecture, scalable internet 739, 740, 741, 742, 743, 745, 750 QoS, end-to-end 739, 740, 741, 742, 743, 749, 754, 755, 756, 757, 759 QoS evaluation, time-related 396, 398, 401, 402, 404, 412, 416 QoS routing 740, 749 QoS, workflow 402 quality of service (QoS) 3, 4, 8, 15, 19, 20, 396, 398, 400, 401, 402, 404, 412, 414, 415, 416, 418, 419, 443, 448, 472, 473, 479, 480, 482, 484, 485, 489, 490, 497, 508, 513, 515, 516, 549, 723, 724, 728, 730, 731, 732, 734, 735, 739, 740, 741, 742, 743, 745, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759
R random walk search algorithm 131, 132, 133 real option 471, 472, 473, 474, 478, 483 real-time deep packet inspection 859 real-time packet inspection 858 real-time requirement 859, 862
real-time system, development of 606 reliability 421, 422, 427, 439, 441, 488, 492, 495, 496, 500, 502, 509, 543 reorder buffer (ROB) 561, 562, 563, 567, 568, 570, 571 replica consistency 486, 487, 489, 491, 501, 502, 512, 516 replica location service (RLS) 5, 10, 13 replica placement 489, 490, 492, 494, 495, 496, 497, 504, 505, 507, 508, 509, 510, 511, 515, 516 replica selection 490, 498, 499, 500, 501, 512, 515, 516 replica selection service 490, 498, 501, 516 replica servers 488, 496, 499, 501, 508 reservation-based analysis (RBA) 608, 609, 630, 631, 640, 641, 642, 643, 644 resource broker 530, 531, 532, 533, 539, 547, 549 resource discovery 144, 192, 193, 194, 195, 201, 203, 204, 216, 217 resource management (RM) 23, 27, 40, 59, 62, 63, 64, 83, 84, 92, 120, 741, 742, 743, 750, 752, 753, 757, 758 resource management system (RMS) 442, 443, 444, 445, 446, 448, 449, 450, 451, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 466, 467, 470, 535 resource reliability 220 resource sharing control 552, 562, 563 resource sharing networks 521, 526, 543 result certification 44, 46, 53, 60 running time curves 385
S scalability 89, 90, 95, 97, 101, 102, 104, 105, 109, 110, 112, 116, 118, 120, 415 scalable algorithm 645 ScaLAPACK algebra library 776, 778, 781 scatternet 707 schedule 247, 254, 255, 262 scheduler, adaptive 355 scheduler, non-adaptive 355 scheduling 90, 91, 92, 102, 103, 117, 118, 119, 120, 121, 338, 339, 342, 346, 347, 349, 350, 352, 354, 355, 356, 357, 358, 362,
7
Index
365, 375, 376, 377, 384, 388, 389, 396, 397, 398, 401, 404, 405, 406, 407, 408, 409, 411, 415, 416, 418 scheduling decisions 338, 346 scheduling, distributed 339, 346 scheduling, non-preemptive 355 scheduling, preemptive 355, 377 scientific collaboration 397, 404, 405, 406, 407, 408, 409, 411, 416 scientific workflow 396, 397, 398, 400, 401, 404, 405, 406, 407, 409, 411, 412, 413, 414, 415, 416, 418 scientific workflow execution 396, 397, 398, 401, 404, 405, 406, 407, 412, 413, 414, 416 scientific workflow management system 397 semantic knowledge 14, 15, 19 sequence alignment 841, 842, 843, 844, 846, 848, 855, 856, 857 server allocation 645, 647, 648, 655 servers 33, 34, 36, 37, 38, 39, 44, 53, 65, 70, 82 service-based workflow system 404 service level agreements (SLA) , 442, 443, 444, 445, x, 445, x, 446, 448, xxii, 451, 468, 469, 470, 480, 481, 485, 469, 480, 529, 537, 545, 548, 740, 742, 745, 756 service-oriented provider 6, 8, 9 service repository 8 shaders 290, 291 Shared Hierarchical Academic Research Computing Network (SHARCNET) 473, 477, 480, 481, 482, 484 simple API for grid applications (SAGA) 66, 67, 84, 85, 86 simple storage service (S3) 472, 484 simultaneous multi-threading (SMT) 552, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 570, 571, 573, 577, 578, 579, 580, 581 single process, multiple data (SPMD) 247, 253, 255, 258, 273 SLA workflow broker 444 Software as a service (SaaS) 79 space of modeling (SM) 297, 298, 302, 303, 304, 305, 306, 308
8
spanning tree 234, 237 Spiral Architecture 808, 811, 814, 816, 819, 820, 821, 822, 823, 824, 825, 826, 827, 830, 831, 832, 833, 834, 836, 837, 838, 839, 840, SPMD, group 253, 258 stabilization 141, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 158, 159, 171 star topology 224, 227, 234, 243 static evaluation 339 storage broker 10 store queue (SQ) 567, 570, 571 string editing problem 383 string similarity problem 378, 379, 381, 382, 383 strong consistency algorithm 501 Sun Grid Engine 74, 75, 92 Sun Network.com 79 supernode 145, 147, 149, 150, 152, 153, 154, 155, 156, 157, 158, 172 superscalar processors 552, 555, 556, 561 superscalar sequential applications (STARSs) 71 symmetric multi-processing (SMP) 563, 564 synchronization 552, 563, 564, 565, 569, 577, 578, 581, 589 synergistic processor element (SPE) 315, 320, 321, 322, 324, 327, 328, 329, 330, 335
T Tabu search 375, 376, 377 task coordination 261 task graph 254, 257, 260, 262 task parallelism 248, 249, 250, 256, 268, 269, 271 temporal-dependable service 411 temporal-dependency relations 397 temporal disciplines 407, 412 temporal model 398, 401, 404, 407, 416 thread migration 874, 877, 878, 880, 881, 882, 889, 891, 892, 894 threads 277, 278, 282, 283, 284, 287, 291, 319, 322, 324 throttling 562, 563 timing analysis 606, 607, 608, 609, 610, 611, 612, 614, 615, 622, 628, 630, 643, 644
Index
timing analysis, abstract 609 timing analysis, target-specific 609 timing requirements 606, 607, 608, 609, 623, 624, 631, 632, 634, 635 trace queue (traceQ) 568, 569, 570, 572 tracing 340, 341, 347 traffic engineering (TE) 759 trailing thread (TT) 568, 569, 570, 571, 572 transaction processing 421, 427, 428, 432, 439, 441 transient faults 566, 567, 579 tree topology 224, 234 tuning, dynamic 585, 588 tuning, static 585
U uncertainty 18, 19, 20, 25 uncertainty, application 19 uniform index caching 128 uniform interface for computing resources (UNICORE) 2, 77, 78, 79, 86, 88
V value prediction 565, 566, 580, 581 vertex shader 290, 291 very long instruction word (VLIW) 555, 557 virtual cluster 523, 528 virtual execution environment 523 virtualisation technology 519, 528, 534, 543 virtual machine 874, 876, 880, 881, 889, 891, 893, 894
virtual organizations (VOs) 24, 88, 510, 518, 519, 520, 524, 525, 526, 528, 530, 533, 536, 537, 538, 539, 541, 542 virus 858, 861, 870, 871 volatile consistency 666, 667, 668, 670, 671, 678 volunteer computing 38, 39, 48, 49, 54, 56, 57, 59
W Wavefront algorithm 379, 383, 387, 393, 394, 395 weak consistency algorithms 501 Web services 71, 87, 88, 682, 683, 684, 687, 695, 698, 703 wireless and mobile networks (WMNs) 718, 719, 720, 724, 726, 730, 734 wireless grid 4, 12, 13, 28 workers 37, 39, 40, 41, 46, 47, 57, 60 workflow model 402, 406 workflows 7, 62, 63, 71, 72, 73, 80, 90, 92, 93, 94, 98, 99, 102, 105, 106, 107, 108, 109, 110, 115, 119, 120 workflow, scientific 108
X XtremWeb 33, 36, 37, 39, 40, 41, 42, 43, 44, 46, 52, 56, 58, 61
9