Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Workshops, Part II ... Computer Science and General Issues)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Sang-Soo Yeo | Jong Hyuk Park | Laurence Tianruo Yang | Ching-Hsien Hsu

35 downloads 1005 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

6082

Ching-Hsien Hsu Laurence T. Yang Jong Hyuk Park Sang-Soo Yeo (Eds.)

Algorithms and Architectures for Parallel Processing 10th International Conference, ICA3PP 2010 Busan, Korea, May 21-23, 2010 Workshops, Part II

13

Volume Editors Ching-Hsien Hsu Chung Hua University, Department of Computer Science and Information Engineering Hsinchu, 300 Taiwan, China E-mail: [email protected] Laurence T. Yang St. Francis Xavier University, Department of Computer Science Antigonish, NS, B2G 2W5, Canada E-mail: [email protected] Jong Hyuk Park Seoul National University of Technology Department of Computer Science and Engineering 172 Gongreund 2-dong, Nowon-gu, Seoul, 139-742, Korea E-mail: [email protected] Sang-Soo Yeo Mokwon University, Division of Computer Engineering Daejeon 302-729, Korea E-mail: [email protected]

Library of Congress Control Number: 2010926694 CR Subject Classification (1998): F.2, H.4, D.2, I.2, H.3, G.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-13135-2 Springer Berlin Heidelberg New York 978-3-642-13135-6 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

It is our great pleasure to present the proceedings of the symposia and workshops on parallel and distributed computing and applications associated with the ICA3PP 2010 conference. These symposia and workshops provide vibrant opportunities for researchers and industry practitioners to share their research experience, original research results and practical development experiences in the new challenging research areas of parallel and distributed computing technologies and applications. It was the first time that the ICA3PP conference series added symposia and workshops to its program in order to provide a wide range of topics that extend beyond the main conferences. The goal was to provide a better coverage of emerging research areas and also forums for focused and stimulating discussions. With this objective in mind, we selected three workshops to accompany the ICA3PP 2010 conference: • • •

FPDC 2010, the 2010 International Symposium on Frontiers of Parallel and Distributed Computing HPCTA 2010, the 2010 International Workshop on High-Performance Computing, Technologies and Applications M2A 2010, the 2010 International Workshop on Multicore and Multithreaded Architectures and Algorithms

Each of the symposia / workshops focused on a particular theme and complemented the spectrum of the main conference. All papers published in the workshops proceedings were selected by the Program Committee on the basis of referee reports. Each paper was reviewed by independent referees who judged the papers for originality, quality, contribution, presentation and consistency with the theme of the workshops. We deeply appreciate the tremendous efforts and contributions of the Chairs of individual symposia / workshops. Our thanks also go to all authors for their valuable contributions and to all the Program Committee members and reviewers for providing timely and in-depth reviews. Particularly, we thank the Local Arrangements Committee for exceptionally nice arrangements. We hope you will enjoy the proceedings.

May 2010

Laurence T. Yang Jong Hyuk Park J. Daniel Garcia Ching-Hsien (Robert) Hsu Alfredo Cuzzocrea Xiaojun Cao Kuo-Chan Huang Yu Liang

FPDC 2010 Foreword

We would like to welcome you to the proceedings of the 2010 International Symposium on Frontiers of Parallel and Distributed Computing (FPDC 2010) held in Busan, Korea, May 21–23, 2010. The FPDC 2010 symposium intended to bring together researchers from industry and academia, practitioners, scientists and engineers to discuss novel and innovative research activities, on-going research efforts, emerging parallel/distributed computing technologies and applications. Each paper in the FPDC 2010 symposium was reviewed by at least three Technical Program Committee members of the ICA3PP 2010 conference. After the reviewing process, 29 papers of high quality were invited from 110 submissions for presentation and publication in the FPDC symposium. The acceptance rate of the symposium is 26%. The selected papers cover various topics in parallel and distributed computing systems and technologies with focus on the following areas: -

Parallel Programming and Multicore Technologies Grid / Cluster Computing Parallel Algorithms and Architectures Bioinformatics and Application Mobile Computing and Web Services Distributed Operating Systems and P2P Computing Fault-Tolerant and Information Security

Many individuals contributed to the success of this symposium directly or indirectly. First of all, the symposium Program Co-chairs would like to thank the symposium General Chairs, Laurence T. Yang and Jong Hyuk Park, for their excellent guidance and continuous support. We are very grateful to the ICA3PP 2010 General Chair and Program Chair, Laurence T. Yang and Robert C. Hsu, who helped us in selecting papers for this symposium. Last but not least, we would like to thank all authors for accepting our invitation to publish their papers in this symposium. We hope you will enjoy the proceedings.

May 2010

Laurence T. Yang Jong Hyuk Park Ching-Hsien (Robert) Hsu Sang-Soo Yeo

HPCTA 2010 Foreword

It gives us great pleasure to introduce this collection of papers that were presented at the 2010 International Workshop on High-Performance Computing Technologies and Applications (HPCTA 2010), May 21–23, 2010, at the Busan Lotte Hotel, Busan, Korea. The Program Committee received 23 submissions, from which it selected 12 for presentation and publication. Each paper was evaluated by three referees. Technical quality, originality, relevance, and clarity were the primary criteria for selection. We wish to thank all who submitted manuscripts for consideration. We also wish to thank the members of the HPCTA 2010 Program Committee who reviewed all of the submissions.

Whey Fone Tsai Hsi-Ya Chang Ching-Hsien Hsu Kuo-Chan Huang

M2A2 2010 Foreword

It is with great pleasure that we present the proceedings of the 2010 International Workshop on Multicore and Multithreaded Architectures and Algorithms (M2A2 2010) held in conjunction with the 10th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2010) at Busan, Korea. In recent years, multicore systems are dominating the processor market, and it is expected that the number of cores will continue increasing in most of the commercial systems, such as high-performance, desktops, or embedded systems. This trend is driven by the need to increase the efficiency of the major system components, that is, the cores, the memory hierarchy, and the interconnection network. For this purpose, the system designer must trade off performance versus power consumption, which is a major concern in current microprocessors. Therefore new architectures or architectural mechanisms addressing this trade-off are required. In this context, load balancing and scheduling can help to improve energy savings. In addition, it remains a challenge to identify and productively program applications for these architectures with a resulting substantial performance improvement. The M2A2 2010 workshop provided a forum for engineers and scientists to address the resulting challenge and to present new ideas, applications, and experience on all aspects of multicore and multithreaded systems. This year, and because of the high quality of the submitted papers, only about 40% of papers were accepted for the conference. We would like to express our most sincere appreciation to everyone contributing to the success of this workshop. First, we thank the authors of the submitted papers for their efforts in their research work. Then, we thank the TPC members and the reviewers for their invaluable and constructive comments. Finally, we thank our sponsors for the support of this workshop.

Houcine Hassan Julio Sahuquillo

Reviewers

FPDC 2010 Jemal Abawajy Ahmad S. AI-Mogren Hüseyin Akcan Giuseppe Amato Cosimo Anglano Alagan Anpalagan Amnon Barak Novella Bartolini Alessio Bechini Ladjel Bellatreche Ateet Bhalla Taisuke Boku Angelo Brayner Massimo Cafaro Mario Cannataro Jiannong Cao Andre C.P.L.F. de Carvalho Denis Caromel Tania Cerquitelli Hangbae Chang Ruay-Shiung Chang Yue-Shan Chang Jinjun Chen Tzung-Shi Chen Zizhong Chen Allen C. Cheng Francis Chin Michele Colajanni Carmela Comito Raphaël Couturier Mieso Denko Bronis R. de Supinski Julius Dichter Der-Rong Din Susan K. Donohue Shantanu Dutt Todd Eavis

Deakin University, Australia AI Yamamah University, Saudi Arabia Izmir University of Economics, Turkey ISTI-CNR, Italy Universitá del Piemonte Orientale, Italy Ryerson University, Canada The Hebrew University of Jerusalem, Israel University of Rome "La Sapienza", Italy Alessio Bechini, University of Pisa, Italy ENSMA, France Technocrats Institute of Technology, India University of Tsukuba, Japan University of Fortaleza, Brazil University of Salento, Lecce, Italy University “Magna Græcia” of Catanzaro, Italy Hong Kong Polytechnic University, Hong Kong Universidade de Sao Paulo, Brazil University of Nice Sophia Antipolis-INRIA-CNRS-IUF, France Politecnico di Torino, Italy Daejin University, Korea National Dong Hwa University, Taiwan National Taipei University, Taiwan Swinburne University of Technology, Australia National University of Tainan, Taiwan Colorado School of Mines, USA University of Pittsburgh, USA University of Hong Kong, Hong Kong Universitá di Modena e Reggio Emilia, Italy University of Calabria, Italy University of Franche Comte, France University of Guelph, Canada Lawrence Livermore National Laboratory, USA University of Bridgeport, USA National Changhua University of Education, Taiwan The College of New Jersey, USA University of Illinois at Chicago, USA Concordia University, Canada

XIV

Reviewers

Giuditta Franco Karl Fuerlinger Jerry Zeyu Gao Jinzhu Gao Irene Garrigós Amol Ghoting Harald Gjermundrod Janice Gu Hyoil Han Houcine Hassan Pilar Herrero Michael Hobbs JoAnne Holliday Ching-Hsien Hsu Tsung-Chuan Huang Yo-Ping Huang Young-Sik Jeong Qun Jin Xiaolong Jin Soo-Kyun Kim Jongsung Kim Dan Komosny Gregor von Laszewski Changhoon Lee Deok Gyu Lee Yang Sun Lee Laurent Lefevre Casiano Rodriguez Leon Daniele Lezzi Jikai Li Keqin Li Keqin Li Keqiu Li Minglu Li Xiaofei Liao Kai Lin Jianxun Liu Pangfeng Liu Alexandros V. Gerbessiotis Yan Gu Hai Jiang George Karypis Eun Jung Kim Minseok Kwon Yannis Manolopoulos

University of Verona, Italy University of California, Berkeley, USA San Jose State University, US University of the Pacific, Stockton, CA, USA University of Alicante, Spain IBM T. J. Watson Research Center, USA University of Nicosia, Cyprus Auburn University, USA Drexel University, USA Universidad Politecnica de Valencia, Spain Universidad Politécnica de Madrid, Spain Deakin University, Australia Santa Clara University, USA Chung Hua University, Taiwan National Sun Yat-sen University, Taiwan National Taipei University of Technology, Taiwan Wonkwang University, Korea Waseda University, Japan University of Bradford, UK PaiChai University, Korea Kyungnam University, Korea Brno University of Technology, Czech Republic Rochester Institute of Technology, USA Hanshin University, Korea ETRI, Korea Chosun University, Korea INRIA, University of Lyon, France Universidad de La Laguna, Spain Barcelona Supercomputing Center, Spain The College of New Jersey, USA State University of New York, USA SAP Research, France Dalian University of Technology, China Shanghai Jiaotong University, China Huazhong University of Science and Technology, China Dalian University of Technology, China Hunan University of Science and Technology, China National Taiwan University, Taiwan New Jersey Institute of Technology, USA Auburn University, US Arkansas State University, US University of Minnesota, US Texas A&M University, US Rochester Institute of Technology, USA Aristotle University of Thessaloniki, Greece

Reviewers

Alberto Marchetti-Spaccamela Toma Margalef María J. Martín Michael May Eduard Mehofer Rodrigo Fernandes de Mello Peter M. Musial Amiya Nayak Leandro Navarro Andrea Nucita Leonardo B. Oliveira Salvatore Orlando Marion Oswald Apostolos Papadopoulos George A. Papadopoulos Deng Pan Al-Sakib Khan Pathan Dana Petcu Rubem Pereira María S. Pérez Kleanthis Psarris Pedro Pereira Rodrigues Marcel-Catalin Rosu Paul M. Ruth Giovanni Maria Sacco Lorenza Saitta Frode Eika Sandnes Claudio Sartori Erich Schikuta Martin Schulz Seetharami R. Seelam Erich Schikuta Edwin Sha Rahul Shah Giandomenico Spezzano Peter Strazdins Domenico Talia Uwe Tangen David Taniar Christopher M. Taylor Parimala Thulasiraman A. Min Tjoa Paolo Trunfio Jichiang Tsai Emmanuel Udoh

Sapienza University of Rome, Italy Universitat Autonoma de Barcelona, Spain University of A Coruña, Spain Fraunhofer Institute for Intelligent Systems, Germany University of Vienna, Austria University of Sao Paulo, Brazil University of Puerto Rico, USA University of Ottawa, Canada Polytechnic University of Catalonia, Spain University of Messina, Italy Universidade Estadual de Campinas, Brazil Ca' Foscari University of Venice, Italy Hungarian Academy of Sciences, Budapest, Hungary Aristotle University of Thessaloniki, Greece University of Cyprus, Cyprus Florida International University, US BRAC University, Bangladesh West University of Timisoara, Romania Liverpool John Moores University, UK Universidad Politecnica de Madrid, Spain The University of Texas at San Antonio, USA University of Porto, Portugal IBM, USA The University of Mississippi, USA Universitá di Torino, Italy Università del Piemonte Orientale, Italy Oslo University College, Norway University of Bologna, Italy University of Vienna, Austria Lawrence Livermore National Laboratory, USA IBM T.J. Watson Research Center, USA University of Vienna, Austria The University of Texas at Dallas, USA Louisiana State University, USA ICAR-CNR, Italy The Australian National University, Australia Università della Calabria, Italy Ruhr-Universität Bochum, Germany Monash University, Australia University of New Orleans, USA University of Manitoba, Canada Vienna University of Technology, Austria University of Calabria, Italy National Chung Hsing University, Taiwan Indiana University-Purdue University, USA

XV

XVI

Reviewers

Gennaro Della Vecchia Lizhe Wang Max Walter Cho-Li Wang Guojun Wang Xiaofang Wang Chen Wang Chuan Wu Qishi Wu Yulei Wu Fatos Xhafa Yang Xiang Chunsheng Xin Neal Naixue Xiong Zheng Yan Sang-Soo Yeo Eiko Yoneki Chao-Tung Yang Zhiwen Yu Wuu Yang Jiehan Zhou Sotirios G. Ziavras Roger Zimmermann

Gennaro Della Vecchia - ICAR-CNR, Italy Indiana University, USA Technische Universität München, Germany The University of Hong Kong, China Central South University, China Villanova University, USA CSIRO ICT Centre, Australia The University of Hong Kong, China University of Memphis, USA University of Bradford, UK University of London, UK Central Queensland University, Australia Norfolk State University, USA Georgia State University, USA Nokia Research Center, Finland Mokwon University, Korea University of Cambridge, UK Tunghai University, Taiwan Northwestern Polytechnical University, China National Chiao Tung University, Taiwan University of Oulu, Finland NJIT, USA National University of Singapore, Singapore

HPCTA 2010 Hamid R. Arabnia, USA Rajkumar Buyya, Australia Jee-Gong Chang, Taiwan Ruay-Shiung Chang, Taiwan Yue-Shan Chang, Taiwan Wenguang Chen, China Khoo Boo Cheong, Singapore Yeh-Ching Chung, Taiwan Chang-Huain Hsieh, Taiwan James J.Y. Hsu, Taiwan Suntae Hwang, Korea Hae-Duck Joshua Jeong, Korea Jyh-Chiang Jiang, Taiwan Hai Jin, China Pierre Kestener, France Chung-Ta King, Taiwan Jong-Suk Ruth Lee, Korea Ming-Hsien Lee, Taiwan Weiping Li, China Kuan-Ching Li, Taiwan

Chao-An Lin, Taiwan Fang-Pang Lin, Taiwan Pangfeng Liu, Taiwan Carlos R. Mechoso, USA Rodrigo Mello, Brazil Nikolay Mirenkov, Japan Chien-Hua Pao, Taiwan Depei Qian, China Gudula Ruenger, Germany Cherng-Yeu Shen, Taiwan Tony Wen-Hann Sheu, Taiwan Michael J. Tsai, Taiwan Cho-Li Wang, Hong Kong Jong-Sinn Wu, Taiwan Yongwei Wu, China Chao-Tung Yang, Taiwan Jaw-Yen Yang, Taiwan Chih-Min Yao, Taiwan Weimin Zheng, China Albert Y. Zomaya, Australia

Reviewers

M2A2 2010 Hideharu Amano, Japan Hamid R. Arabnia, USA Luca Benini, Italy Luis Gomes, Portugal Zonghua Gu, Hong Kong Rajiv Gupta, USA Houcine Hassan, Spain Seongsoo Hong, Korea Shih-Hao Hung, Taiwan Eugene John, USA

Seon Wook Kim, Korea Jihong Kim, Korea Chang-Gun Lee, Korea Yoshimasa Nakamura, Japan Hiroshi Nakashima, Japan Sabri Pllana, Austria Julio Sahuquillo, Spain Zili Shao, Hong Kong Kenjiro Taura, Japan Sami Yehia, France

XVII

Table of Contents – Part II

The 2010 International Symposium on Frontiers of Parallel and Distributed Computing (FPDC 2010) Parallel Programming and Multi-core Technologies Eﬃcient Grid on the OTIS-Arrangment Network . . . . . . . . . . . . . . . . . . . . . Ahmad Awwad, Bassam Haddad, and Ahmad Kayed Single Thread Program Parallelism with Dataﬂow Abstracting Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianzhou Chen, Xingsheng Tang, Jianliang Ma, Lihan Ju, Guanjun Jiang, and Qingsong Shi

1

11

Parallel Programming on a Soft-Core Based Multi-core System . . . . . . . . Liang-Teh Lee, Shin-Tsung Lee, and Ching-Wei Chen

22

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors . . . . . . Yongqing Ren, Hong An, Tao Sun, Ming Cong, and Yaobin Wang

32

Ensuring Conﬁdentiality and Integrity of Multimedia Data on Multi-core Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eunji Lee, Sungju Lee, Yongwha Chung, Hyeonjoong Cho, and Sung Bum Pan A Paradigm for Processing Network Protocols in Parallel . . . . . . . . . . . . . Ralph Duncan, Peder Jungck, and Kenneth Ross Real-Time Task Scheduling on Heterogeneous Two-Processor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chin-Fu Kuo and Ying-Chi Hai

42

52

68

Grid/Cluster Computing A Grid Based System for Closure Computation and Online Service . . . . . Wing-Ning Li, Donald Hayes, Jonathan Baran, Cameron Porter, and Tom Schweiger

79

A Multiple Grid Resource Broker with Monitoring and Information Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao-Tung Yang, Wen-Jen Hu, and Bo-Han Chen

90

Design Methodologies of Workload Management through Code Migration in Distributed Desktop Computing Grids . . . . . . . . . . . . . . . . . . Makoto Yoshida and Kazumine Kojima

100

Dynamic Dependent Tasks Assignment for Grid Computing . . . . . . . . . . . Meddeber Meriem and Yagoubi Belabbas

112

XX

Table of Contents – Part II

Implementation of a Heuristic Network Bandwidth Measurement for Grid Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao-Tung Yang, Chih-Hao Lin, and Wen-Jen Hu

121

Parallel Algorithms, Architectures and Applications An Eﬃcient Circuit-Switched Broadcasting in Star Graph . . . . . . . . . . . . . Cheng-Ta Lee and Yeong-Sung Lin Parallel Domain Decomposition Methods for High-Order Finite Element Solutions of the Helmholtz Problem . . . . . . . . . . . . . . . . . . . . . . . . Youngjoon Cha and Seongjai Kim Self-Organizing Neural Grove and Its Distributed Performance . . . . . . . . . Hirotaka Inoue A Massively Parallel Hardware for Modular Exponentiations Using the m-ary Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcos Santana Farias, S´ergio de Souza Raposo, Nadia Nedjah, and Luiza de Macedo Mourelle Emulation of Object-Based Storage Devices by a Virtual Machine . . . . . . Yi-Chiun Fang, Chien-Kai Tseng, and Yarsun Hsu Balanced Multi-process Parallel Algorithm for Chemical Compound Inference with Given Path Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiayi Zhou, Kun-Ming Yu, Chun Yuan Lin, Kuei-Chung Shih, and Chuan Yi Tang Harnessing Clusters for High Performance Computation of Gene Expression Microarray Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . Philip Church, Adam Wong, Andrzej Goscinski, and Christophe Lef`evre

131

136 146

156

166

178

188

Mobile Computing/Web Services Semantic Access Control for Corporate Mobile Devices . . . . . . . . . . . . . . . Tuncay Ercan and Mehmet Yıldız A New Visual Simulation Tool for Performance Evaluation of MANET Routing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Sabbir Rahman Sakib, Nazmus Saquib, and Al-Sakib Khan Pathan A Web Service Composition Algorithm Based on Global QoS Optimizing with MOCACO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wang Li and He Yan-xiang

198

208

218

Table of Contents – Part II

XXI

Distributed Operating System/P2P Computing Experiences Gained from Building a Services-Based Distributed Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Goscinski and Michael Hobbs

225

Quick Forwarding of Queries to Relevant Peers in a Hierarchical P2P File Search System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingting Qin, Qi Cao, Qiying Wei, and Satoshi Fujita

235

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhang Jianzhong, Pan Kai, Yu Yuntao, and Xu Jingdong

244

Fault-Tolerant and Information Security Toward a Framework for Cloud Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Brock and Andrzej Goscinski

254

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs . . . . . . . . . . . . . Nagateru Iwasawa, Tatsuro Watanabe, Tatsuya Iwasaki, and Keiichi Kaneko

264

Edge-Bipancyclicity of All Conditionally Faulty Hypercubes . . . . . . . . . . . Chao-Ming Sun and Yue-Dar Jou

275

The 2010 International Workshop on High Performance Computing Technologies and Applications (HPCTA 2010) Session I Accelerating Euler Equations Numerical Solver on Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Kestener, Fr´ed´eric Chˆ ateau, and Romain Teyssier

281

An Improved Parallel MEMS Processing-Level Simulation Implementation Using Graphic Processing Unit . . . . . . . . . . . . . . . . . . . . . . Yupeng Guo, Xiaoguang Liu, Gang Wang, Fan Zhang, and Xin Zhao

289

Solving Burgers’ Equation Using Multithreading and GPU . . . . . . . . . . . . Sheng-Hsiu Kuo, Chih-Wei Hsieh, Reui-Kuo Lin, and Wen-Hann Sheu

297

Support for OpenMP Tasks on Cell Architecture . . . . . . . . . . . . . . . . . . . . . Qian Cao, Changjun Hu, Haohu He, Xiang Huang, and Shigang Li

308

XXII

Table of Contents – Part II

Session II A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ning Zhang, Yan Chen, Naixue Xiong, Laurence T. Yang, Dong Liu, and Yuyuan Zhang A Parallel Distributed Algorithm for the Permutation Flow Shop Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samia Kouki, Talel Ladhari, and Mohamed Jemni A Self-Adaptive Load Balancing Strategy for P2P Grids . . . . . . . . . . . . . . Po-Jung Huang, You-Fu Yu, Quan-Jie Chen, Tian-Liang Huang, Kuan-Chou Lai, and Kuan-Ching Li Embedding Algorithms for Star, Bubble-Sort, Rotator-Faber-Moore, and Pancake Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihye Kim, Dongwan Kim, and Hyeongok Lee

318

328 338

348

Session III Performance Estimation of Generalized Statistical Smoothing to Inverse Halftoning Based on the MTF Function of Human Eyes . . . . . . . . . . . . . . . Yohei Saika, Kouki Sugimoto, and Ken Okamoto

358

Power Improvement Using Block-Based Loop Buﬀer with Innermost Loop Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming-Yuan Zhong and Jong-Jiann Shieh

368

An Eﬃcient Pipelined Architecture for Fast Competitive Learning . . . . . Hui-Ya Li, Chia-Lung Hung, and Wen-Jyi Hwang

381

Merging Data Records on EREW PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . Hazem M. Bahig

391

The 2010 International Workshop on Multicore and Multithreaded Architecture and Algorithms (M2A2 2010) Session I Performance Modeling of Multishift QR Algorithms for the Parallel Solution of Symmetric Tridiagonal Eigenvalue Problems . . . . . . . . . . . . . . Takafumi Miyata, Yusaku Yamamoto, and Shao-Liang Zhang

401

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory Hierarchy System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tangpei Cheng, Qun Wang, Xiaohui Ji, and Dandan Li

413

Table of Contents – Part II

A New Metric for On-line Scheduling and Placement in Reconﬁgurable Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maisam Mansub Bassiri and Hadi Shahriar Shahhoseini

XXIII

422

Session II Test Data Compression Using Four-Coded and Sparse Storage for Testing Embedded Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhang Ling, Kuang Ji-shun, and You zhi-qiang Extending a Multicore Multithread Simulator to Model Power-Aware Hard Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Luis March, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and Jos´e Duato

434

444

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joan Aracil, Carlos Dom´ınguez, Houcine Hassan, and Alfons Crespo

454

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

465

Table of Contents – Part I

Keynote Papers Eﬃcient Web Browsing with Perfect Anonymity Using Page Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shui Yu, Theerasak Thapngam, Su Wei, and Wanlei Zhou

1

InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Application Services . . . . . . . . . . . . . . . . . . . . . Rajkumar Buyya, Rajiv Ranjan, and Rodrigo N. Calheiros

13

Parallel Algorithms Scalable Co-clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bongjune Kwon and Hyuk Cho

32

Parallel Pattern Matching with Swaps on a Linear Array . . . . . . . . . . . . . . Fouad B. Chedid

44

Parallel Preﬁx Computation in the Recursive Dual-Net . . . . . . . . . . . . . . . Yamin Li, Shietung Peng, and Wanming Chu

54

A Two-Phase Diﬀerential Synchronization Algorithm for Remote Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonghong Sheng, Dan Xu, and Dongsheng Wang

65

A New Parallel Method of Smith-Waterman Algorithm on a Heterogeneous Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Chen, Yun Xu, Jiaoyun Yang, and Haitao Jiang

79

Improved Genetic Algorithm for Minimizing Periodic Preventive Maintenance Costs in Series-Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . Chung-Ho Wang and Te-Wei Lin

91

A New Hybrid Parallel Algorithm for MrBayes . . . . . . . . . . . . . . . . . . . . . . Jianfu Zhou, Gang Wang, and Xiaoguang Liu Research and Design of Deployment Framework for Blade-Based Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haiping Qu, Xiuwen Wang, Lu Xu, Jiangang Zhang, and Xiaoming Han Query Optimization over Parallel Relational Data Warehouses in Distributed Environments by Simultaneous Fragmentation and Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ladjel Bellatreche, Alfredo Cuzzocrea, and Soumia Benkrid

102

113

124

XXVI

Table of Contents – Part I

Parallel Architectures Function Units Sharing between Neighbor Cores in CMP . . . . . . . . . . . . . . Tianzhou Chen, Jianliang Ma, Hui Yuan, Jingwei Liu, and Guanjun Jiang

136

A High Eﬃcient On-Chip Interconnection Network in SIMD CMPs . . . . . Dan Wu, Kui Dai, Xuecheng Zou, Jinli Rao, and Pan Chen

149

Network-on-Chip Routing Algorithms by Breaking Cycles . . . . . . . . . . . . . Minghua Tang and Xiaola Lin

163

A Fair Thread-Aware Memory Scheduling Algorithm for Chip Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danfeng Zhu, Rui Wang, Hui Wang, Depei Qian, Zhongzhi Luan, and Tianshu Chu

174

Eﬃcient Partitioning of Static Buses for Processor Arrays of Small Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susumu Matsumae

186

Formal Proof for a General Architecture of Hybrid Preﬁx/Carry-Select Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Liu, Qingping Tan, Xiaoyu Song, and Gang Chen

193

An Eﬃcient Non-Blocking Multithreaded Embedded System . . . . . . . . . . Joseph M. Arul, Tsung-Yun Chen, Guan-Jie Hwang, Hua-Yuan Chung, Fu-Jiun Lin, and You-Jen Lee A Remote Mirroring Architecture with Adaptively Cooperative Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongzhi Song, Zhenhai Zhao, Bing Liu, Tingting Qin, Gang Wang, and Xiaoguang Liu

205

215

SV: Enhancing SIMD Architectures via Combined SIMD-Vector Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Libo Huang and Zhiying Wang

226

A Correlation-Aware Prefetching Strategy for Object-Based File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julei Sui, Jiancong Tong, Gang Wang, and Xiaoguang Liu

236

An Auxiliary Storage Subsystem to Distributed Computing Systems for External Storage Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MinHwan Ok

246

Table of Contents – Part I

XXVII

Grid/Cluster Computing Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Mehnert-Spahn and Michael Schoettner On-Line Task Granularity Adaptation for Dynamic Grid Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nithiapidary Muthuvelu, Ian Chai, Eswaran Chikkannan, and Rajkumar Buyya

254

266

Message Clustering Technique towards Eﬃcient Irregular Data Redistribution in Clusters and Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Chang Chen, Tai-Lung Chen, and Ching-Hsien Hsu

278

Multithreading of Kostka Numbers Computation for the BonjourGrid Meta-desktop Grid Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heithem Abbes, Franck Butelle, and Christophe C´erin

287

Adaptable Scheduling Algorithm for Grids with Resource Redeployment Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cho-Chin Lin and Chih-Hsuan Hsu

299

Using MPI on PC Cluster to Compute Eigenvalues of Hermitian Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fazal Noor and Syed Misbahuddin

313

Cloud Computing/Virtualization Techniques idsocket: API for Inter-domain Communications Base on Xen . . . . . . . . . . Liang Zhang, Yuein Bai, and Cheng Luo Strategy-Proof Dynamic Resource Pricing of Multiple Resource Types on Federated Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marian Mihailescu and Yong Meng Teo

324

337

Adapting Market-Oriented Scheduling Policies for Cloud Computing . . . Mohsen Amini Salehi and Rajkumar Buyya

351

A High Performance Inter-VM Network Communication Mechanism . . . . Yuebin Bai, Cheng Luo, Cong Xu, Liang Zhang, and Huiyong Zhang

363

On the Eﬀect of Using Third-Party Clouds for Maximizing Proﬁt . . . . . . Young Choon Lee, Chen Wang, Javid Taheri, Albert Y. Zomaya, and Bing Bing Zhou

381

A Tracing Approach to Process Migration for Virtual Machine Based on Multicore Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Zhang, Yuebin Bai, and Xin Wei

391

XXVIII

Table of Contents – Part I

GPU Computing and Applications Accelerating Dock6’s Amber Scoring with Graphic Processing Unit . . . . . Hailong Yang, Bo Li, Yongjian Wang, Zhongzhi Luan, Depei Qian, and Tianshu Chu

404

Optimizing Sweep3D for Graphic Processor Unit . . . . . . . . . . . . . . . . . . . . . Chunye Gong, Jie Liu, Zhenghu Gong, Jin Qin, and Jing Xie

416

Modular Resultant Algorithm for Graphics Processors . . . . . . . . . . . . . . . . Pavel Emeliyanenko

427

A Novel Scheme for High Performance Finite-Diﬀerence Time-Domain (FDTD) Computations Based on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianshu Chu, Jian Dai, Depei Qian, Weiwei Fang, and Yi Liu

441

Parallel Programming, Performance Evaluation A Proposed Asynchronous Object Load Balancing Method for Parallel 3D Image Reconstruction Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose Antonio Alvarez-Bermejo and Javier Roca-Piera

454

A Step-by-Step Extending Parallelism Approach for Enumeration of Combinatorial Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hien Phan, Ben Soh, and Man Nguyen

463

A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prakash Raghavendra, Akshay Kumar Behki, K. Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S. Bhat, V.M. Thejus, and Vishnumurthy Prabhu Impact of Multimedia Extensions for Diﬀerent Processing Element Granularities on an Embedded Imaging System . . . . . . . . . . . . . . . . . . . . . Jong-Myon Kim

476

487

Fault-Tolerant/Information Security and Management Reducing False Aborts in STM Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Nic´ acio and Guido Ara´ ujo

499

Fault-Tolerant Node-to-Set Disjoint-Path Routing in Hypercubes . . . . . . Antoine Bossard, Keiichi Kaneko, and Shietung Peng

511

AirScope: A Micro-scale Urban Air Quality Management System . . . . . . . Jung-Hun Woo, HyungSeok Kim, Sang Boem Lim, Jae-Jin Kim, Jonghyun Lee, Rina Ryoo, and Hansoo Kim

520

Table of Contents – Part I

XXIX

Wireless Communication Network Design of a Slot Assignment Scheme for Link Error Distribution on Wireless Grid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghoon Lee, Seong Baeg Kim, and Mikyung Kang

528

Wireless Bluetooth Communications Combine with Secure Data Transmission Using ECDH and Conference Key Agreements . . . . . . . . . . . Hua-Yi Lin and Tzu-Chiang Chiang

538

Robust Multicast Scheme for Wireless Process Control on Traﬃc Light Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghoon Lee, Gyung-Leen Park, Seong-Baeg Kim, Min-Jae Kang, and Mikyung Kang

549

A Note-Based Randomized and Distributed Protocol for Detecting Node Replication Attacks in Wireless Sensor Networks . . . . . . . . . . . . . . . . Xiangshan Meng, Kai Lin, and Keqiu Li

559

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

571

Efficient Grid on the OTIS-Arrangment Network Ahmad Awwad1, Bassam Haddad2, and Ahmad Kayed1 1

Department of Computer Science, Faculty of Computing Fahad Bin Sultan University, Tabuk - Saudi Arabia [email protected], [email protected] 2 Department of Computer Science, Faculty of Information Technology University of Petra, Amman-Jordan [email protected]

Abstract. Many recent studies have revealed that the Optical Transpose Interconnection Systems (OTIS) are promising candidates for future highperformance parallel computers. In this paper, we present and evaluate a general method for algorithm development on the OTIS-Arrangement network (OTIS-AN) as an example of OTIS network. The proposed method could be used and customized for any other OTIS network. Furthermore it allows efficient mapping of a wide class of algorithms into the OTIS-AN. This method is based on grids as popular structure that support a vast body of parallel applications including linear algebra, divide-and-conquer type of algorithms, sorting, and FFT computation. This study confirms the viability of the OTIS-AN as an attractive alternative for large-scale parallel architectures.

1 Introduction The choice of network topology for parallel systems is a critical design decision that involves inherent trade-offs in terms of efficient algorithms support and network implementation cost. For instance, networks with large bisection width allow fast and reliable communication. However, such networks are difficult to implement using today’s electronic technologies that are two dimensional in nature [19]. In principle, free-space optical technologies offer several fronts to improve this trade-off. The improved transmission rates, dense interconnects, power consumption, and signal interference are few examples on these fronts [1, 2, 6, 7, 10, 13, 14]. In this paper, we focus on Optical Transpose Interconnection Systems Arrangement Networks-(OTIS-AN) which was proposed by Al-Sadi [21] that can be easily implemented using free-space optoelectronic technologies [1]. In this model, processors are partitioned into groups, where each group is realized on a separate chip with electronic inter-processor connects. Processors on separate chips are interconnected through free space interconnects. The philosophy behind this separation is to utilize the benefits of both the optical and the electronic technologies. The advantage of using OTIS as optoelectronic architecture lies in its ability to manoeuvre the fact that free space optical communication is superior in terms of C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 1–10, 2010. © Springer-Verlag Berlin Heidelberg 2010

2

A. Awwad, B. Haddad, and A. Kayed

speed and power consumption when the connection distance is more than few millimetres [6]. In the OTIS-AN, shorter (intra-chip) communication is realized by electronic interconnects while longer (inter-chip) communication is realized by free space interconnects. Extensive modelling results for the OTIS have been reported in [7]. The achievable Terra bit throughput at a reasonable cost makes the OTIS-AN a strong competitive to the to its factor network [1, 6, 7, 11, 12]. These encouraging findings prompt the need for further testing of the suitability of the OTIS-AN for real-life applications. A number of recent studies have been conducted in this direction [3, 4, 5, 8, 15, 17, 18]. Sahni and Wang [3, 4] have presented and evaluated various algorithms on OTISnetworks such as basic data rearrangements, routing, selection and sorting. They have also developed algorithms for various matrix multiplication operations [18] and image processing [17]. Zane et al. [20] have shown that the OTIS-mesh efficiently embeds four-dimensional meshes and hypercubes. Aside from the above mentioned works, the study of algorithms on the OTIS is yet to mature [16]. In this paper we contribute towards filling this gap by presenting a method for developing algorithms on the OTIS-AN. This method is based on grid as popular structure that supports a vast body of applications ranging from linear algebra to divide-and-conquer type of algorithms, sorting, and FFT computation. The proposed method is discussed in the sequel, but first we give the necessary definitions and notation.

2 Preliminary Notations and Definitions Let n and k be two integers satisfying 1≤ k ≤ n-1 and let us denote = {1, 2,…., n} and = {1, 2,…, k}. Let

Pkn taken k at a time, the set of arrangements of k ele-

ments out of the n elements of . The k elements of an arrangements p are denoted p1, p2,…, pk. Definition 1 (Arrangement Graph) The (n,k)-arrangement graph An,k = (V, E) is an undirected graph given by: V= { p1 p2… pk⎪pi in and pi≠pj for i≠j} =

Pkn , ….

(1)

and E = {(p,q) ⎪ p and q in V and for some i in , pi≠qi and pj = qj for j≠i}. … (2) That is the nodes of An,k are the arrangements of k elements out of n elements of , and the edges of An,k connect arrangements which differ exactly in one of their k positions. For example in A5,2 the node p=23 is connected to the nodes 21, 24, 25, 13, 43, and 53. An edge of An,k connecting two arrangements p and q which differ only in one position i, it is called i-edge . In this case, p and q is called the (i,q)-neighbour of p. An,k is therefore a regular graph with degree k(n-k) and n!/(n-k)! nodes. As an example of this network figure 1 shows A4,2 arrangement with size of 12 nodes and a symmetric degree of 4.

Efficient Grid on the OTIS-Arrangment Network

3

Since OTIS-networks are basically constructed by "multiplying" a known topology by itself. The set of vertices is equal to the Cartesian product on the set of vertices in the factor network. The set of edges consists of edges from the factor network and new edges called the transpose edges. The formal definition of OTIS-networks is given below. 42

12

32 14

34

24

13

31

43 23

21

41

Fig. 1. The arrangement graph A4,2

Definition 2 (OTIS-Network) Let G0 = (V0, E0) be an undirected graph representing a factor network. The OTIS-G0 = (V, E) network is represented by an undirected graph obtained from G0 as follows V = {〈x, y〉 | x, y ∈ V0} and E = {(〈x, y〉, 〈x, z〉) | if (y, z)∈E0} ∪ {(〈x, y〉, 〈y, x〉) | x, y ∈ V0 and x ≠ y}. The set of edges E in the above definition consists of two subsets, one is from G0, called G0-type edges, and the other subset contains the transpose edges. The OTISAN approach suggests implementing Arrangement-type edges by electronic links since they involve intra-chip short links and implementing transpose edges by free space optics. Throughout this paper the terms “electronic move” and the “OTIS move” (or “optical move”) will be used to refer to data transmission based on electronic and optical technologies, respectively.

4

A. Awwad, B. Haddad, and A. Kayed

Definition 3 (Cross Product) The cross product G = G1⊗G2 of two undirected connected graphs G1=(V1, E1) and G2=(V2, E2) is the undirected Graph G=(V, E), where V and E are given by: V={〈x1, y〉 | x1∈V1 and y∈V2} and E={(〈x1, y〉, 〈y1, y〉) | (x1, y1) ∈ E1} ∪ {(〈x, x2〉,〈x, y2〉) | (x2 ,y2) ∈ E2} . So for any u =〈x1, x2〉 and v =〈y1, y2〉 in V, (u, v) is an edge in E if, and only, if either (x1, y1) is an edge in E1 and x2 = y2, or (x2, y2) is an edge in E2 and x1 = y1. The edge (u, v) is called a G1-edge if (x1, y1) is an edge in E1, and it is called G2-edge if (x2, y2) is an edge in E2 [9]. The size, degree, diameter and number of links of the cross product of two networks are defined next. Definition 4 (Topological properties of cross product networks) If G1 and G2 are two undirected connected graphs of respective size s1 and s2 and have respective diameters δ1 and δ2, then [8, 9]:

1) G1⊗G2 connected. 2) The diameter δ of G1⊗ G2 is δ = δ1+δ2. 3) The size s of G1⊗ G2 is given by s = s1.s2

4) The degree of a node u =〈x1, x2〉 in G1⊗G2 is equal to the sum of the degrees of vertices x1 and x2 in G1 and G2, respectively. 5) Number of links for the product network, is given by (size⋅degree)/2.

3 Topological Properties of OTIS-AN This section reviews some of the basic topological properties of the OTISArrangement network including size, degree, diameter, number of links, and shortest distance between 2 nodes [7, 21]. The topological properties of the OTISArrangement network along with those of the Arrangement network are discussed below. We will refer to g as the group address and p as the processor address. An intergroup edge of the form (〈g, p〉, 〈p, g〉) represents an optical link and will be referred to as OTIS or optical move. Note that also we will be using the following notations are defined: • • • • • •

| An,k | = size of the graph An,k.. |OTIS-An,k | = size of the graph OTIS-An,k. Deg. An,k ( p) = Degree of the graph An,k at node p. Deg. OTIS-An,k (g, p) = Degree of the graph OTIS- An,k at node address . Dist-An,k (p1, p2) = The length of a shortest path between the two nodes p1 and p2 in Arrangement graph. Dist. OTIS-An,k (p1, p2) = The length of a shortest path between the two nodes and in OTIS-Arrangement.

Efficient Grid on the OTIS-Arrangment Network

5

In the OTIS-Arrangement the notation 〈g, p〉 is used to refer to the group and processor addresses respectively. Figure 2 shows that as an example of OTIS-A3,2. The figure shows that two nodes 〈g1, p1〉 and 〈g2, p2〉 are connected if and only if g1 = g2 and (p1, p2)∈E0 (such that E0 is the set of edges in Arrangement network) or g1 = p2 and p1 = g2, in this case the two nodes are connected by transpose edge. The distance in the OTIS-Arrangement is defined as the shortest path between any two processors, 〈g1, p1〉 and 〈g2, p2〉, and involves one of the following forms: 1. When g1 = g2 then the path involves only electronic moves from source node to destination node. 2. When g1 ≠ g2 and if the number of optical moves is an even number of moves and

more than two, then the paths can be compressed into a shorter path of the form: 〈g1, E

E

O

p1〉 ⎯⎯→ 〈g1, p2〉 ⎯⎯→ 〈p2, g1〉 ⎯⎯→ 〈p2, g2〉 ⎯⎯→ 〈g2, p2〉 here the symbols O and E stand for optical and electronic moves respectively. 3. When g1 ≠ g2, and the path involves an odd number of OTIS moves. In this case the paths can be compressed into a shorter path of the form: O

E

O

E

〈g1, p1〉 ⎯⎯→ 〈g1, g2〉 ⎯⎯→ 〈g2, g1〉 ⎯⎯→ 〈g2, p2〉. The following are the basic topological properties for the OTIS-Arrangement. For instance if the factor Arrangement network is of size n!/(n-k)!|, degree is n-1 and diameter is ⎣1.5 k⎦ [7]. Then the size, the degree, the diameter, number of links, and the shortest distance of OTIS-Arrangement network are as follows: • • • • •

Size of |OTIS-An,k | = |n!/(n-k)!|2. Degree of OTIS-An,k= Deg.(An,k),if g= p. Deg. (An,k) + 1, if g ≠ p. Diameter of OTIS-An,k = 2⎣1.5 k⎦ +1. Number of Links: Let N0 be the number of links in the An,k and let M be the number of nodes in the An,k. The number of links in the OTIS-An,k =

( M 2 − M ) / 2 + N 02 . For instance, the number of links in the OTIS-A4,2 consisting of 144 processors is = (12 •

2

− 12) / 2 + 2302 = 595

Dist. of OTIS-An,k =

Theorem 1 The length of the shortest path between any two processors 〈g1, p1〉 and 〈g2, p2〉 in OTIS-Arrangement is d(p1, p2) when g1 = g2 and min{d(p1, p2) + d(g1, g2) +2, d(p1, g2)

+ d(g1, p2) + 1} when g1 ≠ g2, where d(p, g) stands for the shortest distance between the two processors p and g using any of the possible shortest paths as seen in the above forms 1, 2 and 3 [15].

6

A. Awwad, B. Haddad, and A. Kayed

It is obvious from the above theorem that when g1 = g2, then the length of the path

between the two processors 〈g1, p1〉 and 〈g2, p2〉 is d(p1, p2). From the shortest path construction methods in (2) and (3) above, it can be easily verified that the length of the path equal min {d (p1, p2) + d(g1, g2) +2, d(p1, g2) + d(g1, p2) + 1 when g1 ≠ g2}.

To send a message M from the source node 〈g1, p1〉 to the destination node 〈g2, p2〉 it must follow a route along one of the three possible paths 1, 2, and 3. The length of the shortest path between the nodes 〈g1, p1〉 and 〈g2, p2〉 is one of the forms: ⎧ d(p1, p2)

if g1 = g2

(3) Length = ⎨ ⎩ min( d(p1, g2) + d(g1, p2) + 1, d(p1, p2) + d(g1, g2) + 2) o.w. Where d(p1, p2) is the length of the shortest path between any two processors 〈g1, p1〉 and 〈g1, p2〉. If δ0 is the diameter of the factor network An,k then from (1I) it follows that

the diameter of the OTIS-An,k is 2δ0 + 1.The diameter of OTIS-An,k is the Max (δ0, 2δ0 + 1) which is equal to 2δ0 + 1. The proof of the above theorem is a direct result from (I).

Fig. 2. OTIS-A3,2 network

4 Hierarchical Decomposition for the OTIS-Networks In this section the hierarchical structure of the OTIS-AN is discussed. The properties of a new decomposition method for the OTIS-AN presented and proved. These

Efficient Grid on the OTIS-Arrangment Network

7

properties are then used in the subsequent sections to develop grids and pipelines as methods for developing various parallel algorithms on the OTIS-AN. An OTIS-AN based computer contains N2 processors partitioned into N groups with N processors each. A processor is indexed by a pair 〈x, y〉, 0≤x, y
single vertex labeled by i and having a link between i and j if Ψi and Ψj share a prefect matching, i.e. VGΨ = V0 and EGΨ = {(i, j) ⎟ Ψi perfectly matches Ψj}. Theorem 2 The two Ψ and Φ decomposition methods of the OTIS-AN0 have the following properties: 1. 2. 3. 4.

Ψi is isomorphic to AN0. VΨi ∩ VΦj = {〈i, j〉}.

Ψi and Φi share perfect matching for all i values. Ψi and Ψj share perfect matching for all i and j values and hence GΨ is a complete graph. (Fig. 3)

8

A. Awwad, B. Haddad, and A. Kayed

U

<1

<2

)~v0~

)2

)1

U

11

21

«¢¢ 1,*²²

U12

U22

«¢¢ 2,*²²

U1~~v0~~

U2~~v0 ~

¢*,1²²

< ~v0~~

U v ~ 0~ ~1

¢*,~ ~v ~²

¢*,2²²

U v ~ 0~ ~2

0

«¢¢ ~v0~,*²²

Uv v ~ 0~ ~~ 0~ ~

Fig. 3. The grid structural for OTIS-AN

Proof Property 1 is a direct consequence of Definition 7. The function ρ maps nodes from VΨi to V0. In fact, the set {ρ(u) | u∈Ψi} is equal to V0 for any i. Since any two neighboring nodes u and v in Ψi should have γ(u)=γ(v) and since (ρ(u), ρ(v)) is an edge in E0; the subgraph Ψi is isomorphic to AN0. Property 2 states that for any two labels i and j from V0, the two subgraphs Ψi and Φj have exactly one node in common. Since, VΨi = {〈i, x〉 | x ∈ V0} and VΦj = {〈x, j〉 | x

∈ V0}, the intersection VΨi ∩ VΦj contains only the node 〈i, j〉.

Let fi : VΨi → VΦi be a function that maps nodes form Ψi into Φi for all i values de-

fined as follows: fi(〈x, y〉) = 〈y, x〉. First we have |VΨi| = |VΦj| for all i and j. For any two

distinct nodes u and v in VΨi we have fi(〈γ(u), ρ(u)〉) = 〈ρ(u), γ(u)〉 ≠ fi(〈γ(v), ρ(v)〉) =

〈ρ(v), γ(v)〉; because ρ(u) ≠ ρ(v). Hence the function fi is on-to-one and onto. Thus property 3 follows.

Efficient Grid on the OTIS-Arrangment Network

9

Let tij: VΨi → VΨj be a function that maps nodes form Ψi into Ψj, for any i and j, as follows: tij(〈i, x〉)=〈j, x〉. For any two distinct nodes u and v from VΨi we have tij(〈i,

ρ(u)〉) = 〈j, ρ(u)〉 ≠ tij(〈i, ρ(v)〉) = 〈 j, ρ(v)〉. Since |VΨi| = |VΨj| it follows that Ψi and Ψj share perfect matching for all i and j values and hence GΨ is a complete graph. Lemma 1: GΨ can be embedded into OTIS-AN0 with dilation δAN0 + 2.

Proof Since GΨ is complete, any two distinct nodes i and j in VANΨ are neighbors. The “vir-

tual” path between 〈i, x〉 and 〈j, x〉 in OTIS-AN0 that corresponds to the edge (i, j) in EANΨ is constructed as follows: 〈i, x〉 → 〈x, i〉 || πG0(i, j) || 〈x, j〉 → 〈j, x〉. An arrow represents an edge connecting the two nodes and the operation “||” means appending two paths (i.e. connecting the last node in the left path to first node in the right path). Notice that the choice of x from V0 does not affect the construction of this path nor its length. The path segment πG0(i, j) is an isomorphic copy to the optimal length path from i to j in AN0. It can be verified that the above constructed path is of optimal length equal to dAN0(i, j)+2. Hence, the longest such path cannot exceed δAN0 + 2.

5 Conclusion The study of algorithms on the Optical Transpose Interconnection Systems (OTIS) is still far from being matured. In this paper, we have contributed towards filling this gap by proposing a method for algorithm development on OTIS-AN network. This method is based on the grid structure as popular framework for supporting vast body of important real-world parallel applications. Utilizing this method to develop parallel algorithms for linear algebra will be discussed as a future case study. Several topological properties including size, degree, diameter, number of links and shortest distance between any two nodes have been discussed. The proposed OTIS-AN shown to be an attractive alternative for its factor network in terms of routing by utilizing both electronic and optical technologies. As a future research work we could utilize the proposed framework in solving real life problems on OTIS-AN including Matrix problems and Fast Fourier transforms.

References 1. Agelis, S.: Optoelectronic router with a reconfigurable shuffle network based on microoptoelectromechanical sys. Journal of Optical Networking 4(1), 1–10 (2005) 2. Akers, S.B., Harel, D., Krishnamurthy, B.: The Star Graph: An Attractive Alternative to the n-Cube. In: Proc. Intl. Conf. Parallel Processing, pp. 393–400 (1987) 3. Al-Sadi, J., Awwad, A., AlBdaiwi, B.: Efficient Routing Algorithm on OTIS-Star Network. In: Proceedings of the IASTED International Conference on Advances in Computer Science and Technology, pp. 157–162. ACTA Press (2004)

10

A. Awwad, B. Haddad, and A. Kayed

4. Awwad, A., Al-Ayyoub, A., Ould-Khaoua, M., Day, K.: Solving Linear Systems Equations Using the Grid Structural Outlook. In: Proceedings of the 13th IASTED Parallel and Distributed Computing and Systems (PDCS 2001), USA, pp. 365–369 (2001) 5. Chatterjee, S., Pawlowski, S.: Enlightening the Effects and Implications of Nearly Infinite Bandwidth. Communications of the ACM 42(6), 75–83 (1999) 6. Dally, W.J.: Performance Analysis of k-ary n-cubes Interconnection Networks. IEEE Trans. Computers 39(6), 775–785 (1990) 7. Day, K., Tripathi, A.: Arrangement Graphs: A Class of Generalised Star Graphs. Information Processing Letters 42, 235–241 (1992) 8. Day, K., Al-Ayyoub, A.: Topological properties of OTIS-Networks. IEEE Transactions on Parallel and Distributed Systems 13(4), 359–366 (2002) 9. Day, K., Al-Ayyoub, A.: The Cross Product of Interconnection Networks. IEEE Trans. Parallel and Distributed Systems 8(2), 109–118 (1999) 10. Hendrick, W., Kibar, O., et al.: Modeling and Optimisation of the Optical Transpose Interconnection System. In: Optoelectronic Technology Centre, Program Review, Cornell University (September 1995) 11. Krishnamoorthy, A., Marchand, P., Kiamilev, F., Esener, S.: Grain-size Considerations for Optoelectronic Multistage Interconnection Networks. Applied Optics 31(26), 5480–5507 (1992) 12. Marsden, G., Marchand, P., Harvey, P., Esener, S.: Optical Transpose Interconnection System Architecture. Optics Letters 18(13), 1083–1085 (1993) 13. Wang, C., Sahni, S.: Matrix Multiplications on the OTIS-Mesh Optoelectronic Computer. IEEE Trans. Computers 40(7), 635–646 (2001) 14. Yayla, G., Marchand, P., Esener, S.: Speed and Energy Analysis of Digital Interconnections: Comparison of on-chip, off-chip, and Free-space Technologies. Applied Optics 37(2), 205–227 (1998) 15. Sahni, S., Wang, C.: OTIS-star an attractive alternative network. In: Proceedings of the 4th WSEAS International Conference on Software Eng., Parallel & Distributed Systems (2005) 16. Sahni, S.: Models and Algorithms for Optical and Optoelectronic Parallel Computers. In: Proceedings of the International Symposium on Parallel Algorithms and Networks, pp. 2– 7. IEEE Computer Society Press, Los Alamitos (1999) 17. Sahni, S., Wang, C.: BPC Permutations on the OTIS-mesh Optoelectronic Computer. Technical Report 97-008, CISE department, University of Florida (1997) 18. Wang, C., Sahni, S.: Image Processing on the OTIS-mesh Optoelectronic Computer. IEEE Trans. Parallel and Distributed Systems 11(2), 97–109 (2000) 19. Wang, C., Sahni, S.: Computational Geometry On The OTIS-Mesh Optoelectronic Computer. In: Proceedings International Conference on Parallel Processing, pp. 501–507 (2002) 20. Zane, F., Marchand, P., Paturi, R., Esener, S.: Scalable Network Architecture Using the Optical Transpose Interconnection System (OTIS). Journal of Parallel and Distributed Computing 60, 521–538 (2000) 21. Al-Sadi, J., Awwad, A.: A New Efficient Interconnection Network. The International Arab Journal of Information Technology (accepted, 2009) (to appear)

Single Thread Program Parallelism with Dataflow Abstracting Thread Tianzhou Chen, Xingsheng Tang, Jianliang Ma, Lihan Ju, Guanjun Jiang, and Qingsong Shi College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, 310027, China {tzchen,tang1986,majl,lhju,libbug,zjsqs}@zju.edu.cn

Abstract. CMP brings more benefits comparing with uni-core processor, but CMP is not fit for legacy code well because legacy code bases on uni-core processor. This paper presents a novel Thread Level Parallelism technology called Dataflow Abstracting Thread (DFAT). DFAT builds a United Dependence Graph (UDG) for the program and decouples single thread into many threads which can run on CMP parallelly. DFAT analyzes the program’s data-, control- and anti-dependence and gets a dependence graph, then dependences are combined and be added some attributes to get a UDG. The UDG decides instructions execution order, and according to this, instructions can be assigned to different thread one by one. An algorithm decides how to assign those instructions. DFAT considers both communication overhead and thread balance after the original thread division. Thread communication in DFAT is implemented by producer-consumer model. DFAT can automatically abstract multi-thread from single thread and be implemented in compiler. In our evaluation, we decouple single thread into at most 8 threads with DFAT and the result shows that decoupling single thread into 4-6 threads can get best benefits.

1 Introduction As there are many problems arising from uni-core processor, nowadays multi-core processor is becoming more and more popular. Multi-core processor is also called as Chip Multi-Processor (CMP). It is no doubt that CMP can bring many benefits for parallel program or multi-thread program, but most of current application is based on the uni-core processor, which usually are single thread program. CMP is not good for single thread program because of the chip frequency declining. Automatic transferring single thread to multi-thread program is an ideal method, there are many researchers devoting themselves into this area. Nowadays the main idea for single thread parallelism in CMP is exploring the parallel part of code. Instruction Level Parallelism (ILP), Thread Level Parallelism (TLP) and Thread Level Speculation (TLS) are usual methods used in parallelism. ILP is a fine-grained way for parallelism, which realizes instruction parallelism execution. TLP, also called software level parallelism, abstracts multi-thread from single thread program and every abstracted thread is assigned to cores to accomplish some jobs. Few of abstracted threads comprise a kind of software level pipelining. Speculation is C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 11–21, 2010. © Springer-Verlag Berlin Heidelberg 2010

12

T. Chen et al.

another parallelism method, which can be considered as a kind of TLP. It can execute some parts of code beforehand according to some special algorithm. Speculation is an important direction for current research and there are many implementations of speculation, but if it is a wrong speculation, it cannot improve program performance and maybe lose some performance. Whatever people did in parallelism area, the dependence is the basic constraint for program parallelism. So every research current done is based on dependence. Program dependence has two kinds, control dependence and data dependence. But if a sequence code slice is executed parallelism, anti-dependence is also a problem which should be considered. If we consider branch code slice as value dependence of branch condition, control dependence is one kind of data dependence, adding the original data dependence, data dependence and control dependence link instructions into chains, which are called as data-flows. Usually there are many data-flows existing in a program at the same time and those data-flows interweave together combining the whole program data-flow. Adding the anti-dependence to the whole program dependence, it can be definitely decided that some instructions must be executed before some others and some be later than some else, and some can be executed at the same time. Based on such truth, a graph with instruction execution order is constructed. This graph indicates some instructions can be run at the same time, so all instructions in the code slice can be divided into groups according to the graph. Instructions with strong contact will be assigned to the same group and some with less contact assigned to the different group. The grouping can separate one slice into two or more slices so that code can be run parallelly in CMP. This paper tries to abstract program parallelism with Dataflow Abstracting Thread (DFAT) in compiling period. DFAT is an independent method for the threads parallelism and it is compatible with other parallel technology. DFAT analyzes the program’s data-, control- and anti-dependence and gets a dependence graph, then dependences are combined and be added some attributes to get a United Dependence Graph (UDG). The UDG can decide all instructions execution order in the graph. According to the order, instructions can be assigned to different thread one by one. An algorithm decides how to assign those instructions. DFAT considers both communication overhead and thread balance after the original thread division. Thread communication in DFAT is implemented by producer-consumer model. DFAT can automatically abstract multi-thread from single thread and be implemented in compiler.

2 Related Work As single core processor draws many problems that are difficult to be solved, CMP born in an appropriate time. CMP architecture has many benefits over than single core processor, and the most important advantage brought by CMP is that CMP can run the different threads at the same time and achieve the real parallelism but not concurrent. ILP tries to explore the parallelism probability of instructions. It is a fine-grained parallelism and can be supported by compiler. But most usually, ILP is supported by substrate hardware and is unconscious to the program. Kumar et al proposed an

Single Thread Program Parallelism with Dataflow Abstracting Thread

13

architecture call Carbon to support efficient ILP[8]. Carbon has relatively simple hardware and devotes to accelerate dynamic task scheduling on scalable CMPs. Ottoni et al uses a global instruction scheduler GREMIO to assign different instruction into cores[11]. GREMIO uses control dependence analysis to extract nonspeculative thread-level parallelism from sequential codes. It fits for parallel control application, but if the control dependences are sequence, the assignment will be constrained heavily. Sampson et al exploit the potential of CMPs for fine-grained data parallel tasks[16]. They present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. The barrier can guarantee the data synchronization, but if the different parallel parts are not at the close length, the waiting overhead can take a huge overhead, especially, when one part need more time than others, this can cause a lot of latency. ILP has been proposed for a long time and many modern processors take the advantages of ILP. There are many researches about ILP such as [19][20]. Those researches have some similarities, which are the characters of ILP. Usually, ILP is implemented by substrate hardware and is invisible for compiler and programmer, but compiling optimization can make ILP more efficient. The benefit of ILP is great, but meanwhile, ILP brings complexity of hardware design. Substrate hardware should provides faster data transmission and need more energy assumption, besides, ILP brings more resources contention. For the most of programs, they are compute-intensive and have abundant threadlevel parallelism, and are therefore good targets for running on a CMP. Thies et al focuses on the streaming application of legacy C code and proposes a new and pragmatic approach to leveraging coarse-grained pipeline parallelism[1]. It inserts some annotations to the legacy code to indicate the pipeline boundaries and tracks communication across those boundaries. The difficulty of the proposed method is the programmer has to know where the annotations should be inserted. For a lot of legacy code, the parallel parts are implicit and it is not easy for programmer to find those parts. If there is an automatic method to insert the annotations, it will be good. DSWP is another way of thread parallelism for the legacy code and legacy code is extracted to multi-thread program automatically by compiler[4]. It works on loops and analyses the dependences of the loops to construct a dependences graph. According to the graph, some instructions will be combined to the new node so that all dependence in the graph is ordinal. If the loop is nested, the inner loop’s balance is a challenge of DSWP. The benefit of DSWP is that thread abstraction can be done by compiler and without manually helping. The disadvantage is that it brings much communication overhead and cannot deal cyclic block. Speculation can be used in ILP and TLP, but for the CMP, speculation is used as TLP usually, sometimes called thread-level speculation. Hammond [14] describes the complete implementation of the support for thread level speculation on the Hydra chip multiprocessor, which is good for improving performance when there is a substantial amount of medium-grained loop-level parallelism. The data path may be the constraint of whole system. Actually, speculation can be adopted in a more range area; speculative thread parallelism in SPEC2000 is exposed[10], which manually parallelizes several SPEC2000

14

T. Chen et al.

benchmarks to discuss how and where parallelism was located within the application. Those work can guides the future advanced TLS compiler design. The difficulty of speculation in the paper is that some of them are hard to implement. The common case is that people can analyze the program characters according to the program structure and then use efficient speculation to them; the compiler cannot do that so intelligently. TLP and speculation have not a clear distinction and many researches implement speculation with TLP. Speculation can make code block execution pipelining[5][6]. TLP needs compiler support to distract parallel parts form sequence codes. The keys of TLP are threads communication and balance. Speculation also needs compiler support usually. Meanwhile, speculation needs hardware support sometimes. Speculation policy decides the performance of speculation.

3 DFAT 3.1 Motivation There is not any simple and efficient multi-threads program design technology nowadays. Because of the variety of CMP, it needs a long time to build a general multithread program model. For the current existing parallel program model, such as OpenMP and MPI, can use the advantages of CMP and improve program performance, but those technologies need programmers have a clear understanding of program parallel parts and set the execution mode of those parts. This needs programmers learn new skills of writing parallel program, but for some parallel parts which are implicit, programmers might not find. In this section, we propose a technology called DFAT to improve single thread application parallelism in CMP. This technology cans automatically analyze the dependences among instructions, and group parallel instructions into new threads so that different cores in CMP can run those threads parallelly. DFAT is an automatic thread decoupling technology and completed by compiler, it can keep the original single thread application’s semantic when decoupled thread is executed on CMP. DFAT is not exclusive and can be inclusive with ILP, software level pipelining and speculation. The in-order execution program has two kinds of dependences, data dependence and control dependence. Those two dependences decide the process the program runs, but those dependences bring constraints for parallelism. If we want to fall this single thread program into two or more parts and run different parts in parallel, there will introduces another new dependence — anti-dependence, which is like the situation Write after Read (WAR) in out of order processor. Namely, the values of some variables have been changed by some instructions which should be executed later than the instruction using those variables in the original program. In this paper, we will mix mentioned three dependences and call the combined dependences as United Dependence (UD). UD decides the instruction execution order and is the key point of our analysis. Figure 1 is an example for instruction parallelism. Figure 1(a) is a code slice from SEPC2000 twolf which is a basic block. (b) is the UDG of (a). How to get a UDG is described at following section. (c) and (d) are two divisions of original code, and the

Single Thread Program Parallelism with Dataflow Abstracting Thread

15

Fig. 1. (a) Code slice from twolf SPEC2000; (b) UDG of (a); (c) Decoupling (a) into two groups; (d) Communication overhead with different decoupling; (e) Add attributes to UDG

Fig. 2. An example of control value transmission

dashed line is the value transmission. (c) is a ideal division, and is the result of DFAT. (d) exchanges instructions 4 and 5 bringing more communication. As for basic block, division is simple, but in loop or branch, control value should be transmitted to each group. Figure 2 is an example of branch value transmission. When the original branch is decoupled into more than one branch, the original branch value should be sent to each new branch. This is implemented by thread communication. Loop condition is the same as branch. 3.2 DFAT Algorithm We can consider those dependences as just one kind because what the three dependences definitely do is to decide the instructions execution order. So we mix three dependences into one dependence called United Dependence. The graph base on United Dependence is called United Dependence Graph (UDG), so UDG is also a graph of instructions execution order. The rest of this section will present the algorithm of DFAT. Figure 3 is the pseudo code of DFAT. The DFAT algorithm is a recursive algorithm. It bases on the un-nested loop. Unnested loop means the loop without containing any loops inside and including basic block and branch only. After finding such loops, abstract the sequence block (all instructions in the loop except the loop condition) and analyze the dependences of instructions and get a Dependences Graph (DG) with control, data and anti-dependences. If any two nodes in DG have more than one edge, eliminate redundant edges and just left one. This one edge is United Dependence and decides the instruction execution order. When all redundant edges are eliminated, DG is changed to UDG.

16

T. Chen et al.

Fig. 3. Pseudo code of DEAT algorithm

Considering the communication overhead and balance among decoupled threads, Communication scale and node latency are added to edges and nodes. Because of data and control dependences need value transmission, the edge with value transmission should add a value to represent the number of values transmitted, and the node should add a value of the estimated node execution time. Figure 1(e) is the UDG with adding attributes. The following step of DFAT is dividing UDG into different groups, the number of groups is the number of thread and set previously by system, compiler or user, and the instructions in the same group will construct a thread. The goal of division is to achieve the best thread performance in CMP after decoupling. There has best solution for every UDG division, but it is a NP problem. In this paper, we use a heuristic algorithm to reduce the time and space consumption. According to the evaluation, the heuristic algorithm can get close performance comparing with the best solution. In our division, we consider the communication overhead first, namely instruction with value dependence should be assigned into the same thread, but meanwhile the thread balance is important and would be guaranteed. As for the thread balance, we set a threshold and the biggest time difference of threads should not be bigger than the threshold. There are many ways to set threshold, it could be a constant, a random number among a range or a ratio of the longest thread running time. We chose the ratio of original code running time as the threshold in this paper. The detailed division is like this. Firstly, it gets a free node from UDG. Free node is the node without depending on other nodes. The free node will try to be put into every group, and for each group, DFAT will calculates the communication overhead

Single Thread Program Parallelism with Dataflow Abstracting Thread

17

and judges whether it is balance. If it is not balance, the node could not be put into that group. The free node would be put into the group that has the minimum communication overhead at last. But if the communication overhead is same when the free node is put into different groups, the more balance group would be the choice. Namely, estimated running time difference is smallest. The communication overhead is the total number of value transmission. The balance is judged by the threshold, and estimated thread running includes two parts — instruction execution time and estimated thread waiting time which is calculated according to previous instruction division. If the biggest estimated running time difference of threads is bigger than the threshold, threads are unbalance. After the node putting, this node would be deleted from UDG, and the related edges are also deleted. That would produce some new free nodes. Then get a new free node and try to put it into groups until the UDG is empty. After the division of un-nested loop, a group is considered as a node in original codes. DFAT estimates running time of all groups and calculates the live-in and liveout of each group. When estimating running time, the loop is assumed that it executes only once. Because we believe that every group has some instructions of loop, whatever how many iterations the loop has, groups are balance.

…

1j= ; 2 while or if (condition) { 3 some j = , = j or not; 4 j= ; 5 some j = , = j or not; } 6 = j;

…

… …

…… … …… …

Fig. 4. A special case should be considered

When all instructions are grouped, producer and consumer instructions are inserted into threads. There is a special case which should be considered. Figure 4 is a presentation of special case. In the figure, the values of variable j in instruction 6 may be from instruction 1, and may be from the inside of loop or branch. In such case, instructions 1 and 4 like would be adjusted in the previous division and put into the same thread. Then the producer instruction would be put behind the right bracket, if j is used in other threads. 3.3 Communication We use producer-consumer mode to deal with thread communication. Some modifications should be added to compiler. Compiler needs to add two instructions, producer and consumer respectively. producer’s format is operation code, register and memory address. Register is the value which should be delivered and memory address is a point to store value. consumer’s format is also operation code, register and memory address. The difference of producer and consumer is write value to and read value from memory.

18

T. Chen et al.

4 Evaluation 4.1 Test Cases and Simulator Table 1. The baseline microarchitecture of every core I cache D cache L2 cache Branch Pred Execution Function Unit Size(insts) Width(insts/c)

16K 4-way set-associative, 32 byte blocks, 1 cycle latency 16K 4-way set-associative, 32 byte blocks, 1 cycle latency 1024K 8-way set-associative, 128 byte blocks, 6 cycle latency hybrid Out-of-Order issue, Out-of-Order execution, In-Order commit 2 integer ALU, 1 integer MULT/DIV, 2 load/store units, 2 FP ALU, 1 FP IQ:8, RUU:32, LSQ:8 decode:2, issue:2, commit:2

In this section, we make an evaluation to the proposed idea. We choose benchmarks from SEPC2000, which are mentioned and tested in [10]. We do not evaluate the whole benchmark and only implement the important parts indicated in [10], because those parts almost take all running time of whole benchmark. We decouple chose parts into two to eight threads and simulate them on CMP. The simulation only implements the proposed method of this paper and does not use any other parallelism technology. The simulation is implemented in Simplescalar. The baseline microarchitecture is showed in Table 1. We modified Simplescalar heavily and merged many cores in Simplescalar. Cores are homogeneous. Instruction execution latency, cache construction, memory hiberarchy is the same as Simplescalar default set. Thread communication is solved through producer-consumer mode and two new instructions (producer and consumer) are added to the Simplescalar instruction set. L1 cache is private for cores, and L2 cache is shared between cores. Cache coherence is guaranteed by snoop. Table 2. The value of threshold set in this paper # of thread Threshold

2 0.4

3 0.5

4 0.4

5 0.3

6 0.3

7 0.2

8 0.2

The balance of thread is based on the threshold. There are many ways to define threshold, meanwhile every threshold can be tested through running a set of program. The changeable threshold provides the flexibility of the system. In our evaluation, we took the ratio of original program running time as the threshold and we tested the reasonable threshold for different number threads. Table 2 is the result of our test. 4.2 Performance Evaluation Compared to Baseline The baseline performance is the original single thread program running under the baseline microarchitecture showed in Table 1. Figure 5 is the normalized performance improvement compared with the baseline performance. In the figure, performance

Single Thread Program Parallelism with Dataflow Abstracting Thread

gzip gcc

2.4

parser votex

bzip art

Waitting time

Normalized Performance

2.2 2 1.8 1.6 1.4 1.2 1 2

3

4

5

6

Number of Threa ds

7

8

Fig. 5. Speedup with multi-threads

gzip gcc

90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 2

3

parser votex

4 5 6 Number of Threads

19

bzip art

7

8

Fig. 6. The ratio of waiting time

improvement is close when the number of thread is equal or smaller than 4. But when it is bigger than 4, the performance improvement for different program is various. Some programs still get performance improvement, but some get performance degradation. The reason is the constraint of program dependence and threads spends lot of time on waiting. If the overhead of thread communication is zero, program performance after decoupling would never degrade when the number of thread increase and the worst case is performance improvement is constant when the number of threads is bigger than a value. But threads communication cause more memory access and cache coherence, so when number of threads increases, communication overhead kill the performance improvement. Decoupling 4 to 6 threads is better in DFAT. But there also has some exception. votex still get improvement when decoupled into 8 threads. Figure 6 is percentage of average waiting time of each core compared the thread running time. First of all, all programs have the same trend that core spends more time on waiting when the number of thread increases. But some programs speed more time than others when they have the same number of thread, so programs get different improvement at the same number of thread. Secondly, the percentage of average waiting time is close, but the performance improvement is discrete, especially when number of thread is more than 6. The important reason is the percentage indicates the ratio of waiting only, but different programs have different running time after decoupling. Programs with same waiting percentage have different waiting time.

Acknowledgement This paper is supported by the Research Foundation of Education Bureau of Zhejiang Province under Grant No. Y200803333, the State Key Laboratory of High-end Server & Storage Technology(No. 2009HSSA10), National Key Laboratory OS Science and Technology on Avionics System Integration, the Special Funds for Key Program of the China No. 2009ZX01039-002-001-04, 2009ZX03001-016, 2009ZX03004-005.

20

T. Chen et al.

References 1. Thies, W., Chandrasekhar, V., Amarasinghe, S.: A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (December 2007) 2. Kaul, M., Vemuri, R., Govindarajan, S., Ouaiss, I.: An automated temporal partitioning and loop fission approach for FPGA based reconfigurable synthesis of DSP applications. In: Proceedings of the 36th ACM/IEEE conference on Design automation (June 1999) 3. Bondhugula, U., Ramanujam, J., Sadayappan, P.: Automatic mapping of nested loops to FPGAS. In: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming (March 2007) 4. Ottoni, G., Rangan, R., Stoler, A., August, D.I.: Automatic Thread Extraction with Decoupled Software Pipelining. In: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture (November 2005) 5. Bhowmik, Franklin, M.: A general compiler framework for speculative multithreading. In: Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures, pp. 99–108 (2002) 6. Johnson, T.A., Eigenmann, R., Vijaykumar, T.N.: Min-cut program decomposition for thread-level speculation. In: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pp. 59–70 (2004) 7. Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGPLAN Notices 41(11) (November 2006) 8. Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Proceedings of the 34th annual international symposium on Computer architecture (June 2007) 9. Vachharajani, N., Iyer, M., Ashok, C., Vachharajani, M., August, D.I., Connors, D.: Chip multi-processor scalability for single-threaded applications. ACM SIGARCH Computer Architecture News 33(4) (November 2005) 10. Prabhu, M.K., Olukotun, K.: Exposing speculative thread parallelism in SPEC 2000. In: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming (June 2005) 11. Ottoni, G., August, D.: Global Multi-Threaded Instruction Scheduling. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (December 2007) 12. Chen, J., Juang, P., Ko, K., Contreras, G., Penry, D., Rangan, R., Stoler, A., Peh, L.-S., Martonosi, M.: Hardware-modulated parallelism in chip multiprocessors. ACM SIGARCH Computer Architecture News 33(4) (November 2005) 13. Chu, M., Ravindran, R., Mahlke, S.: Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (December 2007) 14. Hammond, L., Willey, M., Olukotun, K.: Data speculation support for a chip multiprocessor. ACM SIGOPS Operating Systems Review 32(5) (December 1998) 15. Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In: Proceedings of the 30th annual international symposium on Computer architecture (June 2003) 16. Sampson, J., Gonzalez, R., Collard, J.-F., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (December 2006)

Single Thread Program Parallelism with Dataflow Abstracting Thread

21

17. Rangan, R., Vachharajani, N., Vachharajani, M., August, D.I.: Decoupled Software Pipelining with the Synchronization Array. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (September 2004) 18. Ferrante, J., Ottenstein, K.J., Warren, J.D.: The Program Dependence Graph and Its Use in Optimization. ACM Transactions on Programming Languages and Systems 9(3), 319–349 (1987) 19. Colwell, R.P., Nix, R.P., O’Donnell, J.J., Papworth, D.B., Rodman, P.K.: A VLIW architecture for a trace scheduling compiler. In: Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1987, pp. 180–192 (1987) 20. Lam, M.S.: Software pipelining: An effective scheduling technique for VLIW machines. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1988, pp. 318–328 (1988)

Parallel Programming on a Soft-Core Based Multi-core System Liang-Teh Lee, Shin-Tsung Lee, and Ching-Wei Chen Department of Computer Science and Engineering, Tatung University Taipei City 10452, Taiwan [email protected]

Abstract. Soft-core system allows designers to modify the components which are in the architecture they designed conveniently. In some systems, uni-core processor can not provide enough computing power to support a huge amount of computing for specific applications. In order to improve the performance of a multi-core system, in addition to the hardware architecture design, parallel programming is an important issue. The current parallelizing compilers are hard to parallelize the programs effectively. The programmer must think about how to allot the task to each processor in the beginning. In this paper, we present a software framework for designing parallel program. The proposed framework provides a convenient parallel programming environment for programmers to design the multi-core system’s software. From the experiments, the proposed framework can parallelize the program effectively by applying the provided functions.

1 Introduction Recently, the applications of the embedded system are becoming more popular, and the computing speed requirement of the embedded system is also increasing. Usually the operating system used in the embedded system is an RTOS (Real-Time Operating System) with the time constrain. To improve the performance of an embedded system, in addition to modifying the scheduling algorithm, the other possible way is to modify the hardware architecture. In some systems, uni-core processor can not provide enough computing power to support a huge amount of computing for specific applications. For instance, in the image processing [1], the graphics card uses multi-core processors to process the images [2]. For this reason, multi-core system development has become more important. In order to improve the performance of a multi-core system, in addition to the hardware architecture and task scheduling [3], parallel programming is an important issue. The parallelizing compiler in nowadays are hard to parallelize the programs effectively. The programmer must think about how to allot the task to each processor in the beginning. In general computer systems, there are many parallel programming languages like MPI [4], OpenMP [5], CUDA [6], and OpenCL [7] and so on. But in the soft-core based embedded system, there is not a parallel programming language for use. So, we expect C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 22–31, 2010. © Springer-Verlag Berlin Heidelberg 2010

Parallel Programming on a Soft-Core Based Multi-core System

23

to provide a library-based parallel programming framework for designing algorithms on the soft-core based embedded system.

2 Background 2.1 Multi-core Architectures Compared with the uni-core system, the multi-core system is a system with two or more cores on a single chip. In the heterogeneous multi-core system, it is used to program on MPMD model. In the homogeneous multi-core system, it is used to program on SPMD model. In the MPMD model, each core runs a different program and works on different sets of data. In the SPMD model, the programs running on different core are identical. Each core has their own memory space and work on different sets of data. In this paper, we use the homogeneous multi-core system and SPMD model to construct our environment. Under this model, we can simplify the system architecture design. In the communication between each core, there are two major methods be proposed -shared memory and message passing. In the system with shared memory, several different cores access the same memory element. So they must access the data on shared memory in the same formula. Message passing is the interprocess communication between each core in multi-core systems. Communication is made by sending messages to recipients. The passed messages have to include function invocation, signals, and data packets. Message passing do not need to share anything as long as the cores which want to pass the message are connected. 2.2 Multi-core Architecture with Soft-Core System The soft-core system has flexibility for designing the system architecture. The designers can design the desired hardware architecture conveniently and easy by modifying the existing architecture. In the existing architecture, it uses the shared memory for data sharing. However, using the shared memory has to solve the problem of data corruption. If a core is writing the data to the shared memory and another core is reading or writing at the same time, data corruption will occur. In addition, synchronization and memory consistency issues must be solved in the shared memory system. To keep the design simply, we provide the Mailbox in the constructed multi-core architecture for message passing. Mailbox is easy to use and can avoid the data corruption, synchronization and memory consistency problems in the shared memory. If the updated data have not sent to the Mailbox, other cores can not get the data for further processing. 2.3 Parallel Programming In general computer systems, there are many parallel programming languages like MPI, OpenCL, CUDA, SWARM [8] and so on. Message Passing Interface (MPI) is a language-independent communications protocol used to program parallel computers. In

24

L.-T. Lee, S.-T. Lee, and C.-W. Chen

the communication it supports point-to-point and collective communication. MPI has the advantages of portability and speed. Because MPI is implemented for message passing, it almost can use in every distributed memory architecture. SWARM (SoftWare and Algorithms for Running on Multi-core) is a parallel programming framework for multi-core processors. It is based on homogeneous multi-core systems and communicates by shared memory architecture. SWARM offers open source library so we can use their function and read their source code to understand the function’s processing flow.

3 Parallel Programming Environment In this section, the proposed soft-core based multi-core system will be presented. By referring to other resources [9] [10], we construct a hardware architecture which consists of three cores and other necessary components. The multi-core architecture with the simple image in our design should include the CPU, memory, I/O, MUTEX, and a bus which connects all the components together. The architecture as shown in Fig.1 is the simple one. There are some important components should be discussed in detail, like CPU, memory, bus, MUTEX, Mailbox, SystemID, and I/O.

Fig. 1. A Simple Multi-core Hardware Architecture

3.1 Hardware Architecture A basic multi-core system includes several processors, and is connected with system bus to the memory and peripherals [9]. In this paper we implement the multi-core system with Altera’s NEEK FPGA platform. The number of logic elements in NEEK can implement the multi-core system of three cores. Fig.1 is the system architecture. In SPMD model, each core has its memory space to store the program instruction and data. Between CPUs and memories, we use bridge as connector. When each CPU wants to read instruction or data, they must access through the bridge. In this model, the bridge’s bandwidth will be the bottleneck of system performance. So we dispose the instruction cache in each CPU to decrease the access time of memories. At the communication between the CPUs, we offer the shared memory and mailbox to send the data. On the shared resources, we offer the hardware MUTEX to protect the shared data and keep the correctness of the data.

Parallel Programming on a Soft-Core Based Multi-core System

25

The System ID component is used to allocate the serial numbers to every CPU. The compiled program will be loaded to run on the target CPU with the System ID number to assure that the program will be executed correctly. In the program operation, we used the SPMD model to implement our parallel programming environment. In this model each CPU runs the same program. If we want to divide the specific code block to different CPU, we can use the System ID number to decide which CPU should do the section. At I/O device part, in order to the convenience of management, each I/O device is controlled by CPU1. If other CPUs need to access the I/O device, they must send the instruction to CPU1. Once CPU1 received the instruction, CPU1 will control the operation of I/O device. 3.2 Software Environment Our programming framework is a library-based parallel programming framework, and consists of many functions which will be used in the parallel programming. Using these functions, programmer can parallelize their program by doing a little change. Before using our function, programmers need to include a head file parallel.h. A program using our framework is structured as follows: #include “parallel.h” int main ( ) { p_Init( );/* the routine you want to parallelize*/ p_Finalize(); } In the main function, we need to use the p_Init() function to initialize the environment at first. In the initialization setting, every thread can initialize some components like the hardware mutex, shared memory and some useful variables we defined. THREADS is defined to be the total number of cores in this system. MYTHREAD is defined to be the id number of this core and starts from 0. THREADED is defined to be a structure that contains the required information for running the parallel routine. After initialization, we can use the offered function and write the program we want to process in parallel. At last, we need to call the p_Finalize( ) function to release the allocated memory. 3.3 Partition Function 3.3.1 Pardo In a program, if we need to use the loop to compute, it will take a lot of time to execute the loop. We design a loop partition function “pardo” to execute the loop concurrently on each core. The pardo can partition the loop into several parts according to the cpu numbers. When the calculated data are independent, using pardo to execute the loop will get a good performance. Pardo uses macro to replace the instruction with a “for” loop and cut the loop into several parts.

26

L.-T. Lee, S.-T. Lee, and C.-W. Chen

3.3.2 On_one_thread, On_thread() In program control, we provide two functions “on_one_thread” and “on_thread(thread_id).” When programmers want to distribute a code block to an allocated thread, they can use these two functions. “on_thread(thread_id)” function allocates code block which is surrounded by the brace characters to the assigned thread. When you insert the thread number to the “thread_id” location, the cpu will decide that whether it should do the code block according to the thread number. “on_one_thread” is defined using macro and it will be replaced by if statement. “on_one_thread” is replaced by “if(MYTHREAD == 0)”, and “on_thread(thread_id)” will be replaced by”if(MYTHREAD ==(thread_id))”. 3.4 Synchronization Function Because the operation amount is not equal after allocating each operation block to different threads, the threads running on different cores are asynchronous. When we want to synchronize each thread, we can use the barrier function to synchronize all the threads in the system. If we use the barrier function in the program, when a thread reaches the barrier, it will check whether all threads reach the barrier or not. If all threads do not reach the barrier, the reached threads will wait until all threads reach the barrier. To implement this function, we need a mutex core and a shared memory buffer to count the reached thread’s amount. When a thread reaches the barrier, it will add the counter after locked the mutex and check the counter value. If the counter is equal to THREADS, this thread will set the counter to zero and unlock the mutex. If the counter is not equal to THREADS, this thread will unlock the mutex and into a busy waiting loop. In this loop, threads read the counter value and check the counter value is 0 or not. If the value is 0, threads will jump out the loop. The flowchart of barrier function is shown in Fig.2. 3.5 Collective Communication Function (Broadcast and Reduce) In the SPMD (Single Program Multiple Data) architecture, each core runs the same program on their own memory space. When the variables in each CPU need to

Fig. 2. The flowchart of barrier function

Parallel Programming on a Soft-Core Based Multi-core System

27

participate in operation synchronously, we can use the collective communication function to operate the variables. Broadcast function can send the variable value in one thread to the others. It can synchronize the value in the same variable on different cores. In the broadcast function, we declare a shared memory buffer to store the broadcasted value. In the master threads, it will store the broadcasted value into the shared memory buffer when it runs the instruction. In other threads, it will receive the broadcasted value into the local variable space when it runs the instruction. Reduce function is able to calculate the partial values on each core to obtain addition, maximum or minimum of values. Different arguments put into the function make different operations. SUM is to calculate the sum of the partial values which are provided by each thread. MAX is to find the maximum value in the values provided by each thread. MIN is to find the minimum value in the values provided by each thread. 3.6 Memory Management Function (p_malloc, p_free) To use the shared memory, we provide two functions to allocate and release the memory space. “p_malloc” is used to allocate the memory space on the shared memory. “p_free” is used to release the allocated memory space. At the start of the program, we will get the base address of the shared memory. So we can use the address to manage the shared memory with pointer. In the example, we declare an integer pointer A. After calling the “p_malloc” function, A will be stored an address of the shared memory which the function allocated.When we do not need to use the memory, we should call the “p_free” to release the allocated memory space.

4 Experiment and Discussion 4.1 Experimental Environment In this experiment, we use Altera’s NEEK FPGA platform to construct our environment. According to the number of logic elements in NEEK, we can build a three-core multi-core system at most. Each core is 32-bit RISC CPU and the frequency is 100MHz. It has 4 kbytes instruction cache and includes Branch Prediction, Hardware Multiply and Hardware Divide in it. The memory subsystem contains two kinds of memory component. First is a 3MBytes DDR SDRAM with 16 bits wide, and the second is a 16 Mbytes Flash memory. SDRAM is used to execute the process. Flash is used to store the program. In these two memories each core has different space for use. System Bus is a 32 bits width bridge to connect the cores and memories. In software, we provide two programs in our experiments. The first is the matrix multiplication program. The second is Livermore loop 1 program. 4.2 Matrix Multiplication In matrix multiplication program, we make two matrices and give them initial values using two “for” loops. At last, we multiply these two matrices and store the result into another matrix. In this method, the data calculation is independent so we can distribute the program in high parallelism. After parallelizing, we can get the program like:

28

L.-T. Lee, S.-T. Lee, and C.-W. Chen

#include “parallel.h” .... int main() { .... p_Init(); z=(matrix*)p_malloc(SIZE*SIZE*sizeof(double)); .... pardo(i,0,SIZE,1){ for(j=0;j<SIZE;j++){ for(k=0;k<SIZE;k++){ z->h[i][j]+=x[i][j]*y[k][j];}} }p_Finalize(); } We use the pardo to divide the “for” loop and use the “p_malloc” to declare the matrix in the shared memory component. By using the shared memory, we can decrease the control of data transfer. In this experiment, we set the problem size as (i) 45 x 45 matrix (ii) 60 x 60 matrix. Then we use one core, two cores and three cores to run the program and then measure and record the execution time. 4.3 Livermore Loop Livermore loop is a standard benchmark for testing system performance. In this section, we use the Livermore loop 1 as an experiment in our environment. In the beginning, we declare three one-dimensional arrays and give them the initial number by a “for” loop. Then we calculate the floating point multiplication in the second “for” loop. After parallelized program of Livermore Loop as: #include “parallel.h” .... int main() { .... p_Init(); z=(array*)p_malloc(SIZE*SIZE*sizeof(double)); .... pardo(i,0,LOOP,1){ for(k=0;k<SIZE;k++){ x->h[k]=q+y[k]*(r*z[k+10]+t*z[k+11]);} }p_Finalize(); } The array x is declared in the shared memory. The second “for” loop is rewritten by pardo to be executed concurrently on each core. In this experiment, we also set two problem sizes. One is (i) loop 30 with array size 720 and the other is (ii) loop 90 with array size 1440. Then we use one core, two cores and three cores to run the program and then measure and record the execution time.

Parallel Programming on a Soft-Core Based Multi-core System

29

4.4 Experimental Results When all experimental programs are completed, we can get the execution time on the different conditions. is the experimental result of parallel programming on multi-core system. It shows the Matrix Multiplication and Livermore loop1 programs’ execution time in different core numbers with different problem sizes. Referring to this Table 1, we can find that at these two cases, programs can distribute equally among these cores. So when we parallelize the task on two or three cores, the processing time will be near to half or one third. In Fig.3 and Fig.4, we show the execution time and speedup in graphs. Table 1. The Experimental Result Experimental Environment (ms) Number(s) of core 1 2 3

Experimental Program and Problem Size Livermore loop1 Matrix Multiplication Loop=30 Size=720

Loop=90 Size=1440

45x45

60x60

7142.870 3681.697 2499.206

44717.277 23036.129 15015.596

5840.373 3018.139 2090.792

15556.318 8086.112 5431.313

Fig. 3. Execution Time of Each Program. X-axis is the Number of Cores and Y-axis is Execution Time (ms).

By this framework, we can change the existent sequential process into the parallel process with some modifications. In the experiment, parallel programming can reach the performance we expected.

30

L.-T. Lee, S.-T. Lee, and C.-W. Chen

Fig. 4. Speedup of Each Program. X-axis is the Number of Cores and Y-axis is Speedup.

5 Conclusions and Future Work Parallel programming is an important issue in multi-core systems. A well parallelized program can speed up the operation in the multi-core system and provide better performance. To make the software design faster, a convenient parallel programming environment is the development goal. We construct the Mailbox that provides the mechanism for multi-core architecture to pass data. We propose a software framework for designing parallel programs. In this framework, it contains several basic parallelizing functions for multi-core programming, such as partition function, synchronization function, collective communication function and memory management function. By this framework, we can change the existent serial process into the parallel process with some modifications. The experimental results show that by applying the proposed framework in a soft-core based multi-core system, the parallel programming can reach the performance we expected positively. In this framework, we just implement some basic functions in parallel programming. In future, more useful functions will be augmented in this framework to support complicated operations so as to obtain the higher programmability in parallel programming.

Acknowledgements Financial support for this work was provided by the National Science Council, Taiwan, R.O.C., under the contract of NSC 98-2221-E-036-034.

References 1. Huang, Q., Huang, Z., Werstein, P., Purvis, M.: GPU as a General Purpose Computing Resource. In: Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 151–158 (2008)

Parallel Programming on a Soft-Core Based Multi-core System

31

2. Park, J., Ha, S.: Performance Analysis of Parallel Execution of H.264 Encoder on the Cell Processor. In: IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia, pp. 27–32 (2007) 3. Chen, J.J., Yang, C.Y., Kuo, T.W., Shih, C.S.: Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems. In: Design Automation Conference 2007, ASP-DAC’07, Asia and South Pacific, pp. 342–349 (2007) 4. Message Passing Interface Forum, http://www.mpi-forum.org/ 5. OpenMP: Simple, Portable, Scalable SMP Programming, http://openmp.org/wp 6. NVIDIA, CUDA Programming Guide, Version 2.2.1 (2009), http://www.nvidia.com/object/cuda_develop.html 7. OpenCL: The open standard for parallel programming of heterogeneous systems, http://www.khronos.org/opencl/ 8. Bader, D.A., Kanade, V., Madduri, K.: SWARM: A Parallel Programming Framework for Multicore Processors. In: International Conference on Parallel and Distributed Processing Symposium 2007, pp. 1–8 (2007) 9. Jose, S.: Creating Multiprocessor Nios II Systems, Altera Co. (2007) 10. Sun, W.T., Salcic, Z.: Modeling RTOS for Reactive Embedded Systems. In: 20th International Conference on VLSI Design, pp. 534–539 (2007)

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors Yongqing Ren, Hong An, Tao Sun, Ming Cong, and Yaobin Wang School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, China, 230027 [email protected], [email protected], {renyq,mcong,wyb1982}@mail.ustc.edu.cn

Abstract. Technology evolving has forced the coming of chip multiprocessors (CMP) era, and enabled architects to place an increasing number of cores on single chip. For the abundance of computing resources, a fundamental problem is how to map application on it, or how many cores should be assigned for each application. As the available concurrency varies widely for diverse applications or different execution phases of an individual program, the number of resource allocated should be adjusted dynamically for high utilization rate while not compromising performance. In this paper, aiming at resource management in flexible architecture, an implementation of confidence predictor, referred as speculative depth estimator (SDE), is introduced, which is able to conduct the real-time resource tuning. By applying the speculative depth estimator to dynamic resource tuning, the experiments results show that a good trade-off between concurrency exploitation and resource utilization is achieved. Keywords: Flexible Core Chip Multiprocessors, Resource Tuning, Concurrency Exploitation, Resource Utilization.

1 Introduction With the evolving of semiconductor technology, several physical constraints (power, wire delay, etc) force the processor design entering the multicore era, and in the near future, the number of cores available on single chip is able to reach more than one thousand [1]. On the one hand, it means almost infinite computing capability, but on the other, how to make use of so many resources is an open challenge. As the amount of available instruction level parallelism (ILP) varies widely for diverse applications [4] and different execution phases of an individual program [3], resource allocated for applications should be adjusted dynamically for high utilization rate while not compromising performance. Besides, for system with multiple tasks scheduling, dynamic resource tuning mechanism also result in highly efficient system performance, because resource-intensive applications will have more chances to obtain extra computing power for concurrency exploiting. But modern commercial CMPs are designed to exploit instruction-level parallelism within processors and thread-level parallelism across processors, which is not suitable C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 32–41, 2010. © Springer-Verlag Berlin Heidelberg 2010

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors

33

for dynamic resource tuning to each individual application. Some recently proposed alternatives to these fixed-core multiprocessors are Flexible core Chip MultiProcessers (FCCMP) [5] [6] [7], in which the number of cores attached to each program is determined at run time. For the flexible architectures, the decision mechanism of resource allocation becomes a major challenge. Some mechanisms for resource tuning [2] [8] have been proposed recently through OS or runtime controlling with profile-based or dynamic scheduling algorithms. As they are dependent heavily on software, the overhead of resource reconfiguration is greatly high. In this paper, we borrow ideas from the success of confidence estimation [9] in system resource control, and introduce a scheme to estimate the optimal number of cores allocated for different applications or execution phases. The confidence estimator, referred as speculative depth estimator (SDE), is able to generate quantitative estimations on the maximum size of instruction window for different execution phases. Based on a flexible architecture, TFlex [6], a “speculative depth table” is added to record speculative depth history patterns and predict the estimation result. Although the initial design of speculative depth estimator is insufficient, after introducing large value filter and eliminating pseudo mis-predictions, remarkably improved accuracy is obtained. By applying this predictor to actual resource tuning, results demonstrate that it achieves a good trade-off between concurrency exploitation and resource utilization. While greatly increasing the size of instruction window to the fixed 8 core allocation, the dynamic scheme improves the resource utilization rate from 47.7% to 71.8% compared to the fixed 32 core configuration. The rest of this paper is organized as follows. Section 2 is related work and the original idea. Section 3 introduces the initial implementation of the speculative depth estimator. Then in section 4, evaluation and optimizations are described. After that, in section 5, the predictor is applied to resource dynamic tuning. And at the end, section 6 are conclusions.

2 Related Work and Original Idea 2.1 Confidence Estimation Speculation plays a vital role in exploiting instruction level parallelism, such as branch prediction, value prediction, and thread level speculative execution. As the excessive use of speculation, confidence estimation [9] [10] is proposed to balance the benefit of speculation against to other overheads. A confidence estimator attempts to corroborate or assess the prediction made by branch predictors, value predictors, or thread predictors. Each prediction is eventually determined to have been predicted correctly or incorrectly. For each prediction, the confidence estimator assigns a “high confidence” or “low confidence” to the prediction. With this information, some possible applications are listed in [9], such as thread switch decision for multithreading CPU, the number of instruction issue control for individual thread in SMT, speculation control for energy and resource reduction [12] [13], and accuracy improvement of branch predictors [10].

34

Y. Ren et al.

2.2 Flexible Core Chip Multiprocessors Aiming at the technique inefficiency of modern commercial CMPs, recently proposed alternatives are some flexible core architectures [5] [6] [7], in which the cores attached to each program are determined by runtime configuration. These architecture designs provide a large optimization space to tune resource for applications or execution phases in order to achieve an ideal match between concurrency characteristics and resource utilization. Core fusion [5] and Federation [7] provides mechanism to enable multiple out-of-order/in-order cores to be fused into a single more powerful core. TFlex [6] exploits hyperblock based execution model, and in actual execution, several small cores group together to form a powerful logic core, on which one nonspeculative hyperblock and multiple speculative ones can be dispatched. As TFlex provides several novel and potential technique trends, we choose it as the fundamental platform of our work, and to explore the feasibility of dynamic resource tuning. 2.3 Confidence Estimation: Speculative Depth For sequential program, only one non-speculative hyperblock is determined at any time. To achieve a huge instruction window, an aggressive branch predictor for hyperblock is required. Thereby, besides the non-speculative hyperblocks, several speculative hyperblocks construct a huge instruction window, emerging high potential for performance exploiting.

Fig. 1. Description of Speculative Depth

To quantify the capability of speculative execution in hyperblock based program, we recur to the conception of confidence estimation, named as “speculative depth” in our work, which means how many correct predictions can be obtained on consecutive hyperblocks from a point of execution. In figure1, for a non-speculative hyperblock (HB0), the following hyperblocks can be predicted with branch predictor. The second one (HB1) is obtained based on global history and the address base of HB0. The third one (HB2) could be predicted based on the address of HB1 and the global history including predicting HB1, assuming it is a correct prediction. The same behavior repeats and a huge instruction window will be constructed by one non-speculative hyperblock (HB0) and several speculative ones (HB1, HB2…). But during the later execution, the prediction of HB6 is proved as a mis-prediction. For this execution phase, from HB0 to HB5, the speculative depth is considered as 5, which is a rough estimation about the capability of speculative execution in this code region. When execution enters this region again later, the value 5 could be used to conduct the resource allocation.

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors

35

3 Speculative Depth Estimator (SDE) Based on the consideration, some tables could be introduced to trace and predict the number of consecutive predictable hyperblocks in different phases of executing program. Once entering a phase with huge concurrency, the estimator identifies the state of affairs and conducts the processor to execute more aggressively with a more powerful logic core. Otherwise, less speculative hyperblocks should be fetched. The concurrency here means the number of hyperblocks predicted correctly, which determines the size of instruction window.

Mis-prediction

Fig. 2. Diagram of the SDE

Similar as the history table in conventional branch predictor, a depth table records the number of correct consecutive predictions from each address of start hyperblock in a phase. For example, in figure 2, hyperblock HB0 is the start block of an execution phase. The two basic operations of the SDE are: Request: Once encountering a mis-prediction, a request is sent to the SDE. In figure 2, a mis-prediction occurs when predicting the next hyperblock at the execution point of HBn. Then, HBn is considered as a new phase start, like HB0, and a request is sent to the depth table with the address base of HBn. The value in the entry of depth table indexed by the address of HBn is read out, which could be used to conduct the resource allocation of next execution phase. In figure 2, the value y from depth table indicates the reasonable number of speculative hyperblocks is y, and about y cores matches the concurrency requirement. Update: After an execution phase, in other words, encountering a mis-prediction, the information related to this region should be updated in the depth table. In figure 2, the entry indexed by start hyperblock (HB0) in depth table is updated, replacing x with value n. In addition, considering actual implementation, the value of speculative depth can not be too large. For example, for loops with great amount of iterations, it is not reasonable to spawn hundreds of instances, otherwise it will result in complex scheme to maintain the sequential semantics of program, and also cause huge amount of communication overhead for the distributed chip multiprocessors. Thereby, a maximal

36

Y. Ren et al.

execution phase can comprise up to 32 hyperblocks, and saturate the counter up to 32, which is referred the maximum number of cores in TFlex [6].

4 Evaluation and Optimizations Similar as conventional branch predictor, for the SDE, the basic metrics is accuracy. Higher accuracy rate means more opportunities to saving resource while not compromising performance. In this section, the initial SDE is evaluated, and two optimizations are introduced. 4.1 Simulation and Benchmarks In this work, we focus on the accuracy of SDE. Each correct branch prediction means the speculative depth adding 1, and branch mis-prediction will result in request and update to SDE, as described in 2.3. Six benchmarks from SPEC CPU2K are compiled and executed correctly with the TFlex compiler and simulator, and used in the following evaluation (3 spec-int: 164.gzip, 181.mcf, 255.vertex and 3 spec-fps: 179.art, 183.equake, 188.ammp).

Accuracy Rate

Initial

Filter

Eliminating

100% 90% 80% 70% 60% 50% 40% 30% 20%

A

ve ra

ge

x

ke

5. vo rt e 25

p

3. eq ua 18

.a rt

ip

8. am m 18

17 9

18

16 4. gz

1. m cf

10% 0%

Fig. 3. Accuracy rate of the SDE. The three bars correspond to results of initial design (4.1), introducing large value filter (4.2), and eliminating pseudo mis-predictions (4.3).

4.2 Initial Design of SDE If the predicted depth equals the actual number of correct consecutive predictions, it is considered as a correct prediction. Otherwise, a mis-prediction occurs. In figure3, the white bar indicates the accuracy rate of the simple SDE with these six selected benchmarks from SPEC CPU2000. The average accuracy rate is only about 25%, a very disappointing result. No one is greater than 35%, and for the worst one, 183.equake, the accuracy rate is only a little greater than 10%. Obviously, the accuracy ratio is unacceptable. In execution, diverse execution phases can be classified into two categories: one is these with intricate control flows and hard to predict for the branch predictor. For these code regions, the actual depth is usually small and varies widely for distinct execution instances. Another is these execution phases with highly accuracy rate of branch prediction, like huge loops, in which consecutive hyperblocks will be divided to sub-phases comprising 32 ones, as the upper size of an execution phase is 32. Because the execution phases with high accuracy rate of branch prediction are the major

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors

37

source of huge instruction window and concurrency exploiting, they are the major target of SDE. 4.3 Optimized with Large Value Filter Based on the analysis in last section, a value filter can be inserted into the SDE in order to eliminating these prediction outcomes which are extremely small. In implementation, the threshold value is set as 16, which means that only the depth value larger than or equal to 16 is counted and passed to the processor for resource allocation. If the value predicted is less than 16, then a default value 8 is used, and it is not included in the calculation of accuracy rate. The results with large value filter are plotted in gray bar of figure 4. It is much better than the initial SDE without any optimization. For the six SPEC benchmarks, the average rate increases to 70%, much better than the previous 25%. Even the worst case, 164.gzip, is in excess of 50%, and the best case, 255.vortex, is greater than 90%. 4.4 Eliminate Pseudo Mis-Prediction The large value filter in last section seems too strict, and some useful information is hidden: While erasing small values is reasonable, some large values may be considered as mis-prediction and filtered too. For example, assuming the prediction value from SDE is 24, but in actual execution is 28, and then it is considered as a misprediction in our evaluation. Actually, the prediction result 24 is a reasonable value, although a little less than 28. A resource tuning according to the value of 24 is also beneficial for performance improvement, even if a little potential performance is lost. Consequently, it is unreasonable to consider an absolute equivalent as a correct prediction between prediction and actual depth. During execution, the branch predictor is continuing training itself. While an execution phase ends with a depth of 24, the exit predictor also updates its content and expects a better prediction later. Once execution enters the same phase again, the speculative depth is expected larger than before. So we modify the large value filter to eliminate these pseudo mis-predictions. If the prediction result is reasonable, in other words, in excess of 15, and the actual depth is larger than the value predicted, then it is considered as correct prediction too. But if the predicted value is larger than the actual value, it is still considered as a misprediction, the same as former. The black bars in figure3 graph the results after considering this implementation. We can find the accuracy rate of these SPEC programs increases to more than 90%. It is much better and suitable to put in use.

5 Apply the SDE to Resource Allocation In this section, the efficiency of applying the SDE to resource allocation is explored. Firstly, the resource allocation scheme will be introduced. According the result of SDE, the predicted value about consecutive hyperblocks in control flow path is used to conduct the resource allocation. Assuming each hyperblock consuming a processor core, the default configuration of MultiScalar and TFLEX, the number of processor cores allocated for a program is determined in this way: if the value returned from SDE is larger than or equal to 16, then the number of cores allocated equals the value.

38

Y. Ren et al.

Fix-32

Cores in use

Dynamic 32

20

28

Number of Cores

Cores in use

Fix-8 24

16 12 8 4

(a)

ag e av er

17 9. ar t 18 8. am m p 18 1. m c f 18 3. eq ua ke

16 4. gz ip 25 5. vo rt ex

0

Cores Allocated

24 20 16 12 8 4 0

Fix-8

Fix-32

Dynamic

(b)

Fig. 4. (a) Average granularity of actual execution phase, in the terms of the number of consecutive hyperblocks predicted correctly on the configuration of fixed 8 cores, 32 cores, and dynamic ones with the help of SDE. (b) Resource utilizations of three different configurations. The left means the actual depth, or the number of consecutive hyperblocks predicted correctly, and the right one indicates how many cores are allocated.

Otherwise, the core number maintains fixed 8, which is same as the default configuration of TRIPS [17], the precursor of TFlex. Then what we concerned is the size of instruction window constructed and resource utilization rate with the help of SDE for resource allocation. The former one can be calculated by the average number of consecutive hyperblocks predicted correctly, and the latter one is the average cores allocated to the program. While encountering a mis-prediction, two status counters for instruction window and resource utilization change once together. The actual speculative depth is added up to the instruction window statistics, and the predicted depth is considered as the resource allocated. For the two counters, several increasing rules are put in use. In the following, RA means Resource Allocated though SDE, and IW means the size of Instruction Window, counted with the average number of consecutive hyperblocks of each phase in actual execution. (1) 15 < RA < IW: Counter_RA += RA, Counter_IW += RA: If instruction window is larger than number of cores allocated and the latter greater than 15, than the counter of RA and IW both increase by RA, as the instruction window is constrained by resource. (2) 15 < RA && IW ≤ RA: Counter_RA += RA, Counter_IW += IW: If the number of cores allocated is greater than 15 and also IW, then the Counter_RA increases by RA, and Counter_IW increases by IW. (3) RA ≤ 15 && IW > 8: Counter_RA += 8, Counter_IW += 8: If the speculative depth predicted is less than or equals to 15, only 8 cores are allocated to the execution phase. So even if the actual speculative depth is greater than 8, the actual number of resource and instruction window are 8 (cores and hyperblocks). (4) RA ≤ 15 && IW ≤ 8: Counter_RA += 8, Counter_IW += IW: Similar as the former rule, if the actual depth is extreme low, the actual IW will be less than 8. For comparison, two static schemes of resource allocation, fixed 8 and 32 cores, are also evaluated. Figure 4 shows the results. Fix-8 means allocating 8 cores statically, like TRIPS and Fix-32 indicates holding the number of 32 cores unchangeably, just like the extreme configuration of TFLEX for an individual program. The label “Dynamic” means resource allocation is dynamically determined by SDE. Result shows that fixed 8 cores is insufficient for concurrency exploiting, as for fixed 32

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors

39

cores, the average size of instruction window achieves up to 15.2 hyperblocks. But for the latter, the obvious shortcoming is too many resources wasting, in which only half of cores are used efficiently on average. After introducing SDE for resource allocation, the average number of cores allocated decreases to 18.9. Although a little concurrency exploitation loss (the size of instruction window decreases from 15.2 to 13.6), the average number of cores allocated for the benchmarks are reduced from 32 to 18.9, indicating a resource utilization rate promotion from 47.7% to 71.8%. With SDE, a simple model of resource allocation for flexible architecture is evaluated and remarkable improvement is achieved. Two factors lead to the efficiency of prediction on control flows: temporal locality and steady branch behavior, but also cause some troubles. For example, the granularity of execution phase may vary for different instances of the same code region, but the branch predictor trains itself gradually through execution, so the granularity of execution phase will grow coarser along with the training. A possible optimization is going through the similar process of branch predictors, such as adding global history, tournament predictor, etc. But it is more complex than branch predictor to implement these structures on SDE, and we will explore the feasibility in future work. Besides, although high accuracy of branch predictor means more opportunities of constructing a huge instruction window, and potential of performance improvement, the actual situation is more complicated. Besides control flow transfer, data dependence, communication mode, and organization of code are all critical factors for the final performance. In general, the SDE can be considered as a technique of performance tuning for the whole system. Some similar works are also proposed before [19]. Besides, more complicated methods like machine learning [15] also could be applied to highly accurate performance tuning.

6 Conclusions Flexible Core Chip Multiprocessors emerge many potential benefits compared to the current design. To support the adaptive configuration for diverse applications or execution phases, an efficient resource allocation mechanism is the first challenge, as the balance between concurrency exploitation and resource utilization is hard to determine with static methods. Some software schemes based on OS or runtime controller are inefficient for dynamic real-time resource tuning. In this paper, based on the idea of confidence estimation, a simple and efficient prediction scheme for resource tuning is proposed. The confidence estimator, named SDE, produces the estimation about the capability of speculative execution, so predicts how large an instruction window can be constructed. Although the initial design is insufficient, after introducing large value filter and eliminating some pseudo mispredictions, the SDE achieves an average accuracy rate of more than 90% on the SPEC benchmarks. By applying SDE to dynamic resource tuning, the experiments results show that it achieves a good trade-off between concurrency exploitation and resource utilization. While greatly increasing the size of instruction window compared to the fixed 8 core allocation, the dynamic resource tuning mechanism improves the resource utilization rate from 47.7% in the fixed 32 core configuration to 71.8% in adaptive allocation.

40

Y. Ren et al.

This paper is an attempt to solve the anguished problem of resource tuning on flexible architecture, and many open issues still exist. In the next step, more design space about the SDE will be explored and some more factors affecting performance will be considered.

Acknowledgement This work is supported financially by the National Basic Research Program of China under contract 2005CB321601,the National Natural Science Foundation of China grants 60633040 and 60970023 and 60736012, the National Hi-tech Research and Development Program of China under contracts 2006AA01A102-5-2 and 2009AA01Z106, the Important National Science & Technology Specific Projects 2009ZX01036-001-002 the China Ministry of Education & Intel Special Research Foundation for Information Technology under contract MOE-INTEL-08-07.

，

References 1. Asanovic, K., et al.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS, Department, University of California at Berkeley, UCB/EECS-2006-183 (2006) 2. Gulati, D., Kim, C., Sethumadhavan, S., Keckler, S.W., Burger, D.: Multitasking workload scheduling on flexible-core chip multiprocessors. In: Moshovos, A., Tarditi, D., Olukotun, K. (eds.) PACT, pp. 187–196. ACM, New York (2008) 3. Sherwood, T., Perelman, E., Hamerly, G., Sair, S., Calder, B.: Discovering and Exploiting Program Phases. IEEE Micro 23(6), 84–93 (2003) 4. Wall, D.W.: Limits of instruction-level parallelism. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 176–188 (1991) 5. Ipek, E., Kirman, M., Kirman, N., Martínez, J.F.: Core fusion: accommodating software diversity in chip multiprocessors. In: International Symposium on Computer Architecture, pp. 186–197 (2007) 6. Kim, C., Sethumadhavan, S., Govindan, M., Ranganathan, N., Gulati, D., Keckler, S.W., Burger, D.: Composable lightweight processors. In: International Symposium on Microarchitecture (2007) 7. Tarjan, D., Boyer, M., Skadron, K.: Federation: Out-of-order execution using simple inorder cores. Technical Report, University of Virginia, Department of Computer Science (2007) 8. Corbalan, J., Martorell, X., Labarta, J.: Performance-driven processor allocation. IEEE Transactions on Parallel and Distributed Systems 16(7), 599–611 (2005) 9. Grunwald, D., Klauser, A., Manne, S., Pleszkun, A.: Confidence estimation for speculation control. In: Proceedings 25th Annual International Symposium on Computer Architecture, SIGARCH Newsletter, Barcelona, Spain (1998) 10. Jacobsen, E., Rotenberg, E., Smith, J.E.: Assigning Confidence to Conditional Branch Predictions. In: International Symposium on Microarchitecture, pp. 142–152 (1996) 11. Mahlke, S.A., et al.: Effective compiler support for predicated execution using the hyperblock. In: Proceedings of the 25th annual international symposium on Microarchitecture, Portland, Oregon, United States, December 01-04, pp. 45–54 (1992)

Dynamic Resource Tuning for Flexible Core Chip Multiprocessors

41

12. Manne, S., Klauser, A., Grunwald, D.: Pipeline gating: speculation control for energy reduction. In: Proc. ISCA-25, pp. 132–141 (1998) 13. Vandierendonck, H., Seznec, A.: Fetch gating control through speculative instruction window weighting. In: Stenström, P. (ed.) Transactions on High-Performance Embedded Architectures and Compilers II. LNCS, vol. 5470, pp. 128–148. Springer, Heidelberg (2009) 14. Franklin, M.: The Multiscalar Architecture. Ph.D. Dissertation. University of WisconsinMadison (1993) 15. Ganapathi, A., Datta, K., Fox, A., Patterson, D.A.: A Case for Machine Learning to Optimize Multicore Performance. In: HotPar09, Berkeley, CA (2009) 16. Ranganathan, N., Nagarajan, R., Burger, D., Keckler, S.W.: Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical Report TR-0241, Department of Computer Sciences, The University of Texas at Austin (September 2002) 17. Sankaralingam, K., et al.: TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Transactions on Architecture and Code Optimization (TACO) 1(1), 62–93 (2004) 18. Luo, Y., Packirisamy, V., Hsu, W.-C., Zhai, A., Mungre, N., Tarkas, A.: Dynamic performance tuning for speculative threads. In: Keckler, S.W., Barroso, L.A. (eds.) ISCA, pp. 462–473 (2009)

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms Eunji Lee1, Sungju Lee1, Yongwha Chung1,∗, Hyeonjoong Cho1, and Sung Bum Pan2 1

Department of Computer and Information Science, Korea University, Korea {achee,peacfeel,ychungy}@korea.ac.kr http://algolab.korea.ac.kr/ 2 Dept. of Control, Instrumentation, and Robot Engineering, Chosun Univ., Korea [email protected]

Abstract. SECMPEG selectively encrypts data to protect MPEG video streams with minimum overhead. However, this scheme is known to provide only confidentiality but not integrity for data. In this paper, we apply AES-CCM to SECMPEG for ensuring integrity as well as confidentiality for multimedia data. Furthermore, in order to satisfy the real-time requirement, we use a multi-core processor. After analyzing the computational characteristics of SECMPEG/AES-CCM on a multi-core processor, we parallelize SECMPEG/AESCCM with both static and dynamic task assignments. Our experimental results show that a near-optimal speed-up could be obtained using on 4-core processor with ensuring both confidentiality and integrity of multimedia data.

1 Introduction With the advance on the multimedia technology, the multimedia contents can be delivered to users having portable devices such as mobile phones. Multimedia data delivery raises issues of the content ownership and the privacy and thus protecting the multimedia data becomes important in multimedia applications. Since most multimedia data have huge sizes, we need an efficient multimedia encryption scheme for protecting the multimedia contents while satisfying the real-time requirement. Recently, SECMPEG(Secure MPEG)[1-2] - a tightly-coupled encryption and compression scheme - has been proposed in order to reduce the computational workload with keeping an appropriate level of security. However, such multimedia encryption scheme is still computationally heavy for embedded processors to satisfy the real-time requirement. Furthermore, SECMPEG ensures only confidentiality, but not integrity. As multi-core processors have been increasingly used for embedded systems, the parallel processing techniques on multi-core processors becomes attractive in order to satisfy the real-time requirement for embedded systems. Many parallel approaches have been proposed for video compression using parallel computers[3]. Those ∗

Corresponding Author.

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 42–51, 2010. © Springer-Verlag Berlin Heidelberg 2010

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms

43

approaches are divided into two categories, static and dynamic. Static approaches assign each task into each processor(i.e., divide the data space equally) at the beginning of the computation to reduce heavy communication overhead across the processors. This static task assignment is easy to implement and works very well with regularly-structured computations such as matrix multiplication. However, some of the MPEG computation is irregularly-structured(i.e., computational workload depends on data content, rather than data size), and such static assignment may cause the load balancing problem[4-6]. On the contrary, assigning tasks dynamically based on the relative workload of each processor can solve the load unbalance problem, at the expense of the heavy communication overhead. Therefore, one of the major research issues on dynamic task assignments on parallel computers is to minimize the communication overhead across the processors while efficiently providing load balancing. Especially, multi-core processors chips designed to have low communication overhead between cores are of our interest. In this paper, we propose to apply AES-CCM[7-10] to SECMPEG(i.e., video compression and selective encryption) in order to ensure both confidentiality and integrity for multimedia data. In addition, we parallelize the proposed scheme on multi-core processor with both the static and the dynamic task assignment techniques. More specifically, we first analyze the computational characteristics of SECMPEG on a multi-core processor. Then, the static and the dynamic task assignments are applied to the regularly- and the irregularly-structured computation in compression and encryption, with the minimum number of synchronizations. Our experimental results showed that a near-optimal speed-up could be obtained on a 4-core processor with ensuring both confidentiality and integrity of multimedia data The rest of the paper is structured as follows. Section 2 describes SECMPEG and how to provide data integrity. Section 3 explains the parallel SECMPEG with AESCCM for a multi-core processor. Finally, Section 4 and 5 describe the experimental results and conclusion, respectively.

2 Background 2.1 Multimedia Data Protection Generally, the MPEG video encoding[1] scheme represents the video signal using repetition of Group of Pictures(GOPs), and each GOP is a sequence of selected I-, Pand B-frames. I-frames are encoded as standard JPEGs, whereas P- or B-frame are encoded only for the frame difference between the close frames based on motion compensation. Also, when the motion compensation cannot find the approximate matching block in the close frames, that block is encoded as intra-coded mode, where this macro block is called I-block. Finally, an I-block which has large amount of motion in P- or B-frame is encoded as standard JPEGs. When the real-time performance is required for embedded systems, encrypting the full video data may not be adequate. To reduce the computational workload while providing an appropriate level of security, SECMPEG[2] - a tightly-coupled encryption and compression scheme has been proposed. Using the characteristics of the MPEG video compression, SECMPEG selectively encrypts data such as I-frame and I-block.

44

E. Lee et al.

SECMPEG provides five levels of the security where level 4 represents full encryption and level 0 represents no encryption. At the security level 1, it encrypts the headers from the sequence layer down to the slice layer. The motion vectors and the DCT-coded blocks are unaltered. At the security level 2, SECMPEG encrypts parts of the I-blocks in addition to the encryption performed in the level 1. At the security level 3, SECMPEG encrypts I-frames and all I-blocks. Table 1 shows the encryption part and the data size of each security level. Note that we focus on SECMPEG security level 3. Table 1. Security level of secmpeg[2] Security Level 0

1

2 3 4

Encryption Part X Sequence Header GOP Header Picture Header Slice Header Security Level 1 Relevant I-block I-Frame Whole I-blocks in P-/B-Frame Encoded Video Sequence

Data Size X 40byte 8byte/GOP 16byte/Frame 6byte/Slice Security Level 1 384byte Variable Variable Variable

2.2 Ensuring Data Integrity In cryptography, Hash-based Message Authentication Code(HMAC) is a specific cons truction for calculating a Message Authentication Code(MAC) involving a cryptograp hic hash function in combination with a secret key. The cryptographic strength of the HMAC depends upon the size of the secret key used. The most common attack agains t HMACs is the brute force attack to uncover the secret key. Block cipher modes of operation have used with block cipher such as AES[7]. For ensuring confidentiality and integrity of the data, National Institute of Standards and Technology(NIST) has proposed five modes of operations[8]. One of the five modes, Counter(CTR) mode can be used as a stream cipher, and encrypt data very fast. Another method to verify integrity is Cipher Block Chaining-Message Authentication Code(CBC-MAC), which uses the final block for verifying of the message integrity. AES-Counter with CBC-MAC(AES-CCM)[9,10] is a combined encryption and aut hentication block cipher mode. The CCM mode specification supports either both aut hentication and encryption or encryption only. AES-CCM algorithm consists of two p rocesses.: (1) CCM requires two block cipher encryption operations per each block of encrypted and authenticated message; and (2) one encryption per each block of associ ated authenticated data. Fig. 1 shows the block diagram of AES-CCM.

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms

45

Input : Nonce, Associated Data, Plain Text CBC

Block Formatting function Block0

Block1

Block2

AES (Block0)

AES (Block1)

AES (Block2)

Increase

CTR

Increase

Counter0

Counter1

Counter2

AES (Counter0)

AES (Counter1)

AES (Counter2)

Plain Text

Plain Text Cipher Text

AES(Block2)

Plain Text MAC

AES(Counter0)

MAC

Fig. 1. Block Diagram of AES-CCM[9,10]

3 Parallel SECMPEG/AES-CCM In this section, we first apply AES-CCM to SECMPEG security level 3(denoted as SECMPEG/AES-CCM). Then, we analyze the computational characteristics of SECMPEG/AES-CCM on a multi-core processor. Finally, we assign tasks both statically and dynamically based on the relative workload of each processor for providing load balancing. 3.1 SECMPEG/AES-CCM At SECMPEG security level 3, only I-frames and I-blocks of P or B-frames are encrypted, instead of whole multimedia data. In this paper, we use AES-CCM combining CTR and CBC. Note that, AES-CCM ensures confidentiality and integrity of the data at the same time. In general, CMAC is generated with a block cipher mode such as CBC after applying full data encryption. However, only 2% of whole multimedia data were encrypted in SECMPEG security level 3. To apply AES-CCM to SECMPEG, we define V-MAC(Video MAC) and S-MAC(Small MAC). As shown in Fig. 2, V-MAC that is generated by all S-MACs can ensure the all video data integrity. Also, S-MAC can detect the forgery in selective data such as I-frame or I-block. Fig. 3 shows that comparison of full data encryption and modified SECMPEG with AES-CCM(i.e., SECMPEG/AES-CCM). Although this approach causes some redundant memory space, it can reduce the execution time by the selective encryption.

46

E. Lee et al.

Fig. 2. AES-CCM for SECMPEG Security Level 3

Fig. 3. Comparison of Full Encryption and Selective Encryption

Moreover, the total size of V-MAC and S-MAC is negligible, compared to that of the compressed multimedia data. 3.2 Parallel SECMPEG/AES-CCM In order to satisfy the real-time requirement, we need to apply parallel processing techniques on a multi-core processor. To do so, we use ALPBench[3] that is a parallel benchmark code for MPEG-2. We first apply AES-CCM to ALPBench and analyze Thread-Level Parallelism(TLP) of ALPBench which assigns the macro blocks statically to each core before run-time. As shown in Fig. 4, the static assignment cannot guarantee the balance of the workload across cores. This is because the MPEG-2 involves many branch and comparison operations. Note that, AES-CCM workload depends on the MPEG-2 computational workload. We also observed that the load unbalance problem might happen across cores in both B- and P-frames. This is because the computational workload of B- and P-frame depends on the characteristic of macro block(i.e., intra, forward, backward encoding). To solve this problem, we need to assign the workload to cores dynamically.

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms

47

Fig. 4. Execution Time of MPEG2 at Each Core with 25 Frames(4-Core AMD Phenom II)

3.3 Load Balancing SECMPEG/AES-CCM Static task assignment is easy to implement and works well with regularly-structured computation such as matrix multiplication. However, some of the MPEG computation is irregularly-structured, and such static assignment may cause the load balancing problem. On the contrary, assigning tasks dynamically based on the relative workload of each processor can solve the load unbalance, at the expense of the heavy communication overhead. Since “motion estimation” with P- or B-frame is the most time consuming and unpredictable workload, we need to assign the motion estimation tasks dynamically to minimize computational time and provide load balancing across the cores. For parallelizing SECMPEG/AES-CCM on a multi-core processor, an efficient data assignment needs to be considered for each I-, B-, and P-frames. Data assignment focuses on data distribution among the cores for balancing the workload. Iframes whose workload can be predicted accurately based on the data size encrypted are assigned statically into each core(See Fig. 5). Then, each core encrypts the assigned frame and performs the synchronization for the correct execution.

Fig. 5. Static Assignment for I-frame

48

E. Lee et al.

As an I-block depends on the motion in the video contents, its frequency cannot be predicted accurately. Therefore, in case of compressing an I-block, the dynamic assignment is applied. As shown in Fig. 6, each core dynamically assigns macro blocks in the variable length coding(i.e., “putpict”).

Fig. 6. Parallel Processing for SECMPEG/AES-CCM of B- and P-frame

Fig. 7. Static and Dynamic Task Assignment for SECMPEG/AES-CCM

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms

49

Fig. 7 shows the dynamic assignment of the motion estimation and the static assignment of the remaining tasks for B- and P-Frame. Note that, in this dynamic assignment, multiple cores can share the memory resources and thus the data consistency problem should be solved. To solve the consistency problem, we use the synchronization technique(i.e., mutex) in executing the critical section to maintain the consistency of data. Also, we need to minimize the number of synchronizations due to the relatively high overhead. In the proposed approach, every core performs the synchronization after executing the motion estimation task.

4 Experimental Results For evaluating the proposed approaches, we used AMD Phenom II X4 955 Processor having 4-cores(3.2 GHz) and RAM 3.0 GB, and the code was parallelized with Pthread. We also used a data set of 25 YUV frames composed of three frames per one video sequence. The size of experimental video image was 352×288, and the size of one pixel was 24 bit. First, we measured the execution time for AES-CCM with various data size. The execution time of AES-CCM was 0.03 sec/MB. Since AES-CCM encrypts the Iblocks(i.e., about 0.0002%), I-frames(i.e., about 10%), and S-MACs(i.e., about 0.01%) only, the execution time of SECMPEG/AES-CCM was 0.002 sec/MB. Since the multimedia data include a few numbers of I-blocks, I-frames, and S-MACs, the SECMPEG security level 3 is very effective in reducing the computational overhead of resource-constrained platforms such as embedded systems. Fig. 8 also shows the effect of SECMPEG/AES-CCM, compared with full encryption. The execution time of selective encryption is shorter than that of full encryption in SECMPEG/AESCCM over all ranges of data size, and thus SECMPEG security level 3 can protect a large amount of multimedia data efficiently.

Fig. 8. Effect of Selective Encryption with SECMPEG/AES-CCM

Finally, we measured execution times of sequential and parallel approaches. To satisfy the real-time requirement, MPEG2 encoder should be performed in 7.3MB/sec=(25 frames/sec 101,376 pixels/frame 3 B/pixel). As shown in Table 2,

ⅹ

ⅹ

50

E. Lee et al.

we confirmed that the sequential approach cannot satisfy the real-time requirement even with SECMPEG(i.e., only provides an execution time of 3.55MB/sec). However, the parallel approach can achieve a speedup of 3.5~3.8 with a 4-core processor, and can satisfy the real-time requirement(i.e., provides an execution time of 13.55MB/sec). For balancing the workload with minimum overhead, we applied not only the static but also the dynamic task assignment techniques carefully for encrypting I-blocks of P- or B-frames. Table 2. Execution Time Comparison with 500MB Data

AES-CCM MPEG2 Encoding Total Throughput Real-Time Satisfaction

Sequential Approach with AES-CCM Full SECMPEG Encryption 15.47 sec 1.46 sec 139.5 sec 139.5 sec 154.97 sec 140.96 sec 3.23 MB/sec 3.55 MB/sec NO

NO

Parallel Approach with SECMPEG/AES-CCM Static Task Static + Dynamic Assignment Task Assignment 0.39 sec 0.39 sec 49.0 sec 36.5 sec 49.39 sec 36.89 sec 10.12 MB/sec 13.55 MB/sec YES

YES

5 Conclusion In this paper, for ensuring both the confidentiality and the integrity multimedia data, we applied AES-CCM to SECMPEG. Also, in order to satisfy the real-time requirement, we parallelized SECMPEG/AES-CCM on a multi-core processor. After analyzing the characteristics of the computation on a multi-core processor, the static and the dynamic load balancing techniques were applied to the regularly- and the irregularlystructured computation, with the minimum number of synchronizations. Our experimental results confirmed that the speedup of up to 3.8 could be obtained with a 4-core processor. To the best of our knowledge, this is the first result ensuring both confidentiality and integrity of multimedia data in real-time using a multi-core processor with a portable Pthread program. We will extend this research to video surveillance applications in order to provide a secure and cost-efficient solution.

Acknowledgement This work was supported by the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korea government(MEST) (No. 2009-0086148).

References 1. Watkinson, J.: The MPEG Handbook. Elsevier, Amsterdam (2004) 2. Furth, B., Kirovshi, D.: Multimedia Security Handbook. CRC Press, Boca Raton (2005) 3. Ahmad, I., He, Y., Liou, M.: Video Compression with Parallel Processing. Parallel Computing 28(7), 1039–1078 (2002) 4. Hennessy, J., Patterson, D.: Computer Architecture. Elsevier, Amsterdam (2007)

Ensuring Confidentiality and Integrity of Multimedia Data on Multi-core Platforms

51

5. Akhter, S., Roberts, J.: Multi-Core Programming - Increasing Performance through Software Multi-Threading. Intel Press (2006) 6. Park, K., et al.: On-Chip Multiprocessor with Simultaneous Multithreading. ETRI Journal 22(4), 13–24 (2000) 7. Daemen, J., Rijmen, V.: Advanced Encryption Standard (AES) (November 26, 2001) 8. NIST Special Publication 800-38A, Recommendation for Block Cipher Modes of Operation - Methods and Techniques, U.S. DoC/NIST (2001) 9. Jonsson, J.: On the Security of CTR+CBC-MAC. In: Nyberg, K., Heys, H.M. (eds.) SAC 2002. LNCS, vol. 2595, pp. 76–93. Springer, Heidelberg (2003) 10. Dworkin, N.: Recommendation for Block Cipher Modes of Operation: The CCM Mode for Authentication and Confidentiality. NIST Special Publication 800-38C (2002) 11. AMD (2009), http://multicore.amd.com/Products/

A Paradigm for Processing Network Protocols in Parallel Ralph Duncan, Peder Jungck, and Kenneth Ross CloudShield Technologies, an SAIC company 212 Gibraltar Drive, Sunnyvale, CA 94089 USA [email protected]

Abstract. Network packet processing applications increasingly execute at speeds of 1-40 Gigabits per second, often running on multi-core chips that contain multithreaded network processing units (NPUs) and a general-purpose processor core. Such applications are typically programmed in a language that exposes NPU specifics needed to optimize low-level thread control and resource management. This facilitates optimization at the cost of increased software complexity and reduced portability. In contrast, our approach provides portability by combining coarse-grained, SPMD parallelism with programming in the packetC language’s high-level constructs. This paper focuses on searching packet contents for packet protocol headers. We require the host system to locate protocol headers for layers 2, 3 and 4, and to encode their offsets data in a packet information block (PIB). packetC provides descriptors, C-style structures superimposed on the packet array at runtime-calculable, user or PIBsupplied offsets. We deliver state-of-the-practice performance via an FPGA for locating layer offsets and via micro-coded interpretation that treats PIB layer offsets as a special addressing mode.

1 Introduction Pressure for faster network packet processing continues to increase as transmission mediums become faster (e.g., those specified by SONET/SDH [2, 3] and 10GbE [4]) offer speeds in the 10-40 Gigabits per second range) and the volume of data to be transmitted continues its own relentless increase. Packets contain protocol headers for communications standards, such as IPv4. A header is a contiguous set of fields with routing, service and standards data. Many protocols exist, each with distinctive content. Since multiple headers may be present and since their relative offset from the packet’s start may vary from packet to packet, a key aspect of packet processing is to determine which headers are present and where they are. Header searching is a common task for network applications, which typically run in a multithreaded environment where they are partitioned into light-weight threads that swap themselves out for each memory access. This programming style exploits low-level machine features to optimize performance. However, the resulting C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 52–67, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Paradigm for Processing Network Protocols in Parallel

53

machine-specific code can require extensive redesign and recoding when the application is ported.

2 Our Approach and Contributions Our approach to packet processing as a whole has three major elements: a model of parallel packet processing, a specialized language to express the model and an ensemble of heterogeneous processors to implement the language. In this paper, we focus on features in all three that are involved in header processing. The model’s key characteristics for protocol processing are as follows: • Task granularity is at the level of a complete program that processes a packet. Thus, the model’s parallelism embodies the Single Program Multiple Data (SPMD) paradigm). • The host system locates protocol headers in a packet before a copy of the program is executed on that packet. • Each program copy works with a current packet and system-provided data on the presence and location of layer offsets.

System-provided Ingress Processing

Input Packets

packet data

packet data

packet data

Packet

Packet

Packet

Program

Program

Program

Session and search term data Shared memory

System-provided Egress Processing

Processed Packets

Fig. 1. A parallel packet processing model Copyright CloudShield Technologies, 2009

This model is expressed through packetC [5], a C-style language that takes C99 [6] as its point of departure and provides features for packet operations and protocol header processing: • Each program copy works on a single packet stored as a byte array in big endian order (matches network order). • The protocol information predetermined by the system, such as layer locations and protocol flags, is provided in a C-style structure, termed the packet information block (PIB). • Protocols can be represented by a descriptor, a special kind of structure that is superimposed on the packet at a user-specified, runtime-calculable offset location.

54

R. Duncan, P. Jungck, and K. Ross

• packetC redefines structure bitfields in a manner calculated to remove ambiguity and to allow descriptors to overlay protocol fields with non-standard bit widths in a predictable fashion. Finally, implementation support for header processing includes using an FPGA to predetermine the presence and offsets of L2, L3 and L4 headers. Thus, we mitigate foregoing thread-level parallelism for protocol header processing by executing optimized header processing on a dedicated processor, while micro-coded interpreters on NPUs are running other application tasks in end-to-end fashion on each packet. The next section presents the model in more detail.

3 A Parallel Packet Processing Model The model provides an intuitive view of parallel packet processing to the developer, one in which task management mechanics are hidden, memory is partitioned in a straightforward way and protocol detection and analysis are performed by the system as part of setting up each execution run of a program copy. The principal aspects of the model (Fig. 1) are listed below: • Concurrency is provided by copies of a small program that each completely process one packet each time they are run. • A ‘global’ memory is shared by all the program copies and private memories for each program copy. • The host system manages the program copies and routes packets to and from program copies. • The host system ensures that a program copy has two kinds of pre-processed data when it is triggered for a packet (Fig. 2). • A copy of the packet, itself, in the form of an array of unsigned bytes (in big endian order).

System-provided Ingress Processing

Current Packet Array

Packet Info Block • L2_offset = 0 • L3_offset = 14 • L4_offset = 34 …

Input Packet

Fig. 2. System-calculated protocol layer offsets for a packetC program copy Copyright CloudShield Technologies, Inc., 2009

A Paradigm for Processing Network Protocols in Parallel

55

• A collection of values that indicate whether protocol headers for network layers [7] 2, 3 and 4 are present in the packet and, if they are, where they are located (in terms of offsets from the packet array’s start.). These values are assembled in a packet information block (PIB), along with checksum information and protocol type information. In this paper we are most interested in the model’s requirements for the host system to predetermine certain protocol layer offsets and assemble them in the PIB. Operations to determine protocol headers’ presence and their offsets could be done via run-time support software routines. However, requiring the host to perform this function facilitates using a specialized processor or offload engine dedicated to effectively performing this function. Much of the PIB data’s usefulness is a consequence of how its layer offset values can be used with packetC descriptor constructs and how the implementation rapidly fetches packet data via those offset values. The next section sketches the packetC language as a whole.

4 Packetc Language Overview The packetC language’s [1] high-level characteristics include: • Emphasizing strong typing by removing implicit type casts, coercions or promotion, • Promoting real-time reliability by eliminating dynamic memory allocation, as well as pointer and address manipulations, • Using a C-style program main as the unit of SPMD parallelism, • Providing extended data types and operators to support classic packet-processing capabilities, such as matching masked database and searching for packet payload contents that match user-defined strings and regular expressions. For high-speed protocol processing, these language features are most relevant: • Presenting the PIB information in the form of a C-style structure, • Providing a descriptor construct that superimposes a C-style definition of a protocol header on the packet array at a user-specified offset expression (which may contain a PIB offset value and runtime-variable calculations), • Defining a container-based alternative to C-style bit fields to ensure that protocol header structures will match corresponding data in the packet byte array. These elements are discussed in detail below.

5 The PIB as C-Style Structure A key aspect of the PIB’s header offset values for network layers 2, 3 and 4 is that these values are determined afresh for each packet, i.e., each time a parallel copy of the packetC main program is executed. Thus, the results are always packet-specific, since the location of a particular kind of protocol header may vary from one packet to another. The PIB definitions are shown below with some omissions for readability.

56

R. Duncan, P. Jungck, and K. Ross

Packet Info Block

pib.L3_offset = 14 … pib.L3_type = L3TYPE_IPv4

Packet Array

Descriptor descriptor ipv4Descr { bits byte { version : 4; headerLength: 4; } … int ipv4_payload; } ipv4Head at pib.L3_offset;

Fig. 3. Positioning descriptors on the basis of header offset values in the PIB Copyright CloudShield Technologies, 2009

A Paradigm for Processing Network Protocols in Parallel

57

The PIB provides packetC applications with easily accessed information about whether given protocols exist and (if they do) where they are. To exploit that information, applications must be able to define those protocol headers and apply those definitions to the packet array where the data resides. packetC provides this capability through the descriptor construct described below.

6 The Descriptor Construct packetC’s descriptor construct is a structure that corresponds to a portion of the packet array with the same size. In a sense, it is an alias for an array-slice within the packet.

A descriptor declaration consists of its structure base type, its name and its location – an integer expression that defines the descriptor’s starting point in terms of an index into the packet array. This location specification, or at-clause, may contain three kinds of elements: compile-time constants, variables with values known only at run-time and PIB layer-offset fields. By combining a descriptor’s structure definitions with an offset location based on a PIB layer offset value we can create a precise, high-level descriptor of a protocol header that gravitates to the correct packet array location each time a new packet is prepared for a packetC program (Fig. 3). descriptor ipv4Descr { bits byte { version : 4; headerLength: 4; } byte typeOfService; short totalLength; short ipv4_identification; short ipv4_fragmentOffset; byte ipv4_ttl; byte ipv4_protocol; short ipv4_checksum; int ipv4_sourceAddress; int ipv4_destaddress; int ipv4_payload; } ipv4Header a t pib.L3_offset;

58

R. Duncan, P. Jungck, and K. Ross

Consider the IPv4 protocol shown above. First, the descriptor defines a structure that matches the fields of an IPv4 header. The at-clause then states that it will always be found at the packet’s layer 3 offset (when a valid layer 3 header is present). Descriptor at-clauses can be constant or run-time determined, simple or arbitrarily complex. Complex at-clause expressions are especially relevant when the start of one header depends on the presence of optional fields in a preceding header. For example, if we did not provide Layer 4 offsets, it would be possible to calculate them in terms of an IPv4 Layer 3 header as follows:

The next section presents the model in more detail. The descriptor construct is useful for describing a stack of protocols, i.e., a sequence of consecutive protocol headers that appear in a packet as a group, as shown below and in Fig. 4. • Layer 2: • Layer 3: • Layer 4:

Ethernet IP (e.g., IPv4 or IPv6) TCP or UDP

The flexibility of the descriptor’s at-clause construct allows programmers to specify stacks (i.e., to link a sequence of protocols) when the involved protocol headers • Have layer offsets other than those available in the PIB. • Have a variable size (e.g., because the header has optional fields). Handling these characteristics is particularly valuable when the developer is dealing with custom or proprietary protocols. The descriptor construct’s at-clause provides a means to flexibly superimpose a Cstyle structure onto the packet array. However, in order to use structures to represent protocol headers, we have to address fundamental problems with C bit-fields, as the next section describes.

7 Container-Based Bit Fields Given an entire descriptor’s starting location in terms of an offset into the packet array (via the at-clause’s value) we should know which portions of the packet array correspond to individual fields of the descriptor. However, many standard protocols have fields smaller than typical integer storage units (32, 16, 8 bits) or do not take up an integral number of bytes. C’s bit field construct is not adequate because the implementation freedom it bestows creates a variety of uncertainties (See section 6.7.2.1 of the C99 standard [6]).

A Paradigm for Processing Network Protocols in Parallel

59

struct structTag { unsigned int notAbitField; unsigned char a: 4; unsigned int b: 2; unsigned int c: 4; } myStruct; Packet Info Block

Descriptors

Packet Array

descriptor ethermetDescr {…} ethernetHeader at pib.L2 offset

pib.L2_offset = 0 pib.L3_offset = 14

descriptor ipv4Descr {…} Ipv4Header at pib.L3 offset pib.L4_offset = 34

•••

descriptor tcpDescr {…} tcpHeader at pib.L4_offset

Fig. 4. A protocol ‘stack’ with its corresponding PIB and descriptor information Copyright CloudShield Technologies, 2009

• ‘Straddle’ behavior – The entire field named c cannot fit in a byte allocated for a and b. Some compilers let it ‘straddle’ bytes, with 2 bits in the byte allocated for the first two fields and the remaining bits in a trailing byte but others do not. The C99 specification comments: “if insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementationdefined.” • Container size – Similarly, the compiler may or may not heed user specifications of the storage unit to use for the bit fields. The specification says an implementation can use “any addressable storage unit large enough” to hold the bit field. • Bit field layout – Finally, we cannot be certain whether the compiler allocates the topmost fields in the declaration to the least significant bytes of the corresponding memory or how the containing unit is aligned: “the order of allocation of bits-fields within a unit is implementation-defined. The alignment of the addressable storage unit is undefined.” It is highly desirable to port a packetC application to new processors or compilers without recoding it to reflect new bit field implementation peculiarities. Thus, the packetC bit field syntax and conventions discussed below strictly control implementation. Example code below shows the packetC equivalent of the C99 structure shown above.

60

R. Duncan, P. Jungck, and K. Ross

struct structTag { int notAbitField; bits short {1 a: 4;2 b: 2; c: 4; pad: 6;3 } containerName;4 } myStruct;

1. Related bit fields are organized inside a container, which has one of packetC’s 4 unsigned integer types. 2. Since a bit field is always part of a container that has a type, each bit field declares a name and a size, not a type. 3. Pad fields are declared explicitly and cannot be accessed for test or set operations. 4. The container name can be used to access and manipulate the bit field collection as a whole. This approach removes container size, straddling and boundary uncertainties. To achieve portability we must also manage byte allocation order. C structures do not use the same byte allocation order when they are compiled on big-endian and littleendian processors. User operations on fields that match whole integer storage units do not show host-specific endianness effects but operations on bit fields, which can be sub-elements of integer storage units or straddle them, do show such effects. Big endian machines store the most significant bits of a word at the lowest byte address, while little endian machines store the least significant bits starting at the lowest byte address. packetC packet array and descriptors are required to be in bigendian order to ensure that they are portable. Although CloudShield’s products are all big-endian platforms, little-endian processors could implement packetC descriptors and packet arrays via compiler adjustments and byte swapping. Having reviewed portability matters, we turn to performance issues in the next section.

8 Current Hardware Implementation The packetC language specifies that the host system predetermines specific layer offsets for certain PIB fields but does not dictate how those calculations are done. In practice, CloudShield Technologies’ products, such as the CS-2000 [8], use multi-core NPUs much as our rivals do. However, our approach uses micro-coded interpreters running on NPUs to interpret user programs and, when appropriate, to push data to specialized processors, as indicated below. • Microcode running on some of the NPU cores works with FPGAs to control the packet pipeline and pre-locate headers. • An ensemble of NPUs, each running multiple contexts, executes the SPMD program copies and system programs.

A Paradigm for Processing Network Protocols in Parallel

61

• A custom FPGA provides the shared memory for scalars and C-style aggregates. • T-CAM (Ternary Content Addressable Memory) chips and Regular Expression processing chips implement operations on packetC’s database and searchset types, respectively [1].

Fig. 5. Layer offsets test. (a) Output speed at 1, 2, 5, 10G; (b) Details at 10Gbps Copyright CloudShield Technologies, 2010

Two system hardware aspects stand out for this discussion and support the performance encountered in the next section: • A Xilinx Virtex-5 family FPGA is the ingress processor that locates offsets for layer 2, 3 and 4 protocol headers (Fig. 3). • The NPU-based interpreters efficiently cache PIB data in registers and effectively treat PIB layer offsets as an optimized addressing mode for packet access.

9 PIBs, Descriptors and Performance CloudShield’s fundamental approach is to combine several factors in order to mitigate foregoing the performance benefits of user-directed, fine-grained parallelism and lowlevel machine resource control. Those factors consist of using: • Specialized processors to speed-up key operations. • High-level programming constructs that neither preclude nor dictate specialized hardware implementation. • Mainstream NPU parallelism but with distinctive elements of program interpretation and packet-level SPMD parallelism. In the case of applications that primarily manipulate protocol headers, these factors involve using:

62

R. Duncan, P. Jungck, and K. Ross

• A Xilinx Virtex-5 family FPGA as an off-load engine to pre-calculate often-used layer offsets, • The packetC descriptor construct to define and access protocol header fields in terms of run-time calculable values and PIB layer offsets, • Support for interpreters handling PIB layer offsets as a packet array addressing mode. Thus, the data reported here shows that for network traffic with relatively large packets, our approach lets small programs with multiple accesses to PIB layer offsets and descriptor fields run ‘at wire-speed’ for a spectrum of 1-10 Gbps input speeds. We produced the performance metrics described below, using a CloudShield CS-2000 model that uses a DPPM-800 (Deep Packet Processing Module) to execute parallel packetC programs. We used CloudShield’s CPOS 3.0.2 operating system and the packetC compiler available as part of the PacketWorks IDE (Integrated Development Environment) 3.1. The Network traffic for the tests was produced by an IXIA 1600T[9] with 10G interface and line-rate processor. Test results were produced by selecting packet size and line utilization (i.e., utilizing 50% of the 10G interface to model a 5G link). After several wall-clock seconds of execution, the experiment would be manually halted and an output metric would be calculated in terms of the total number of output bytes that the application produced, divided by the total number of bytes that the IXIA had pumped into the application. The first test centers on four PIB accesses; it consists of accessing PIB layer 3 and 4 information and forwarding the packet if it contains IPv4 TCP protocol header data. packet module layerAccess_test; #include "protocols.ph" void main( $PACKET curpacket ) { if ( p ib.l3Offset && p ib.l3Type == L3TYPE_IPV4 ) { if ( p ib.l4Offset && p ib.l4Type == L4TYPE_TCP ) { pib.action = FORWARD_PACKET; } } }

As Fig. 5 shows, the CS2000 runs the test at wire-speed with output speeds matching input speed at 1, 2 and 5 Gbps. Differences based on packet size emerge at 10Gbps, although the greatest lag (at a 500-byte packet size) still produces an output speed that is 85.8% of the input. For this particular platform and system software release, the ‘sweet spot’ is found near a packet size of 1500 bytes, where output is 97.2% of the input speed. The second test takes the four PIB accesses from the initial test and adds 8 descriptor field accesses (4 reads and 4 writes). This spoofing program simply swaps the source and destination IP addresses, as well as the source and destination ports, before sending the otherwise unaltered packet back to its sender.

A Paradigm for Processing Network Protocols in Parallel

63

Output as % of input at 10Gbps

Output (% of input)

100%

99.5%

100% 96%

80%

95.4% 93.2% 91.2%

92% 60% 88%

40%

84%

Input Gbps

1

2

5

10

Packet size (bytes)

500

1000 1500 2000

Fig. 6. Spoofing test. (a) Output speed at 1, 2, 5, 10G; (b) Details at 10Gbps Copyright CloudShield Technologies, 2010

// protocol.ph defines descriptor ipv4 below void main( $PACKET curpacket ) { int myIpSrc, myIpDest; short mySrcPort, myDestPort; if ( pib.l3Offset && pib.l3Type == L3TYPE_IPV4 ) { if ( pib.l4Offset && pib.l4Type == L4TYPE_TCP ) { myIpSrc = i pv4.sourceAddress; myIpDest = i pv4.destinationAddress; mySrcPort = t cp.sourcePort; myDestPort = t cp.destinationPort; // Swap addresses ipv4.sourceAddress = myIpDest; ipv4.destinationAddress = myIpSrc; // Swap Ports tcp.sourcePort = myDestPort; tcp.destinationPort = mySrcPort; pib.action = FORWARD_PACKET; } } }

The results are akin to those of the previous test, although everything runs slightly faster (Fig. 6). Again, the program keeps up with its input until the input speed is around 10G, at which time it falters very slightly. The marginally better numbers for a more complex program are possibly the result of the buffering and scheduling software favoring larger programs.

64

R. Duncan, P. Jungck, and K. Ross

The results indicate that small packetC programs with roughly a dozen or more descriptor and PIB accesses can run at wire-speed up to roughly a speed of 10G on the CS2000/DPPM800 platform (with its current configuration and software). The results also have broader implications. Most importantly, the results show that the benefits of the CloudShield implementation, which include using an FPGA off-load engine for layer calculations and using a fast PIB offset addressing mode, compensate for any advantages of having users implement layer detection via low-level, machine-specific tasking. In sum, the high-level, portable programs are running at state-of-the-practice speeds. Second, the application code for protocol definitions and manipulation can be coded at a high-level that is succinct and readily comprehensible. For example, both test programs fit within a paragraph or two’s space and, even for a reader unfamiliar with network programming, are immediately accessible. The next section reviews other languages for parallel network processing in general and protocol processing in particular.

10 Related Work L. George and M. Blume describe the NOVA language for the IXP network processor in [10]. NOVA features include record and tuple aggregates, familiar control-flow constructs, functions and exceptions. NOVA does not automate protocol header discovery but it provides ample means for precisely specifying protocol representation through a layout construct for unambiguously describing protocol header bit fields. A layout can describe a given bit field in two forms: packed and unpacked. The packed form approximates a C bit field description. NOVA’s overlay construct provides the capability to define alternative organizations for a given bit range within a layout.. The unpacked form accords a word of storage or a nested unpacked form to each field. NOVA provides pack and unpack operations to manage the two forms. Intel Corporation’s Microengine C is targeted to its IXP1200 network processor family [11,12]. The language is a C dialect that omits features that embedded network applications are unlikely to use (the floating-point data type, function pointers and a runtime stack). The IXP architecture features a general-purpose processor core and multiple microengines, RISC processors optimized for packet processing. The processing model involves breaking programs into multiple threads that are partitioned across multiple microengines. The user manages communications among threads and processes and typically swaps out a task that reads from or writes to memory. Microengine C provides access to many machine-specific IXP features, such as associating variables with one of three memory classes, with FIFOs or with five register classes. In addition, the language facilitates exploiting machine specifics through an extensive battery of intrinsics and support for inlining. NetPDL (Network Protocol Description Language) [13] is an XML-based language aimed at developing a database of standard network protocols, i.e., header descriptions for standard protocols. The designers appear primarily interested in providing a reusable repository of protocol definitions, rather than in defining a programming language as such, although the language does include provisions for expressions and application development.

A Paradigm for Processing Network Protocols in Parallel

65

J. Wagner and R. Leupers describe a C dialect and compiler for the Infineon network processor [14]. Language extensions provide capabilities for mapping packet protocol contents to special registers and accessing arbitrary bit-width operands without typical alignment restrictions. The compiler maps C arrays with the register qualifier to a special register file. Programmers can specify operands in the register file by using a bit-field pointer and bit-width arguments. Such arguments can have arbitrary sizes and alignments; they can also span multiple registers. A set of compiler intrinsics (compiler-known functions) are the key language extension that provides users with a mechanism to employ these capabilities at a high language level. packetC shares precise bit field specification with NOVA and shares protocol header support with NOVA and the Infineon C dialect. However, only packetC requires off-loading header detection. NOVA, microengineC and Infineon C are all geared to specific NPUs and reflect machine specifics. In contrast, packetC hides processor-specifics, although it is currently implemented with the same NPU family as the first two languages. packetC’s machine-independence goals and its current implementation’s exploitation of familiar NPU technology drive our conclusions about its portability and performance.

11 Conclusions This paper is one of several related papers: each isolates a machine-independent packetC language construct, then describes CloudShield’s specialized hardware implementation that delivers high-speed performance for that construct. In this paper, we concentrate on protocol header processing, so the relevant features in each area are as follows. • Relevant language constructs are the descriptor type and PIB definitions. • Relevant implementation specifics are the ingress processing FPGA and the use of PIB layer offsets for a fast addressing mode during user program interpretation. • Relevant performance experiments are focused on reading and writing protocol header information. Our general intent is to demonstrate that it is possible for packet processing applications to enjoy both portable, high-level language programming and state-of-thepractice packet processing performance (in the 1-10 Gbps range). CloudShield’s approach to maintaining portability while providing packet processing parallelism rests on adopting coarse-grained, SPMD parallelism and packetC programming constructs that hide host machine specifics. Consider the test programs used in section 9. Although simple, the example is a complete network program, reduced to a paragraph’s size. Esoteric system data structures have been replaced by references to predefined structure fields and enumeration literals. The SPMD-level parallelism relieves the programmer of responsibility for tasking at the everymemory-access level. In sum, much of the more arcane aspects of network programming are replaced by familiar, comprehensible constructs. Our overall approach to performance compensates for the lack of user-specified, low-level thread optimizations by employing specialty chips and processors and by

66

R. Duncan, P. Jungck, and K. Ross

exploiting classic NPU multi-core parallelism in the service of an interpreted virtual machine with domain-specific operations. In protocol header processing the CloudShield approach is manifested by using an FPGA off-load engine and by having the interpreters cache and exploit PIB layer offset information for rapid packet access. The CS2000 system used for these tests is typically deployed as a deep packet inspection platform. Our tests are essentially using it as a network router, which is appropriate for this discussion’s focus on protocol header operations. Not surprisingly, the CS2000’s observed performance exceeds that of many commercial routers. The system easily handles the test programs at 1, 2 and 5 Gbps but falters slightly at 10 Gbps. In summary, this language and system implementation approach does deliver concise, portable network programs that operate in the 1-10 Gbps range.

Acknowledgements Peder Jungck, Dwight Mulcahy and Ralph Duncan are the co-authors of the packetC language. Gary Oblock and Matt White also contributed to the language and to compiler development. Andy Norton, Greg Triplett, Kai Chang, Mary Pham, Alfredo Chorro-Rivas and Minh Nguyen provided valuable help. Professors Rajiv Gupta and Rainer Leupers provided expert advice.

References 1. Duncan, R., Jungck, P.: packetC Language for High Performance Packet Processing. In: Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, Seoul, South Korea, pp. 450–457. IEEE, Los Alamitos (June 2009) 2. ANSI T1.105: SONET - Basic Description including Multiplex Structure, Rates and Formats 3. ANSI T1.119/ATIS PP 0900119.01.2006: SONET - Operations, Administration, Maintenance, and Provisioning (OAM&P) - Communications 4. IEEE Std 802.3ae-2002. To be superseded by more recent but not finalized IEEE Std 802.3-2008 5. CloudShield Technologies. packetC Programming Language Specification. Rev. 1.128 (October 10, 2008) 6. ISO/IEC 9899:1999. Standard for the C programming language, version (‘C99’) (May 2005) 7. International Organization for Standardization (ISO) 7498. Open Systems Interconnections (OSI) reference model (1983) 8. CloudShield Technologies. CS-2000 Technical Specifications. Product datasheet available from CloudShield Technologies, 212 Gibraltar Dr., Sunnyvale, CA, USA 94089 (2006) 9. IXIA. IXIA 1600T/400T. Datasheet (retrieved 11/29/2009), http://www.ixiacom.com/pdfs/datasheets/ch_1600t_400t.pdf 10. George, L., Blume, M.: Taming the IXP network processor. In: Proceedings of the ACM SIGPLAN ’03 Conference on Programming Language Design and Implementation, San Diego, California, USA, pp. 26–37. ACM, New York (June 2003) 11. Intel Microengine C Compiler Language Support: Reference Manual. Intel Corporation, order number 278426-004 (August 10, 2001)

A Paradigm for Processing Network Protocols in Parallel

67

12. Intel Microengine C Networking Library for the IXP1200 Network Processor: Reference Guide. Intel Corporation (December 2001) 13. Risso, F., Baldi, M.: NetPDL: an extensible XML-based language for packet header description. Computer Networks. The International Journal of Computer and Telecommunications Networking 50(5) (2006) 14. Wagner, J., Leupers, R.: C compiler design for a network processor. IEEE Trans. on CAD 20(11), 1–7 (2001)

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems Chin-Fu Kuo and Ying-Chi Hai Department of Computer Science and Information Engineering National University of Kaohsiung, Kaohsiung, Taiwan 106, ROC [email protected]

Abstract. A heterogenous multiprocessor system is usually composed of one general purpose processor and one or more speciﬁc purpose computing component. While heterogenous multiprocessor system becomes more popular, more and more researches pay lots of attention to this domain. In such a system, tasks often need to be processed at multiple different functional processing units. Therefore a task is usually divided into several subtasks according to its execution requirement, each of which is executed at particular processing unit with precedence constraints. In this paper, we present an EDF-based algorithm to schedule the tasks in the heterogenous multiprocessor system and propose the scheduability analysis. A series of simulation experiments are conducted to verify the analytic results and to show the capability of the proposed algorithm.

1

Introduction

The applications, which need higher computational functionality while meeting the real-time constraints in the embedded system at the same time, encounter much more rigor with the growth of application complexity. As the applications of multimedia system likewise, the processor also demands enough power to handle the video or audio streams, such as the decoding or encoding of the high-quality video compression format H.264 [1] or the Advanced Audio Coding (AAC ) [2]. For the higher computational demands of real-time applications, there are two main fashion to achieve the goal. The ﬁrst fashion is to speed the processor up by means of deeper pipeline and the advance of IC technology. However, tasks are preempted with the asynchronous events to meet the real-time constraints such that the system not only losses the beneﬁts of pipeline but also impacts additional overhead. Moreover, although the processor speed becomes faster with the advance of IC technology, it causes the problems of higher power consumption and thermal discharge too. It is not a trustworthy fashion to some devices which need to limit the power supply. The second fashion is to cope with data simultaneously on the strength of extra processors to support the system more computational capacity. The products of Intel CoreT M 2 duo and Intel CoreT M 2 quad nowadays are on the basis of this concept [3]. Besides, the uniprocessor architecture development not only suﬀers the bottleneck of IC processes C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 68–78, 2010. c Springer-Verlag Berlin Heidelberg 2010

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems

69

but has the more cost than multiprocessor architecture. Thereby, the multiprocessor architecture platform becomes more and more popular to the applications with higher computational capacity for meeting the fast-grown demand of applications and increasing the performance of whole system. Besides, heterogenous multiprocessor platforms has become the major trend. For example, ”TMS320DM6446” of Texas Instruments is a powerful embedded system platform to support highly-computing requirement of multimedia applications [4]. The multiprocessor system can be typically classiﬁed to two kinds of architectures: homogenous and heterogenous. The homogenous multiprocessor system as implied by the name has serval identical processors. On the contrary, the heterogenous multiprocessor system has diﬀerent characteristics processors including the general purpose processors (GPP) and the special purpose processors (DSP). Among this two kinds of processors, the DSPs are dedicated to handle the complex large mathematic equations and the decoding or encoding of multimedia streams. Moreover, because the DSPs have large amount of registers to speed the computation up, the operation fashions of them are non-preemptable to avoid the overhead of unnecessary context switch. In opposition to DSPs, due to the GPPs are not dedicated to special purpose, the frequency of GPPs must be much higher than DSPs to be evenly matched with handling some special applications. Hence the homogenous multiprocessor systems lose their predominance under the consideration of cost. Furthermore, the architecture of DSPs is provided with predictable execution time, it has more advantage in real-time system. Consequently, the application of heterogenous multiprocess systems become noteworthiness in industry, such as automotive systems, communication network, consumer electronics and mobile handhelds. Those researches mentioned above are preemptive algorithms. On the other hand, some scholars develop their schedulability tests based on non-preemptive algorithms, such as: Sanjoy purposes a feasibility analysis based on non-preemptive earliest-deadline ﬁrst algorithm upon several identical processors in [5]. An on-line non-preemptive scheduling algorithm that is easy to be implemented for both, underload and overload upon the homogenous multiprocessor system is investigated in [6]. On the contrary, it is relative to be scanty of researches in heterogenous multiprocessor systems. One of the most signiﬁcant reason is that the problems of real-time system are NP-hard [7] and the characteristics of heterogenous multiprocessor system lead the problems to be more complex. However, our paper focuses the previous research problem model in heterogenous multiprocessor system [8, 9, 10]. Gai et al, ﬁrst points out that when the traditional scheduling algorithm such as [10] schedules tasks which need to be executed on DSP, a hole within each job is generated in the schedule of the GPP, and then they analyze the scenario when tasks are scheduled via Rate-Monotonic scheduling algorithm (RM) [11] to make more accurate the schedulability bound [9]. Due to the system model in [9] has two waiting queues (one for GPP and the other for DSP). When the DSP is active, the scheduler selects tasks from the GPP queue only. Nevertheless, the task which needs to be executed on DSP still should have the

70

C.-F. Kuo and Y.-C. Hai

right to be executed on GPP when its partial work which needs to be executed on GPP doesn’t ﬁnish. Previous problem model is improved in [8]. Under traditional real-time applications, the system decides the execution priorities of tasks according to their periods or deadlines. Many researches of real-time system are on the foundation of this basic phenomenon over the years, such as network, database and disk I/O. Even so, with the growth of complexity of real-time applications in embedded system, tasks often need to be split to execute on diﬀerent processor environments, i.e., the heterogenous multiprocessor architecture (GPPs and DSPs), to satisfy the performance and real-time constraints. For instance, a video decode program must decode 30 frames per second. During the digital processing , GPP reads the data of video from disks in the ﬁrst stage. The second stage is to call DSP to decode the data and then GPP shows the manipulated data on the monitor in the ﬁnal stage. However, this action inﬂuences the eﬃciency of traditional real-time scheduling algorithms. In our paper, we adopt the previous two-queue architecture as our strategy but the DSP queue doesn’t break oﬀ when DSP is active against [9]. Furthermore, in order to improve the upper bound limit, we choose Earliest-Deadline-First (EDF) scheduling algorithm as our basis algorithm instead of RM algorithm. Note that the schedulability tests of prior researches just calculate the total utilization of GPP, but we count both the DSP and GPP utilizations separately in our schedulability tests. It give a more precise analysis than previous researches. Finally, we evaluate the eﬀect of the proposed algorithms by a series of simulations. Our paper oﬀers a precise schedulability test analysis for task scheduling in heterogenous multiprocessor systems. In Section 2, we ﬁrst introduce the architecture of a heterogenous multiprocessor system. And in Section 3, we then propose our deadline-assignment method considering that the platform has diﬀerent processors and tasks have non-preemptive execution requirements on the DSP. We illustrate the simulation and propose the evaluation results in Section 4. Finally, Section 5 presents an inductive conclusion of this paper and our future work.

2 2.1

System Model Processor Model

In this section, we will describe the processor architecture in this paper. The system architecture is composed of two diﬀerent kinds of computational units named general purpose processor (GPP) and digital signal processor (DSP), respectively, is built in the same chip. Hence, the heterogenous multiprocessor system can exchange message to each other via sharing a common bus. Figure 1 illustrates the abstract of heterogenous dual-core architecture. For instance, in the playback upon TMS320C64x of Texas Instruments [4], the general purpose processor (ARM9) saves the captured raw frame data from the video capture device to the common memory shared by the GPP and the DSP, and then invokes the DSP to begin encoding those data using a video

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems

71

Fig. 1. Architecture conﬁguration of the dual-core system

Fig. 2. Execution of a DSP task

encoder algorithm. After above steps, the GPP writes those handled data to the ﬁle system. The functionality of host port interface (HPI) in Figure 1 lets the GPP can directly access the cache of the DSP and thereby speed the communication up. Furthermore, the two processors are built on the same chip and both can share the memory address space of the architecture, we can assume that the communication overheads between the GPP and the DSP can be considered as negligible. Even if some possible communication overheads still exist when tasks invoke requests on the GPP or migrate to another core, we can put it into account in priori. Hence, we can classify the system model as similar as systemon-a-chip (SoC) architecture. Note that the GPP not only communicates with the DSP but also directly controls the other peripherals. In the system, all the jobs launching on a DSP are invoked by a remote procedure call (RPC) and are executed in a non-preemptive fashion. Such RPC is also called DSP activity and scheduled by the GPP. Therefore, the GPP has the responsibility of the real-time kernel to prevent that a task invokes a DSP request when there is at least one task on the DSP. 2.2

Task Model

Figure 2 illustrates the considered real-time task model on the two heterogeneous processor system. Each task τi is a stream of job instances generated in runtime. For example, a decoder task needs to decompress about 30 encoded frames per second. The task reads the video ﬁle on the GPP, decompress the data on the DSP, and then display the frame on the GPP. For frame encoding, the execution sequence is similar to decoding procedure [12]. Therefore, each job τi,j which

72

C.-F. Kuo and Y.-C. Hai

arrives at time ri,j needs ci,gpp = ci,pre +ci,post unit of time on the GPP. ci,pre and ci,post are for the data reading and frame displaying, respectively, in a decoder task. The job also need ci,dsp unit of time and on the DSP. Besides, we assume that there is at most one DSP request invoked by each job during its execution after ci,pre unit of time. When a job ﬁnishes its DSP activity it then executes for ci,post unit of time, so that ci,pre + ci,post = ci,gpp . Deﬁnition 1. If a task needs to be executed on the DSP, it is regarded as a DSP task, otherwise it is regarded as a regular task. Moreover, jobs should arrive periodically (i.e., ri,j −ri,j−1 = Ti , where Ti denotes the task’s period) or sporadically (that is ri,j − ri,j−1 ≥ Ti , where Ti means minimum interarrival time) and end before the absolute deadline adi,j . Therefore, the relative deadline di = adi,j − ri,j . A regular task, in fact, is equal to a DSP task with ci,dsp = 0 but for the sake of simplicity, we disrepute regular tasks as DSP tasks when their ci,dsp = 0. Finally, ui denotes the utilization of task τi , ci where ui = min(d . i ,Ti )

3

Real-Time Task Scheduling on Two Heterogenous Processor System

In this section, we propose the on-line EDF-based scheduling algorithm and apply the deadline assignments for oﬀ-line schedulability test and on-line admission control. 3.1

Deadline Assignment for Subtasks on Processors

In order to guarantee the real-time requirements of tasks on the system with two heterogenous processors, the idea behind our proposed approach is to assign the relative deadlines to the subtasks of a DSP task and do the oﬀ-line schedulability analysis. In runtime, when the processor (i.e., GPP or DSP) is idle, the global scheduler dispatchs the subjob with the most urgent absolute deadline in the corresponding ready queues to the processors. As forementioned, the system maintains a global scheduler that has the responsibility to dispatch subjobs to

Fig. 3. Illustration of schedulers on dual-core systems

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems

73

appropriate processors, as Figure 3 shown, against local scheduler that only deals with its own processor such as GPP or DSP. Furthermore, the scheduling strategy of the scheduler is EDF [11]. In this section, ﬁrst we discuss how to assign the relative deadlines to the subtasks of a DSP task. A DSP task is composed of three subtasks that carry out on diﬀerent processors according to their execution requirements. Each subtask can be considered as a stream of subjobs. The subtasks of task τi can be viewed as individual and independent tasks with equal period but diﬀerent absolute deadlines. Let di,pre , di,dsp and di,post represent the relative deadline of three subtasks, i.e., τi,pre , τi,dsp , and τi,post of task τi , respectively. If a job τi,j arrives at ri,j , its corresponding subjobs are τi,j,pre , τi,j,dsp , and τi,j,post . Let adi,j,pre , adi,j,dsp , and adi,j,post denote the corresponding absolute deadlines of the subjobs. The absolute deadline of subjob τi,j,pre /(τi,j,dsp ) is the ready time of subjob τi,j,dsp /(τi,j,post ). The absolute deadlines are derived by the following equations: adi,j,pre = ri,j + di,pre , adi,j,dsp = adi,j,pre + di,pre , and adi,j,post = adi,j,dsp + di,post .

The subjobs τi,j,pre and τi,j,post are then saved in the corresponding ready queue for the GPP while subjob τi,j,dsp is saved in that for the DSP. The scheduler schedules the subjobs in the queues based on their absolute deadlines. The relative deadlines of subtasks can be assigned according to diﬀerent deadline assignment strategies such as Eﬀective Deadline (ED), Equal Slack (EQS), Equal Flexibility (EQF) in [13] and Equal Deadline (EQD). In the paper, we adopt the EQF deadline assignment strategy. The formulas of the strategy are shown in the following: c

c

c

i,dsp i,post di,pre = Ti×( ci,gppi,pre +ci,dsp ), di,dsp = Ti×( ci,gpp+ci,dsp ), and di,post = Ti×( ci,gpp+ci,dsp ).

For instance, a DSP task τi with di = 9, Ti = 10, ci,pre = ci,dsp = ci,post = 1, under the deadline assignment strategy EQF. τi,j,pre arrives at time 0, for the reason that we can visualize the subjob of τi,pre of τi as an independent and individual task τk with dk = 3, Tk = 10, ck,gpp = 1, ck,dsp = 0 and adk = adi,j,pre = 3. The subjobs of τi,j,dsp and τi,j,post have the properties s similar as τi,j,pre . The period of the three subtasks are all equal to 10. The most diﬀerence is adi,j,dsp = adi,j,pre + dk = 3 + 3 = 6 and adi,j,post = adi,j,dsp + di,pre = 6 + 3 = 9. Note that The diﬀerent color boxes represent the subtasks of the DSP task. 3.2

Online EDF-Based Scheduling Algorithm

The proposed Online EDF-based Scheduling Algorithm (OESA) is based on EDF scheduling algorithm [11], that always assigns the highest priority to the job with the absolute deadline nearest to the current time. The most significant diﬀerence about the scheduling strategy in this paper is the scheduler takes no absolute deadlines of jobs but absolute deadlines of subjobs as the foundation of priority assignment in runtime, i.e., the subjob whose absolute deadline is the closest to the current time of all subjobs of all ready jobs has the

74

C.-F. Kuo and Y.-C. Hai

Algorithm 1. Online EDF-based Scheduling Algorithm (OESA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

/* Deals with the operations while a subjob of job τi,j is finished or job τi,j arrives */ if subjob of job τi,j finishes then if the status of job τi,j == pre then if ci,dsp != 0 then The status of job τi,j := dsp; Insert nowGPP into D-queue; nowGPP := Null; else

if the status of job τi,j == dsp then if ci,post != 0 then The status of job τi,j := post; Insert nowDSP into G-queue; nowDSP := Null; else

if the status of job τi,j := post then nowGPP := Null;

16 if job τi,j arrives then 17 Insert job τi,j into G-queue; /* Checking the urgency of jobs */ 18 if the first element of G-queue has a more urgent deadline than the deadline of nowGPP then 19 Insert nowGPP into G-queue; nowGPP := the first element of G-queue; 20 if nowDSP := Null then 21 nowDSP := the first element of D-queue; 22 Execute the corresponding operation of nowGPP and nowDSP.

highest priority to be executed. If a job ﬁnishes its partial works on one processor GPP(/DSP)and it still needs to be executed on the other processor DSP(/GPP) whose characteristics is diﬀerent from the previous one, the job then is migrated to the other processor. Algorithm 1 describes the scheduling situation in detail. Whenever a scheduling event occurs such as subjob ﬁnishes and new job released, etc, the global scheduler is invoked. The scheduler has the two queues, G-queue and D-queue to maintain job information. The elements in the queues are in the order of non-decreasing absolute deadlines. Besides, we use two variables named nowGPP and nowDSP to keep the information of jobs running on the GPP and the DSP, respectively. From steps 1 to 15, the scheduler deals with the operations after subjobs completion by checking the statuses of jobs, such as changing statuses or clearing nowGPP and nowDSP. There are three kinds of statuses of job named pre, dsp, and post, respectively, and the order of them takes turns. Each status represents the execution requirement of job now. Hence, a job has no status of dsp if it is a instance of a regular task. From steps 16 to 19, the scheduler checks if the subjob of the new arrival job is more urgent than the current running subjob. If the result is true, the running subjob will be preempted. Finally, because of the non-preemptive fashion of DSP, the scheduler must check whether DSP is active or not before assigning subjob to it in step 20.

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems

3.3

75

Oﬀ-Line Schedulability Test Analysis

In our study, we propose a oﬀ-line schedulability test analysis to determine whether the system with n tasks is schedulable or not, which is divided into two parts. The ﬁrst part is used for the consideration of the GPP and the other is used for the DSP. As long as one of them is failed, the system is unschedulable. On the General Purpose Processor. As we mention in Section 3.2, the scheduling algorithm we adopt is on the basis of EDF algorithm. Hence, the utilization upper bound which determines a system is feasible or not is equal to 1. Theorem 1 gives a complete depiction. Theorem 1. [14] A system of n independent, preemptable tasks is schedulable upon one processor via the EDF algorithm if the total utilization of the tasks is not more than 1. Theorem 1 is applied to uniprocessor system without considering the existence of DSP tasks. However, as forementioned visualization, the subtasks of the same task are viewed as individual and independent tasks with equal period but different deadlines. Therefore the utilization ui,pre of subtask τi,pre is equal to ci,pre min(di,pre ,Ti ) . The deriving of ui,dsp and ui,post are similar to ui,pre . Besides, because the relative deadlines of subtasks are smaller than period, we can therec fore reduce the utilization ui,pre to di,pre for subtask τi,pre . Before we extend i,pre Theorem 1 to the system we study, we deﬁne some notations ﬁrst: UT is the total utilization of task set T in system. Φ is the task set whose subtasks need to execute on DSP in system. ui,dsp denotes the utilization of τi,dsp , c where ui,dsp = ui,dsp . ui,post denotes the utilization of τi,post , where ui,post = i,dsp ci,post di,post . Because of the characteristics of splitting jobs, we then conclude the following corollary that deal with the GPP schedulability. Corollary 1. A system with the task set T , where T consists of n independent and preemptable tasks, is schedulable via the EDF algorithm if UT = (ui,pre + ui,post ) + ui ≤ 1. (1) τi ∈Φ

τi ∈T −Φ

Proof. The proof is a direct consequence of the point of view of visualization mentioned behind Theorem 1. On the Digital Signal Processor. We show the schedulability test on the DSP in this subsection. First, we deﬁne some basic notations. Cmax is the maximal blocking time of all non-preemptable subtasks on the DSP, i.e., Cmax = max∀τi ∈Φ {ci,dsp }. Dmin is the minimal relative deadline of all non-preemptable subtasks on the DSP, that is Dmin = min∀τi ∈Φ {di,dsp }.

76

C.-F. Kuo and Y.-C. Hai

Theorem 2. [15] For a system with non-preemptable subtasks is schedulable via the EDF algorithm if Cmax ui,dsp ≤ 1 − . (2) Dmin τi ∈Φ

Therefore, we follow Theorem 2 as our scheduling test on DSP. Due to take consideration of diﬀerent processors, the advantages is not only to enhance the schedulability but to relax the total utilization UT .

4

Simulation and Performance Evaluations

The purpose of this section is to evaluate the performance of the proposed scheduling scheme. We develop a simulation model for the scheme, and use some numerical examples to show the capabilities of our scheme. We investigate two kinds of experiments. First we examine the task sets in the heterogenous multiprocessor system to be scheduled or not. 4.1

Experiment Environment Setting

In our experiments, the subtasks on each processor are scheduled via EDF algorithm [11]. We design a task generator to produce end-to-end periodic tasks and their workload for each set of the experiments. The performance of our scheduling test analysis is inspected on a large number of task sets by the schedulability test equations (i.e., Equation 2 and Equation 1), each simulation result is an average over twenty independent simulation runs. The execution times of tasks in a task set are generated by random variables with uniform distribution while the utilization of a task is assigned with poisson distribution. The task numbers are 10, 20, 30, 40 and 50. The total utilization of the system is ﬁrst randomly generated between 0 and 1, and then the average utilization of each task is calculated as the seed of poisson distribution. We assign the utilization of each tasks via poisson distribution variable with mean equal to the average utilization of each task. In each experimental task set, the proportion of DSP tasks to total tasks is 0.9. The ci,pre and ci,post are chosen between 1 to 10 with uniform probabilities, and the ci,dsp is assigned in the range between 0.5 to 0.8 of ci,gpp . The c +c +ci,post Ti is derived by i,pre i,dsp . ui We ﬁrst examine the performance and correctness of our oﬀ-line schedulability test. The performance metric is the average schedulability ratio. The schedulability ratio is the feasible solution number divided by the number of task sets. We compare the evaluation of three diﬀerent approaches: MDSS [9], RSHD [8] and DPCP [10], as description in Section 1, against our proposed method. 4.2

Simulation Result

Figure 4(a) shows the schedulability ratios under diﬀerent compared algorithms. We can observe that when the system load is light (i.e., the total utilization

Real-Time Task Scheduling on Heterogeneous Two-Processor Systems

(a) Task number is 30

77

(b) Task number is 50

Fig. 4. Performance of diﬀerent schedulability test methods

ci,pre +ci,dsp +ci,post is less or equal to 0.4, which is deﬁned as ), the results Ti of OESA are similar to that of DPCP, MDSS and RSHD. The schedulability ratios start to drop when the total utilizations are equal to 0.4, 0.5, 0.6 and 0.8 under DPCP, OESA, MDSS and RSHD, respectively. We can also observe that the performance of OESA is better than MDSS and DPCP in most situation. However, the performance of OESA algorithm is worse than that of RSHD when the total utilization of the system is less than 0.9. That is because Equation 1 of our OESA algorithm lets the utilization of DSP tasks become twice lager against the other algorithms. Therefore, the performance of OESA algorithm decays when the total utilization of the system is about equal to 0.5. However, although the point, where the schedulability ratios decreasing of our OESA algorithm is earlier than RSHD and MDSS, the decay ratio of our OESA algorithm is smoother and outperforms the others when the total utilization of the system is larger than 0.9. The reason can be concluded that because we use EDF-based scheduling algorithm as our method which is diﬀerent from the compared RMbased algorithms, the schedulability upper bound of EDF-based algorithm is higher than that of RM-based algorithm. Moreover, the previous researches only take the total utilization including DSP activities as blocking time on GPP into account. Our proposed method separates the utilization calculation for the GPP and the DSP, respectively. Hence, our method is better when the system has a higher load. When the task number is equal to 50, the similar results can be obtained in Figure 4(b).

5

Conclusion and Future Work

In this paper, we propose an online EDF-based scheduling algorithm, named OESA, for a heterogenous multiprocessor system with a GPP and a DSP. It is motivated by the needs in which the system must run real-time tasks in a real-time fashion. The schedulability test is proposed to determine whether the system can schedule the tested task set without any deadline missing. The capability of the proposed algorithm is evaluated by a series of experiments. It was shown that the proposed algorithm can have better performance compared to other related approaches.

78

C.-F. Kuo and Y.-C. Hai

More than our features mentioned before, there are still some appealing problems which are worthy to be investigated in the future. For example, we can derive a more accurate schedulability bound in this dual-cores heterogenous multiprocessor system and extend our OESA algorithm to deal with the system where there are multiple GPP’s and DSP’s.

References [1] Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the h.264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 560–576 (2003) [2] Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., Dietz, M., Herre, J., Davidson, G., Oikawa, Y.: ISO/IEC MPEG-2 advanced audio coding. Journal of the Audio Engineering Society 45, 789–814 (1997) [3] Intel, http://www.intel.com [4] TMS320DM6446 Digital Media System-on-Chip – Datasheet. Texas Instruments (March 31, 2008) [5] Baruah, S.K.: The non-preemptive scheduling of periodic tasks upon multiprocessors. Real-Time System 32, 9–20 (2006) [6] Dolev, S., Keizelman, A.: Non-preemptive real-time scheduling of multimedia tasks. In: Proc. Third IEEE Symposium on Computers and Communications ISCC ’98, June 30-July 2, pp. 652–656 (1998) [7] Lee, C., Lehoczky, J., Siewiorek, D., Rajkumar, R., Hansen, J.: A scalable solution to the multi-resource qos problem. In: Proc. 20th IEEE Real-Time Systems Symposium, December 1-3, pp. 315–326 (1999) [8] Kim, K., Kim, D., Park, C.: Real-time scheduling in heterogeneous dual-core architectures. In: Proc. 12th International Conference on Parallel and Distributed Systems ICPADS 2006, vol. 2, p. 6 (July 12-15, 2006) [9] Gai, P., Abeni, L., Buttazzo, G.: Multiprocessor dsp scheduling in system-on-achip architectures. In: Proc. 14th Euromicro Conference on Real-Time Systems, pp. 231–238 (June 19–21, 2002) [10] Sha, L., Rajkumar, R., Lehoczky, J.P.: Priority inheritance protocols: an approach to real-time synchronization. IEEE Transactions on Services Computing 39, 1175– 1185 (1990) [11] Liu, C.L., Layland, J.W.: Scheduling algorithms for multiprogramming in a hardreal-time environment. Journal of the ACM (JACM) 20(1), 46–61 (1973) [12] Paulin, P.G., Pilkington, C., Langevin, M., Bensoudane, E., Benny, O., Lyonnard, D., Lavigueur, B., Lo, D.: Distributed object models for multi-processor soc’s, with application to low-power multimedia wireless systems. In: DATE ’06: Proceedings of the conference on Design, automation and test in Europe, pp. 482–487 (2006) [13] Kao, B., Garcia-Molina, H.: Deadline assignment in a distributed soft real-time system. IEEE Transactions on Parallel and Distributed Systems 8, 1268–1274 (1997) [14] L´ opez, J.M., D´ıaz, J.L., Garc´ıa, D.F.: Utilization bounds for edf scheduling on real-time multiprocessor systems. Real-Time System 28(1) (2004) [15] Liu, J.W.S.: Real-Time Systems. Prentice-Hall, Englewood Cliﬀs (2000)

A Grid Based System for Closure Computation and Online Service Wing-Ning Li1 , Donald Hayes1 , Jonathan Baran1 , Cameron Porter2, and Tom Schweiger2 1

Department of Computer Science and Computer Engineering University of Arkansas, Fayetteville, AR 72701 USA [email protected] 2 Acxiom Corporation, 301 Ward Ave., Conway, AR 72032

Abstract. Record linkage deals with ﬁnding records that identify the same real world entity, such as an individual or a business, from a given ﬁle or set of ﬁles and has many applications. This problem is also referred to as the entity resolution or record recognition problem. To locate those records identifying the same real world entity, in principle, pairwise record analyses have to be performed among all records. Analytical operations are complex and take a lot of time. The number of such analyses is quadratic in terms of the number of records given and therefore is very time consuming. To reduce the number of pairwise record comparisons, blocking techniques are introduced to partition the records into blocks and records in each block are analyzed against one and another. One of the eﬀective blocking methods is the closure approach. In this paper, we describe the design and implementation of a parallel and distributed closure prototype system running in an enterprise grid. The system can either produce all closures to a ﬁle in a batch fashion or run as a service where upon receiving a record it returns the closure of that record. Preliminary experiment indicates the approach is eﬃcient and scalable.

1

Introduction

Determining records that represent the same real world entity is an important and challenging problem with many applications. For instance, it addresses data quality issues such as “data accuracy, redundancy, consistency, currency and completeness [1]. Ensuring data quality is becoming a critical issue that impacts organizational performance [2,3,4,5]. This problem is also referred to in the literature as record linkage problem [6], data cleaning problem [7], object identiﬁcation problem [8], or entity resolution problem [9]. All these research eﬀorts deal with the fundamental question of how to eﬀectively identify record ”duplicates” when unique identiﬁers are unavailable or do not exist in records. The main idea is to rely on matching of other ﬁelds in records such as name, address, and so on. The set of ﬁelds chosen is application dependent and is often referred to as keys. The most basic application is to identify duplicates within a single ﬁle or between two ﬁles. In the single ﬁle situation, in principle, each record must be C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 79–89, 2010. c Springer-Verlag Berlin Heidelberg 2010

80

W.-N. Li et al.

checked against every other record in the same ﬁle in order to ﬁnd its duplicates. Similarly, in the two ﬁles scenario, each record in one ﬁle must be compared against every record in the other ﬁle. Both schemes amount to carrying out all pairwise analyses among records and have a time complexity that is quadratic to the number of records (the input size of an algorithm). Since analytical tools are complex and time consuming, each pairwise analysis takes much more time than that of a simple instruction. For large ﬁles having hundreds of millions to billions of records, the performance of such a scheme is unacceptable. To overcome the poor performance, the total number of pairwise record analyses must be reduced. To understand how this could be done, let us consider the case where records are in a single ﬁle. Conceptually each record may be viewed as being associated with a potential set of records from which to ﬁnd its duplicates. Records not in the potential set are guaranteed not to be duplicates, and therefore need not be compared with. Hence, pairwise analyses are only needed between the records in each potential set. For a record, a straightforward way of deﬁning its potential set is to let all other records in the ﬁle to be its potential set. This leads to the quadratic pairwise comparisons. Now imagine that a scheme exists that reduces the potential set from the whole ﬁle to a small fraction of that and this scheme can be implemented eﬃciently. What the scheme does is that it eﬀectively partitions the records in the input ﬁle into many relatively small groups within which all pairwise record analyses are needed. The scheme that reduces the record pairs to be compared is called blocking in the literature [10]. Closure operation, of which the deﬁnition is given in the preliminary section, is one of the blocking schemes. Closure operation is not only useful in reducing the number of pairwise comparisons in single ﬁle record linkage applications but also in applications that involve multiple ﬁles. For instance, sometimes two ﬁles are provided as input, where ﬁle A has the distinct records (no duplicates) and the ﬁle B contains potential ”duplicate” records of ﬁle A, and the goal is to ﬁnd the ”duplicate” records in ﬁle B for each record in ﬁle A. To apply the same idea introduced earlier, for each record in ﬁle A, we need to know its potential set or closure in ﬁle B. One way to solve this problem is to combine the two ﬁles into a single ﬁle and then perform the closure (blocking or partitioning) operation. Once this is done, each record in ﬁle A is associated with a closure (a potential set) that could contain some records from ﬁle B. Notice that records in ﬁle B are partitioned into closures and each closure could be associated with some record in ﬁle A. To determine its duplicates in ﬁle B, a record of ﬁle A only needs to be checked with those records of ﬁle B that are in the same closure in which the record of ﬁle A also belongs. Other record linkage applications that involve multiple ﬁles can use the closure idea to reduce the number of record analyses in a similar way. Since the primary objective of a blocking scheme such as closure is to reduce the time and space complexity of an overall process of record linkage, the blocking scheme must be implemented eﬃciently. To this end, it is necessary to investigate the question of how to design eﬃcient algorithms and their implementations to realize any proposed blocking method. For the closure method,

A Grid Based System for Closure Computation and Online Service

81

eﬃcient sequential algorithms are proposed and empirical studies are carried out in [1]. A Parallel and distributed algorithm is proposed and implemented in MPI [11,12,13]. A graph theoretic view of the closure problem is also introduced in [1,11]. This paper adopts the parallel and distributed closure algorithm proposed in [11] to a service based grid environment. In this paper, a grid based closure prototype system is reported. The system is developed in C++ with pthread and Corba and runs in an enterprise grid. The system can either produces a closure ﬁle in a batch fashion or runs as a service where upon receiving a record it returns the closure of that record. Preliminary experiment indicates the approach is eﬃcient and scalable.

2

Preliminary and Background

The development of the prototype system is based on previous research eﬀorts on the closure problems [1,11,12,13], where the reader will ﬁnd more thorough discussions of the transitive closure problem and the proposed parallel and distributed algorithm that has been adopted and implemented in grid architecture. A brief description is provided later to make the paper self-contained. 2.1

The Closure Problem

Since the application of a blocking scheme such as Transitive Closure is basically the same for a single ﬁle or multiple ﬁles, in the sequel, a single ﬁle is assumed. A ﬁle contains a sequence of records. Each record contains a sequence of ﬁelds. Certain ﬁelds or some combinations of ﬁelds are chosen as keys. The selection of keys depends on the types of record linkage and could vary from applications to applications. Once keys are selected, two records are deﬁned as directly related if for some key the two records have the same value. For example, if name and address ﬁelds are the keys, two records A and B with the same name values ”John Doe” and diﬀerent address values ”723 main street” and ”123 main street”, respectively, are directly related because of same value ”John Doe” in the name key. Two records X and Y are deﬁned as transitively related if there is a sequence of records of which each adjacent pair is directed related and X and Y are directly related to the ﬁrst and last records in the sequence respectively. Let us expand the above example by introducing records C, where C has name value ”Jane Doe” and address value ”123 main street”. In the example, A and C are transitively related because A and B are directly related (through name key with value ”John Doe”) and B and C are directly related (through address key with value ”123 main street”). In this simple example, the sequence in the deﬁnition of transitively relatedness has only one record (record B) even though in reality the sequence may contain many records. Two records are deﬁned as related if and only if either they are directly related or transitively related. Based on the relatedness just introduced, records in the input ﬁle are partitioned into group in the closure (transitive closure) computation, where all related records are in one group, which is called a closure . Typically, a closure is a set of record identiﬁers

82

W.-N. Li et al.

instead of a set of physical records with all the ﬁelds. Record identiﬁer and record are used interchangeably in the sequel. Given a ﬁle of records and chosen keys as the input to the transitive closure problem, in the batch-processing mode, the output is a set of closures. In a service-oriented mode, a ﬁle of records and chosen keys are given once to set up the closure service. Afterwards each record in the ﬁle may be sent to the service. Upon receiving a record, the service returns the closure associated with the record. The design and implementation of the closure software to be presented support both modes of operations. Even though the transitive closure problem is deﬁned in terms of a ﬁle of records containing many ﬁelds of which some are chosen as keys, the input to the prototype system is not a ﬁle of records. Instead it is a ﬁle of pairs of record identiﬁers. Each pair of record identiﬁers signiﬁes that the corresponding records are directly related. For any two directly related records at least one pair of the corresponding record identiﬁers exists in the input ﬁle. The reason for the change of input is as follows. It has been observed in [11,13], the process of parallel and distributed computation of closure may be broken into two steps: in step one all directly related pair of records are computed; in step two upon getting the result from step one all closures are computed. Since step one can be computed eﬃciently using a parallel and distributed processing algorithm that is scalable [11,13], the focus of this research eﬀort aims at step two. For that reason, the input to the prototype is a ﬁle of pairs of record identiﬁers. Conceptually the two parallel and distributed solutions of steps one and two can be easily combined into a single system where the input is a ﬁle of records. 2.2

The Basic Parallel and Distributed Algorithm

The basic parallel and distributed algorithmic idea and algorithm that the prototype uses are provided in [11,12,13,14]. The main ideas are recapped as follows. First hashing is used to evenly assign records to processors conceptually. A record identiﬁer is an alphanumeric string. The string’s hash value modular of the number of processors is the ID of the processor that owns the record. Records owned to a processor are called local records with respect to that processor. Once this is done, record pairs (directly related records) in the input ﬁle are classiﬁed with respect to each processor as local pairs, global pairs, and irrelevant pairs. A record pair is said to be a local pair of a processor if both records in the pair are local to the processor, a global pair if only one of the records is local to the processor, and an irrelevant pair if none are local to the processor. In parallel, processors read the record pairs from the input ﬁle, where local pairs are processed immediately using disjoint set ﬁnd and union data structures [15](the two records are directly related and should be in the same closure or the same set), global pairs are stored so they may be processed after all the pairs are read, and irrelevant pairs are ignored (note an irrelevant pair to one processor is always a local pair or a global pair to some other processor). The disjoint set data structures maintain that the local records that belong to the same closure

A Grid Based System for Closure Computation and Online Service

83

are grouped together as a set, called a local cluster. Once the reading of the input ﬁle is ﬁnished, in parallel each processor goes through its global pairs to generate new local pairs and new global pairs for other processors. Then in parallel the processors interchange new local pairs and global pairs among themselves. The new local edges are processed ﬁrst using the disjoint set data structures that allow more local records in a closure to be grouped together (combining local clusters determined in the previous iteration together). Once the new local pairs are processed, the new global pairs are examined to generate new local and global pairs for the next phase in the iteration. After that global pairs are collapsed or reduced. Each phase is synchronized and within a phase communication and computation among the processors are asynchronous or concurrent. The number of iteration is logarithmically related to the number of processors [11]. Note that after processing the local pairs, no local pair is stored. And after processing the global pairs, each local cluster has at most one global pair from its processor to another processor. The rest of the global pairs are thrown away. The existence of global pairs indicates that records in a closure are assigned to diﬀerent processors by the hashing scheme. At the end, all local records belonging to a closure form a single local cluster. And local clusters in diﬀerent processors are linked to each other by global pairs to establish a closure. The disjoint sets (local clusters) and the associated global pair structures allow the generation of closure ﬁle and support the closure query service. The above algorithm is realized in the prototype of which the design is given next. For a more detailed description and examples of the algorithm the reader is referred to [11,12,13,14]. One more thing should be noted before describing the design. The eﬃcient implementation of disjoint set ﬁnd and union requires that integer values be the members of the set [1,15]. On the other hand record identiﬁers are alphanumeric strings in general, and this is particularly true for the real data ﬁles where record identiﬁers contain both letters and digits. To bridge the gap, a mapping scheme is needed that maps each alphanumeric string to an integer. The mapping scheme is also used to convert pairs of strings to pair of integers. Once the conversion is done, the computation of closures is based on integer pairs, which have much smaller memory foot print and reduce computation time as well as communication time of the parallel and distributed algorithm outlined earlier. An empirical study of the time reduction in computation and communication by using integer pairs is conducted in [11]. Even though the mapping operation is expensive, as shown in experimentation (Setup pair), the time and space reduction it brings to the overall computation outweighs its initial cost.

3

Architecture Design

The system runs on a set of grid nodes, which may be thought of as processors in a parallel computer conceptually. Multiple threads are used to carry out the processing in each grid node. The system uses Client-Server architecture to implement the parallel and distributed closure algorithm. In particular, the architecture addresses the no blocking communication between processors so that

84

W.-N. Li et al.

computation and communication can be concurrent within a phase of the algorithm. The client and server are separate threads in the same grid node so that computation and communication are carried out concurrently. The current design uses one thread for the client and one thread for the server. Each grid node has a server, which is a remote object to clients running on other grid nodes. The server provides remote procedure calls for clients. These procedures allow a remote client to send the new local and global pairs derived in each iteration (push the data to grid node running the server) and to query the string to integer mapping information of the local records of the gird node. Each grid node spawns a client thread and keeps the main thread running as a server. The client thread does the majority of processing and makes remote procedure calls to serves running on other grid nodes. The server thread handles incoming requests from remote clients and delivers the data to the client thread via shared memory between the two threads. Each gird node has three major data structures for mapping between local records and integers from 1 to n (n is the total number of records), for supporting disjoint set ﬁnd and union, and for relating global pairs and local clusters. The current design delegates the mapping structures to the server thread, and the other two structures to the client thread. It is conceivable that the mapping operation may be separated as a grid service to reduce the memory requirement of each grid node in closure computation. The current design makes that transition easier to implement. For batch processing, the system has three phases to go through: the assignment phase, the closure computation phase, and output phase. For service processing, the only change is to have the output phase replaced by a repeated query-processing phase. In the assignment phase, the local records of each grid node are determined and mapped to integers. At the end of the assignment phase, a single communication occurs in which all grid nodes communicate the largest integer used in the local map. The information is used to establish a global map consisting of all the local maps (each grid node adds an oﬀset to its local map). The global map is used to map string pairs to integer pairs. The information is needed so that remote records (an integer) can be mapped back to the grid node to which they are assigned. The client thread of each grid node sends the largest integer used information to all servers (server threads) on other grid nodes. Once a grid node receives this communication from all other grid nodes, it has the assignment information and the next phase begins. During closure computation phase, in each grid node, the client thread goes through the record pairs (ﬁrst local pairs then global pairs) of the current iteration by combining local clusters, generating local and global pairs for other grid nodes, and sending new pairs generated to other grid nodes (remote servers) for the next iteration. The server thread receives record pairs from remote clients and gathers all the pairs received so that pairs are delivered when the client is ready for the next iteration.

A Grid Based System for Closure Computation and Online Service

85

The problem of handling communication between multiple, simultaneously running programs in each grid node is solved by using multiple threads. The client thread is used for processing while the server thread is used to handle incoming messages. The server thread handles incoming messages by storing the data (pairs) in a temporary buﬀer and then writing to a shared data structure at the end of a closure computation phase. This allows the client thread to continue processing while information is being received. This threading approach realizes a signiﬁcant practical improvement from the single threaded MPI implementation in [11]. The communication between grid nodes is logically a fully connected network; each grid node communicates with every other grid node. More detailed discussions of the design are given in [14].

4

System Implementation

The software is implemented in C++ programming language using a multithreaded client-server architecture. Each grid node runs the same copy of the program. POSIX Pthread library is used to implement multithreading and CORBA library is used to implement remote objects allowing communication between servers and clients running on diﬀerent grid nodes. Each grid node has a client thread that does the bulk of the processing. A separate server thread is started in each grid node in order to asynchronously receive data from remote grid nodes. The speciﬁc communication API used is CORBA. CORBA was selected in order to allow the software to conform to the current grid architecture of the project sponsor and to be more easily adapted to a service-oriented architecture in the future (for instance closure query service). The total number of grid nodes and the ID of the current grid node must be provided to the program as a command line parameter. A potential improvement would be to move this information to a conﬁguration ﬁle. In either cases, the number of running grid nodes is known. The total number of grid nodes and the ID of the current grid node running the program are required by the program for the implemented algorithm to perform correctly. MD5 is used as the hashing function in the assignment phase. The mapping structure is implemented using C++ map class in the standard template library. The mapping object maps the alphanumeric strings (record identiﬁers) to integers. Records may be inserted into the object. After all initial records are inserted, the object must be ﬁnalized. This ﬁnalization is done for performance reasons, as the object is read only after ﬁnalization and does not require further locking. The local server reads this object in order to map incoming global records from strings to integers. During the output phase, a reverse map is also constructed to allow the string records to be output after completion of the entire algorithm or to support closure service. A two-pass scan of the input ﬁle is used in this particular implementation. The ﬁrst scan assigns records to grid nodes and establishes the mapping object in each grid node. The second scan converts string pairs from the input ﬁle to an integer pairs. The mapping object of each grid node, which is established in

86

W.-N. Li et al.

the ﬁrst scan, is used to realize the conversion step. To be more speciﬁc, the local mapping object will be consulted twice to convert a local pair. The local mapping object and a remote mapping object are consulted to convert a global pair. Since consulting remote mapping objects involve remote procedure calls, bulk lookup was implemented and used to improve eﬃciency. In bulk lookup an array of strings is provided as input and an array of integers is returned as output. More detailed discussions of the implementation are given in [14].

5

Experimental Results

Preliminary tests are conducted for the system. The tests were run on dual core Intel Pentium III 1.266 Ghz dual core machines with 4 GB of memory using 100 Mbit Ethernet for grid node connection. The operating system was CentOS 4.5 with the 2.6.9-55.0.2.ELsmp linux kernel. Two grid nodes were run on each machine. In Tables 1 to 3, each column except the ﬁrst one identiﬁes a grid node and shows the resources used by the grid node, in various steps in the computation. The second row shows the peak memory usage. For example, in Table 1, memory at 8M means the global pairs array was allocated to contain 8 million record pairs and node 0 peak memory usage is 276 MB. As mentioned in [14], CORBA does not deﬁne or make available its memory allocation and deallocation routines. As such, the memory usage is more indicative of the implementation than of the algorithm. All times are in seconds. ”Max pair” is the largest number of global pairs that existed at any point in the processing, though this usually occurred during the ﬁrst pass. ”Scan time” gives the time to complete the initial pass of reading the ﬁle. ”Setup pair” is the time to convert string pairs to integer pairs and to process the initial local pairs. ”Compute Closure” is the time to process initial global pairs, to generate and send new local and global pairs in all iterations. Table 1. Synthetic data of 7.3 million pairs Node ID

0

1

2

3

4

5

Memory at 8M Max Pair Scan Time Pair Set-up Computer Closure Total Time

276 MB 2,130,875 14 27 11 52

274 MB 2,044,955 13 28 10 51

281 MB 2,204,346 14 28 11 53

268 MB 1,837,045 16 25 14 55

280 MB 2,317,816 16 30 9 55

271 MB 1,968,969 16 27 12 55

Table 1 contains the results of a synthetic data set running on six grid nodes. Tables 2 and 3 contain the results from running a real data set on two and six grid nodes. The average runtime on two grid nodes is 423 seconds, compared to 345 seconds on six grid nodes. The memory usage for two grid nodes is 1.8 GB

A Grid Based System for Closure Computation and Online Service

87

Table 2. Real data of 26.9 million pairs on 2 nodes Node ID

0

1

Memory at 14M Max Pair Scan Time Pair Set-up Computer Closure Total Time

1.8 GB 13,549,308 159 225 39 423

1.8 GB 13,549,308 159 225 39 423

of memory, but only 1.1GB of memory when using six grid nodes. This shows the speed and memory beneﬁts of parallelism. Table 3. Real data of 26.9 million pairs on 6 nodes Node ID

0

1

2

3

4

5

Memory at 14M Max Pair Scan Time Pair Set-up Computer Closure Total Time

1.1 GB 7,482,705 101 179 62 342

1.1 GB 7,481,988 101 178 66 345

1.1 GB 7,488,032 102 173 70 345

1.1 GB 7,474,246 102 179 67 348

1.1 GB 7,477,869 98 180 64 342

1.1 GB 7,489,096 99 183 64 346

Note that the large portion of the time is related to alphanumeric string pair to integer pair conversion process in all tests. Note also that in the two grid node test, both grid nodes are in the same machine and the measured time (communication portion) could be shorter than that of the two grid nodes running on diﬀerent machines. The closure query service uses the results and structures of the closure computation and has two major steps. Upon receiving a record identiﬁer for closure service, the service grid node uses the hash function to determine which grid node owns the record as its local record. Once the grid node is determined, the service grid node performs an initial fetch by sending the record identiﬁer to the grid node, which returns a list of processor ID and remote key pairs. The processor ID tells us the processor that has some of the records in the closure as its local records. The remote key is the corresponding mapped integer value of one local record in the closure. The list is derived from the mapping object and the global pairs of the local cluster to which the queried record identiﬁer belongs. Next for each pair in the list, the service grid node sends the remote key to the processor, identiﬁed by the processor ID, which returns all the record identiﬁers of the local cluster to which the remote key belongs. The returned information is derived from the disjoint set and mapping object structures. When every queried processor returns, the service grid node has all the records in the closure corresponding to the query record identiﬁer and performs a return.

88

W.-N. Li et al.

The closure query service is evaluated by a synthetic data set; the same data set that table 1 is based. It has about two millions of records and 7.3 million pairs. This data was processed by the transitive closure solution, and subsequently queried by the service addition to select a known closure of 1,472 members for the purpose of timing comparison. The timing method was composed of repeating a request for an entry 1,000 times and recording the average request timer for the fetch process. This was performed for multiple members of the closure, and using one to six processors to maintain the distributed closure. The process was timed for both the initial process fetch that returns the processor id and the remote key pairs for a given record identiﬁer, and for the subsequent queries of transitive closure components. The initial fetch was noted as negligible, as it did not require more than one value search into its local data, and returned at most 2n integers, with n being the number of processes (nodes in the grid) in the system. The time spent for a query to each processor was averaged for the 1000 attempts, and then grouped by processor count to record the average amount of time that was spent on each processor. Table 4. Closure queries to 2 million records (1472 records retrieved) Nodes to maintain closure

1

2

3

4

5

6

Average fetch time 0.0420 0.0730 0.0720 0.0680 0.0560 0.608 0.0420 0.0365 0.0240 0.0170 0.0112 0.0101 Average busy time per node 0.0055 0.0125 0.0070 0.0058 0.0011 Reduced work time (tn − tn−1 ) −

As shown in Table 4, independent to the number of grid nodes used, the average fetch time remains more or less the same. This is expected as the service node runs a sequential process, which is the bottleneck. However, the average busy time per process decreases as more nodes are used. It suggests that nodes in the closure solution are not very busy when more nodes are use. It seems that this could support multiple service nodes to query the closure solution in parallel and maintain average fetch time for each service node.

6

Conclusions

A parallel and distributed transitive closure prototype system running in an enterprise grid was developed. The system could speed up various record linkage applications. The design and implementation is based on a novel parallel and distributed closure algorithm, multithreading, and Corba-based client and server model. The system can either produce all closures to a ﬁle in a batch fashion or run as a service where upon receiving a record it returns the closure of that record. Preliminary experiment results are encouraging and seem to indicate the approach is eﬃcient and scalable.

A Grid Based System for Closure Computation and Online Service

89

A more thorough evaluation of the prototype is needed in the future so it may be ﬁne tuned and enhanced for eﬃciency and scalability. Acknowledgment. This research was supported in part by Acxiom Corporation through the Acxiom Laboratory for Applied Research.

References 1. Li, W., Zhang, J., Bheemavaram, R.: Eﬃcient algorithms for grouping data to improve data quality. In: Proc. 2006 International Conference on information and knowledge engineering, Las Vegas, pp. 149–154 (2006) 2. Ballou, D., Wang, H., Pazer, G.: Modeling information manufacturing systems to determining information product quality. Management Science 44(4), 462–484 (1998) 3. Ballou, D.: Enhancing data quality in data warehousing environment. Comm. ACM 42(1), 73–78 (1999) 4. Delone, W., Mclean, E.: Information systems success: The quest for the independent variable. Information Systems Research 3(1), 60–95 (1992) 5. Redman, T.: The impact of poor data quality on the typical enterprise. Comm. ACM 41(2), 79–82 (1998) 6. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969) 7. Do, H., Rahm, E.: Coma - a system for ﬂexible combination of schema matching approaches. In: Proc. ACM SIGKDD ’02 (2002) 8. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation for high accuracy object identiﬁcation. In: Proc. Very Large Data Bases 2002, pp. 610–621 (2002) 9. Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University (2005) 10. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003) 11. Bheemavaram, R.: Parallel and distributed grouping algorithms for ﬁnding related records of huge data sets on cluster grid. M. s. thesis, University of Akansas (2006) 12. Li, W., et al.: Paralle and distributed grouping algorithms for ﬁnding related records of huge data sets on cluster grids. In: Proc. ALAR conference on Applied Research in Information Technology, Conway (2007) 13. Li, W., Bheemavaram, R., Zhang, J.: Transitive closure of data records. In: Chan, Y., Talburt, J., Talley, T. (eds.) Data Engineering: Mining, information and Intelligence, pp. 39–74. Springer, New York (2010) 14. Hayes, D.: A corba-based distributed and multithreaded algorithm ﬁnding related records in a large data set. M. s. thesis, University of Akansas, Fayetteville, Arkansas (2008) 15. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction To Algorithms. McGraw-Hill Book Company, Cambridge (2002)

A Multiple Grid Resource Broker with Monitoring and Information Services∗ Chao-Tung Yang∗∗, Wen-Jen Hu, and Bo-Han Chen Department of Computer Science, Tunghai University, Taichung, 40704, Taiwan (ROC) [email protected]

Abstract. Grid computing is now in widespread use, which integrates geographical computing resources across multiple virtual organizations to achieve high performance computing. We build a resource broker on multiple grid environments, which integrates a number of single grids from different virtual organizations without the limit of cross-organization. We can efficiently use the multiple grid resources to avoid the waste of resources. In addition, we proposed a Multi Grid Resource Selection Strategy for the resource broker to select the better allocation of resources before submitting job for avoiding network congestion caused by the decrement of performance. Keywords: Multi-Grid, Resource Broker, MGRSS.

1 Introduction In the Grid environment, applications make use of shared Grid resources to improve performance. The target function usually depends on many parameters, e.g., the scheduling strategies, the configurations of machines and links, the workloads in a Grid, the degree of data replication, etc [3, 4]. In this work, we examine how those parameters may affect performance. We choose an application’s overall response time as the object function and focus on dynamically scheduling independent tasks. We define the job, the scheduler, and the performance model of a Grid site and conduct experiments on the Multi-Grid platform [1]. We propose a multi-grid resource selection strategy, called “MGRSS”. The strategy helps user to select the better performance of machines in order to shorten the execution time of programs, and furthermore we adjusted each grid user has the quota of resources. Experimental results show that MGRSS exhibits a better performance than other strategies [7].

2 System Design and Implementation 2.1 Resource Broker The system architecture of the resource broker is shown in Fig. 1. Users could easily make use of our resource broker through a common Grid portal [5, 6, 7-11]. The ∗

This work is supported in part by National Science Council, Taiwan R.O.C., under grants no. NSC 96-2221-E-029-019-MY3 and NSC 98-2218-E-007-005. ∗∗ Corresponding author. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 90–99, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Multiple Grid Resource Broker with Monitoring and Information Services

91

primary task of the Resource Broker is to compare requests of users and resource information provided by the Information Service [2, 13]. After choosing the appropriate job assignment scheme, Grid resources are assigned and the Scheduler is responsible to submit the job. The results are collected and returned to Resource Broker. Then, Resource Broker records results of the execution in the database of Information Center through the Agent of Information Service. The user can query the results from the Grid portal. Our resource broker architecture includes the four layers, web portal, resource brokering subsystem, multi-grid manager center, and multi-grid resource. The multi-grid resource consists of many single grid environments

Fig. 1. Resource Broker system architecture

2.2 Cross Grid Authentication Service The Globus Toolkit authentication is issued by the certificate of GSI. Each user and service is certificated to identify and authenticate the trust between users or services [12, 17]. If two parties have the certificate and both of them trust the CAs, then these two parties could trust each other. This is known as the mutual authentication. A subject name, which identifies the person or object that the certificate represents. The cross grid authentication service manages the several certificates and the subject of grid, and we know these messages from the registered information via the web portal.

92

C.-T. Yang, W.-J. Hu, and B.-H. Chen

All of nodes on the multi-grid environment should setup the cross grid service tool, which written in the shell script program, as shown in Fig. 2. The cross grid service tool contains some procedures to regular automatic updates with a multi-grid manager center, such as the IP and domain of host list, the certificates, and the subject. Any single grid user on our multi-grid environment makes use of the multi-grid resource through the tool to update with a multi-grid manager center. The primary task of cross grid information is to gather the IDL file, which contains the several attributes of hosts on the grid, the detail is described in next section.

Fig. 2. Multi-Grid Manager Center

2.3 Cross Grid Information Service The CGIS consists of three layers: the Core Service Layer, the Translator Layer, and the Resource Layer, as shown in Fig. 3. The Core Service Layer contains the Agent, Filter, Getter and Setter, and Gather. These components are installed in every grid environment for information gathering and maintenance. The Translator Layer supports a variety of format conversion monitor tools, such as Ganglia, Cacti, and MDS. The information in different monitor tools is transformed into the IDL format. The Resource Layer describes resource information from different grid environments. In this study, it includes Tiger Grid, Medical Grid, and Bio Grid [14, 15, 16]. We define a new information format to translate different formats. The proposed format is called the IDL (Information Describing Language). It is responsible to exchange and translate resource information among grids, and to perform the posttransfer filtering to ensure that only necessary information is passed to clients, end users, and software components.

A Multiple Grid Resource Broker with Monitoring and Information Services

93

Fig. 3. Cross Grid Information Service architecture

2.4 Web Portal The user could execute a parallel program via the resource broker and choose the required for the execution of the host after submitting the job, as shown in Fig. 4, and get the result form the Job Monitor Service. The page shows the many detail messages in the execute time, such as scheduling log, machine list, result message, running log, debug message, and turnaround time, as shown in Fig. 5. We enhance the graphical display of monitor function, uses JFreechart [18] tool draw more meaningful graphics, including Job Monitor, Login Info, and Utilization of Resource. JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications.

Fig. 4. Job Monitor

94

C.-T. Yang, W.-J. Hu, and B.-H. Chen

Fig. 5. Job Monitor

3 Multi-Grid Resource Selection Strategy 3.1 Parameters and MGRSS Algorithm In this section, the parameters used in the algorithm are listed and explained in the following: y

y y y y y y y

jobi: The ith job dequeued from the job queue, where i = 1~n. The job contains some information such as job name, program name, program arguments, input data, and number of processors to run on. The program is usually a parallel program written by MPI library and compiled by MPICH-G2 that is a Gridenabled implementation of the MPI standard. The Resource Broker will allocate resources according to the information provided by the job. NPmGrid : The total number of processors on the multiple grid NPlGrid : The total number of processors on the local grid NPmf : The total number of processors with idle status on the multiple grid NPlf : The total number of processors with idle status in the local grid NPmax : The maximum processors on the multiple grid NPvalid : The available processors on the multiple grid NPreq: The processors used for executing jobi, if the Resource Broker dispatches the jobi successfully, the resources distributed over several Nodes or Sites used for jobi will be locked for a while until the jobi finishes.

A Multiple Grid Resource Broker with Monitoring and Information Services

y y

Smax : The higher scores of the machines NPvalid : How many number of processor user can apply. •

y

95

·

·

(1)

Loadi: It is a sum of the Load1min, Load5min, and Load15min of Nodei, the expression of Loadi is shown as follows: 3

2

,

6

(2)

where Load1min is the average of load in one minute, Load15min is the average of load in five minute, and Load15min is the average of load in fifteen minute. For each parameter has different proportion value, the time is closer, the value is higher. y PEi: The computing performance efficiency of host, the expression is shown as follows: (3)

,

where HPLi is the benchmarking value of host obtained by the approach of benchmarking. y Flowi : The average network flow of host, the expression is shown as follows: ∑

_

_

,

(4)

where Byte_in and Byte_out get from the IDL, estimated the network flow influence through these two parameters. We also considered that same domain of other machines will affect the flow of the entire site, where N is the number of nodes in same domain. Therefore we use the average of all nodes in same domain. Finally, the Flowi mean Megabyte per second after divided by square of 1024. y TPi : Total performance power, the expression is shown as follows: ·

,

(5)

where β is the effect ratio used to regulate the percentage of PE and Flow. The main policy of MRGSS, a single grid select resource itself when that possesses sufficient resources, otherwise selects the multiple grid resource. The number of available resources is variation, which according to their own resources of local grid. 3.2 MGRSS Flowchart Job submission is the most important part of Resource Broker. Among these resources, choosing the appropriate resource to do assignment is a very important topic, because the fine resource selection strategy can shorten the execution time of jobs, and upgrade the overall performance on the Multi-Grid environment. For above purposes, we provide a Multi-Grid Resource Selection Strategy, called MGRSS, as shown in Fig. 6.

96

C.-T. Yang, W.-J. Hu, and B.-H. Chen

Fig. 6. Multi-Grid resource selection strategy flowchart

First, a job is dequeued from the job queue for scheduling. Second, it gets the latest information from an IDL file of multi-grid information server to determine, whether the number of processors of requests is smaller than the total number of processors of the Multi-Grid. If it is false, scheduler discards jobs. If it is true, scheduler determining use the single grid or multi-grid strategy by the processors of request. For the single grid strategy, the processors of requests should be smaller than the free processors of a local grid; otherwise, that enqueue the job and waiting for five minutes. Our schedule will resubmit the job when the waiting time is exceeded five minutes. If the number of processors of requests is still smaller than that of the free processors of the local grid, we adopt the multiple grid strategy. For the multiple grid strategy, if the number of the processors of requests is larger than that of the valid processors of multiple grids, then scheduler discard job. Otherwise if the processors of request smaller than free processors of multiple grid, then we adopt the multiple grid strategy.

4 Experimental Results In this section, we have the four experimental results. Among the first two experiment respectively measurement the execute time for different specific program which use the different metric in five strategy environment. In the third experiment, we execute the entire parallel programs. The last experiment randomly selected twenty different types of parallel programs and uses the different number of processors to execute. The above-mentioned results are all order to compare the execution time between different resource selection strategies. We calculate the value of performance and network for each machine, and according the score to sort the list of machines which represents the resource selection priority and previous section mention the MGRSS algorithm. We divided MGRSS into three levels from level one to level three by different weighted valueβ. The maximum weighted value of the network is MGRSS3(β=0.7);

A Multiple Grid Ressource Broker with Monitoring and Information Services

97

the medium weighted valuee(β=0.5); the maximum weighted value of the performaance is MGRSS1(β=0.3). First experiment shows the total time of bucket-sort MPI programs total in ten times and the program parrameter size using the 512, 1024, 4096 respectively. T The bucket-sort MPI program uses u a small number of resources. In this case, the raapid transmission normally will decide the time of program execution. The MGRSS2 and MGRSS3 have better strateegies in this experiment, because that has more proporttion of network in three level off MGRSS algorithm. Experiment results are shown in Fiig.7 and 8.

Fig. 7. Result fxor Bucketsorrt_MPI in different parameters sequence

Fig. 8. Result for Bucketsort_MPI in diffeerent strategies sequence

Second experiment show ws the total time of matrix multiplication MPI program ms in ten times, and the program parameter size using the 256, 512, 1024 respectively. T The matrix multiplication requiires a large amount of computing power and network, but the network flow is still mo ore proportion. So the MGRSS3 spent the less time execcuting the jobs. Experiment ressults are shown in Fig. 9 and Fig. 10.

Fig. 9. Result for Mmd_M MPI in different parameters sequence

Fig. 10. Result for Mmd _MPI in different strategies sequence

We execute the nine paarallel programs which are described in previous sectiion. Similarly, we compared the t MGRSS and other differences strategy again. T The different characteristics of the program, so using the different strategies will prodduce different results between th hese situations. Overall, three levels of MGRSS strateggies

98

C.-T. Yang, W.-J. Hu u, and B.-H. Chen

are still better than other strrategies. Experiment results are shown in Fig. 11. The last experiment we randomly selected s twenty different types and parameter of paraallel programs to execute the diifferent strategies in twenty times, as shown in Fig. 122. In some cases, the speed only y and the network only are better than the MGRSS, but MGRSS is better than otherr two strategies in the most of case.

Fig. 11. Result for all mp pi programs

ms Fig. 12. Result for variety of mpi program

5 Conclusion and Fu uture Work In this paper, we propose a multi-grid resource selection strategy, which helps userrs to select the better performan nce of machines in order to shorten the execution timee of programs, and furthermoree we adjusted each grid user has the quota of resourcces. Experimental results show that MGRSS exhibits a better performance than that applied other strategies. Overrall, we construct a multi-grid platform, which integraates four single grid systems inccluding the Tiger Grid, the Medical Grid, the Bio Grid, and GCA Grid and, furthermorre, design and implement a Multi-Grid Resource Brooker with multi-grid resource selection strategy. In the future, we will continue to enhaance and improve the function of resource brokers and the multi-grid resource selecttion strategy.

References 1. Foster, Kesselman, C.: Globus: G A Metacomputing Infrastructure Toolkit. Internatioonal Journal of Supercomputerr Applications 11, 115–128 (1997) 2. Czajkowski, K., Fitzgeraald, S., Foster, I., Kesselman, C.: Grid Information Servicess for Distributed Resource Sharing. In: Proceedings of the Tenth IEEE International Sym mposium on High-Performancce Distributed Computing, pp. 181–194 (2001) 3. Foster, Karonis, N.: A Grrid-Enabled MPI: Message Passing in Heterogeneous Distribuuted Computing Systems. In: Proceedings P of 1998 Supercomputing Conference, p. 46 (19988) 4. Tang, J., Zhang, M.: An Agent-based Peer-to-Peer Grid Computing Architecture. In: F First International Conference on Semantics, Knowledge and Grid, SKG ’05, p.57 (2005) 5. Aloisio, G., Cafaro, M.: Web-based access to the Grid using the Grid Resource Brooker mputation: Practice and Experience 14, 1145–1160 (2002) portal. Concurrency Com

A Multiple Grid Resource Broker with Monitoring and Information Services

99

6. Krauter, K., Buyya, R., Maheswaran, M.: A taxonomy and survey of grid resource management systems for distributed computing. Software Practice and Experience 32, 135– 164 (2002) 7. Yang, C.T., Lai, C.L., Shih, P.C., Li, K.C.: A Resource Broker for Computing Nodes Selection in Grid Environments. In: Jin, H., Pan, Y., Xiao, N., Sun, J. (eds.) GCC 2004. LNCS, vol. 3251, pp. 931–934. Springer, Heidelberg (2004) 8. Yang, C.T., Shih, P.C., Li, K.C.: A high-performance computational resource broker for grid computing environments. In: 19th International Conference on Advanced Information Networking and Applications, AINA 2005, vol. 2, pp. 333–336 (2005) 9. Yang, C.T., Li, K.C., Chiang, W.C., Shih, P.C.: Design and Implementation of TIGER Grid: an Integrated Metropolitan-Scale Grid Environment. In: Proceedings of the 6th IEEE International Conference on PDCAT’05, pp. 518–520 (2005) 10. Yang, C.T., Lin, C.F., Chen, S.Y.: A Workflow-based Computational Resource Broker with Information Monitoring in Grids. In: Proceedings of Fifth International Conference on Grid and Cooperative Computing (GCC’06), pp. 199–206 (2006) 11. Yang, C.T., Chen, S.Y., Chen, T.T.: A Grid Resource Broker with Network BandwidthAware Job Scheduling for Computational Grids. In: Cérin, C., Li, K.-C. (eds.) GPC 2007. LNCS, vol. 4459, pp. 1–12. Springer, Heidelberg (2007) 12. Ian, F., Carl, K.: Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications 11, 115–128 (1997) 13. Kim, D.H., Kang, K.W.: Design and Implementation of Integrated Information System for Monitoring Resources in Grid Computing. In: Computer Supported Cooperative Work in Design, pp. 1–6 (2006) 14. Tiger Grid, http://gamma2.hpc.csie.thu.edu.tw/ganglia/ 15. Medical Grid, http://eta1.hpc.csie.thu.edu.tw/ganglia/ 16. Bio Grid, http://140.128.98.25/ganglia/ 17. Globus, http://www.globus.org/ 18. JFreeChart, http://www.jfree.org

Design Methodologies of Workload Management through Code Migration in Distributed Desktop Computing Grids Makoto Yoshida and Kazumine Kojima Information and Computer Engineering, Okayama University of Science, 1-1 Ridai-cho, Okayama, 700-0005, Japan {yoshida,kojima}@ice.ous.ac.jp

Abstract. Large scale loosely coupled PCs can organize clusters and form desktop computing grids on sharing each processing power; power of PCs, transaction distributions and load balancing characterize the performance of the computing grids. This paper describes the design methodologies of workload management in distributed desktop computing grids. Based on the prototype experiment, several simulations were performed; several centralized and decentralized algorithms for location policy were examined, and the design methodologies for distributed desktop computing grids are derived from the simulation results. The methodologies for domains, language and control algorithms for computing grids are described. The language for distributed desktop computing is designed to accomplish the design methodologies.

1 Introduction Grid system is a type of distributed systems, and it can be categorized into three classes: computing grid, data grid, and service grid [3, 6]. This paper deals with the computing grid. The purpose of the computing grid in this paper is to use the idle computing powers effectively; those scattered around among one or more organizations. Several hundreds of PSs, such as Note PC, Desktop PC and WS, are assumed. User jobs can be executed on the local or remote computer systems through migration [1]. To work the desktop grids efficiently, the transfer policy that determines the migration decision, and the location policy that determines the site to be migrated, must be provided adequately depending upon the application environment [8]. However, little work has been associated with their application environment. We need some methodologies to design the workload of transactions in the application environment. In the previous paper [2, 7], we compared and evaluated several centralized and decentralized algorithms for location policy. We summarize these results in this paper with some new results for scalability, and design methodologies for desktop computing grids derived from several simulation results are described. The language for distributed computing is designed to accomplish the methodologies derived. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 100–111, 2010. © Springer-Verlag Berlin Heidelberg 2010

Design Methodologies of Workload Management through Code Migration

101

Distributed Computing

Grid System

Data Grid

Computing Grid

Service Grid

Distributed File Computing Sharing

Platform Collaboration

Load Balancing

Scheduling

Transfer Policy State Migration Table

P2P System

Location Policy

Active Method Pull Push

Evaluation Method Centralized Decentralized

Static Dynamic

Non Migration CPU based Workload based Working Buffer based

Fig. 1. Taxonomy of Distributed Computing System

The rest of paper is organized as follows. Section 2 and 3 summarizes the simulation results. Section 2 describes the grid computing model and the prototype experiment, and section 3 describes the results of the simulations. Section 4 describes some design methodologies. The language designed for workload sharing in distributed desktop computing is also introduced. Section 5 concludes with the future works.

2 Computing Model Figure 1 shows the taxonomy of the distributed computing systems [4, 5]. The shadow area is the objective area in this paper. The transfer policy and the location policy must be adequately defined for computing grids [8]. The transfer policy in our model is determined by the migration state table, which was obtained by implementing and evaluating the prototype experiment [2, 7]. To satisfy the location policy, the algorithm to detect the appropriate site for object migration must be provided [8]. Many load balancing algorithms have been proposed; pull and push approach for active method, and several algorithms for evaluation method exist, as is shown in Figure 1 [3, 6]. We compared several centralized algorithms [7], and the decentralized algorithm [2], by varying transaction patterns, and derived some design methodologies. 2.1 Centralized Algorithms There exists several centralized location policy algorithms [7]: the random based selection algorithm, the CPU power based selection algorithm, the traffic based selection algorithm and the working buffer based selection algorithm. The random based algorithm (RB) selects the site randomly from available sites.

102

M. Yoshida and K. Kojima

SITE i (Si) SENDER Process Send (MigrateRequest) to all_member; Receive(Ack_with_Timestamp Txo) frommembers; /* Txo stands for the timestamp T for object o on site x/ Select the site_X that has the earliest timestamp Txo from Received messages;

Move_Transaction with Txo) to the members; Receive(Result); Send (

RECEIVER Process if Receive(MigrateRequest) then Statecheck(); /* Check Migration State Table/ if state=idle then Reserve&Enque_Request with timestamp Tio; /*Tio stands for the Timestamp T for object on site i/ Return(Ack_with_TimestampTio);

if

Receive(Move_Transaction) then /*Compare the timestamps/ if (It has sent the Ack with Timestamp Tio) & ( Tio= Txo) Then Migrate_Code & Execute_Code;

Return (Result); Else

Cancel&Deque the request that has the timestamp Tio;

Fig. 2. Decentralized Algorithm (DC Algorithm)

The CPU based algorithm (CB) selects the site by the order of CPU powers, and the traffic based algorithm (TB) selects the site by the order of workloads at each site. The working buffer algorithm (WB) assigns the number of transactions affordable at each site to be migrated the transactions from remote sites, and the algorithm selects the site by the largest value of the numbers. If the site is selected for migration, the number is decreased. The number which defines the affordable power of processing for other sites must be assigned properly [7]. These location policy algorithms are compared to the non-migration policy (NM). 2.2 Decentralized Algorithm Figure 2 is the autonomous decentralized algorithm adopted at each site [2]. Each site determines the migration site autonomously by using the DC algorithm described in Figure 2. Both the sender and the receiver processes shown in the Figure 2 exist at each site. The sender broadcasts the migration request to the members of domains when the state is not in the idle state, and waits the acknowledgements from the members. If several acknowledgements are returned, it selects the site that has the earliest timestamp attached, and migrate the code to that site. On the other hand, if the receiver receives the migration request and if its state is in idle state, it returns the acknowledgement with its current timestamp. After acknowledgement, if it has received Move_Transaction message with more earlier timestamp than owns or it has not received Move_Transaction message, then it cancels the migration request received. 2.3 Prototype Experiment A middleware prototype system that migrates objects among PCs was implemented. And, the performance of the two systems; the one that does not migrate object and the

Design Methodologies of Workload Management through Code Migration

103

100% ) et is a f o n iot az ili t u U P C( t in op ld o hs re h T

90% 80%

Busy

70% 60%

Normal

50% 40% Idle

30% 20% 10% 0% 0%

10%

20%

30%

40% 50% 60% T ransaction w orkload

70%

80%

90%

100%

Fig. 3. Migration State Table Arrival (from users & other sites)

Transaction Manager

Queue Migration (to other sites)

Agent

CPU Processing

Departure (end of transaction)

Migration State Table

Fig. 4. Simulation Model of Each Site

other that always migrates object to the remote site was compared [7], and the threshold points were obtained by varying the CPU utilization and the transaction workload. Figure 3 plotted the threshold points with the axis X the workload of a transaction and the axis Y the average CPU utilization of other transactions in the site. We call the table shown in Figure 3 as the migration state table (MST). It was used to provide the mechanism for the transfer policy in the following simulations. The migration state table is divided into three states [9]: busy state, normal state and idle state, through setting some widths around the threshold points. When a transaction comes in a site and the current state is in the normal or busy states, the transaction can be moved to the remote site whose state is in the idle state, provided the mechanism for location policy.

3 Simulation A simulation model of each site of the system is modeled based on the prototype experiment, and it is shown in Figure 4. Each site runs transaction manager that directs a transaction which has just originated from that site. If the processor is idle,

104

M. Yoshida and K. Kojima

a transaction is processed at that site. Otherwise, a transaction must wait in the queue to finish the other transactions, or must move to the other site to finish computing. It is determined by checking the Migration State Table located at each site. The agent negotiates with the information server for the movement of the transactions if it were the centralized control. The simulation is carried out by using the event based algorithm. The simulation model, shown in Figure 4, was organized and several simulations are carried out. The parameters we selected for the model are the following: 1) Site number: 50 to 500 sites for computing are assumed, and 1 site for information server is assumed for centralized control. 2) Average arrival time: The arrival transaction patterns at each site follow one of the three different distributions; a Normal distribution, a Poisson distribution, and a Uniform distribution. 1000 to 15000 transactions at each site are assumed. 3) Processing time: The processing time of transactions at each site is assumed to follow a normal distribution. The average processing time is assumed to have 2 simulation units. 4) CPU power: The CPU power at each site follows a normal distribution. 5) Transmission time: The time taken to transmit a message from one site to another for migration is assumed to vary from 1 unit to 20 simulation units. 6) Object migration time: The time taken to migrate object code of a transaction to the remote site is assumed to be constant. 10 simulation units are assumed. 7) Transfer policy: The migration state table given in Figure 3 was provided at each site. 8) Location policy: Six location policy algorithms are compared; these are the random based algorithm (RB), the CPU power based algorithm (CB), the traffic based algorithm (TB), the working buffer based algorithm (WB), the decentralized algorithm (DC), and the non-migration algorithms (NM). The resulting statistics collected are the average response time, the average throughput, the remote processing rate, the rejected transaction rate, and the average message transferred. 3.1 Simulation1: Performance Comparison of Various Load Balancing Algorithms The algorithms for location policy were examined in the simulations. The result for the response time was as follows [2, 7]: (good) DC > WB > {TB, CB} > RB > NM (bad) In this paper, we picked up the two best algorithms from the simulation results [2, 7]. The WB which is the best centralized algorithm for response time, and the DC which is decentralized and the best algorithm in all algorithms simulated, are selected and compared to the non-migration method.

Design Methodologies of Workload Management through Code Migration 400

NM DC WB

350 300 e) m tit niu 250 ( e im t 200 es no 150 spe R 100 50 0

4000

5000

6000 7000 8000 work load (number of transaction)

9000

10000

Fig. 5. Comparison of the Response Time (Normal Distribution) 400

NM DC WB

350

)e 300 tim itn 250 u( e tim200 es no 150 ps Re 100 50 0

4000

5000

6000 7000 8000 work load (number of transaction)

9000

10000

Fig. 6. Comparison of the Response Time (Poisson Distribution) 400

NM DC WB

350 300 e) im tti 250 nu ( e im t 200 es no 150 spe R 100 50 0

4000

5000

6000 7000 8000 work load (number of transaction)

9000

10000

Fig. 7. Comparison of the Response Time (Uniform Distribution)

105

106

M. Yoshida and K. Kojima Transaction patterns

Uniform distribution

NM

Poisson distribution

DC/WB

NM

Normal distribution

DC/WB

NM

Non Applicable

DC/WB

4,000

5,000

Non Applicable

6,000 7,000

8,000

9,000

Work load ( number of transactions)

10,000

Fig. 8. Migration Domain

3.2 Simulation2: Performance Comparison of Response Time Varying Transaction Patterns The distribution of transactions at each site was varied and evaluated [2, 7]. Three distributions, such as a Normal distribution, a Poisson distribution, and a Uniform distribution, are provided, and the response time was compared. Figure 5, 6 and 7 show the comparison of the response time. The effectiveness of the traffic patterns can be ordered as follows: (good) Uniform > Poisson > Normal (bad) The non migration policy get worsen the response time immediately when the traffic increases. The DC and WB have a threshold at which the response time gets worsen, around 7000 in the normal distribution, and around 9000 in the Poisson distribution. Figure 8 shows the application domain of each migration algorithms with the transaction patterns. If transactions arrive as a normal distribution pattern, and the number of transactions are between 4000 and 7000, the DC and WB algorithms, work efficiently. It could have about 9 times better response time than the NM, when the workload is 7000. When the workload exceeds 7000, both algorithms get worsen exponentially, so it had better not to migrate transactions. When transactions arrive as a Poisson distribution pattern, it has also two threshold points at 4500 and 9000. Between them, the migration algorithms work efficiently. It works about 15 times better than the NM. When transactions arrive as a uniform distribution pattern, the migration algorithms have tremendous effect. Though the NM has the threshold point at 8500, the migration methods do not have the threshold point. It works over the threshold point continuously. 2000

NM RB DC WB

1800 1600

e)im 1400 tit n(u 1200 em it 1000 sen 800 poes R 600 400 200 0

1

5

9

13

17

21

25 29 Site number

33

37

Fig. 9. Response Time (50 sites)

41

45

49

Design Methodologies of Workload Management through Code Migration 2000

107

NM RB DC WB

1800 1600

)e 1400 im ttni 1200 u( em it 1000 ens osp 800 eR 600 400 200 0

1

46

91

136

181

226 271 Site Number

316

361

406

451

496

Fig. 10. Response Time (500 sites) 12

)e 10 im tt in 8 u( em 6 tie sn op 4 esR 2 0

NM RB CB TB WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Network delay (unit time)

Fig. 11. Response Time (4000 Transaction /site Assumed) 24 22 e)m tit 20 in u( 18 em it sen 16 pos 14 eR

NM RB CB TB WB

12 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Network delay (unit time)

Fig. 12. Response Time

（5000 Transactions/site Assumed)

3.3 Simulation3: Performance Comparison of Response Time Varying Grid Scales The computing sites are scaled up to 500 sites from 50 sites, and the same simulations are performed. Figure 9 shows the response time of each site in the case of 50 sites. Figure 10 shows the response time scaling up the sites to 500. It can be observed that the increase of sites does not affect the response time at all. Scaling up sites did not affect the response time of each site. Even though the sites increase, the

108

M. Yoshida and K. Kojima

average response time in the grids will not change, if the computing scales were in the order of hundreds. 3.4 Simulation4: Performance Comparison of Response Time Varying Network Delay To observe the influence of network delay, the transmission time is changed, and the response time is observed. The ratio of transmission time to processing time is ranged from 1:1 to 1:20. Figure 11 and 12 show the response time changing both the transmission time and the number of transactions. The intersections of the NM and the others are shifting to the right by the increase of the transactions. Though the response gets worsen by increasing the network delay, we could observe that the more the transmission delay, the more we could have extend the migration domain. When the transactions are 4000 and the network delay is less than the 10 times of the processing time, the migration works effectively. When the transactions are 5000, the migration works effectively if the network delay is less than the 20 times of the processing time of a transaction. As the network delay increases, it get worsen the response time lineally. The equation for the response time graph can be described as follows: / 2 + C

(1)

Network delay is the relative time unit compared to the processing time. C is the constant value determined by the workload of transactions. Given the network delay and C, the upper bound of the migration domain can be defined by the equation. If the value by the equation (1) is greater than the average non-migrating response time, then migration should not be performed, or the constant value C has to be adjusted. Given the network delay and the value of C, we can calculate the response time and compare to the anticipated response time. 3.5 Simulation Results The results of the simulations are summarized as follows: 1) Several load balancing algorithms were evaluated and compared to the non migration policy (NM). The effective order of the location policy algorithms for response time was observed. The migration ratio converged to a certain different point depending upon the location policy algorithms. 2) A transaction pattern affects the response time of the system tremendously. The order of effective response time for transaction patterns can be observed, and the location policy algorithms of the application domains can be defined. 3) The increase of sites does not affect the response time of each site when the grid scales were in the order of hundreds. The results we obtained are effective to the grid scales that were in the order of hundreds. 4) When the network delay is very large comparing to the processing time of a transaction, it is useless to migrate the transaction. The relationship of the network delay and response time was obtained. It sets down the upper bound of the transaction workloads.

Design Methodologies of Workload Management through Code Migration

109

4 Design Methodologies The language and the design methodologies for the desktop computing grids are described. 4.1 Language Designed We designed the language that specifies the domains of the system for load balancing. The domain defines the applicable area of the migration, and can be used not only for the purpose of load balancing but also for fault tolerant or any other purposes. Figure 13 (a) shows the syntax of the language, and (b) shows the BNF notation. In the language, one of the four parameters:” ALL”, “domain names”,” direct address” or “NULL”, must be selected. If “ALL” is selected, the migration request is notified to all the sites of desktop PCs in the grids. If “domain names” is selected, the migration request is multicasted to the sites that the domain involves. The address of the sites must be included into the domain name in advance. If “direct address” is selected, the request is sent to the address directly described. If “NULL” is selected, the request is not sent to anywhere, and be executed in its own site. 4.2 Design Methodologies This section describes the design methodologies obtained from the simulation results, assuming the following strategies: 1) The Migration State Table must be constantly observed by the SMT. It defines the transfer policy for migration. 2) Transactions had better be distributed uniformly. 3) The relationship between the network delay and processing time must be observed to define the upper bound of the migration domain. The simulation result shows that the network delay should be less than the 20 times of the processing time when the number of sites is in the order of hundreds. 4) If the number of grid sites is in the order of hundreds and the heavy traffics are in the migration domain, the distributed algorithm for decentralized control and the working buffer algorithm for centralized control are the best solution to apply. The following is the design methodologies we learned from the simulation results: 1) Before designing the migration, trace the workload of transactions and analyze the average workload of a transaction and the pattern of transactions at each site. 2) Obtain the Migration State Table for transfer policy, described in section 2.3 by tracing or calculating the workload of transactions.

110

M. Yoshida and K. Kojima

3) Define the upper bound of the migration domain using the equation (1) in section 3.4. Given the network delay, workload of transactions, or anticipating response time, we can define the upper bound of transactions for migration. Migration effectively works within the migration domain. 4) Design the domain of each site. Try to group each site to the domains as each domain to have the uniform distribution of the workloads, in the migration domain described in Figure 8. 5) Define the sites of the domains by the language described in section 4.1. 6) Select the migration algorithms for location policy. Either DC algorithm for distributed control or WB algorithm for centralized control can be selected. If it does not have powerful machines for servers, select the DC algorithm. However, the WB algorithm has an advantage for controlling the response time at each site to be uniformly distributed. #DIST

ALL <domain names> NULL

## Follows the original Java source code

…………… ……………

(a) DIST Syntax DIST::= #DIST {ALL | <domain names> | | NULL} <domain names>::= <domain name> | <domain name>,<domain names> <domain name>::= | ,<domain name> ::= | <symbolic address> <symbolic address>::= | , <symbolic address>

(b) BNF Notation Fig. 13. Language Syntax for Distributed Computing

5 Conclusion This paper summarized the performance evaluation of load balancing control [2, 7], and described the design methodologies derived from the simulation results in distributed desktop computing grids. The domain, the language and the algorithms for computing grids are described. The distributed language which supplements the design methodologies was introduced. The distributed language for distributed desktop computing grids must be used effectively within a company or within a university. The validation of the effectiveness of the system we described is our next research topic. We are now implementing the platform of the distributed desktop computing grids.

Design Methodologies of Workload Management through Code Migration

111

References 1. Coulouris, G., Dollimore, J., Kindberg, T.: Distributed Systems concepts and Design, 4th edn. Addison Wesley, Reading (2005) 2. Yoshida, M., Sakamoto, K.: Performance Comparison of Decentralized Workload Control through Code Migration in Distributed Desktop Computing Grids. In: The 5th IEEE International Symposium on Embedded Computing (2008) 3. Krauter, K., Buyya, R., Maheswaran, M.: A Taxonomy and Survey of Grid Resource Management Systems for Distributed Computing. Software-Practice and Experience 32 (2002) 4. Venugopal, S., Buyya, R., Ramamohanarao, K.: A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing. ACM Computing Surveys 28 (2006) 5. Theotokis, S.A., Spinellis, D.: A Survey of Peer-to-Peer Content Distribution Technologies. ACM Computing Sueveys 36(4) (2004) 6. Choi, S., Buyya, R., et al.: A Taxonomy of Desktop Grids and its Mapping to State-of –the Art Systems, Technical Report, GRIDS-TR-2008-3, The University of Melbourne, Australia (2008) 7. Yoshida, M., Sakamoto, K.: Performance Comparison of Load Balancing Algorithms through Code Migration in Distributed Desktop Computing Grids. In: The 3rd IEEE Asia Pacific Services Computing Conference (2008) 8. Shah, R., Veeravalli, B., Misra, M.: On the Design of Adaptive and Decentralized LoadBalancing Algorithms with Load Estimation for Computational Grid Environments. IEEE Trans. On Parallel and Distributed Systems 18(12) (2007) 9. Alonso, R., Cava, L.: Sharing Jobs Among Independently Owned Processors. In: Proc. of the 8th ICDCS (1988)

Dynamic Dependent Tasks Assignment for Grid Computing Meddeber Meriem1 and Yagoubi Belabbas2 1

Department of Computer Science, University of Mascara, Algeria [email protected] 2 Department of Computer Science, University of Oran, Algeria [email protected]

Abstract. In a grid computing, task execution time is dependent on the machine to which it is assigned and task precedence constraints represented by a directed acyclic graph. In this paper we propose a hybrid assignment strategy of dependent tasks in Grids which integrated static and dynamic assignment technologies. We consider that a grid computing is a set of clusters formed by a set of computing elements and a cluster manager. Our main objective is to arrive at task assignment that could achieve minimum response time and to reduce the transfer cost inducing by the tasks transfer respecting the dependency constraints.. ...

1

Introduction

Grid computing originated from a new computing infrastructure for scientiﬁc research and cooperation and is becoming a mainstream technology for largescale resource sharing and distributed system integration. Current eﬀorts towards making the global infrastructure a reality provide technologies on both grid services and application enabling[1]. A task is deﬁned to be a program segment that can be individually scheduled. A grid computing element is deﬁned to be any processor that can receive tasks from a central scheduler and may be a single processor node or one of the processors within a multi-processor node. The problem of obtaining an optimal matching of tasks to machines in any distributed system is well known to be NP-hard even when the tasks are independent. The problem is much more difﬁcult when the tasks have dependencies because the order of task execution as well as task-machine pairing aﬀects overall completion time[2]. A precedence relation from task i to task j means that j needs data from i before being started. If these two tasks are not assigned to the same computing element, a delay cij must be considered between the completion of i and the beginning of j to transfer the data. Dynamic tasks assignment assumes a continuous stochastic stream of incoming tasks. Very little parameters are known in advance for dynamic tasks assignment. Obviously, it is more complex than static tasks assignment for implementation, C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 112–120, 2010. c Springer-Verlag Berlin Heidelberg 2010

Dynamic Dependent Tasks Assignment for Grid Computing

113

but achieves better throughput. Also it is the most desired because of the application demand[3]. In this paper, we propose a hybrid assignment strategy of dependent tasks in Grids which integrated static and dynamic assignment technologies. This strategy meets the following objectives: (i) reducing, whenever possible, the average response time of tasks submitted to the grid, (ii) respecting the constraints of dependency between tasks, and, (iii) reducing communication costs by using a static tasks placement based on the connected components algorithm to minimize the delay cij between task i and task j and by favoring a dynamic tasks placement within the cluster rather than the entire grid. The rest of this paper is organized as follows. We begin with the overview of some related works in Section 2. Section 3 presents the tasks assignment problem. In Section4 we will presents our system model. Section 5 describes the main steps of the proposed assignment strategy. We evaluate the performance of the proposed scheme in Section 6. Finally, Section 7 concludes the paper.

2

Related Work

There have been many heuristic algorithms proposed for the static and dynamic tasks assignment problem. Many of these algorithms apply only to the special case where the tasks are independent i.e. there are no precedence constraints[4,5,6]. Many heuristic algorithms have been proposed for static scheduling of dependent tasks where task precedence constraints are modelled as a directed acyclic graph (DAG). In [7], Yang Qu and al, target dependent task models and propose three static schedulers that use diﬀerent problem solving strategies. The ﬁrst is a heuristic approach developed from traditional list based schedulers. It presents high eﬃciency but the least accuracy. The second is based on a full-domain search using constraint programming. It can guarantee to produce optimal solutions but requires signiﬁcant searching eﬀort. The last is a guided random search technique based on a genetic algorithm, which shows reasonable eﬃciency and much better accuracy than the heuristic approach. Wayne F. Boyer and al [2], propose a non-evolutionary random scheduling (RS) algorithm for eﬃcient matching and scheduling of inter-dependent tasks in a DHC system. RS is a succession of randomized task orderings and a heuristic mapping from task order to schedule. Randomized task ordering is eﬀectively a topological sort where the outcome may be any possible task order for which the task precedent constraints are maintained. However static tasks assignment is performed oﬀline, or in a predictive manner and can be used whenever the task information is known a priori such as at compile time of a parallelized application. Although the good results that this approaches provides, they stay limited to a static assignment. Large and non dedicated computing platforms as grids may require dynamic task assignment methods to adapt to the run-time changes such as increases in the workload or resources, processor failures, and link failures [8]. In this paper, we address these issues.

114

3

M. Meriem and Y. Belabbas

Tasks Assignment

As Grid is a distributed system utilizing idle nodes scattered in every region, the most critical issue pertaining to distributed systems is how to integrate and apply every computer resource into a distributed system, so as to achieve the goals of enhancing performance, resource sharing, extensibility, and increase availability. Tasks assignment is very important in a distributed environment. In distributed systems, every node has diﬀerent processing speed and system resources, so in order to enhance the utilization of each node and shorten the consumption of time, tasks assignment will play a critical role. On the other hand, in distributed systems, the policies and methods for keeping a tasks assignment will directly aﬀect the performance of the system. In addition, the tasks assignment policies for distributed systems can be generally categorized into static tasks assignment policies and dynamic tasks assignment policies[9]. 3.1

Static Tasks Assignment

Static tasks assignment policies use some simple system information, such as the various information related to average operation, operation cycle, and etc., and according to these data, tasks are distributed through mathematical formulas or other adjustment methods, so that every node in the distributed system can process the assigned tasks until completed. The merit of this method is that system information is not required to be collected at all times, and through a simple process, the system can run with simple analysis. However, some of the nodes have low utilization rates. Due to the fact that it does not dynamically adjust with the system information, there is a certain degree of burden on system performance. 3.2

Dynamic Tasks Assignment

Dynamic tasks assignment policies refer to the current state of the system or the most recent state at the system time, to decide how to assign tasks to each node in a distributed system. If any node in the system is over-loaded, the over-loading task will be transferred to other nodes and processed, in order to achieve the goal of a dynamic assignment. However, the migration of tasks will incur extra overhead to the system. It is because the system has to reserve some resources for collecting and maintaining the information of system states. If this overhead can be controlled and limited to a certain acceptable range, in most conditions, dynamic tasks assignment policies out perform the static tasks assignment policies.

4 4.1

System Model Grid Model

In our study we model a Grid as a collection of n clusters with diﬀerent computational facilities. Let G = (C1 , C2 ,..., Cn ) denotes a set of clusters, where each cluster Ci is deﬁned as a vector with four parameters : Ci = (N CEi , Mi , Bandi ,

Dynamic Dependent Tasks Assignment for Grid Computing

115

Spdi ), where N CEi is the number of computing elements, Mi is the Manager node of the cluster Ci , Bandi is the bandwidth of the network, Spdi correspond to the cluster capability. Every cluster is connected to the global network (WAN). In this model, a cluster represents a set of R homogeneous computing elements connected by a local network (LAN) and located geographically in the same organization. Ci = (CEi1 , CEi2 ,..., CEir ), where each computing element CEij , have it own capability. The cluster manager CMi uses the following equation to calculate Spdi : Spdij (1) Spdi = j∈N CEi

Figure1 shows the Grid system model. In highly distributed systems, centralized work tasks assignment approaches become less feasible because it make use of a high degree of information, which causes a high work tasks assignment overhead. That is why we chose to develop a hybrid load balancing model that is centralized intra-cluster, but distributed inter-clusters. Each cluster in the Grid has a manager, which assign tasks to the cluster computing element.

Cluster 2 1111 Manager 1111 0000 0000 0000000 1111 0000 111 1111 0000 00001111 1111 ... 0000 1111

Cluster 1

0110

CE1,2

00 11 11 00 00 11 00 11 00 11111 0110 000 00 11

Network

Broker

11 00

T1 T2 T3

.....

Tasks arrival

CEn,i

Cluster n Fig. 1. Grid model

We assume that in the grid under study there is a central resource broker (CRB ), to which every Cluster Manager (CM ) connects and the grid clients send their tasks to the CRB. The CRB is responsible for scheduling tasks among CMs. 4.2

Application Model

DAG model. An application can be represented by a directed acyclic graph (DAG) D= (V, E), where V is a set of v nodes and E is a set of directed e

116

M. Meriem and Y. Belabbas

edges. A node in the DAG represents a task which in turn is a set of instructions which must be executed sequentially without pre-emption in the same processor. The edges in the DAG, each of which is denoted by (ni , nj ), correspond to the precedence constraints among the nodes. The weight of an edge is called the communication cost of the edge and is denoted by Cij . The source node of an edge is called the parent node while the sink node is called the child node. A node with no parent is called an entry node and a node with no child is called an exit node[10].

T1

T2

T4

T3

T5

Fig. 2. Task precedence graph

Task model. Tasks arrive randomly with a random computation length, an arrival time and precedence constraints. In our work, we generate randomly precedence constraints between tasks. Also, we believe that tasks can be executed on any computing element and each CE can only execute one task at each time point, the execution of a task cannot be interrupted or moved to another CE during execution. We also assume that a task cannot start execution before it gathers all of the messages from its parent tasks. The communication cost between two tasks assigned to the same processor is supposed to be zero.

5

Proposed Strategy

In order to reduce the global response time of the system and respect the tasks dependencies, this study proposed a hybrid tasks assignment policy, consisting of static and dynamic tasks assignment strategies. In the static case, when a user sends his tasks, they will be assigned to appropriate computing elements to achieve the goal of placement. In addition, in the dynamic case, the system will be adjusted dynamically according to the clusters workload. 5.1

Static Tasks Placement Strategy

Central resource broker. The role of the CRB in the system is to assign statically tasks placed in the task queue. For that we propose the following steps that will be executed periodically:

Dynamic Dependent Tasks Assignment for Grid Computing

117

– Partition all tasks waiting in the queue to x connected component by executing the connected component algorithm. A connected component is deﬁned as a collection of dependent tasks with inter-task data dependencies. The following ﬁgure shows a set of waiting tasks composed of three connected component.

T1 T3 T6

T2

T7 T4

T8

T5

Fig. 3. Example with three connected components

– Send each connected component CCk to a cluster manager CMi , using a round robin strategy, as follows : (CC1 , CM1 ), (CC2 , CM2 ),..., (CCP , CMn ), (CCp+1 ,CM1 ),...,(CCx , CMj ) – Send tasks associated to the connected component CCk , to cluster manager CMi . Cluster manager. Once the manager receives the connected component, it aﬀects them to the computing elements composing the cluster: – Randomly, or, – Using a round robin strategy : (CC1 , CE1 ), (CC2 , CE2 ),..., (CCx , CEj ) Then the cluster manager sends tasks composing each connected component CCj to the same computing element as CCj . 5.2

Dynamic Tasks Placement Strategy

Computing element. The computing element, perform these steps while its tasks queue is not empty: – Run the ﬁrst entry task Tj (with no precedence constraints) of its tasks queue. – Updates the connected component CCj associated to task Tj . – Executes the connected component algorithm on CCj to obtain the new entry tasks. Figure4 shows an example of a connected component with one entry task. After the end of execution of Tj , the CCj is divided on three connected components.

118

M. Meriem and Y. Belabbas T1

T2

T4

T3

T5

T3

T2

T6

T4

T7

T5

T6

T7

Fig. 4. Example with one entry task

The computing element executes the following steps periodically: – Computes its execution time texij as follows: lenghtc,k T exij = Spdij

(2)

c∈CCNij l∈L k∈P

Where, CCN is the connected component number assigned to the computing element. L is the level number of connected components, and P is the tasks number of level k. – Sends its execution time to the cluster manager and all computing element of the cluster. – We deﬁne a threshold α, from which a resource CEij can say that it is more loaded than another. If (T exij > T exik +α ) then transfer some connected components from CEij to CEik until T exij ≤ T exik + α. – Inform the cluster manager about the tasks movement. Cluster manager. The cluster manager receives periodically the execution time of each resource of the cluster and performs the next steps: – Computes the execution time of cluster as follow: T exi = T exij

(3)

j∈R

– We deﬁne a threshold β, such as: If (T exi >T exk + β) then transfer some connected components from Ci to Ck until T exi ≤ T exk + β.

6

Simulation Results

To test and evaluate the performance of our model, we developed our strategy under the GridSim[11] simulator written in Java, we can: (i) Generate the grid conﬁguration ﬁle (clusters number, computing element number, their characteristics, period sending load information, bandwidth ...),

Dynamic Dependent Tasks Assignment for Grid Computing

119

(ii) Generate a set of tasks with all associated data (submission time, computation length, precedence constraints...). As performance measures, we are interested in average response time of tasks. To obtain results that are as consistent as possible, we repeated the same experiments more than ten 5 times. All these experiments were performed on a PC 3 GHz Pentium IV, with 1GB of memory and running on Linux Redhat 9.0. The ﬁrst results obtained for response time are shown in ﬁgure5.

Fig. 5. Response time results (4 clusters)

We observe that our strategy reduce considerably the average response time of tasks submitted to the system. When increasing the tasks number, response time beneﬁts increase. The lower beneﬁt is obtained with 500 tasks, it is due to an under loaded state of the grid.

7

Conclusion

In this paper we have proposed a hybrid assignment strategy of dependent tasks in Grids which integrated static and dynamic assignment technologies for solving the placement problem. A tasks placement strategy is introduced; it has the advantage of being able to divide the input task graph into set of connected component in order to reduce the response time of system application. To test and evaluate the performance of our model, we developed our strategy under the GridSim simulator written in Java. We have randomly generated clusters with diﬀerent characteristics and a set of dependent tasks. The ﬁrst experimental results are encouraging since we can signiﬁcantly reduce the average response time. To measure the eﬃciency of the strategy, we plan to compare its performance with other grid simulators such as SimGrid[12]. We plan also to integrate our strategy to the middleware GLOBUS[13].

References 1. Cao, J., Spooner, D.P., Jarvis, S.A., Nudd, G.R.: Grid load balancing using intelligent agents. Future Generation Comp. Syst. 21, 135–149 (2005) 2. Boyer, W.F., Hura, G.S.: Non-evolutionary algorithm for scheduling dependent tasks in distributed heterogeneous computing environments. J. Parallel Distrib. Comput. 65, 1035–1046 (2005)

120

M. Meriem and Y. Belabbas

3. Vidyarthi, D.P., Sarker, B.K., Tripathi, A.K., Yang, L.T.: Scheduling in Distributed Computing Systems Analysis, Design and Models. A research monograph. Springer, New York (2009) 4. Leal, K., Huedo, E., Llorente, I.M.: A decentralized model for scheduling independent tasks in Federated Grids. Future Generation Comp. Syst. 25, 840–852 (2009) 5. Salcedo-Sanz, S., Xu, Y., Yao, X.: Hybrid meta-heuristics algorithms for task assignment in heterogeneous computing systems. Computers and OR 33, 820–835 (2006) 6. Maheswaran, M., Ali, S., Siegel, H.J., Hensgen, D., Freund, R.F.: Dynamic mapping of a class of independent tasks onto heterogeneous computing systems. J. Parallel Distrib. Comput. 59, 107–121 (1999) 7. Qu, Y., Soininen, J.-P., Nurmi, J.: Static scheduling techniques for dependent tasks on dynamically reconfigurable devices. Journal of Systems Architecture 53, 861–876 (2007) 8. Uar, B., Aykanat, C., Kaya, K., Ikinci, M.: Task assignment in heterogeneous computing systems. J. Parallel Distrib. Comput. 66, 32–46 (2006) 9. Yan, K.Q., Wang, S.C., Chang, C.P., Lin, J.S.: A hybrid load balancing policy underlying grid computing environment. Computer Standards and Interfaces 29, 161–173 (2007) 10. Kwok, Y.-K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys 31, 406–471 (1999) 11. Buyya, R., Murshed, M.: Gridsim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. The Journal of Concurrency and Computation: Practice and Experience (CCPE) 14, 13–15 (2002) 12. Casanova, H.: Simgrid: a toolkit for the simulation of application scheduling. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid’01), Brisbane, Australia, pp. 430–437 (2001) 13. Foster, I.: Globus toolkit version 4: Software for service oriented systems. In: IFIP: International Conference on Network and Parallel Computing, Beijing, China, pp. 2–13 (2005)

Implementation of a Heuristic Network Bandwidth Measurement for Grid Computing Environments∗ Chao-Tung Yang∗∗, Chih-Hao Lin, and Wen-Jen Hu Department of Computer Science, Tunghai University, Taichung, 40704, Taiwan (ROC) [email protected]

Abstract. Grid computing technique is more and more popular. In general, Ganglia and NWS are applied to monitor Grid nodes’ status and networkrelated information, respectively. Comprehensive monitoring and effective management are criterions to archive higher performance of grid computation. Unfortunately, owing to diverse user requirements, information provided by Ganglia and NWS services is not sufficient in real cases, especially for application developers. In addition, NWS services that deployed based on “Domainbased Network Information Model” could greatly reduce overheads caused by unnecessary measurements. This study proposes a heuristic QoS measurement architecture, which constructed with a domain-base model, to provide effective information to meet user requirements, especially for application developers. Keywords: Grid Computing, Heuristic, QoS, Network Information Model.

1 Introduction As we known, Grid computing technique is more and more popular adopted by organizations to obtain high performance computing and heterogeneous resources sharing. Since all computing nodes in grid environments are connected by means of the network, all tasks that executed in grid environments will be influenced by network status due to complicated and numerous communications between computing resources [1, 2]. While we design algorithms for specific usages or assign tasks into grid environments, we have to evaluate the network performance and to adjust algorithms to attain optimal performance in real-time execution [9, 11]. The best scenario is that our grid environments have some mechanisms to retrieve network status and to evaluate performance automatically [3, 4]. Thus, applications or web service agents could provide higher performance due to dynamic parameter adjustment and algorithms optimization. While grid computing becomes popular, it brings about a new issue, i.e., how to manage and monitor numerous resources of grid computing environments. In most cases, we use Ganglia and NWS to monitor machines’ status and network-related information, respectively. Owing to diverse user requirements, information provided by ∗ This work is supported in part by National Science Council, Taiwan R.O.C., under grants no. NSC 96-2221-E-029-019-MY3 and NSC 98-2622-E-029-001-CC2. ∗∗ Corresponding author. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 121–130, 2010. © Springer-Verlag Berlin Heidelberg 2010

122

C.-T. Yang, C.-H. Lin, and W.-J. Hu

these services is not sufficient. According to the mechanism that we designed in previous work [10], we could retrieve relative network information in real-time manner; even advanced customization for special purposes is available. With the customized shell scripts that we wrote for NWS services’ deployment, we could easily and quickly deploy NWS services to each grid node and fetch network-related information in a regular time interval. Besides, we could obtain extra statistics for job-scheduling in our grid environments. Except job-scheduling, statistics is also helpful in many respects.

2 Network Information Provider The NWS (Network Weather Service) [6, 7] is a distributed system that detects network status by periodically monitoring and dynamically forecasting over a given time interval. This service operates by a distributed set of performance sensors (network monitors, CPU monitors, etc.) from which they gather system information. It uses numerical models to generate forecasts of what the conditions will be produce a given time period. The NWS system includes sensors for end-to-end TCP/IP performance (bandwidth and latency) [5], available CPU percentage, and available non-paged memory. The sensor interface, however, allows new internal sensors to be configured into the system. We primarily use NWS for end-to-end TCP/IP measurements. As Rich Wolski said [7], NWS is designed to maximize four possible conflicting functional characteristics. It must meet these goals despite the highly dynamic execution environment and evolving software infrastructure provided by shared metacomputing systems.

Fig. 1. NWS services integrated with Ganglia web portal

Implementation of a Heuristic Network Bandwidth Measurement

123

Fig. 2. Network statistics produced by NWS measurements demonstrated in web portal

• • • •

Predictive Accuracy Non-intrusiveness Execution longevity Ubiquity

We have successfully developed a number of shell scripts for the automatic NWS deployment. And these scripts form the basis of NWS services’ management. And we have successfully integrated NWS services with the Ganglia web portal.

3 Heuristic QoS Measurement QoS (Quality of service) [13, 14] is the ability to provide different priority to different applications, users, or data flows, or to guarantee a certain level of performance to a data flow. It was widespread adopted in the field of computing networking, and we use it as a quality measurement of grid environments. In our previous project, we have built an integrated grid environment including a web portal composed of Ganglia and NWS services. Afterward, we start another project about PACS (Picture Archive and Communication System) [8] and most experiments were done in the same platform. The primary mission in this project is to exchange medical images efficiently with specific applications developed by our team. The application, named "Cyber [12]", has successfully integrated eight algorithms. For exchanging medical images efficiently with these algorithms integrated in Cyber, we have to configure a lot of parameters before tasks submitted. Unfortunately, we have no idea what's best combination of parameters we should take in advance. Therefore, we adopt "trial and error method" unavoidably. But it's definitely not

124

C.-T. Yang, C.-H. Lin, and W.-J. Hu

practical for most conditions. For this reason, we expect to establish an automatic parameter-self-optimization method. To guarantee the degree of QoS, we regard user requirements as constraints of tasks. With these constraints and heuristic QoS measurements we proposed in this paper, we could provide more QoS to meet user requirements. 3.1 Deploy Flowchart We regard several grid nodes as a group, and each group has a header to deploy nameserver and memoryserver. A simple NWS services deployment procedure that we used is divided into 3 steps: 1. Clean all NWS process. 2. Load NWS services. 3. Register NWS clique. And the standard procedure we wrote in shell scripts is shown as Fig. 3. Owing to the non-intrusiveness characteristic of NWS, these shell scripts could be executed without root privilege.

Fig. 3. Procedure of NWS services deployment

Fig. 4. The flowchart of gathering network information

Implementation of a Heuristic Network Bandwidth Measurement

125

Fig. 4 has shown a simple flowchart. In this paper, we have edit crontab(Linux Scheduler) to schedule some routines for inserting NWS information into database automatically and backing up raw data as plain text files locally. While the procedure routines that we scheduled in crontab are invoked, customized shell scripts are executed. The first step of the shell script is to get host groups from the database for NWS information gathering. Each host group is pre-defined in the database and is assigned a clique for measuring network status. After the clique is created, it will measure network information in an equal time interval, for example, 30 seconds. Then the script extracts bandwidth and latency from the NWS clique respectively. If it successes, it will insert bandwidth and latency information into the database. The second routine keep raw data as plain text files locally is designed for the future use. Currently, it just provides a different storage than a database to keep raw information of NWS services. 3.2 Heuristic Architecture We collected historical network information of grid environments and found an approach to evaluate QoS. We could give applications dedicated parameters in a simple manner by means of database operations. Couples of functions have been designed for analyzing historical information of network performance. Statistics is helpful in many fields, especially for prediction. Some researchers have used statistical method to monitor and predict bandwidth for QoS sensitive tasks [13]. All network relative information was periodically categorized to most used statistics. Besides, we have planned an innovative method to obtain the real-time network state that worked with Dynamic Domain-based Network Information Model, i.e. dynamically deploying clique into dedicated nodes, measuring network states, and then reporting results to a database, users, or applications. The enhanced version of current work which supports Dynamic Domain-based Network Information Model is currently under development. We have designed a simple model for the integration of Ganglia, NWS, and NINO. Ganglia and NINO provide the UI for users to manage and monitor grid environments. NWS and Ganglia collect related information from hosts and network regularly. And “Smart Broker” provides parameters to applications like Cyber. Smart Broker is the key component for us to evaluate QoS. Our previous work [8, 12] has provided users an interface for tuning up parameters which is shown as shown in Fig. 5. But most parameters used by this application, Cyber, must be set manually and it’s very inconvenient. We developed “Smart Broker” to help us to achieve automation of parameters self-optimization in diverse scenarios. Smart Broker works as the evaluation layer between applications and the information collection layer. We have pre-defined 4 task types that perform QoS measurement in various ways. • • • •

Download Upload Computational Hybrid

126

C.-T. Yang, C.-H. Lin, and W.-J. Hu

Fig. 5. Smart Broker Model

4 Experimental Results In this paper, we have chosen 4 grid nodes as “Header”, which is called “border” in the domain-based network information model, to register specific NWS services – the clique for gathering inter-domain network performance. Except these headers, we also registered a NWS clique named “cross-domain” to measure network performance between these headers. Information collected by NWS services is our basis to evaluate QoS. Hence, we have to ensure the NWS services deployment we performed is applicable. We have adopted a pull-based model to collect network information measured by NWS services as shown in Fig. 6. All grid nodes were deployed with NWS sensors, and zeta1, beta2, delta2, and eta4 were deployed both NWS nameserver and memoryserver. Zeta1 was deployed a routine to collect (pull) all network information measured by NWS services and to load these raw data into the database locally. The versions of Operation systems in these grid nodes are different, but they don’t influence our work.

Implementation of a Heuristic Network Bandwidth Measurement

Fig. 6. NWS cliques deployed in our Grid environments

Fig. 7. NWS measurements as our basis for QoS evaluation

127

128

C.-T. Yang, C.-H. Lin, and W.-J. Hu

We could easily found that the measurements of the NWS clique may be uneven. For example, eta7-eta9 has minimum measurements, 615, but eta1-eta3 has maximum measurements, 754. Uneven measurements may influence the accuracy of our model while evaluating QoS with statistical approaches. The NWS services have the ability to avoid collision which may cause inaccuracy of measurement, and this advantage is restricted in the same nameserver. In our test-bed, we found that the collision influences the accuracy frequently. We could find that network performance has a great variation due to collision of the NWS measurement. Fig. 8 has shown NWS measurements of our test-bed. Although the QoS evaluation model we adopted in this paper could not absolutely predict the real performance for real-time tasks execution. We still could pick out best selection of resources by means of the QoS evaluation model. To verify the usability of this QoS evaluation approach, we have also performed a simple experiment of the file transmission. And the result is the same as that of our predication using the QoS evaluation model.

Fig. 8. NWS measurements as our basis for QoS evaluation

5 Conclusions In this paper, we propose a heuristic QoS measurement constructed with a domainbased information model and by using a Relational Database Management System. According to this scheme, we could retrieve both real-time and historical network information. With customized shell scripts, NWS services could be quickly deployed to grid machines and fetch network information regularly. And with RDBMS, we could not only keep historical information, but also design some statistical analysis as we need. Statistics is helpful in many fields, for example, job dispatching or replicas

Implementation of a Heuristic Network Bandwidth Measurement

129

selection. We are planning to refer some approaches proposed by other researcher to reduce measurements in the near future. And this evaluation approach should be adjusted to meet the requirements of other 3 task types, i.e., Upload-oriented, Computational and Hybrid. We’ll make a study of these kinds of tasks before long.

References 1. A taxonomy and survey of grid resource management systems for distributed computing. Softw. Pract. Exper. 32(2), 135–164 (2002) 2. Cao, J., Jarvis, S.A., Saini, S., Kerbyson, D.J.: Performance prediction and its use in parallel and distributed computing systems. Future Generation Computer Systems 22(7), 745– 754 (2006)doi: 10.1016/j.future, 02.008 3. Chung, W., Chang, R.: A new mechanism for resource monitoring in Grid computing. Future Generation Computer Systems 25(1), 1–7 (2009) doi: 10.1016/j.future.2008.04.008 4. Krefting, D., Vossberg, M., Tolxdorff, T.: Simplified Grid Implementation of Medical Image Processing Algorithms using a Workflow Managment System. Presented at the MICCAI-Grid Workshop, New York (2008), http://www.i3s.unice.fr/~johan/MICCAI-Grid/website.html (re trieved) 5. Legrand, A., Quinson, M.: Automatic deployment of the Network Weather Service using the Effective Network View. In: Proceedings of 18th International Parallel and Distributed Processing Symposium (2004) 6. Network Weather Service. Network Weather Service, http://nws.cs.ucsb.edu/ewiki/ (retrieved) 7. Wolski, R., Spring, N., Hayes, J.: The network weather service: A distributed resource performance forecasting service for metacomputing. Future Generation Computer Systems 15(5-6), 757–768 (1999) 8. Yang, C.T., Chen, C.H., Yang, M.F., Chiang, W.C.: MIFAS: Medical Image File Accessing System in Co-allocation Data Grids. In: IEEE Asia-Pacific Services Computing Conference, APSCC’08, pp. 769–774 (2008) 9. Yang, C., Chen, S.: A Multi-site Resource Allocation Strategy in Computational Grids. In: Advances in Grid and Pervasive Computing, pp. 199–210 (2008), http://dx.doi.org/10.1007/978-3-540-68083-3_21 (retrieved July 29, 2009) 10. Yang, C., Chen, T., Tung, H.: A Dynamic Domain-Based Network Information Model for Computational Grids. In: Future Generation Communication and Networking, vol. 1, pp. 575–578. IEEE Computer Society, Los Alamitos (2007), http://doi.ieeecomputersociety.org/10.1109/FGCN.2007.9 11. Yang, C., Shih, P., Chen, S., Shih, W.: An Efficient Network Information Model Using NWS for Grid Computing Environments. In: Grid and Cooperative Computing - GCC 2005, pp. 287–299, http://dx.doi.org/10.1007/11590354_40 (retrieved July 29, 2009) 12. Yang, C., Yang, M., Chiang, W.: Implementation of a Cyber Transformer for Parallel Download in Co-Allocation Data Grid Environments. In: Proceedings of the 2008 Seventh International Conference on Grid and Cooperative Computing, pp. 242–253. IEEE Computer Society, Los Alamitos (2008), http://portal.acm.org/citation.cfm?id=1471431 (retrieved July 29, 2009)

130

C.-T. Yang, C.-H. Lin, and W.-J. Hu

13. Yu, Y., Cheng, I., Basu, A.: Optimal adaptive bandwidth monitoring for QoS based retrieval. IEEE Transactions on Multimedia 5(3), 466–472 (2003) doi: 10.1109/TMM.2003.814725 14. Cheng, Z., Du, Z., Zhu, S.: A Service Level QoS Mechanism and Algorithm for Data Distribution and Backup in an Grid Based Astronomy Data Management System. Presented at the Sixth International Conference on Grid and Cooperative Computing, GCC 2007, pp. 430–436 (2007) doi: 10.1109/GCC.2007.25

An Efficient Circuit–Switched Broadcasting in Star Graph Cheng-Ta Lee and Yeong-Sung Lin Department of Information Management, National Taiwan University No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan {d90001,yslin}@im.ntu.edu.tw

Abstract. In this paper, we propose an algorithm for broadcasting in star graph using circuit-switched, half-duplex, and link-bound communication. By using the algorithm, we show that the broadcasting in an n-dimensional star graph can be done in n-1 time steps. We also study the lower bound of time steps of the circuit-switched broadcasting in star graph, and we prove that the optimal broadcasting time steps in an n-dimensional star graph is ⎡lognn!⎤. Finally, the computational results showed that the proposed algorithm gets nearly optimal solutions. Keywords: Broadcasting, interconnection network, star graph, circuit-switched routing, link-bound.

1 Introduction The star graph interconnection network, since being proposed in [1], [2], is receiving increasing attention in the literature. It has been considered as an attractive alternative to the popular hypercube as the network architecture for parallel processing. Part of the reason is its symmetric and recursive nature, and superior (lower) node degree and comparable diameter as opposed to the hypercube [14]. Many references can be found in studying the star graph regarding such as its topological properties [3], [4], [13], embedding capability [5], [6], fault-tolerant capability [7], [8], [9], and even the construction of incomplete stars [10]. Among the efforts of studying the star graph, one of the central issues is around the various versions of broadcasting problem, broadcasting refers to the process by which a data set is sent from one node to all other nodes. Results about broadcasting are summarized in papers by Hedtniemi et al. [11] and Fraigniaud et al. [12]. In this paper, we considered the problem of broadcasting in star graph using circuit-switched, half-duplex, and link-bound communication. We propose an Efficient Circuit-Switched Broadcasting (ECSB) algorithm for an n-dimensional star graph with n! nodes. By using this algorithm, we showed that the broadcasting for an ndimensional star graph is done in n-1 time steps. The rest of this paper is organized as follows. In Section 2, we describe our communication model. In section 3, we discuss lower bounds on the optimal circuit-switched C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 131–135, 2010. © Springer-Verlag Berlin Heidelberg 2010

132

C.-T. Lee and Y.-S. Lin

broadcasting time steps. An efficient circuit-switched broadcasting algorithm is presented in Section 4. Finally, we give our concluding remarks in Section 5.

2 Communication Model An n-dimensional star graph, also referred as n-star or Sn, is an undirected graph consisting of n! nodes (or vertices) and (n-1)n!/2 edges. Each node is uniquely assigned a label x0x1…xn-1, which is the concatenation of a permutation of n distinct symbols {x0x1…xn-1}. Without loss of generality, let these n symbols be {0,1,…,n-1}. Given any node label x0…xi…xn-1, let function gi, 1 ≤ i ≤ n-1, be such that gi(x0…xi…xn1)=xi…x0…xn-1 (i.e., swap x0 and xi to keep the rest symbols unchanged). In Sn, for any node x, there is an edge joining x and node gi(x), and this edge is said to e along dimension i. It is known that Sn is node- and edge-symmetric and has a diameter of Dn=⎣3(n-1)/2⎦. In circuit-switched model, a node x sends its message to a node y via a directed path. Between two neighbor nodes in star graph, there exists exactly one link which can be used for both directions (but only one direction at one time), i.e., half-duplex link, and the link-bound communication is assumed, i.e., a node can use all of its links at the same time. Fig. 1 shows an example of the circuit-switched broadcasting in S3 under our communication model. In Fig. 1(a), a source node 012 sends a message to nodes 210 and 201 during the first time step. Fig. 1(b) shows that the source node 012 and informed nodes, 210 and 201, send messages to the remaining three nodes during the next time step. 012 g2

012 g1

g1

210

102 g2

210 g1

120

201

120

102

201 g1

021

021

(a)

(b)

Fig. 1. An example of the circuit-switched broadcasting in S3

3 Lower Bound of the Optimal Broadcasting Time Steps In this section, we study the lower bound of time steps of the circuit-switched broadcasting in star graph. Theorem 1. The optimal broadcasting time steps for an n-dimensional star graph with n! nodes is ⎡lognn!⎤.

An Efficient Circuit–Switched Broadcasting in Star Graph

133

Proof: The proof of Theorem 1 follows from the observation that each node can send the message to at most n-1 uninformed nodes at each time step in Fig. 1, because each node has n-1 degrees in Sn. For rapid broadcasting, source node and informed nodes must inform exactly n-1 other nodes at each time step except the last time step. Therefore, lower bound of the optimal broadcasting time steps is ⎡lognn!⎤.

4 An Efficient Circuit–Switched Broadcasting Algorithm In this section, we present an efficient circuit-switched broadcasting algorithm in an n-dimensional star graph with broadcasting time steps n-1, n≥3. In order to facilitate our discussion, we introduce the following definitions. Definition 1. The generator g0 is defined by vg0 = v, where v is a source node or a node of informed nodes in Sn. Definition 2. We define a function send[v, gx gx …gx ] to send the message from node v to the node located by gx gx …gx function for c≥0, xi≥0, where v is a source node or a node of informed nodes in Sn. The proposed ECSB algorithm is shown in Fig. 2. 0

0

1

1

c

c

Algorithm. Efficient Circuit-Switched Broadcasting Input: Sn and source node Output: Broadcast to all the nodes 1: begin 2: for i=n-1 to 2 do 3: for j=i to 1 pardo 4: send[v, gj-1gi]; 5: end for 6: end for 7: send[v,g1]; 8: end Fig. 2. The efficient circuit-switched broadcasting algorithm

Lemma 1. The efficient circuit-switched broadcasting algorithm for an n-dimensional star graph with n! nodes can be done in n-1 time steps. Proof: In the broadcasting algorithm, from step 2-6 execute n-2 time steps. Step 7 executes 1 time step. Hence, the algorithm executes n-1 time steps. Lemma 2. The algorithm can broadcasting from source node to all other nodes. Proof: In efficient circuit-switched broadcasting algorithm, the steps 2-6 can send messages to other substar by each iteration, because v can send messages to other substar by gj-1gi function. For example, v=0123 by g2g3 functions=3102, by g1g3 functions=3021, and by g0g3 functions=3120 in S4. Step 7 can send messages to neighbor node by function g1 in the last time step. Hence, the algorithm can broadcasting from the source node to all the other nodes.

134

C.-T. Lee and Y.-S. Lin

g3

0123 g2

c

3120

g1

a

2103

1023

2130

1320

1203

2013

1230 b

2310

d 0213

3210 g3

2301

1302 g3

a

c

3201

0321

3102

0312

0231

3021

0132

3012

b

d

2031

1032

(a) 1st time step 0123

0123

3120

c 2103

a g2

g1

1023

g2

2130

g1

g2 1203

1320

d

1230 b

a g1

2103

2310

a

c

0321

3102

2031

g2

g1

0312

3201

2310

3210

1302

g1

g2 3021

g1

2301

1302

g2

1320

g3

g1

b

1230 b

0213

3210

2301

0231

2013

g1 g1

g1

1203

g3

g2

2130

d

0213

3201

1023

g1

g2 2013

3120

c

a

c

0321

3102

g1

0132

3012 d 1032

(b) 2nd time step

0231 b

g1

0312 g1

g1 3021

0132

g1

3012 d

2031

1032

(c) 3rd time step

Fig. 3. An example of the efficient circuit-switched broadcasting in S4 Table 1. The comparison of the time steps of our proposed algorithm and lower bound time steps in n-dimensional star graph

Star size: n 3 4 5 6 7 8 9 10

Number of nodes: n! Lower bound: ⎡lognn!⎤ 6 2 24 3 120 3 720 4 5,040 5 40,320 6 362,880 6 3,628,800 7

ECSB algorithm: n-1 2 3 4 5 6 7 8 9

An Efficient Circuit–Switched Broadcasting in Star Graph

135

Fig. 3 shows an example of the broadcasting in S4. Figs. 3(a), 3(b), and 3(c) are the 1st time step, 2nd time step, and 3rd time step, respectively.

5 Conclusion We considered the problem of broadcasting in n-dimensional star graph by using circuit-switched, half-duplex, and link-bound communication. The results showed that the broadcasting algorithm can be done in the star graph in nearly optimal time steps. The comparison of the time steps of our proposed algorithm and lower bound time steps in star graph is listed in Table 1.

References 1. Akers, S.B., Harel, D., Krishnameurthy, B.: The star graph: an attractive alternative to the n-cube. In: International Conference on Parallel Processing, pp. 393–400 (1987) 2. Akers, S.B., Krishnameurthy, B.: A group-theoretic model for symmetric interconnection networks. IEEE Transactions on Computers 38(4), 555–566 (1989) 3. Day, K., Tripathi, A.: A comparative study of topological properties of hypercubes and star graphs. IEEE Transactions on Parallel and Distributed Systems 5(1), 31–38 (1994) 4. Qiu, K.: On some properties and algorithms for the star and pancake interconnection network. Journal of Parallel and Distributed Computing, 16–25 (1994) 5. Jwo, J.S., Lakshmivarahan, S., Dhall, S.K.: Embeddings of cycles and grids in star graphs. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 540–547 (1990) 6. Nigam, M., Sahni, S., Krishnamurthy, B.: Embedding hamiltonians and hypercubes in star interconnection graphs. In: International Conference on Parallel processing, vol. 3, pp. 340–343 (1990) 7. Bagherzadeh, N., Nassif, N., Latifi, S.: A routing and broadcasting scheme on faulty star graphs. IEEE Transactions on Computers 42(11), 1398–1403 (1993) 8. Jovanovic, Z., Mišic, J.: Fault tolerance of the star graph interconnection network. Information Processing Letters 49(3), 145–150 (1994) 9. Latifi, S.: On the fault-diameter of the star graph. Information Processing Letters 46(3), 143–150 (1993) 10. latifi, S., Bagherzadeh, N.: Incomplete star: an incrementally scalable network based on the star graph. IEEE Transactions on Parallel and Distributed Systems 5(1), 97–102 (1994) 11. Hedtniemi, S.M., Hedetniemi, S.T., Liestman, A.L.: A survey of gossiping and broadcasting in communication networks. IEEE Networks 18(4), 319–349 (1988) 12. Fraigniaud, P., Lazard, E.: Methods and problems of communication in usual networks. Discrete Applied Mathematics 53(1-3), 79–133 (1994)

Parallel Domain Decomposition Methods for High-Order Finite Element Solutions of the Helmholtz Problem Youngjoon Cha1 and Seongjai Kim2 1 Department of Applied Mathematics Sejong University, Seoul, 143-747, South Korea [email protected] 2 Department of Mathematics & Statistics Mississippi State University, Mississippi State, MS 39762 USA [email protected]

Abstract. The article is concerned with a parallel iterative domain decomposition algorithm for high-order ﬁnite element solutions of the Helmholtz wave equation. The iteration is performed in a block-Jacobi manner. For the interface operator, a Robin interface boundary condition is employed in a modiﬁed form which allows possible discontinuities of the discrete normal ﬂux on the subdomain interfaces. The convergence of the algorithm is analyzed using energy estimates. Numerical results are given to show the eﬀectiveness and parallel eﬃciency of the algorithm for the simulation of high-frequency waves in heterogeneous media in the two-dimensional space. The algorithm is carried out on a 16-node Linux cluster; it has been observed more than 97% parallel eﬃciency for all tested problems.

1

Introduction

Let Ω ⊂ IRd , d = 2 or 3, be a logically rectangular/cubic domain with its boundary Γ = ∂Ω. Consider the following Helmholtz problem (a) −Δ u − K(x)2 u = S(x), x ∈ Ω, ∂u + iα(x) u = 0, x ∈ Γ, (b) ∂ν

(1)

where i is the imaginary unit, ν is the unit outward normal to Γ , and the coeﬃcients K(x) and α(x) satisfy K(x)2 = p(x)2 − iq(x)2 , 0 < p0 ≤ p(x) ≤ p1 < ∞, 0 ≤ q0 ≤ q(x) ≤ q1 < ∞, α = αr − iαi , αr > 0, αi ≥ 0,

(2)

and are suﬃciently regular so that (1) admits a unique solution lying in H 1 (Ω). Here (1b) is an absorbing boundary condition (ABC); for example, one can select C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 136–145, 2010. c Springer-Verlag Berlin Heidelberg 2010

Parallel Domain Decomposition Methods

137

α appropriately so that (1b) represents a ﬁrst-order ABC that allows normally incident waves to pass out of Ω transparently [1]. The Helmholtz problem is diﬃcult to solve numerically, in particular when 0 ≤ q p. In addition to having a complex-valued solution, it is neither Hermitian symmetric nor coercive. As a consequence, most standard iterative methods either fail to converge or converge so slowly. In many applications (e.g., the geophysical wave simulation and seismic velocity inversion), it is often required to reproduce waves up to 50-100 wavelengths in heterogeneous media. It is known that second-order discretization methods need to select at least 10-12 points per wavelength (2π/p) for stability reasons [2,3]. However, in practical high-frequency applications, one should choose at least 20-25 grid points per wavelength for a reasonable accuracy [2,3,4]. Thus the algebraic system for the numerical solution of the Helmholtz problem becomes very huge for realistic applications, besides being poorly-conditioned. As Zienkiewicz [5] pointed out, “the problem remains unsolved and a completely new method is needed.” Most of computational methods in the literature have been suggested for lower-order numerical solutions of constant-coeﬃcient Helmholtz problems. For general coeﬃcient problems, Kim [6,7,8,9] studied nonoverlapping DD methods for solving the Helmholtz problem by ﬁnite diﬀerence and linear ﬁnite element (FE) methods; see also [2]. For the simulation of high-frequency waves in heterogeneous media, Kim et al. [10] suggested the so-called high-frequency asymptotic decomposition method, in which the waveﬁeld was decomposed into two parts (the phase and the cumulative amplitude) and the solution could be simulated by solving two easier-to-solve equations. This article develops a parallel iterative DD method for high-order FE solutions of the Helmholtz problem of variable coeﬃcients. DD methods can combine iterative methods at the interface level and direct algorithms at the subdomain level so that they are attractive for poorly-conditioned large problems such as the Helmholtz problem. We will consider a nonoverlapping algorithm incorporating a Robin interface boundary condition (RIBC). Note that RIBCs impose the continuity of both the discrete solution uh and its normal ﬂux on the subdomain interfaces, while most conforming FE methods admit discontinuities in the normal ﬂux on the element interfaces. Thus, the RIBC should be modiﬁed appropriately in order for the DD method to converge to the original discrete solution.

2

Preliminaries

This section begins with a brief review for the existence and uniqueness of the weak solution of (1). We then present convergence properties for the FE solution of the Helmholtz problem. In the following, L2 (D) denotes the space of all square-integrable functions f on a domain D; (·, ·)D and · 0,D are the corresponding inner product and norm, respectively. Analogously, H m (D) is the usual m-th order Sobolev spaces on D with the norm · m,D for a positive integer m.

138

2.1

Y. Cha and S. Kim

Existence and Uniqueness of the Solution

The weak formulation of (1) is given by seeking u ∈ V = H 1 (Ω) such that (∇u, ∇v)Ω − ( K 2 u, v)Ω + iαu, vΓ = (S, v)Ω , ∀ v ∈ V,

where (f, g)Ω =

(3)

f gdx,

f, gΓ =

Ω

f gdσ. Γ

For a simpler presentation, we deﬁne the following bilinear form L(u, v; D) = (∇u, ∇v)D − ( K 2 u, v)D + iαu, v∂D∩Γ ,

D ⊂ Ω.

We cite the following lemma. Lemma 1. [11] The weak formulation (3) of the Helmholtz problem admits a unique solution u ∈ H 1 (Ω) for S ∈ L2 (Ω). 2.2

The Discrete Solution

Given a FE subspace V h ⊂ V ∩ Qhr , where Qhr is the space of the r-th order splines corresponding to the set of ﬁnite elements T h , r = 1, 2, · · ·, the FE approximation of the weak solution u of (3) is the function uh ∈ V h such that L(uh , v; Ω) = (S, v)Ω , ∀ v ∈ V h .

(4)

Let the approximation error eh = u − uh , where u is the solution of (3) and uh is the solution of (4). It is known [12] that (4) has a unique solution uh ∈ V h for p2 h suﬃciently small and that eh 0,Ω = O(pr+2 hr+1 ),

eh 1,Ω = O(pr+1 hr ),

(5)

for certain classes of data S, e.g., Sr−1,Ω ≤ Cr u0,Ω , where Cr is independent of p. The algebraic system for (4) can be written as Auh = b.

(6)

It is extremely diﬃcult to solve (6) for non-attenuate (q = 0) or slightly-attenuate (q is small) waves. The attenuation coeﬃcient q is negligible for certain cases, e.g. ocean acoustics and optical waves in a vacuum. For the case that 0 ≤ q p, it has been veriﬁed that relaxation methods such as Jacobi and SOR iterations does not converge and that nonsymmetric Krylov subspace algorithms (GCR [13], GMRES [14], etc.) either converge very slowly or have possible breakdowns [15,16,17]. The existence of a convergent nonsymmetric Krylov subspace algorithm for (6) is equivalent to the positive deﬁniteness of the imaginary part of A (i.e., q(x) ≥ q0 > 0) [17].

Parallel Domain Decomposition Methods

139

Fig. 1. The domain decomposed into two nonoverlapping subdomains

3

The Domain Decomposition Method

This section introduces a nonoverlapping, iterative DD method for high-order FE solutions of the Helmholtz problem (1)-(2) of which the convergence can be analyzed by applying energy estimates [8]. For simplicity, we will consider 2D problems and choose a rectangular reference ﬁnite element with the shape functions being tensor products of 1D quadrature-based shape functions. Thus the arguments to be presented are applicable to the 3D space with a minor modiﬁcation. For simplicity, we will present the DD algorithm in the algebraic formulation, rather than the variational formulation. 3.1

The DD Method in the Algebraic Formulation

Consider a rectangular domain of a uniform mesh, with two rectangular subdomains (Ω1 and Ω2 ); see 1. Let its degrees of freedom be ordered by Ω1 \Γ12 ﬁrst, the interface Γ12 next, and then Ω2 \Γ12 . Then, the algebraic system (6) corresponding to the ordering can be written as ⎡ ⎤⎡ h⎤ ⎡ ⎤ A11 A12 0 u1 b1 ⎣ A21 A22 A23 ⎦ ⎣ uh2 ⎦ = ⎣ b2 ⎦ . (7) 0 A32 A33 b3 uh3 However, when a DD method is considered, the degrees of freedom on the interface Γ12 will be counted twice, one for each of Ω1 and Ω2 . Thus, the algebraic system corresponding to the DD method reads u h = b, A where

⎡

A11 ⎢ A21 =⎢ A ⎣ A21 0 Here uh12 = uh21 = uh2 .

A12 A22 0 0

0 0 A22 A32

⎤ 0 A23 ⎥ ⎥, A23 ⎦ A33

(8)

⎤ uh1 ⎢ uh12 ⎥ ⎥ h = ⎢ u ⎣ uh21 ⎦ , uh3 ⎡

⎡

⎤ b1 ⎢ ⎥ = ⎢ b2 ⎥ . b ⎣ b2 ⎦ b3

140

Y. Cha and S. Kim

Fig. 2. An example of nonoverlapping domain decomposition

The above system can be solved in a block-Jacobi manner, parallelized, with each subdomain problem being solved by a direct method such as the LU factorization. The following iterative procedure for (8) is formulated via a matrix split h,0 , ﬁnd u h,n , n = 1, 2, · · ·, ting, incorporating a stabilization term: For a given u by recursively solving +R u u h,n = b h,n−1 , P (9) where

⎡

A11 A12 0 ⎢ A21 A22 + iβD12 0 P =⎢ ⎣ 0 0 A22 + iβD21 0 0 A32

⎤ 0 0 ⎥ ⎥, A23 ⎦ A33

⎡

⎤ 0 0 0 0 ⎢ 0 iβD12 −A23 ⎥ =⎢ 0 ⎥. R ⎣ −A21 iβD21 0 0 ⎦ 0 0 0 0 Here Djk are diagonal matrices with their main diagonal elements being positive; one may choose D12 = D21 = diag{1, 1, · · · , 1}. The quantity β denotes a complex-valued relaxation parameter, β = βr − iβi , βr > 0, βi ≥ 0. Such a choice of β not only introduces a convergent sequence of the iterates { uh,n } but h,n also imposes the continuity of the iterates on the interface, i.e., u12 → uh,n 21 , as −R is equivalent to A when uh12 = uh21 . n → ∞. Note that P The above algorithm can be applied for a general number of subdomains M in a similar way; see Figure 2. The diﬀerence in the algebraic formulation (9) is and R consist of M blocks rather than two. that P Theorem 1. Assume q(x) ≥ q0 > 0 and let the relaxation parameter β = βr − iβi be given as C1 h−1 q2 ≤ βi < βr 02 . 4 p1

Parallel Domain Decomposition Methods

141

where C1 is a positive constant. Then the iterates uh,n of algorithm (9) of M subdomains, M ≥ 2, converge to the original discrete solution of (4), uh ∈ V h , in the following sense: uh,n |Ωj → uh |Ωj in L2 (Ωj ),

j = 1, · · · , M.

Furthermore, let the relaxation parameter β = βr − iβi be chosen as βi =

C1 h−1 , 4

βr = ξ βi

p21 , q02

h2 p2 q 4 1/2 ξ =1+ 1+ · 41 04 . 4C1 p1 + q0

Then the spectral radius of the iteration matrix of algorithm (9) is minimized and bounded as 4 −1 R) ≤ 1 − C2 q0 h2 , ρ(P (10) p21 for some C2 > 0 independent of h, p, and q. For general cases, i.e., q ≥ q0 ≥ 0, we do not know of any convergence analysis for the DD algorithm (9). As mentioned in Section 2, the positivity of q0 is equivalent to the existence of a convergent nonsymmetric Krylov subspace algorithm [17,18]. However, the right-hand side of (10) is most sensitive to the attenuation coeﬃcient q. When q increases (a little, though), the convergence would be signiﬁcantly improved. This is the motivation of the following artificial damping iteration (ArtDI): Given a constant η > 0 and an initial guess uh,0 ∈ V h , ﬁnd uh, ∈ V h , ≥ 1, by recursively solving L(uh, , v; Ω) + (iη 2 uh, , v)Ω = (S, v)Ω + (iη 2 uh,−1 , v)Ω , ∀ v ∈ V h .

(11)

One can show that the above iteration converges when the imaginary parts of all eigenvalues of A is nonnegative, which also can be proved for p21 h2 suﬃciently small. Each step of the ArtDI algorithm (11) is a perturbation of (4), with the wave where K 2 = p2 − i(q 2 + η 2 ). Thus number K being changed by a new one K, one can solve each step by applying the DD method (9), as an inner iteration. The ArtDI algorithm in the algebraic formulation can be expressed as follows: h,0 ; Set η > 0, ε >, and u For⎡ = 1, 2, · · · h,,0 = u h,−1 ; u ⎢ For n = 1, 2, · · · , n∗ ⎢ ⎢ + iη 2 D + iη 2 D) u u h,,n−1 ; h,,n = b ⎢ (P uh,−1 + R ⎢ h, h,,n ∗ ⎣u =u ; h,−1 0 < ε, stop; If uh, − u

(12)

where D is a diagonal matrix which is obtained from the numerical integration of the square of the shape functions for each of nodal points. In the above ArtDI algorithm, the inner iteration has been carried out incompletely, stopping in n∗ iterations; for n∗ suﬃciently large, ArtDI would converge.

142

4

Y. Cha and S. Kim

Numerical Results

In this section, we verify accuracy and eﬃciency of algorithms (9) and (12) for solving the Helmholtz equation in 2D media by FE methods, for various choices of h, r, η, and n∗ . Set the domain Ω = (0, 1)2 ; we consider uniform quadrilateral elements of the edge length h = 1/np , for np > 0, and the ﬁnite element methods incorporating Legendre-Gauss-Lobatto splines of order r. The algorithm is implemented in C++ for the main function and FORTRAN for the others and carried out on a 16-node cluster of 2.4 Ghz Pentium 4 processors, with 512MB RAM memory each. The wave number is selected as p(x) = ω/v(x), where ω (:= 2πf ) denotes the angular frequency. Here f is the frequency. The wave speed v(x) is chosen as follows: v1 (x, y) ≡ 1, v2 (x, y) = 1.6 + | sin 3πx cos 4πy|, 2, if (x, y) ∈ [0.45, 0.75] × [0.55, 0.75], v3 (x, y) = 1, otherwise. Note that v2 is continuous (but not smooth) and v3 is piecewise constant. For the ABC, we set α(x) = p(x). Since we are interested in the propagation of waves in slightly-attenuate or non-attenuate media, the quality factor can be deﬁned by Q :=

ω2 p2 = 2 2 ∈ (0, ∞], 2 q v q

(13)

where Q = ∞ for q = 0. The quality factor is known to be between 50 and 300 in most earth media. The higher it is, the less attenuate and therefore the harder to solve. In this article the attenuation coeﬃcient q(x) will be determined from the selected values of Q, ω, and v(x) and utilizing (13). Let St (x) be the source that corresponds to the following true solution u(x) =

φ(x) · φ(y) , ω2

(14)

where φ(x) = eiω(x−1) + e−iωx − 2, and Sx0 (x) = δ(x − x0 ), for some x0 ∈ Ω. One can decompose the domain in various ways. However, for simplicity, we consider the element-wise decomposition; for parallelism, strips of subdomains are equally divided and assigned to the processors; see Figure 2. The wall-clock time is denoted by CPU (in seconds) and the iteration is stopped when the iterates satisfy the following stopping criterion uh,m − uh,m−1 ∞ < tol, uh,m ∞

Parallel Domain Decomposition Methods

143

Table 1. Eﬀectiveness of ArtDI. Set S = St , v = v1 , Q = ∞, and tol =1.0e-04. 1/h = 160, r = 1, f = 5 η n∗ N CPU r0N 0 1 880 1.9 1.91e-02 10 10 770 1.3 1.90e-02 10 20 520 0.8 1.90e-02 20 20 800 1.5 1.90e-02

1/h = 320, r = 2, f = 30 η n∗ N CPU r0N 0 1 diverges 30 25 diverges 50 25 925 23.5 3.02e-03 50 50 1100 26.8 3.20e-03

where tol is the tolerance, m is n or which is the iteration index respectively for (9) or (12). The tolerance will be chosen as tol = 10−γ , γ = 4 or 8, depending on the solution accuracy. Zero initial values are given: uh,0 ≡ 0, for all examples dealt in this article. The total number of DD iterations is denoted by N and therefore N = m for (9) and N = m · n∗ for (12). For S = St , the numerical error is measured by the relative L2 norm r0N =

uh,N − u0 , u0

where u is the true solution in (14). In table 1, we ﬁrst verify eﬀectiveness of the ArtDI. We have tested the algorithm for various choices of η and n∗ . As one can see from the table, the choice of (η, n∗ ) is important for the simulation of non-attenuate waves (Q = ∞). Note that the ArtDI method (12) with the choice (η, n∗ ) = (0, 1) becomes the DD algorithm (9). As shown in the convergence analysis, the DD algorithm (without ArtDI) fails to converge for most of high-frequency solutions in non-attenuate media. The iteration count n∗ must be set large enough to incorporate the ArtDI appropriately and eﬀectively. It has been numerically veriﬁed that n∗ can be selected solely depending on the grid size h, while the selection of η is more complicated due to its dependence on both h and the wave number (ω/v). However, the quality factor (Q) and the degree of basis functions (r) have shown little eﬀect on the selection of the parameters η and n∗ . For example, one may choose n∗ and η as follows: n∗ = 1/(κh),

κ = 8 ∼ 16,

η 2 = p2 /Qa ,

Qa = 10 ∼ 20,

(15)

where p denotes the L2 -average of p = ω/v over the domain and Qa is the artiﬁcial quality factor. Table 2 shows the performance of the ArtDI (12) for various spline orders r. Set S = St , Q = ∞, and tol =1.0e-08 for all the results. Note that r/h is set to be the same as 384 for the ﬁrst part (f = 10) and 768 for the second part (f = 30), which implies that the number of grid points is the same in each part. As one can see from the table, the case r = 2 gives the smallest CPU time, while higher-order ones (r ≥ 3) produce more accurate results. When the velocity is

144

Y. Cha and S. Kim Table 2. Accuracy-eﬃciency test. Set S = St , Q = ∞, and tol = 1.0e − 08. f = 10 1/h r 384 1 192 2 128 3 96 4 f = 30 1/h r 768 1 384 2 256 3 192 4

N 2842 1140 840 720 N 6390 2117 1606 1380

v = v1 CPU r0N 28.8 2.59e-02 10.0 9.22e-05 11.7 1.19e-06 12.2 7.76e-08 v = v1 CPU r0N 263.8 1.72e-01 77.9 1.39e-03 91.3 1.98e-05 97.3 9.31e-07

N 3219 1580 1060 920 N 4410 2088 1430 1280

v = v2 CPU r0N 32.6 2.88e-03 13.8 1.56e-05 14.7 1.06e-06 15.7 7.71e-08 v = v2 CPU r0N 182.2 7.21e-03 77.2 7.71e-05 81.9 8.11e-06 90.1 8.70e-07

N 6815 1320 6180 9860 N 9135 3393 11572 12240

v = v3 CPU r0N 69.3 2.44e-02 11.6 8.65e-05 85.2 1.18e-06 167.3 8.55e-08 v = v3 CPU r0N 376.7 1.62e-01 125.4 1.31e-03 659.1 1.89e-05 860.6 9.30e-07

continuous (v = v1 and v = v2 ), the higher-order FE methods (r ≥ 3) turn out to be slightly more expensive computationally, while they improve accuracy a lot. One can easily expect that higher-order FE methods (r ≥ 3) may result in a more eﬃcient method for a ﬁxed accuracy, when the medium is smooth enough. On the other hand, for the discontinuous velocity (v = v3 ), the higher-order methods cost much more for a relatively small accuracy improvement over the quadratic FE method. We recommend to employ the FE method of quadratic splines for the simulation of waves in discontinuous media. We have tested the parallel eﬃciency, which turns out to be larger than 97% for all tested examples. Such a high eﬃciency is due to the application of LU factorization for subproblems.

5

Conclusions

A domain decomposition (DD) iterative procedure for solving the Helmholtz wave problem by high-order ﬁnite element (FE) methods has been considered. We have chosen nonoverlapping subdomains and employed a modiﬁed Robin interface boundary condition. Under certain assumptions on the mesh and the quadrature rule, we have proved the convergence of the algorithm for attenuate waves. For non-attenuate waves, we have introduced the artiﬁcial damping iteration (ArtDI) as an outer iteration of the DD method, the convergence of which can be proved when the inner iteration is solved accurately enough. The resulting algorithm combining the ArtDI and DD iterations has been tested for the numerical solution of the Helmholtz problem in 2D media for various spline orders r and diverse frequencies f . The FE method of quadratic splines is recommended for the simulation of waves in heterogeneous media; no apparent phase lag has appeared in the numerical solution for 10-15 grid points per wavelength.

Parallel Domain Decomposition Methods

145

Acknowledgment The work of S. Kim is supported in part by NSF grant DMS-0609815.

References 1. Clayton, R., Engquist, B.: Absorbing boundary conditions for acoustic and elastic wave calculations. Bull. Seismol. Soc. Amer. 67, 1529–1540 (1977) 2. Kim, S., Kim, S.: Multigrid simulation for high-frequency solutions of the Helmholtz problem in heterogeneous media. SIAM J. Sci. Comput. 24, 684–701 (2002) 3. Shaidurov, V., Ogorodnikov, E.: Some numerical methods of solving Helmholtz wave equation. In: Cohen, G., Halpern, L., Joly, P. (eds.) Mathematical and Numerical Aspects of Wave Propagation Phenomena, pp. 73–79. SIAM, Philadelphia (1991) 4. Douglas Jr., J., Hensley, J.L., Roberts, J.E.: An alternating-direction iteration method for Helmholtz problems. Appl. Math. 38, 289–300 (1993) 5. Zienkiewicz, O.C.: Achievements and some unsolved problems of the ﬁnite element method. Internat. J. Numer. Methods Engrg. 47, 9–28 (2000) (Richard H. Gallagher Memorial Issue) 6. Kim, S.: A parallelizable iterative procedure for the Helmholtz problem. Appl. Numer. Math. 14, 435–449 (1994) 7. Kim, S.: Parallel multidomain iterative algorithms for the Helmholtz wave equation. Appl. Numer. Math. 17, 411–429 (1995) 8. Kim, S.: Domain decomposition iterative procedures for solving scalar waves in the frequency domain. Numer. Math. 79, 231–259 (1998) 9. Kim, S., Lee, M.: Artiﬁcial damping techniques for scalar waves in the frequency domain. Computers Math. Applic. 31(8), 1–12 (1996) 10. Kim, S., Shin, C., Keller, J.: High-frequency asymptotics for the numerical solution of the Helmholtz equation. Appl. Math. Letters 18, 797–804 (2005) 11. Douglas Jr., J., Santos, J.E., Sheen, D.: Approximation of scalar waves in the space-frequency domain. Math. Models Methods Appl. Sci. 4, 509–531 (1994) 12. Bayliss, A., Goldstein, C., Turkel, E.: On accuracy conditions for the numerical computation of waves. J. Comput. Phys. 59, 396–404 (1985) 13. Eisenstat, S., Elman, H., Schultz, M.: Variational iterative methods for non– symmetric systems of linear equations. SIAM J. Numer. Anal. 20, 345–357 (1983) 14. Saad, Y., Schultz, M.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7, 856–869 (1986) 15. Bayliss, A., Goldstein, C., Turkel, E.: An iterative method for the Helmholtz equation. J. Comput. Phys. 49, 443–457 (1983) 16. Faber, V., Manteuﬀel, T.: Necessary and suﬃcient conditions for the existence of a conjugate gradient method. SIAM J. Numer. Anal. 21, 352–362 (1984) 17. Joubert, W., Young, D.: Necessary and suﬃcient conditions for the simpliﬁcation of the generalized conjugate–gradient algorithms. Linear Algebra Appl. 88/89, 449– 485 (1987) 18. Freund, R.W.: Conjugate gradient–type methods for linear systems with complex symmetric coeﬃcient matrices. SIAM J. Sci. Stat. Comput. 13, 425–448 (1992)

Self-Organizing Neural Grove and Its Distributed Performance Hirotaka Inoue Department of Electrical Engineering and Information Science, Kure National College of Technology, 2-2-11 Agaminami, Kure-shi, Hiroshima, 737-8506 Japan [email protected]

Abstract. In this paper, we present the improving capability of accuracy and the parallel eﬃciency of self-organizing neural groves (SONGs) for classiﬁcation on a MIMD parallel computer. Self-generating neural networks (SGNNs) are originally proposed on adopting to classiﬁcation or clustering by automatically constructing self-generating neural tree (SGNT) from given training data. The SONG is composed of plural SGNTs each of which is independently generated by shuﬄing the order of the given training data, and the output of the SONG is voted all outputs of the SGNTs. We allocate each of SGNTs to each of processors in the MIMD parallel computer. Experimental results show that the more the number of processors increases, the more the classiﬁcation accuracy increases for all problems.

1

Introduction

Neural networks have been widely used in the ﬁeld of intelligent information processing such as classiﬁcation, clustering, prediction, and recognition. Generally, these neural networks have to be decided the network structure and some parameters by human experts. It is quite tricky to choose the right network structure suitable for a particular application at hand. Concerning the design of the network structure, the following must be decided, (i) the number of the network layers, (ii) the number of the neurons of each layer, (iii) the weights on connection between consequent layers. During learning iterations, the weights on connections of the given networks are updated so as to converge to target value conserving the initially decided static network structure. Consequently, obtaining the right structure of each network is the most important factor in learning and also the most diﬃcult problem in the design of neural networks. In order to avoid these tricky and diﬃcult situations, Self-generating neural networks (SGNNs) are focused an attention because of their simplicity on networks design [1]. SGNNs are some kinds of extension of the self-organizing maps (SOMs) of Kohonen [2] and utilize the competitive learning algorithm which is implemented as self-generating neural tree (SGNT). The SGNT algorithm is proposed in [3] to generate a neural tree automatically from training data directly. In our previous study concerning the performance C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 146–155, 2010. c Springer-Verlag Berlin Heidelberg 2010

Self-Organizing Neural Grove and Its Distributed Performance

147

analysis of the SGNT algorithm [4], we showed that the main characteristic of this SGNT algorithm was its high speed convergence in computation time but it was always not best algorithm in its accuracy comparing with the existing other feed-forward neural networks such as the back-propagation (BP). In order to improve the generalization capability of SGNNs, we proposed ensemble self-generating neural networks (ESGNNs) for classiﬁcation [5]. ESGNNs apply ensemble averaging [6] to SGNNs and fully utilize the high speed convergence characteristics of the SGNT algorithm. Although ESGNNs are improved the accuracy by using various SGNTs, the computation time and the memory capacity are increased in proportion to increase the number of SGNTs. Therefore, we proposed a novel pruning method for the structure of the ESGNNs to reduce the computation time and the memory capacity and we called this model as self-organizing neural grove (SONG) [7]. Ensemble learning has been studied many AI and neural network researchers. Breiman proposed bagging predictors to improve the accuracy of CART [8] and investigated bagging performance on CART and other methods for classiﬁcation and regression problems in [9]. Since ensemble learning is a variance-reduction technique, it is well known that ensemble learning tends to work well for methods with high variance such as neural networks and tree-based methods. In this paper, we present the improving capability of accuracy and the parallel eﬃciency of the SONG for classiﬁcation on a MIMD parallel computer. We apply to three problems in the UCI repository [10] which are given as benchmark.

2

Self-Organizing Neural Grove

In this section, we describe how to prune redundant leaves in the SONG. First, we mention the on-line pruning method in learning of SGNT. Second, we show the optimization method in constructing the SONG. 2.1

On-Line Pruning of Self-Generating Neural Tree

SGNT is based on SOM and implemented as a competitive learning. The SGNT can be constructed directly from the given training data without any intervening human eﬀort. The SGNT algorithm is deﬁned as a tree construction problem of how to construct a tree structure from the given data that consist of multiple attributes under the condition that the ﬁnal leaves correspond to the given data. Before we describe the SGNT algorithm, we denote some notations. – – – – – –

input data vector: ei ∈ IRm . root, leaf, and node in the SGNT: nj . weight vector of nj : w j ∈ IRm . the number of the leaves in nj : cj . distance measure: d(ei , wj ). winner leaf for ei in the SGNT: nwin .

148

H. Inoue

Input: A set of training examples E = {e_i}, i = 1, ... , N. A distance measure d(e_i,w_j). Program Code: copy(n_1,e_1); for (i = 2, j = 2; i <= N; i++) { n_win = choose(e_i, n_1); if (leaf(n_win)) { copy(n_j, w_win); connect(n_j, n_win); j++; } copy(n_j, e_i); connect(n_j, n_win); j++; prune(n_win); } Output: Constructed SGNT by E. Fig. 1. SGNT algorithm Table 1. Sub procedures of the SGNT algorithm Sub procedure copy(nj , ei /w win ) choose(ei , n1 ) leaf (nwin ) connect(nj , nwin ) prune(nwin )

Speciﬁcation Create nj , copy ei /w win as w j in nj . Decide nwin for ei . Check nwin whether nwin is a leaf or not. Connect nj as a child leaf of nwin . Prune leaves if the leaves have the same class.

The SGNT algorithm is a hierarchical clustering algorithm. The pseudo C code of the SGNT algorithm is given in Fig. 1. In Fig. 1, several sub procedures are used. Table 1 shows the sub procedures of the SGNT algorithm and their speciﬁcations. In order to decide the winner leaf nwin in the sub procedure choose(e i,n 1), the competitive learning is used. If an nj includes the nwin as its descendant in the SGNT, the weight wjk (k = 1, 2, . . . , m) of the nj is updated as follows: wjk ← wjk +

1 · (eik − wjk ), cj

1 ≤ k ≤ m.

(1)

After all training data are inserted into the SGNT as the leaves, the leaves have each class label as the outputs and the weights of each node are the averages of the corresponding weights of all its leaves. The whole network of the SGNT reﬂects the given feature space by its topology. For more details concerning how to construct and perform the SGNT, see [3]. Note, to optimize the structure

Self-Organizing Neural Grove and Its Distributed Performance

149

Input T

SGNT 1

SGNT 2

o1

...

o2

SGNT K

oK

Combiner

Σ

Output

o

Fig. 2. An MCS which is constructed from K SGNTs. The test dataset T is entered each SGNT, the output oi is computed as the output of the winner leaf for the input data, and the MCS’s output is decided by voting outputs of K SGNTs.

of the SGNT eﬀectively, we remove the threshold value of the original SGNT algorithm in [3] to control the number of leaves based on the distance because of the trade-oﬀ between the memory capacity and the classiﬁcation accuracy. In order to avoid the above problem, we introduce a new pruning method in the sub procedure prune(n win). We use the class label to prune leaves. For leaves connected to the nwin , if those leaves have the same class label, then the parent node of those leaves is given the class label and those leaves are pruned. In the next sub-section, we describe how to optimize the structure of the SGNT in the SONG to improve the classiﬁcation accuracy. 2.2

Optimization of the SONG

The SGNT has the capability of high speed processing. However, the accuracy of the SGNT is inferior to the conventional approaches, such as nearest neighbor, because the SGNT has no guarantee to reach the nearest leaf for unknown data. Hence, we construct an MCS by taking the majority of plural SGNT’s outputs to improve the accuracy (Figure 2). Although the accuracy of the SONG is comparable to the accuracy of conventional approaches, the computational cost increases in proportion to increase in the number of SGNTs in the SONG. In particular, the huge memory requirement prevents the use of the SONG for large datasets even with latest computers. In order to improve the classiﬁcation accuracy, we propose an optimization method of the SONG for classiﬁcation. This method has two parts, the merge phase and the evaluation phase. The merge phase is performed as a pruning algorithm to reduce dense leaves (Figure 3). This phase uses the class information and a threshold value α to decide which subtree’s leaves to prune or not. For leaves that have the same parent node, if the proportion of the most common class is greater than or equal to the threshold value α, then these leaves are pruned and the parent node is

150

H. Inoue 1 begin initialize j = the height of the SGNT 2 do for each subtree’s leaves in the height j 3 if the ratio of the most class ≥ α, 4 then merge all leaves to parent node 5 if all subtrees are traversed in the height j, 6 then j ← j − 1 7 until j = 0 8 end. Fig. 3. The merge phase 1 begin initialize α = 0.5 2 do for each α 3 evaluate the merge phase with 10-fold CV 4 if the best classiﬁcation accuracy is obtained, 5 then record the α as the optimal value 6 α ← α + 0.05 7 until α = 1 8 end. Fig. 4. The evaluation phase

given the most common class. The optimum threshold values α of the given problems are diﬀerent from each other. The evaluation phase is performed to choose the best threshold value by introducing 10-fold cross validation (Figure 4). 2.3

Simple Example of the Pruning Method

We show an example of the pruning algorithm in Figure 5. This is a twodimensional classiﬁcation problem with two equal circular Gaussian distributions that have an overlap. The shaded plane is the decision region of class 0 and the other plane is the decision region of class 1 by the SGNT. The dotted line is the ideal decision boundary. The number of training samples is 200 (class0: 100,class1: 100) (Figure 5(a)). The unpruned SGNT is given in Figure 5(b). In this case, 200 leaves and 120 nodes are automatically generated by the SGNT algorithm. In this unpruned SGNT, the height is 7 and the number of units is 320. In this, we deﬁne the unit to count the sum of the root, nodes, and leaves of the SGNT. The root is the node which is of height 0. The unit is used as a measure of the memory requirement in the next section. Figure 5(c) shows the pruned SGNT after the merge phase in α = 1. In this case, 159 leaves and 107 nodes are pruned away and 54 units remain. The decision boundary is the same as the unpruned SGNT. Figure 5(d) shows the pruned SGNT after the merge phase in α = 0.6. In this case, 182 leaves and 115 nodes are pruned away and only 23 units remain. Moreover, the decision boundary is improved more than the unpruned SGNT because this case can reduce the eﬀect of the overlapping class by pruning the SGNT.

Self-Organizing Neural Grove and Its Distributed Performance 1

class0 class1

class0 class1 node

Height

0.8

151

7 6 5 4 3 2 1 0

x2

0.6

0.4

1 0.8

0.2 0

0.6 0.2

0 0

0.2

0.4

0.6

0.8

0.4 x1

0.4 0.6

x2

0.2

0.8

1 0

1

x1

(a)

(b) class0 class1 node

Height

class0 class1 node

Height

7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0 1

1

0.8 0

0.8

0.6 0.2

0.4 x1

0.4 0.6

0

x2

0.2

0.8

1 0

0.6 0.2

0.4 x1

(c)

0.4 0.6

x2

0.2

0.8

1 0

(d)

Fig. 5. An example of the SGNT’s pruning algorithm, (a) a two dimensional classiﬁcation problem with two equal circular Gaussian distribution, (b) the structure of the unpruned SGNT, (c) the structure of the pruned SGNT (α = 1), and (d) the structure of the pruned SGNT (α = 0.6). The shaded plane is the decision region of class 0 by the SGNT and the doted line shows the ideal decision boundary. 1

1

class0 class1

0.6

0.6 x2

0.8

x2

0.8

class0 class1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4

0.6 x1

(a)

0.8

1

0

0.2

0.4

0.6

0.8

1

x1

(b)

Fig. 6. An example of the MCS’s decision boundary (K = 25), (a) α = 1, and (b) α = 0.6. The shaded plane is the decision region of class 0 by the MCS and the doted line shows the ideal decision boundary.

To show how well the MCS is optimized by the pruning algorithm, we show an example of the MCS in the same problem used above. Figure 6(a) and Figure 6(b) show the decision region of the MCS in α = 1 and α = 0.6, respectively. We set

152

H. Inoue

the number of SGNTs K as 25. The result of Figure 6(b) is a better estimation of the ideal decision region than the result of Figure 6(a). We investigate the optimization method for more complex problems in the next section.

3

Distributed Processing

Because each SGNT of the SONG can train and test independently, the SONG has a possibility of the parallel computation at the training process and the testing process. Hence, we allocate each of SGNTs to each of processors on the MIMD computer. The procedure of the parallelization of the SONG is presented as follows: Step1: In a master processor, read the training set D and the test set T in the disk. Step2: In the master processor, broadcast D and T for all K−1 slave processors. Step3: In all processors, generate the SGNT from D, then test the SGNT using T , and compute the ok independently. Step4: In all processors, each output ok for T is collected in the master processor by all to one communication. Step5: In the master processor, compute o by voting and write to the disk. Because the number of the communications between the master processor and each slave processor is only two times (Step2 and Step4), the parallel eﬃciency is approximately expected the linear speedup. In our case, all computations are performed on the Intel Paragon (Paragon XP/S15). This is a distributed memory multicomputer, and the architecture is multiple instruction multiple data (MIMD). The Paragon we use has 296 processors. Each processor is Intel i860XP (50MHz). The network topology of the Paragon is adopted the twodimensional mesh.

4

Experimental Results

We allocate each of SGNTs to each of processors on the Paragon, and compute 100 trials for each single/ensemble model. The number of processors (SGNTs) K for the ensemble averaging is changed from 1 to 201 (1,3,5,7,9,15,25,51,101,151 and 201), and the threshold value α is 1 for each SONG. In order to reduce the redundant execution, we repeated 100 trials from Step3 to Step5 in prior section continuously. In order to investigate the parallel performance of the SONG, we select three typical classiﬁcation problems, which are given as benchmark problems in UCI repository [10]. Next, we describe the brief explanation of these problems. breast-cancer-wisconsin: This problem is a binary classiﬁcation task for classify a tumor as either benign or malignant based on cell descriptions gathered by a microscopic examination. Input attributes are:

Self-Organizing Neural Grove and Its Distributed Performance

153

– the clump thickness, – the uniformity of cell size, – cell shape, – the amount of magical adhesion, – the frequency of bare nuclei, etc. This problem has 9 attributes, 699 examples. Each attribute consists of continuous real value. ionosphere: This problem is a binary classiﬁcation task for a radar as either good or bad based on the complex electromagnetic signals. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere. This problem has 34 attributes, 351 examples. Each attribute consists of continuous real value. letter-recognition: The objective is to identify each of a large number of blackand-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 diﬀerent fonts and each letter within these 20 fonts was randomly distorted to produce a ﬁle of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts), which were then scaled to ﬁt into a range of integer values from 0 through 15. In this paper, we use below deﬁned classiﬁcation accuracy. classiﬁcation accuracy =

number of correct . number of test data

(2)

We evaluate the classiﬁcation accuracy using 10-fold cross-validation [11] for above problems. Figure 7(a), (b), and (c) show the inﬂuence of the number of processors on classiﬁcation accuracy for breast-cancer-wisconsin, ionosphere and letter-recognition problems respectively. Classiﬁcation accuracies are improved by computing the ensemble averaging of various SGNTs for all problems. Here, each classiﬁcation accuracies shows the average of 100 trials and its error-bar. It is shown that the classiﬁcation accuracies are improved by computing the ensemble averaging of various SGNTs for all problems. Especially, the minimum classiﬁcation accuracies are largely improved for all problems. The improvement ability is obtained from small K most eﬀectively. The classiﬁcation accuracy of larger than 51 SGNTs is convergence for all problems. Figure 8(a),(b), and (c) show the relation between the number of processors and the execution times for breast-cancer-wisconsin, ionosphere, and letterrecognition problems respectively. The execution times are gradually saturated as the number of processors increase. As the scale of the dataset grows, the proportion of the communication time, i.e. the diﬀerence of the total time and the training time + the testing time, for the total time is decrease. This means that this method have an approximately linear speedup for large-scale datasets.

154

H. Inoue 0.98

0.92

breast-cancer-wisconsin

0.97

ionosphere

0.96 0.955 0.95 0.945

Classification accuracy

0.965

letter

0.96

0.9

0.97

Classification accuracy

Classification accuracy

0.975

0.88 0.86 0.84 0.82

0.94

0.95 0.94 0.93 0.92 0.91 0.9 0.89

0.8

0.935 0.93

0.88

0.78 0

50

100 # of processors

150

0.87

200

0

50

(a)

100 # of processors

150

200

0

50

(b)

100 # of processors

150

200

(c)

2.4

230

1.7

2.3

225

1.6 1.5 1.4 1.3 1.2 Total time Training time + Testing time Training time

1.1 1 0

50

100 # of processors

150

Computation time (sec.)

1.8

Computation time (sec.)

Computation time (sec.)

Fig. 7. Inﬂuence of the number of processors on classiﬁcation accuracy for (a) breastcancer-wisconsin, (b) ionosphere, and (c) letter-recognition

2.2 2.1 2 1.9 1.8 Total time Training time + Testing time Training time

1.7 1.6 200

(a)

0

50

100 # of processors

(b)

150

220 215 210 205 200 195 Total time Training time + Testing time Training time

190 185 200

0

50

100 # of processors

150

200

(c)

Fig. 8. Relation between the number of processors and execution time (seconds) for (a) breast-cancer-wisconsin, (b) ionosphere, and (c) letter-recognition

Consequently, a parallel distributed computing using SONG can obtain more higher classiﬁcation accuracy than the single SGNT by allocating each of SGNTs to each of processors, go on maintaining the high speed processing property of the single SGNT.

5

Conclusions

In this paper, we presented the distributed computing with SONG to obtain more eﬀective implementation for classiﬁcation on the MIMD parallel computer. From the experimental results the following conclusions can be drawn: – Distributed computing with SONG can improve the classiﬁcation accuracy using various SGNTs which are allocated processors on the MIMD computer. – Distributed computing with SONG can perform a task with the high parallel eﬃciency by allocating each of SGNTs to each of processors on the MIMD computer. In the future work, we will study an incremental learning of SONG for large-scale data mining.

Acknowledgements The authors would like to thank the Information Processing Center in Okayama University of Science for using the Paragon.

Self-Organizing Neural Grove and Its Distributed Performance

155

References 1. Wen, W.X., Pang, V., Jennings, A.: Self-generating vs. self-organizing, what’s diﬀerent? In: Simpson, P.K. (ed.) Neural Networks Theory, Technology, and Applications. IEEE Technology Update Series, pp. 210–214. IEEE Technical Activities Board, Piscataway (1996) 2. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995) 3. Wen, W.X., Jennings, A., Liu, H.: Learning a neural tree. In: The International Joint Conference on Neural Networks, Beijing, China, vol. 2, pp. 751–756, November 3-6 (1992) 4. Inoue, H., Narihisa, H.: Eﬃciency of Self-Generating Neural Networks Applied to Pattern Recognition. Int. J. of Mathematical and Computer Modelling 38(11-13), 1225–1232 (2003) 5. Inoue, H., Narihisa, H.: Improving generalization ability of self-generating neural networks through ensemble averaging. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 177–180. Springer, Heidelberg (2000) 6. Haykin, S.: Neural Networks: A comprehensive foundation, ch. 7, 2nd edn. PrenticeHall, Upper Saddle River (1999) 7. Inoue, H.: Self-organizing neural grove: Eﬃcient multiple classiﬁer system with pruned self-generating neural trees. In: Proc. The 18th International Conference on Artiﬁcial Neural Networks, Part I, Prague, Czech Rep., pp. 613–622, September 3-6 (2008) 8. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classiﬁcation and Regression Trees. Wadsworth, California (1984) 9. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 10. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html 11. Stone, M.: Cross-validation: A review. Math. Operationsforsch. Statist., Ser. Statistics 9(1), 127–139 (1978)

A Massively Parallel Hardware for Modular Exponentiations Using the m-ary Method Marcos Santana Farias1 , S´ergio de Souza Raposo1 , Nadia Nedjah1 , and Luiza de Macedo Mourelle2 1

Department of Electronics Engineering and Telecommunications 2 Department of System Engineering and Computation, Engineering Faculty, State University of Rio de Janeiro, Brazil

Abstract. Most of cryptographic systems are based on modular exponentiation. It is performed using successive modular multiplications. One way of improving the throughput of a cryptographic system implementation is reducing the number of the required modular multiplications. Existing methods attempt to reduce this number by partitioning the exponent in constant or variable size windows. In this paper, in the purpose of further accelerating the computation of modular exponentiation, a concurrent novel approach is proposed along with hardware implementation of the concurrent m-ary method. We compare the proposed method to the sequential implementation. Keywords: Modular exponentiation, m-ary method, concurrency.

1 Introduction Modular exponentiation is a very important operation for the public key cryptography systems such as RSA [9], which is an encryption algorithm considered one of the safest. The algorithm encrypts and decrypts information performing the operation C = T E mod M , wherein E is called exponent and M modulus is the module chosen from the product of two primes. Note that the larger the prime numbers are, the more secure the process is. There are many efficient algorithms [2,1,6] that permit the computation of a modular power given the necessary data, i.e. T , E and M . Some of these algorithms [1,6] schedule the underlaying modular multiplications in parallel in order to improve the processing time. Here, we propose a parallel implementation of the m-ary modular exponentiation method. The m-ary methods for exponentiation [10] may be thought of as a three steps procedure: (a) partitioning the exponent E in w d-bit windows; (b) pre-computing all possible powers of a windows; (c) iterating the squaring of the partial result and then multiplying it by the power in the next window. The m-ary methods scans the digits of E from the less significant to the most significant digit and groups them into partitions of equal length. The most popular method is the binary exponentiation as it uses the binary representation of the exponent E = en−1 en−2 . . . e1 e0 as described in Algorithm 1. A generalization of Algorithm 1 to using groups of d bits instead of a single bit, called the m-ary method, is given in Algorithm 2. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 156–165, 2010. c Springer-Verlag Berlin Heidelberg 2010

A Massively Parallel Hardware for Modular Exponentiations

157

Algorithm 1. Binary Method [3] Require: T , M , E; Ensure: C = T E mod M ; C := T en−1 mod M ; for i := n − 2 downto 0 do C := C 2 mod M ; if ei = 0 then C := C × T ei mod M ; end if end for return C;

Algorithm 2. m-aryMethod [3] Require: T , M , E, d = log2 m; Ensure: C = T E mod M ; for i := 2 to 2d − 1 do Store T i mod M ; end for C := T vw−1 mod M ; for i := w − 2 downto 0 do d C := C 2 mod M ; if vi = 0 then C := C × T vi mod M ; end if end for return C;

There exists many hardware implementation of the modular exponentiation [3,8,7]. However, these are all sequential. The solution developed in this paper seeks the parallel implementation of the exponentiation T E mod M using several multipliers. As in the m-ary method [10], the exponent E is divided into w partitions or windows with d bits. The computation can be described as in (1). (w−1)d

T E = (T pw−1 )2

× . . . × (T pi )2 × . . . × (T p1 )2 × T p0 , id

d

(1)

wherein exponent E is viewed as in (2). E = pw−1 × 2(w−1)d + . . . + pi × 2id + . . . + p1 × 2d + p0

(2)

From (1), one can envision the computation of all (T pi )2 in parallel once the precomputation of (T pi ) mod M has been completed. Note that the number of modular id multiplications required in the computation of (T pi )2 mod M is always one multipli(i−1)d cation shorter than that needed in (T pi−1 )2 mod M for all i ≥ 1. Therefore, once d the final result corresponding to the computation of (T p1 )2 mod M has been yield and, as the result of T p0 is already available (from pre-computation), then one can immed diately proceed with the computation of the modular multiplication C = (T p1 )2 × id

158

M. Santana Farias et al.

T p0 mod M . Concurrently with that multiplication, the last one in the exponentiation 2d 2d (T p2 )2 mod M can occur. Hence, the computation of C = C ×(T p2 )2 mod M may id occur in the sequel. Similarly, we can infer that, as soon as power (T pi )2 mod M is id obtained, one can proceed with the modular multiplication C × (T pi )2 mod M , yielding a new partial result and so forth until the final power is obtained. Note that no extra delays are needed to synchronize the squaring and the multiplications steps. The parallelization of the computation of T E mod M is formally described using the Petri net model of Fig. 1, wherein a transition represents the computation indicated by its label and the places with a token symbolize the availability of the result. The net shows the main stages of the proposed concurrent modular exponentiator. The upper place and transition refer to the pre-computation of the required powers of T . Hereafter, this is called the pre-computation stage. Ideally, the end of this stage would trigger the start of all squaring (squaring stage), whose end would, in turn, triggers the prescribed modular multiplication (multiplication stage). However, in an implementation with a single-port shared power memory, this would not be the case as the pre-computed powers are all kept in that memory. Among other improvements, the proposed hardware architecture avoids this kind of delays to further improve the performance of the parallel modular exponentiator. The rest of this paper is organized as follows: first, in Section 2, we describe the proposed architecture for a modular exponentiator and explain its underlaying operation; thereafter, in Section 3, we present some performance results to assess the efficiency of the proposed implementations; last but not least, in Section 4, we draw some conclusions and point out some directions for future work.

2 The Proposed Architecture Fig. 2 shows the macro-architecture of the proposed modular exponentiator. It includes the power memory (PMEM), a scalable number of modular multipliers (MMULTs), the main controller (MCTRL), wherein MMULTs receive their operands from PMEM via the shared data bus (DBUS). Memory PMEM is a shared RAM, which stores the repository of the pre-computed powers of T . Modular multiplier MMULT implements a modular multiplication using the Montgomery algorithm [10]. The modular multiplier details can be found in [5]. Each MMULT performs iteratively in order to provide a given power of one of the pre-computed values stored in PMEM. Controller MCTRL supervises the operation of the MMULTs by controlling several signals, which synchronize the data-path components through the three stages of the modular power computation. Binary-coded exponent is split into w ≥ 2 partitions. Each partition comprises d ≥ 2 bits. The number of MMULTs coincides with the number of partitions. In the following, we describe the three processing stages of the proposed modular exponentiator: pre-computation; squaring and multiplication. 2.1 Pre-computation Besides the basic operation that is performed by all modular multipliers MMULTs, the highest order MMULT(w−1) also pre-computes 2d −1 modular powers of T : T 2 mod

A Massively Parallel Hardware for Modular Exponentiations

159

Fig. 1. Petri net model for the parallelization of the computation of T E

M, . . . , T 2 −1 mod M , and store them in PMEM. Multiplier MMULT(w−1) performs these pre-computations thanks to a slight adjustment using three extra multiplexers Mux3, Mux4 and MuxE, and two registers RegE and Cnt1, as shown in Fig. 2 and Fig. 3. In this stage, MuxE forwards the signal from Cnt1, initially set to constant 1. This counter always points to the PMEM address, wherein the result of the multiplication would be recorded. During the computation of the first modular multiplication, Mux3 switches to T and Mux4 to 1. Then, multiplexers Mux1 and Mux2 that are associated with MMULT(w−1) forward Mux3 and Mux4 outputs, respectively. In addition to MCTRL, MMULTs require local controllers to operate concurrently. Yet during first multiplication, the MMULT controller signals to Mux1 and Mux2 as to forward the data coming from Mux3 and Mux4 outputs. The MMULT controller, through a tri-state buffer, writes data on DBUS when its down-counter reaches 0. At the present stage, this counter stores value 1 and on every multiplication done the result is written on DBUS. When first multiplication is ended up, data is put on DBUS (T mod M ) and MCTRL enables PMEM to write (RD = 0) at position 1, provided by Cnt1. Counter Cnt1 is incremented and now Mux4 switches to data on DBUS. The modular product obtained in each iteration is stored at the proper PMEM address, e.g., T 2 mod M is stored at address 2, T 3 mod M at address 3 and so on. Note that counter Cnt1 has d bits. When d

160

M. Santana Farias et al.

all bits are high, an AND gate (not represented in Fig. 3 for readability reasons only) signals to MCTRL that pre-computation is through and, so squaring must commence. M 1

A SelMux3

Mux _1

Mux _2

Mux _1

Mux _4

B

A

MMult 1

T

Mux _2

B

SelMux4

( w -1)

P

Mux _1

A

B

Mux _3 MMult

Mux _2

ZeroDC

( w -2)

MMult

0

P ZeroDC

P ZeroDC

:

:

REG_Final

DBUS

ZeroDC ZeroDC

. . .

. . . ZeroDC ZeroDC

E

REG_ w

( w -1)

Data Addr

( w -2)

CTRL 1 0

MMStart

Addr

OE

OE

RD

RD

PMEM

( w -1)

MMStartFinal SelMux3 SelMux4

Fig. 2. Macro-architecture of the parallel exponentiator

2.2 Squaring In this stage, controller MCTRL permits for MMULTs to be fed with the pre-computed powers stored in PMEM. Multiplier MMULTi is provided with the word content whose address consist to the d bits of the corresponding partition in exponent E. Exponent E is stored in REGE, which is a shift register of d bits at a time. The most significant d bits of REGE are fed into PMEM address, which allows the exponentiator to retrieved the necessary pre-computed power for each one of the MMULTs before squaring starts. After each PMEM read cycle, MCTRL performs d left-shifts to get the next power address in PMEM. The launch of all MMULTs operation is controlled by a single wbit wide right-shift register, shown are in Fig. 3 as REGw. This register starts with value 2(w−1) , i.e. 1 in the most significant bit and 0 in the remaining d − 1 bits. Rightshifting this register, MCTRL automatically signals to the corresponding MMULTi , setting bit i of REGw, to start operating. When REGw gets zeroed, MCTRL knows that all MMULTs have been initialized. Now it is time to move to next stage: multiplication. 2.3 Multiplication This stage multiplies the partial results obtained by MMULTs. Multiplier MMULT0 is elected to perform the necessary multiplications. It was chosen to run this task because it becomes idle first. Its controller waits for bit 0 of REGw to be set so it loads 1 in one of the inputs of MMULT0 and the other with value on DBUS which is T mod M , retrieved form PMEM. Afterwards, whenever a MMULT completes squaring and flags via signal ZeroDCi , MMULT0 uses DBUS data to obtain the other input. When the this done for all MMULTi , register REGFinal has the exponentiation final result. Each MMULT, as described in Fig. 4, includes a modular multiplier (MM) that implements Montgomery multiplication algorithm [5], a down-counter, which used to control the number of times the power, read from PMEM, is squared, and an independent controller. Fig. 5 presents the state machine of the main controller. The description of each state is presented as follows.

A Massively Parallel Hardware for Modular Exponentiations

ExpStart

LER1

MMDone

Inc_ CNT1 LERE

End_Pre_comp

MMStart RD

1

REGE

CNT1

d

ZeroRegW MMDone _Final

E

d

LShift _RE

Main Controller State Machine

SelMuxE MuxE OE ExpDone

SelMux3

RShift _ RegW

SelMux4

LERW

d

ADDR

REGw ZeroDC ( w -2) . . . ZeroDC 1

MMStart _Final

w

Regw

Fig. 3. Main controller (MCTRL) details

– S0 : initialize system; if start signal is asserted then go to S1 ; – S1 : start MMULT(w−1) for pre-computation; go to S2 ; – S2 : if MM has finished then go to S3 ; – S3 : start result write into PMEM; go to S4 ; – S4 : finish result write into PMEM; go to S5 ; – S5 : increment counter Cnt1; go to S6 ; – S6 : if pre-computation complete then go to S12; else go to S7 ; – S7 : start MMULT(w−1) with Mux4 forwarding data from DBUS; – S8 : if MM has finished then go to S9 ; – S9 : start result write into PMEM; go to S10 ; – S10 : finish result write into PMEM; go to S11 ; – S11 : increment counter Cnt1; go to S6 ; – S12 : MuxE switches to RegE; Mux3 and Mux4 forward data from DBUS; start MMULT(w−1) to perform squaring; – S13 : read PMEM at most significant partition; – S14 : left-shift REGE d bits; right-shift REGw bit; – S15 : start MMULTi corresponding to bit i, set in REGw; – S16 : if all MMULTs have been initiated then go to S17 ; else go to S13

161

162

M. Santana Farias et al.

A

B

M

Sel _ mux2

Sel _ mux1 RegA

RegM

LEABM

RegB

MMStart MMult CRTL ZeroDC

LEP

clk

MMDone MM Down Down Counter RegP

ZeroDC

Fig. 4. MMULT details

S1

S0

S2

S3

S4

S5

Start = 0 MMDone0

S11

=0

S10

End of Pre-computation = 0

S9

S8

MMDone0

S7

S6

=0

End of Pre-computation =1 Zero_REG_ w = 0

S16

S19

S15

S14

Zero_REG_ w = 1

S18

MMDoneFinal

S17

=0

MMDone0

=0

Fig. 5. State Machine to the main controller

S13

S12

A Massively Parallel Hardware for Modular Exponentiations

163

ExpStart ExpDone CLK T 4759

21397

37783

M 5073

977

9169

E 191

60

Adress 01

10

0 4...

Data bus

11

... 11

00

2...

01

114 10

898

11

00 11

00

01

880 616

10

1

1...

11

... 11 00 10 00

5...

01 3330

Data bus enable Final result 0

4432

157

1126

end of pre-computation MMult (w-1) - Start MMult (w-1) - done MMult (w-1) - end MMult (w-2) - start MMult (w-2) - done MMult (w-2) - end MMult (w-3) - Start Multiplier (w-3) - done Multiplier (w-3) - end Final Multiplier - Start Final Multiplier - Done

0 ns

200 ns

100 ns

400 ns

300 ns

500 ns

600 ns

Entity:teste_exponenc Architecture:test_exp Date: Wed Oct 21 20:50:36 E. South America Daylight Time 2009 Row: 1 Page: 1

Fig. 6. Simulation result

– S17 : if MMULT(w−1) has finished then go to S18 ; – S18 : start MMULT0 ; if it has finished then go to S19 ; – S19 : go to S0 ;

3 Performance Results A parametrized VHDL [4] code was written and simulated on ModelSim XE III 6.4 [11]. Fig. 6 shows that the modular exponentiator takes about 40 clock cycles to yield one result. Signal ExpStart in high state flags an exponentiation start. Signal ExpDone is set when a run is through, and power T E is available in Final result. Table 1 presents a summary of the resources requirements of an Spartan-3E FPGA [11] for three different exponent sizes compared to the sequential hardware, which uses a single modular multiplier to perform all the multiplication required to yield the final exponentiation result Table 1. Estimated area for the proposed method vs. the sequential method [3] Size of E (bits) Slice Registers Slice LUT’s Used LUT-FF Response time

Parallel hardware 4 8 16 113 205 373 225 391 703 103 193 361 40 cycles

Sequential hardware 4 8 16 47 101 192 89 166 395 34 71 153 289 cycles

164

M. Santana Farias et al.

[3]. It is clear that the sequential implementation, as expected uses far less hardware resources than the proposed implementation. However, the gain in the computation time suggests that the latter is 7 times more efficient that the former. The use of the proposed hardware within security chips allows us to have a much higher throughput, which is a pre-requisite in most nowadays cryptosystems implementations.

4 Conclusion This paper describes a massively parallel implementation of modular exponentiation. It is a practical demonstration that parallelized modular exponentiation remarkably speeds up modular exponentiations, which is a critical stage on encryption/decryption relatedcomputation in most public-key cryptosystems. Performance depends on parameters w, d and the number of bits in E. We compared the performance of the proposed implementation to that of [3], which is a sequential implementation of the m-ary method. It is obvious that the sequential implementation, as one can expect, uses far less hardware resources than the proposed implementation. However, the gain in the computation time suggests that the latter is 7 times more efficient that the former. The use of the proposed hardware within security chips allows us to have a much higher throughput, which is a pre-requisite in most nowadays cryptosystems implementations. Wider windows (exponent share) extends pre-computation stage, requires larger PMEM, cuts down the number of MMULTs, reduces accesses to DBUS and demands less iterations from the highest order MMULT. In contrast, narrower windows yield the reduce the pre-computation step, requires less memory to store pre-computed powers but needs more modular multiplier. The best configuration should find a balance regarding hardware area and response time. Enhancements to decrease clock cycles per run through a better management of data bus accesses, normal exponentiation bypassing when the handled window is zero, multiple feeds of MMULTs after pre-computation stage using a many-port memory are some possible issues to be developed in an improved future implementation.

References 1. Wu, C.L., Lou, D.C., Lai, J.C., Chang, T.J.: Fast parallel exponentiation algorithm for RSA Public-Key Cryptosystem. Informatica 17, 445–462 (2006) 2. Knuth, D.E.: The art of computer programming: seminumerical algorithms, 2nd edn. Addison-Wesley, Reading (1981) 3. Mourelle, L.M., Nedjah, N.: Fast reconfigurale hardware for the m-ary modular exponentiation. In: Proc. Symposium on Digital System Design: Architectures, Methods and Tools, pp. 516–523. IEEE Computer Society, Los Alamitos (2004) 4. Navabi, Z.: VHDL – Analysis and modeling of digital systems, 2nd edn. McGraw Hill, New York (1998) 5. Nedjah, N., Mourelle, L.M.: Two hardware implementations for the Montgomery multiplication: sequential vs. parallel. In: Proc. of the 15th. SBCCI, pp. 3–8. IEEE Computer Society, Los Alamitos (2002)

A Massively Parallel Hardware for Modular Exponentiations

165

6. Nedjah, N., Mourelle, L.M.: Efficient Parallel Modular Exponentiation Algorithm. In: Yakhno, T. (ed.) ADVIS 2002. LNCS, vol. 2457, pp. 405–414. Springer, Heidelberg (2002) 7. Nedjah, N., Mourelle, L.M.: Efficient hardware for modular exponentiation using the slidingwindow method with variable-length partitioning. In: Proc. ICYCS, pp. 1980–1985 (2008) 8. Nedjah, N., Mourelle, L.M.: High-performance hardware of the sliding-window method for parallel computation of modular exponentiations. International Journal of Parallel Programming 37(6), 537–555 (2009) 9. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signature and publickey cryptosystems. Communication of ACM 21(2), 120–126 (1978) 10. Montgomery, P.L.: Modular multiplication without trial division. Mathematics of Computation 44, 519–521 (1985) 11. Xilinx, Inc. Foundation Series Software (2009), http://www.xilinx.com

Emulation of Object-Based Storage Devices by a Virtual Machine Yi-Chiun Fang, Chien-Kai Tseng, and Yarsun Hsu Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 30013, Taiwan [email protected], [email protected], [email protected]

Abstract. This work presents a new emulation framework for objectbased storage devices (OSDs). The framework integrates an object-based storage emulator with a virtual machine. It resolves the limitations of host CPU resource consumption and network communication overhead found in other frameworks. The storage emulator in the virtual machine runs independently of the emulated CPU, and no network messages are required for communication. The OSD model is compared to a tradition ﬁle system on top of a block device. The benchmark results demonstrate an I/O performance improvement of 436% in the best case. The results suggest that the object-based storage architecture is an ideal choice for throughput machines, especially under heavy workloads.

1

Introduction

This work sets out to establish a new framework for software OSD emulations. The virtual storage emulation framework utilizes a virtual machine to implement the object-based storage system model. QEMU, chosen as the virtual machine, is an ideal choice to perform early emulations of new system architectures. The OSD model in this work avoids the common problems with conventional storage emulation frameworks. Early emulation before a real product is developed is a challenging task. The conventional frameworks for software storage emulation use either local or remote storage emulation [1]. Local storage emulation uses a device driver to redirect the requests to the user-space storage emulator. This framework provides simple implementation. One of its drawbacks is that the emulator consumes the resources of the host CPU. The other drawback is the context switch overhead between the driver and the emulator. Remote storage emulation runs the emulator on another computer and connects the computers by a network. This separates the emulator from the host CPU but introduces signiﬁcant network overhead and a thick layer between the emulator and its underlying operating system. The overhead introduced by both frameworks decreases the accuracy of the system analysis.

This work is supported by the National Science Council (NSC) of Taiwan under grant 96-2221-E-007-131-MY3.

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 166–177, 2010. c Springer-Verlag Berlin Heidelberg 2010

Emulation of Object-Based Storage Devices by a Virtual Machine

167

Virtual storage emulation utilizes the open-source virtual machine QEMU. Its internal design makes it suitable for system architecture explorations. The emulated devices function independently of the emulated CPU, and network messages are not required for the communication. Operating systems can be run inside QEMU to provide a realistic workload and a more in-depth analysis of the system.

2

Related Work

Other researchers have analyzed and implemented the object-based interface. Ceph [2] uses an object-interface ﬁle system in its object storage cluster. The prototype object store Antara [3] uses its own network protocol to provide its functionalities. Lustre [4] uses an object store as its storage infrastructure for improved scalability. An implementation of the Parallel Virtual File System (PVFS) integrated with a software OSD emulator is presented in a project by Ohio Supercomputer Center [5]. All of the OSDs in the research mentioned above are communicated through network protocols which introduce signiﬁcant communication overhead. They focus more on the performance of the overall storage architecture rather than the analysis of the OSD itself. Research has been conducted on software storage emulation. A timing-accurate storage emulator is presented by the Parallel Data Lab in Carnegie Mellon University [1]. The emulator allows experiments to be performed with not-yet-existing storage components in the context of real systems executing real applications. An instructional disk drive simulator with statistical disk models is presented in [6], retaining simplicity while providing timing statistics similar to that of real disk drives. The emulators are all based on the traditional block device and do not provide the object interface. They are implemented as standalone subsystems instead of emulated devices in a virtual machine. The result of these implementations is the consumption of the resources of their host CPUs or the network communication overhead. Other researchers also use QEMU for the emulation of system behaviors and communication protocols. The Virtual 802.11 Fuzzing Project [7] utilizes QEMU to provide a framework to assess wireless communication software inside a virtual environment. It focuses on the communication system instead of the storage system.

3

Implementation Details

Most of the implementations are done in the SCSI subsystem and I/O scheduler in QEMU. QEMU consists of several other subsystems besides the CPU emulator to perform full-system emulation. An emulated device emulates the behavior of a device and calls its attached generic device for access to the underlying host device. The generic device subsystem provides an abstraction of the host devices and is capable of accessing them.

168

Y.-C. Fang, C.-K. Tseng, and Y. Hsu

The overall scheme of the virtual storage emulation is depicted in Fig. 1. The Linux kernel is used as the operating system kernel on top of the emulated CPU. The sg (SCSI generic) driver and sym53c8xx 2 driver are used as the SCSI upperlevel and lower-level drivers inside the kernel. SCSI commands are passed from the sym53c8xx 2 driver to the modiﬁed LSI53C895A SCSI Controller emulated in QEMU. The modiﬁed LSI53C895A SCSI Controller attaches a SCSI OSD Disk in the emulated device layer of QEMU. The SCSI OSD Disk is implemented with SCSI-3 standards with minor modiﬁcations and extensions to the SPC-2 standard [8]. The block-osd generic device is used as the object interface for the emulated devices to access the underlying software storage emulator. The software storage emulator, EBOFS, is a part of the Ceph network ﬁle system [2]. EBOFS exports an object interface to its users, and manages block layouts and object caching internally. It has access to an underlying disk in the host computer.

Fig. 1. The overall scheme of the software OSD emulation

Emulation of Object-Based Storage Devices by a Virtual Machine

3.1

169

QEMU SCSI OSD Disk

The QEMU SCSI OSD Disk is implemented based on the emulated SCSI disk. The disk exports methods to its SCSI controller. The controller calls the registered methods when it needs to process the SCSI request. The QEMU SCSI OSD Disk is responsible for parsing the SCSI command and preparing the request for the underlying EBOFS. The actual object data transfer is done in EBOFS. The SCSI OSD Disk uses the SPC-2 commands with some extensions. The original SCSI-3 command set for SCSI disks only uses ﬁxed-length CDBs. The OSD draft established by the T10 Technical Committee uses the variable-length CDB format for the commands. The format is also supported in the SPC-2 standard [8]. All OSD CDBs are 200-byte long even though they use the variablelength CDB format. The ﬁelds in the CDB are organized so that the same ﬁeld is in the same location in all OSD CDBs. The adapted service action speciﬁc ﬁelds exclude the ﬁelds in the original CDB unrelated to this work, such as security parameters. The send command method is responsible for parsing the SCSI command and preparing the internal request. Its internal request structure includes the command tag, the parameters needed for the OSD request, the total data transfer length, an internal buﬀer for the data transfer, the request status, and a callback structure used by the generic device layer. This method sets up the request structure and returns the length of the data expected by the command. A positive number will be returned for data transfers from the device. A negative number will be returned for transfers to the device. Zero will be returned if the command does not require any data transfer. The read data and write data methods are responsible for issuing read and write requests to the underlying generic device for I/O scheduling. The SCSI OSD Disk uses an internal buﬀer of 128 KB for each request. Data transfers with transfer lengths larger than 128 KB will be split into multiple transfers. The request structure previously set up by the send command method is located using the command tag. The request is then sent to the underlying generic device block-osd. The request is put into an internal queue for scheduling and done asynchronously. The registered callback function is called by the generic device layer after the completion of the request. The callback function determines whether the request needs more transfers and notiﬁes the SCSI controller of the ﬁnished transfer. The SCSI controller will issue more read or write transfers with the same command tag if they are required. 3.2

QEMU OSD Generic Device

The QEMU OSD Generic Device block-osd is implemented based on the blockraw generic device. It exports its methods to and accepts requests from the emulated devices. It also communicates with the underlying EBOFS when an OSD request is scheduled. The generic device interface in QEMU is extended with a new set of methods to adapt to the object interface. Only the block-osd generic device implements the new methods.

170

Y.-C. Fang, C.-K. Tseng, and Y. Hsu

Fig. 2. The overall sketch of the asynchronous request processing

The overall sketch of the asynchronous request processing is illustrated in Fig. 2. The block-osd generic device uses an internal queue for asynchronous reads and writes. It initializes the asynchronous I/O scheduler during system initialization. The scheduler initialization function sets up a pipe for complete tokens and registers the signal SIGUSR2 with its handler. The signal is ﬁred when an OSD request is completed. The QEMU SCSI OSD Disk issues OSD requests by calling the exported methods from the generic device interface. The methods receive an OSD request set up by the emulated device, the address of the internal buﬀer in the emulated device, and a callback function as the parameters. An asynchronous request structure is initialized, set up and inserted into the asynchronous request queue. The asynchronous I/O scheduler is awakened, and the issued request returns with the address of the asynchronous request structure. The scheduler fetches an asynchronous request from the request queue, marks it as active, and calls the exported methods from EBOFS for the data transfer. Data will be read from or written to the buﬀer address given by the issuing emulated device, based on the given object ID, oﬀset and transfer length. The scheduler records the return value from the EBOFS call in the asynchronous request structure and ﬁres the signal SIGUSR2 after the transfer completion.

Emulation of Object-Based Storage Devices by a Virtual Machine

171

The signal handler, catching the signal, writes a token byte to the pipe to indicate a request completion. The asynchronous I/O scheduler keeps fetching and processing the requests until the request queue becomes empty. An I/O handling function is called every time the QEMU CPU emulator ﬁnishes one iteration. This function searches for the I/O handlers and calls the corresponding handling functions. The asynchronous I/O handling function reads the tokens from the pipe and searches for the completed asynchronous requests. The handling function calls the callback function registered in the completed asynchronous request and frees the request. The callback function is registered by the emulated SCSI OSD Disk. The emulated device is responsible for handling the callback and performing further actions.

4

Performance Evaluation

The performance of the proposed SCSI OSD model is compared to that of the traditional ﬁle interface. The ﬁle system ext2 is used for the benchmarks because of its popularity, speed and support for direct I/O. The computer system developed in this work lies within the QEMU virtual machine. The emulated SCSI OSD Disk can only be seen and accessed by the operating system running on top of the QEMU CPU emulator. The benchmark programs used for the evaluations are all run by the operating system inside QEMU. Debian [9] 5.0.1 with Linux kernel [10] oﬃcial release 2.6.26.2 is used as the operating system running in the virtual machine. The Linux kernel is applied with the modiﬁcations described in Section 4.3. The host computer uses an IBM eServer with a 3.40 GHz Intel Pentium 4 CPU, 3 GB DDR133 RAM, and a 70 GB IBM-ESXS VPR073C3ETS10FN SCSI disk. The SCSI disk is attached to QEMU. Ubuntu [11] 7.04 with Linux kernel 2.6.20-16 is used as the host operating system. Bonnie++ [12] is chosen as the benchmark program. It is a benchmark suite that performs a number of simple tests on hard drive and ﬁle system performance. It is widely used for ﬁle system performance evaluation such as the project on Virtual Storage Allocation [13]. Bonnie++ is originally designed to use the ﬁle interface. SCSI commands need to be prepared and processed using the SG IO ioctl() interface in order to access the sg device. Code segments from the Linux sg3 utils package [14] are adapted to Bonnie++, enabling it to access the SCSI OSD Disk by the sg interface. The page cache subsystem in the Linux kernel caches the inodes, directory entries, and data recently accessed using the VFS interface. This subsystem cannot be completely eliminated from the underlying ﬁle system. The SCSI OSD Disk, implemented as a SCSI device, has no caching support in the operation system. This makes the evaluation unfair for the SCSI OSD Disk, since the eﬀect of oﬄoading the computation eﬀorts of the storage component fails to present. Direct I/O and cache eviction are used in the ext2 benchmark to suppress the eﬀects of the page cache subsystem. Direct I/O allows the VFS layer to minimize cache eﬀects of the I/O to and from the ﬁle. Cache eviction is accomplished by performing the posix fadvise() system call with the POSIX FADV DONTNEED

172

Y.-C. Fang, C.-K. Tseng, and Y. Hsu

option and by writing a 3 to the ﬁle /proc/sys/vm/drop caches. The system call fsync() is required before calling posix fadvise() to guarantee the eﬀect. The time spent in cache evictions is not taken into account in the evaluations. 4.1

Single-File Performance

Three types of evaluation are performed in the single-ﬁle evaluation. The Sequential Write test is run at the beginning of the evaluation. The test writes to a new ﬁle or object, sequentially and one block at a time, until the given total length is reached. The length of the data block is given as a program argument. The Rewrite test then reads a block from the ﬁle or object, changes a byte in the block, and overwrites the block back to the previous location. The rewriting is done sequentially until all the blocks are processed. The Sequential Read test starts after the Rewrite test. It reads from the ﬁle or object, sequentially and one block at a time, until the total length is reached. The ﬁle name and object ID are randomly generated. The bonnie++ benchmark suggests using the total data length at least twice the size of the main memory. This evaluation sets the total data length to 300 MB, since QEMU uses 128 MB of RAM from the host computer as its main memory. Four diﬀerent conﬁgurations are evaluated. The SCSI OSD Disk is evaluated with and without direct I/O (the SG FLAG DIRECT IO ﬂag). Direct I/O for the sg device only supports up to transfer length of 256 KB per system call. The ext2 ﬁle system is evaluated with direct I/O and cache eviction, and without any of the two. The result of the Sequential Write test is shown in Fig. 3a. The SCSI OSD Disk, without the ﬁle system management, outperforms the ext2 ﬁle system with cache suppression in all block sizes. The ext2 ﬁle system and the SCSI OSD Disk both beneﬁt from the larger DMA size with larger data blocks. The increasing overhead of space management in ext2, however, results in a much slower throughput growth rate. The throughput diﬀerence between the SCSI OSD Disk and the cache-suppressed ext2 ﬁle system increases as the block size increases. The overhead of the ﬁle system allocator increases as the block size increases. The allocator spends more time allocating space for larger blocks, and space fragmentation is more likely to occur. The SCSI OSD Disk shows a performance improvement of 330% over the cache-suppressed ext2 ﬁle system with block size 1 MB. The direct I/O feature in the sg driver avoids extra kernel buﬀer copies, but it introduces signiﬁcant per-command overhead. The feature is a performance win for SCSI commands only with large data payloads. Direct I/O beneﬁts from avoiding buﬀer copies with block sizes larger than 32 KB. The performance is improved by 30% compared to disabling the feature, and by 436% compared to the cache-suppressed ext2 ﬁle system with block size 256 KB. The normal ext2 ﬁle system, with the beneﬁts from the page cache subsystem, outperforms the rest of the conﬁgurations with block sizes smaller than 32 KB. The SCSI OSD Disk still outperforms the normal ext2 ﬁle system with block sizes larger than 32 KB, and the throughput diﬀerence increases as the block size increases. The throughput is relatively stable in terms of the block size. It

Emulation of Object-Based Storage Devices by a Virtual Machine

173

is 14% smaller, however, than its cache-suppressed counterpart when the block size reaches 1 MB. This demonstrates the drawback of memory caches with large data transfers. The overhead of preparing buﬀers and doing buﬀer copies degrades the overall performance. The result of the Rewrite and Sequential Read test are shown in Fig. 4 and Fig. 3b, respectively. The results are similar to the Sequential Write test, with the exception of the normal ext2 ﬁle system performing much better at throughput in the Sequential Read test. This is a result of the caching eﬀect in the page cache subsystem and prefetching in the ﬁle system. The normal ext2 ﬁle system outperforms the rest of the conﬁgurations with block sizes smaller than 64 KB, and outperforms its cache-suppressed counterpart in all cases. The SCSI OSD Disk, compared with the cache-suppressed ext2 ﬁle system, shows a performance improvement of 13% in the Rewrite test and of 172% in the Read Sequential test with block size 1 MB.

(a) Write

(b) Read

Fig. 3. The result of Sequential Read/Write test

4.2

Multiple-File Performance

Three types of evaluation are performed in the multiple-ﬁle evaluation. Each type of evaluation includes a small-ﬁle and a large-ﬁle test. The small-ﬁle tests use 8 KB data blocks, and the large-ﬁle tests use 1 MB data blocks. The Create test creates the given number of ﬁles or objects, and writes a block after each creation. The Read test reads the block from each ﬁle or object after the creation. The Delete test deletes all of the created ﬁles or objects. The throughput in this evaluation is deﬁned as operations per second. The number of manipulated ﬁles or objects range from 1 K to 256 K. Four diﬀerent conﬁgurations are evaluated. The sequential and random conﬁgurations are used in both the SCSI OSD Disk and the ext2 ﬁle system. The sequentia conﬁguration manipulates in an in-order fashion in terms of the data structure storing the ﬁle names or the object IDs. The random conﬁguration manipulates in a random fashion. The SCSI OSD Disk disables the direct I/O feature in small-ﬁle tests,

174

Y.-C. Fang, C.-K. Tseng, and Y. Hsu

Fig. 4. The result of the Rewrite test

and enables it in large-ﬁle tests to accelerate the driver processing. The ext2 ﬁle system uses cache suppression to provide a fair comparison to the SCSI OSD Disk. The results of the Create test are shown in Fig. 5a and Fig. 5b. The eﬀort of allocation increases as the number of manipulating ﬁles increases, resulting in the throughput degradation. The sequential and random conﬁgurations show similar results due to the early returns of write operations. Write operations return after writing data into the memory cache. Data is written back to the disk asynchronously. All of the sequential conﬁgurations slightly outperform their random counterparts due to the increased overhead of allocation in random conﬁgurations. The random allocation overhead of the SCSI OSD Disk decreases the performance by 10% with 256 K ﬁles. The SCSI OSD Disk underperforms the ext2 ﬁle system in all cases with small ﬁles, but outperforms the ext2 ﬁle system in all cases with large ﬁles. This is the result of the ﬁle system performing inmemory storage management and of the larger overhead the sg driver introduces. In-memory storage management shows its advantage with small I/O transfer size. The eﬀort of allocation is relatively small, and the operation returns after the memory copy from the request to the page cache is completed. The generic design and blocking wait of the sg driver increases the driver overhead of the SCSI OSD Disk. The overhead increases the performance gap between the two models. The in-memory storage management becomes a burden for the system with large transfer size. The overhead of buﬀer management degrades the performance, resulting in the SCSI OSD Disk outperforming the ext2 ﬁle system in all cases. The results of the Read test are shown in Fig. 6a and Fig. 6b. The sequential small-ﬁle reads in the ext2 ﬁle system beneﬁt from the in-memory storage management and outperform the rest of the small-ﬁle conﬁgurations in all cases. The sequential reads utilize the underlying disk by performing reads from neighboring sectors. The random small-ﬁle reads in the ext2 ﬁle system underperform the rest of the conﬁgurations in all cases. The random reads do not have the beneﬁt of neighboring sector reads and spend more time resolving inodes. The SCSI OSD Disk, without the in-memory storage management, outperforms the ext2 ﬁle system in all large-ﬁle read cases. Its sequential read shows a performance

Emulation of Object-Based Storage Devices by a Virtual Machine

(a) small-ﬁle

175

(b) large-ﬁle

Fig. 5. The result of Create test

(a) small-ﬁle

(b) large-ﬁle

Fig. 6. The result of Read test

improvement of 232% over the ext2 ﬁle system with 4 K ﬁles. Its random reads underperform the sequential counterparts due to more time spent in locating data in the disk. The throughput of sequential reads in the SCSI OSD Disk increases as the number of objects increases in some cases. This results from the underlying I/O scheduler in EBOFS. The I/O scheduler uses the elevator algorithm to search for pending requests and combines neighboring sector reads into one large read, thereby reducing the read overhead. The allocation becomes more fragmented as the number of objects keeps increasing, and the throughput drops due to more eﬀort in resolving data locations and more read overhead. The results of the Delete test are shown in Fig. 7a and Fig. 7b. Deletion in the SCSI OSD Disk requires the commands to be sent to the device. Deletion in ﬁle systems only requires marking the corresponding in-memory inode. Sequential deletes in the ext2 ﬁle system outperform the rest of the conﬁgurations in all cases. This results from less eﬀort in resolving inodes and the beneﬁt of inmemory inode processing. The eﬀort of resolving inodes with random deletes

176

Y.-C. Fang, C.-K. Tseng, and Y. Hsu

increases dramatically as the number of ﬁles increases. Random small-ﬁle deletes with more than 16 K ﬁles underperform the random small-object deletes. The SCSI OSD Disk outperforms the ext2 ﬁle system by 169% with 64 K small-object random deletes.

(a) small-ﬁle

(b) large-ﬁle

Fig. 7. The result of Delete test

5

Conclusion

This work oﬀers a new simulation framework for software OSD emulations. It provides a new model for the object-based storage system by utilizing a virtual machine. It addresses the drawbacks of other frameworks. The emulator does not consume the resources of the host CPU and no network overhead exists. This framework allows for more in-depth and accurate analyses to early system emulations at a reasonable speed. An object-based storage emulator is integrated into QEMU to implement the model which allows realistic workloads to be applied to the storage system. The results of the OSD model demonstrate the advantages of an object-based interface. The oﬄoading of storage management into the device dramatically increases the throughput of the storage system. The throughput of the OSD model outperforms that of a local ﬁle system in most cases, especially under heavy workloads. The results suggest that the object-based storage architecture is an ideal choice for throughput machines.

References 1. Griﬃn, J.L., Schindler, J., Schlosser, S.W., Bucy, J.S., Ganger, G.R.: TimingAccurate Storage Emulation. In: Proceedings of the Conference on File and Storage Technologies, Janaury 2002, pp. 75–88 (2002) 2. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A Scalable, High-Performance Distributed File System. In: Proceedings of the 7th symposium on Operating systems design and implementation, pp. 307–320 (2006)

Emulation of Object-Based Storage Devices by a Virtual Machine

177

3. Azagury, A., et al.: Towards an Object Store. In: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, pp. 165–176 (2003) 4. Schwan, P.: Lustre: Building a File System for 1,000-node Clusters. In: Proceedings of the Linux Symposium, July 2003, pp. 380–386 (2003) 5. Devulapalli, A., Dalessandro, D., Wyckoﬀ, P., Ali, N., Sadayappan, P.: Integrating Parallel File Systems with Object-Based Storage Devices. In: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, vol. 00(27) (2007) 6. DeRosa, P., Shen, K., Stewart, C., Pearson, J.: Realism and Simplicity: Disk Simulation for Instructional OS Performance Evaluation. In: Proceedings of the 37th SIGCSE technical symposium on Computer Science Education, pp. 308–312 (2006) 7. Virtual 802.11 Fuzzing, http://www.iseclab.org/projects/vifuzz/ 8. SCSI Primary Commands - 2 (SPC-2), T10 Project 1236-D (July 2001) 9. Debian - The Universal Operating System, http://www.debian.org/ 10. The Linux Kernel Archives, http://www.kernel.org/ 11. Ubuntu, http://www.ubuntu.com/ 12. Bonnie++, http://www.coker.com.au/bonnie++/ 13. Kang, S., Reddy, A.L.: An Approach to Virtual Allocation in Storage Systems. ACM Transactions on Storage 2(4), 371–399 (2006) 14. The Linux sg3 utils package, http://sg.danny.cz/sg/sg3_utils.html

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference with Given Path Frequencies∗ Jiayi Zhou1, Kun-Ming Yu2,∗∗, Chun Yuan Lin3, Kuei-Chung Shih1, and Chuan Yi Tang1 1

Department of Computer Science, National Tsing Hua University, Hsinchu 300, Taiwan [email protected], [email protected], [email protected] 2 Department of Computer Science and Information Engineering, Chung Hua University, Hsinchu 300, Taiwan [email protected] 3 Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 333, Taiwan [email protected]

Abstract. To enumerate chemical compounds with given path frequencies is a fundamental procedure in Chemo- and Bio-informatics. The applications include structure determination, novel molecular development, etc. The problem complexity has been proven as NP-hard. Many methods have been proposed to solve this problem. However, most of them are heuristic algorithms. Fujiwara et al. propose a sequential branch-and-bound algorithm. Although it reaches all solutions and avoids exhaustive searching, the computation time still increases significantly when the number of atoms increases. Hence, in this paper, a parallel algorithm is presented for solving this problem. The experimental results showed that computation time was reduced even when more processes were launched. Moreover, the speed-up ratio for most of the test cases was satisfactory and, furthermore, it showed potential for use in drug design. Keywords: Branch-and-bound algorithm, load-balancing, chemical compound inference, drug design.

1 Introduction The enumeration of chemical compounds that has the same characteristics is one of the fundamental issues in Chemo- and Bio-informatics. Its applications include structure determination using mass-spectrum [1-2], reconstructing molecular structure with given signatures [3-4], classification of compounds [5], etc. In an effort to improve on existing algorithms, many studies have proposed different ways of dealing with the ∗ This work is partially supported by Nation Science Council. (NSC98-2221-E-216-023 and NSC97-2221-E-216-020). ∗∗ Corresponding author. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 178–187, 2010. © Springer-Verlag Berlin Heidelberg 2010

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference

179

enumeration of constraints for other purposes, such as the virtual exploration of chemical universe [6-7]. However, none can guarantee that all solutions can be found since they are based on heuristic algorithms and require additional operations to avoid generating isomorphic results. The Kernel Methods (KMs) approach maps the data in the input space into a high dimension feature space. The data are computed as points in the feature space where each coordinate represents one feature of the data. In applying KMs to chemical compounds, all compounds are mapped to feature vectors in the feature space. The definition of feature vector is widely based on frequencies of labeled paths [8] or frequencies of small fragments [5]. In this study, the pre-image problem was focused on by enumerating all chemical compounds with the same path frequency. The desired object was computed as a point in the feature space, and then the point was mapped back to the input space. This point was called the pre-image of the point. A given point in the feature space (frequency path) was then mapped back to the point in the input space (chemical compound). Let ψ be the mapping function, then the preimage is defined as follows: given a point y in the feature space, to find all x in the input space such that y = ψ(x ) . Therefore, the chemical compound inference (CCI) is defined as follows: given a target compound c and computed path frequency ψ(c ) . The objective is to infer all compounds c1 …cn such that ψ(ci ) = ψ(c ) for

i = 1, …, n . Since the solution for CCI can be applied to new chemical compounds with the same path frequencies, it may be useful in drug design. Akutsu and Fukagawa have proved that to enumerate chemical compounds with given constrains is an NP-hard problem [9]. The branch-and-bound algorithm is generally used to solve a wide variety of NP-hard problems [10], such as Traveling Salesman, Knapsack, Vertex Covering, avoiding exhaustive searches. For CCI problem, Fujiwara et al. have proposed an efficient branch-and-bound algorithm [11]. The branching and bounding strategies are based on the path frequency and the valence constrains of atoms, respectively, to avoid the generation of an invalid tree, thus reducing the search space. Although this algorithm outperforms the exhaustive search algorithms, the computation time increases significantly when the number of atoms in a compound grows. A balanced multi-process parallel branch-and-bound algorithm for CCI (BMPBBCCI) was designed in this study. In BMPBB-CCI, the searching space is dynamic divided into p subspaces with p processors. Since each subspace is independent, it can be processed by a processor and then applied to a sequential branch-and-bound algorithm. Hence, it minimizes the parallel communication cost between processors since only local communication is required. Moreover, two types of queuing, global queue (GQ) and local queue (LQ), are used for load-balancing in branching. A hashable and a serial data structure were used to store the path frequencies instead of the Trie structure because it can be transferred between processors as a plain string. The development trend in computers is one of multi-core processors. Adding more cores to a computer makes it faster, but it also leads to difficulties in designing programs. Although OpenMP [12] is a standard and easy to use multi-threading library that gives the performance of multi-core processors, it is not flexible enough for

180

J. Zhou et al.

delicate operations or for designing complicated parallel algorithm. Therefore, BMPBB-CCI was implemented using multi-process architecture (MPA). Different from multi-thread architecture, for long running programs, MPA obtains memory protection and access control benefits of operating systems. In addition, when a process crashes it does not affect the remaining running processes. Moreover, it also could be easily extended to distributed memory multi-processors system, e.g. a multinode cluster. The experimental results showed that BMPBB-CCI found the optimal solution for chemical compounds in a short time. In addition, it also had a satisfactory speed-up ratio.

2 Preliminaries Being different from a simple graph, a graph that allows multiple edges is called a multigraph. A multigraph without self-loop and cycle is called a multitree. Let Σ denote a label set with each label representing a symbol of an atom, e.g. Σ = {C ,O, H } . A Σ -labeled multitree can be denoted as T = (V , E ) with a vertex set V and edge set E . A valence function, val : Σ → + , is introduced to be the maximum number of bonds an atom can hold, e.g. val (C ) = 4 and val (O ) = 2 . For any vertex v in T , the valence of v is val (l (v )) , where l is the function returning label ∈ Σ of vertex v . A chemical compound can be treated as a (Σ, val ) -labeled multitree, where the number of edges represents the chemical bonds between two vertices (atoms). The path frequencies of the (Σ, val ) -labeled multitree are defined. Let P = (v 0 , …vs ) be the path in multitree T and l (P ) = (l(v0 ), …l(vs )) be the labeled sequence of the path. For a labeled sequence t , let occ(t,T ) stand for the number of paths of t in multitree T , where the multi-edge is treated as a single edge. Then Σk denotes the set of sequences with k labels, and Σ≤k = ∪kj =1 Σk . The path frequencies with level K are defined as fK (T ) = occ(t,T )t ∈ΣK +1 . Fig. 1 illustrates the path frequencies of un-rooted (Σ, val ) -labeled multitree (a chemical compound, C2O2H4) with level K =1. Where all paths are treated as "directed", thus occ(OH ,T ) = occ(HO,T ) =1 and so on. CCI is further defined in the following [11]. Given a finite label K ≥ 1 , a target with path frequencies g , and a valence function val : Σ →

+

. Find all (Σ, val ) -

labeled multitree T such that fK (T ) = g and deg(v ) = val (l (v )) for each vertex v in multitree T = (V , E ) . However, the main concern with CCI is to avoid the generation of isomorphic chemical compounds. Many methods [13-14] propose to solve this problem by choosing a unique vertex or a unique pair of adjacent vertices as a root. Fujiwara et al. [11] applied centroid-rooted [15] left-heavy properties (Theorem 1) to avoid the generation of isomorphic chemical compounds.

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference

181

Fig. 1. Example of (Σ, val ) -labeled multitree T

Theorem 1. For any tree with n vertices, either there is a unique vertex v * such that each subtree obtained by removing v * to contain at most ⎢⎣⎢(n − 1) / 2⎥⎦⎥ vertices, or there is a unique edge e * such that both of the subtrees obtained by removing e * contain exactly n / 2 vertices. Vertex v * is identified as a unicentroid of the tree, or a bicentroid for e * . In order to introduce left-heavy properties, multitree T is indexed using a depth-first search (DFS) order. In general, the vertex sequence, v0 , v1, …vn , is labeled by DFS from the root vertex. The depth function d (v ) of vertex v is the number of edges in P(v ) , where the depth of the root vertex is 0. The depth label sequence of T is defined as DL(T ) = (d(vo ), l(vo ), …, d(vn−1 ), l(vn−1 )) . T(v ) denotes a subtree rooted at vertex v and all of its descendants. The left-heavy properties are defined as follows. T is left-heavy if i < j implies DL(T(vi )) ≥ DL(T(v j )) for any two siblings vi and v j .

For any two depth label sequences DL(T1 ) = (d1,0 , l1,0 , d1,1, l1,1 …, d1,n , l1,n ) and 1

1

DL(T2 ) = (d2,0 , l2,0 , d2,1, l2,1 …, d2,n , l2,n ) , DL(T1 ) > DL(T2 ) means that there is a 2

2

i ∈ [1, min(n1 − 1, n2 − 1)] such that d1, j = d2, j and l1,j = l2, j for j = 0, …i − 1 . In addition to either (1) d1,i > d2,i , or (2) d1,i = d2,i and l1,i > l2,i . Fujiwara et al. [11] propose a branch-and-bound algorithm to enumerate all treelike chemical compounds with given path frequencies. It starts from an empty multitree, then iteratively creates a multi-tree rooted in atom v , where l (v ) ∈ Σ . After that, new children multitree offspring can be obtained by inserting vertex v ' where l (v ') ∈ Σ at the right-most path. If T violates (1) centroid-rooted constraints, (2)

fK (T ) ≤ g , and (3) deg(v ) ≤ val(l (v )) for each v ∈ T then candidate T is bounded immediately.

3 BMPBB-CCI BMPBB-CCI was designed on shared memory multi-core computers. However, the built-in shared memory facilities of the operating system (OS) somewhat restricted

182

J. Zhou et al.

portability to another OS. In addition, this facility allows fewer types of data structures. Therefore, a socket-based manager process was implemented which held various types of data structures. In addition, the manager also could be extended to distributed memory computing architecture, such as cluster systems. Fig. 2 shows the framework of the BMPBB-CCI, where the manager module of the manager process implemented a socket-based communication object which supported the data structure, such as list, double-ended queue (dequeue) and dictionary. A Global Queue (GQ) was also implemented on the manager process as well as a Local Queue (LQ) on each computing process to balance the workload among processors and to reduce inter-process communication. There are three different processes, main process (MP), computing process (CP) and manager process (MgP). Since the valence of the H atom is 1 and always attached in the leaf node, the H atom can be removed during branching operations. The path frequency with the H atom is computed in step 2. After that, in steps 3 and 4, the MgP is created and started, and the required shared objects are also allocated in this step. In steps 5 and 6, the CP is created and launched at each computing core. Moreover, the obtained Id of created shared objects is passed to each CP. Finally, the MP joins all launched CPs until it is terminated and then MP writes the results to disk. Manager Process (MgP)

Manager Module

Main Process (MP) Sub-process Launcher Sub-process Synchronizer

Global Queue (GQ) Computing Process (CP1) Balanced Branchand-Bound Module Local Queue (LQ)

. . . Computing Process (CPn) Balanced Branchand-Bound Module Local Queue (LQ)

Fig. 2. Multi-process framework of BMPBB-CCI

The branch and bound operations are done in CP. Moreover, CP also uses the shared object to balance the workload. An overview of the algorithm of CP, show that it is an infinite loop. It loops the algorithm over until all solutions are found. From steps 2 to 6, the compound in LQ is chosen first. If the LQ is empty, then the compound in the GQ is chosen if that is not empty. Otherwise, the new atom in insertAtomQueue is selected as a new root of the candidate compound in step 5. The

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference

183

bounding operations are applied in steps 7 to 13. H atoms are inserted back to comp to check the feature vector constrains and valence constrains in steps 8 and 9. The centroid-rooted and left-heavy properties are verified in step 10 to avoid generation of isomorphic chemical compounds. If the comp passes all verifications, it is inserted into resultQueue in step 11. The comp is dropped immediately if its path frequency fK (comp) greater than gnoH (step 13). Steps 14 to 17 are branching operations. In step 16, the potentially new pair of atoms are verified in the gnoH . If the pair is presented in gnoH then the new candidate compound newComps is generated. Moreover, borrow a single bond transformation as proposed by Fujiwara et al. [11]. The number of bonds of the attached new vertex is limited by the maximum number of bonds between pairs of atoms in the target compound. Therefore, the searching spaces can be significantly reduced to save on computation time. In order to balance the workload, when there are too few compounds in the GQ, the newComps are appended to GQ, otherwise the newComps are appended to LQ (step 17). Since the GQ is filled during branching operations, the CP can immediately acquire the candidate compounds from GQ without waiting for other CPs to transfer candidates. MgP creates and obtains a shared object via socket-based connections. The benefits of the shared object facilities are (1) they deal with various types of data structures, (2) they have built-in synchronization facility, and (3) they share objects from different computers. The algorithm for BMPBB-CCI is given below. Algorithm. BMPBB-CCI Input: Target compound and valence function Output: Chemical compound which confirms to path frequencies of given compound Environment: Multi-core architecture computers Main Process (MP) Step 1: Remove H atoms from given target compound and compute its path frequency gnoH . Step 2:

Insert the H atoms back to the given target compound and compute its path frequency g H .

Step 3: Step 4: Step 5: Step 6:

Create and start MgP and create a global queue on MgP. Allocate shared object form MgP, resultQueue and insertAtomQueue. Create CP on each computing core in processor. Start CP with following parameters: gnoH , g H ,

Step 7: Step 8:

resultQueue, insertAtomQueue. Wait all started CPs terminated. Write the results to disk.

Computing Process (CP) Step 1: while True: Step 2: if local queue is not empty: comp = pop last item of local queue Step 3: else: if global queue is not empty: comp = pop last item of global queue else:

184

J. Zhou et al.

Step 4: Step 5: Step 6: Step 7: Step 8: Step 9:

if resultQueue is not empty: break else if insertAtomQueue is not empty: comp = pop a atom from insertAtomQueue and create new compound else: break if num of atom of comp = target compound w/o H: Add H atom to comp continue if fK (comp) != g H :

Step 10: Step 11: Step 12:

Check the centroid-rooted and left-heavy properties If pass the checks, add the comp to resultQueue else: Check if fK (comp) <= gnoH

Step 13: Step 14:

for v in right most path of comp: if v is the last atom on right most path: insertAtom = Σ else: insertAtom = s for all s < l (next vertext of v ) for s in insertAtom: if (s, v ) in g H :

Step 15:

Insert s on v to generate new candidate newComp0 , …, newCompq such that bond

≤ the maximum number of bond of pair .. in target compound Step 16:

if len(GQ ) ≤ (numberOfProcessors )2 / 2 Insert newCompi to GQ for i = 0, …, q else: Insert . to LQ for i = 0, …, q .

Manager Process (MgP) Step 1: Create a socket and waiting for the request. Step 2: if incoming request is shared object creation: Create a shared object of request data type return shared object Id Step 3: if incoming request is shared object read: Read the corresponding shared object of given Id return the read value Step 4: if incoming request is shared object write: Acquire a lock object Write the given data to shared object of given Id Release a lock object

4 Experimental Results BMPBB-CCI was implemented in Python language 2.6 on IBM System x 3650 T consisting of 2 Intel Xeon 3.20GHz CPUs (8 computing cores, 4 cores per CPU). Two data sets were used to verify BMPBC-CCI. One was KEGG LIGAND database [16], the other was a set of 22 compounds for neuraminidase (NA) inhibitors of the influenza A virus from Zhang et al. [17]. The chemical compounds in KEGG LIGAND’s database were used to verify the performance of BMPBB-CCI, and path frequencies were computed for levels 1 and 2. Due to page limitation, selected compounds and their properties are shown in Table 1, where (1) C00064, C00073, and C00077 are the entries for L-Glutamine, LMethionine and L-Ornithine in the KEGG LIGAND database, (2) n1 is the number of atoms of an entry and n2 is the number of atoms which removed H atoms, (3) fs is the number of feasible solutions found. Fig. 3 (a)-(c) shows the computation time of

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference

185

BMPBB-CCI with a different number of processes. It was found that a large number of processes (computing process, CP) reduced computation time. Moreover, smaller K values had fewer constraints and more feasible solutions were found (see Table 1). Therefore, there were more candidate compounds in branch-and-bound operations, leading to larger solution spaces to be traversed. The computation time was long when level K was small. Fig. 3 (d) illustrates the speed-up ratio of BMPBB-CCI from (a)-(c). The speed-up ratio increased when the number of processes increased. Moreover, the speed-up ratios were satisfactory even for 8 processes. This result showed that BMPBB-CCI was scalable. Table 1. Properties of selected compounds Entry Formula C00064 C5H10N2O3 C00073 C5H11NO2S C00077 C5H12N2O2

n1

n2

K

fs

20

10

21

9

21

9

1 2 1 2 1 2

274 3 339 2 236 3

Computation time of different quantities of processes (C00064)

Computation time of different quantities of processes (C00073) 90

250

80 70

150 K=1

100

Time (sec.)

Time (sec.)

200

K=2

60 50 40

K=1

30

K=2

20

50

10 0

0 1

2

4

1

8

Number of processes

2

(a)

8

(b)

Computation time of different quantities of processes (C00077)

Speed-up ratio of dfiirent entries

70

4.5

60

4

30

K=1

20

K=2

Speed-up ratio

40

1 proc. 2 proc. 4 proc. 8 proc.

3.5

50

Time (sec.)

4

Number of processes

3 2.5 2 1.5 1

10

0.5

0 1

2

4

Number of processes

8

0 C00064,K=1 C00064,K=2 C00073,K=1 C00073,K=2 C00077,K=1 C00077,K=2 Entries

(c)

(d)

Fig. 3. Computation time and speed-up ratio

BMPBB-CCI was used to infer novel chemical compounds for the second data set to show that it had potential for drug design. A pharmacophore model consists of a 3D arrangement of a collection of features necessary for the biological activity of ligands (compounds). At first, the pharmacophore model (by Accelrys DiscoveryStudio) fwas

186

J. Zhou et al.

constructed for NA of an influenza A virus from Zhang et al. [17]. Then the model was tested with Tamiflu and Zanamivir. The results (Table 2) showed that it was reliable since the predicted IC50 and actual IC50 were on the same quantity level. Finally, the model was used to test new chemical compounds inferred by BMPBB-CCI. Due to page limitation, only the compound Cpd was used as an example (Table 2). The new compound Cpd-Reb had a better predicted IC50 and Fit value than Cpd-Ori. DiscoveryStudio CDOCKER’s docking program was also used to compute the interactions between compound and protein. The best CDOCKER interaction energy for Cpd-Ori was 42.565. The best CDOCKER interaction energy for Cpd-Reb was 45.324. The docked poses are given in Fig. 4. These results showed that the novel compound CpdReb may be a candidate for NA inhibiting of the influenza A virus. Table 2. Results for test compounds in the pharmacophore model Fit Mapped feature Actual IC50 Estimated IC50 value HD1* HD2* HY* NI* (nM) (nM) Tamiflu 1 8.502 10.95 + + + Zanamivir 1.3 4.848 11.194 + + + # Cpd-Ori 6300 5639.67 8.129 + + # Cpd-Reb NA 6.577 11.062 + + + + * HD1: hydrogen-bond donor 1; HD2: hydrogen-bond donor 2; HY: hydrophobic group; NI: negative ionizable group; PI: positive ionizable group. # Cpd-Ori: original compound Cpd; Cpd-Reb: novel compound inferred by BMPBB-CCI. Compound

(a) Cpd-Ori

PI* + + +

(b) Cpd-Reb

Fig. 4. The Cpd-Ori and Cpd-Reb docked pose into NA protein (PDB code: 2hu4)

5 Conclusions An algorithm (BMPBB-CCI) was designed and verified on multi-core computers. From the experimental results, it was observed that BMPBB-CCI reduced computation time with more processors in the case of the KEGG LIGAND database. It also achieved a satisfactory speed-up ratio for most of the test cases. Moreover, it showed potential for drug design.

Balanced Multi-process Parallel Algorithm for Chemical Compound Inference

187

Acknowledgement We are grateful to the National Center for High-performance Computing for computer time and facilities.

References 1. Buchanan, B., Feigenbaum, E.: DENDRAL and Meta-DENDRAL, Their Applications Dimension. Artif. Intell. 11, 5–24 (1978) 2. Funatsu, K., Sasaki, S.: Recent advances in the automated structure elucidation system, chemics. Utilization of two-dimensional nmr spectral information and development of peripheral functions for examination of candidates. J. Chem. Inf. Comput. Sci. 36(2), 190– 204 (1996) 3. Faulon, J., Churchwell, C., Visco Jr., D.: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J. Chem. Inf. Comput. Sci. 43(3), 721–734 (2003) 4. Hall, L., Dailey, R., Kier, L.: Design of molecules from quantitative structure-activity relationship models. 3. Role of higher order path counts: path 3. J. Chem. Inf. Comput. Sci. 33(4), 598–603 (1993) 5. Deshpande, M., Kuramochi, M., Wale, N., Karypis, G.: Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Eng. 17(8), 1036–1050 (2005) 6. Fink, T., Reymond, J.: Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry. J. Chem Inf. Model. 47(2), 342–353 (2007) 7. Mauser, H., Stahl, M.: Chemical fragment spaces for de novo design. J. Chem. Inf. Model. 47(2), 318–324 (2007) 8. Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. In: ICML, pp. 321–328 (2003) 9. Akutsu, T., Fukagawa, D.: Inferring a Graph from Path Frequency. In: Apostolico, A., Crochemore, M., Park, K. (eds.) LOPSTR 2004. LNCS, vol. 3573, pp. 371–382. Springer, Heidelberg (2005) 10. Yu, C., Wah, B.: Efficient branch-and-bound algorithms on a two-level memory system. IEEE Trans. Softw. Eng. 14(9), 1342–1356 (1988) 11. Fujiwara, H., Wang, J., Zhao, L., Nagamochi, H., Akutsu, T.: Enumerating Treelike Chemical Graphs with Given Path Frequency. J. Chem. Inf. Model. 48(7), 1345–1357 (2008) 12. OpenMP, http://openmp.org/ 13. Nakano, S., Uno, T.: Generating colored trees. In: Kratsch, D. (ed.) WG 2005. LNCS, vol. 3787, pp. 249–260. Springer, Heidelberg (2005) 14. Wright, R., Richmond, B., Odlyzko, A., McKay, B.: Constant time generation of free trees. SIAM J. Comput. 15, 540 (1986) 15. Jordan, C.: Sur les assemblages de lignes. J. Reine Angew. Math. 70(185), 81 (1869) 16. KEGG Ligand database, http://www.genome.jp/kegg/ligand.html 17. Zhang, J., Yu, K., Zhu, W., Jiang, H.: Neuraminidase pharmacophore model derived from diverse classes of inhibitors. Bioorg. Med. Chem. Lett. 16, 3009–3014 (2006)

Harnessing Clusters for High Performance Computation of Gene Expression Microarray Comparative Analysis Philip Church1, Adam Wong2, Andrzej Goscinski1, and Christophe Lefèvre3,4 1

School of Information Technology, Deakin University, Geelong, Australia {pcc,ang}@deakin.edu.au 2 Victorian Partnership for Advanced Computing (VPAC) [email protected] 3 Institute for Technology Research and Innovation (ITRI), BioDeakin, Deakin University 4 Victorian Bioinformatics Consortium, Monash University [email protected]

Abstract. Gene Expression Comparative Analysis allows bio-informatics researchers to discover the functional regulation of genes. This is achieved through comparisons between data-sets representing the quantities of substances in a biological system. Unnatural variations can be introduced during the data collection and digitization process so normalization algorithms must be applied to data before any accurate comparison can be made. There exist many different normalization methods each of which gives a different result. Comparing differently normalized datasets can allow for discovery of crucial regulated genes that may be otherwise hidden due to errors in a single normalization study. In this paper we introduce a web-based software package called EXPPAC which makes use of a high performance computing platform of computer clusters to run multiple normalization methods in parallel. By generating multiple normalized datasets concurrently, we allow researchers the ability to improve the accuracy of their research with almost no extra time-cost. Keywords: Gene Expression, Normalization, Clusters, Statistical Algorithms.

1 Introduction Gene Expression Comparative Analysis is a field of bioinformatics that allows researchers to discover the function and regulation of genes. Its basic principle is to group datasets with similar genes and isolate the components that are different. By comparing the expressed genes of similar components with different features, we can then determine the functionality of genes. For example, the mother wallaby produces two different types of milk over an eight month period. The first milk has low levels of fat and protein and elevated levels of carbohydrate. In the second milk, concentration of carbohydrate declines, whereas the concentrations of both protein and fat increase. Research has shown that the development of pouch young when fed the second type of milk develop quicker and are stronger than those fed the first type of milk [1]. By comparing the expressed genes of the two milks, we can isolate the components and quantities that are responsible for this improved growth. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 188–197, 2010. © Springer-Verlag Berlin Heidelberg 2010

Harnessing Clusters for High Performance Computation

189

The microarray approach [2] is the most common method of collecting gene expression data currently being used in bioinformatics. Data from microarray experiments must be stored digitally using one of the many gene expression file formats before being analysed using statistical algorithms. Normalization is a key part of gene expression microarray analysis since unnatural variations can be introduced during the data collection and digitization process. Thus, this data must be corrected and standardized with other arrays before being compared and analysed. There are many normalization algorithms which balance accuracy of ratios and correctness of the returned probes. Comparison of differently normalized datasets can uncover expressed genes once overlooked however the time required to generate normalized datasets mean these additional comparisons are not often done. In this paper we study the use of high performance computing platforms, such as clusters, in running multiple gene expression microarray normalization algorithms concurrently. Presented is a web based package called EXP-PAC that makes use of high performance computing to ready multiple gene expression datasets with common normalization methods in a reduced time. Using this package it is hoped gene expression researchers will be able to improve the accuracy of their results with minimal time and effort. The rest of this paper is organized as follows. Section 2 gives a background of gene expression comparative analysis in bioinformatics. Section 3 explains the role and mechanisms used in normalization of gene expression data. Section 4 describes the implementation of our high performance normalization method through EXP-PAC [3], a gene expression comparative analysis framework. The experiments and performance evaluation of the high performance normalization method is presented in Section 5. Finally, Section 6 concludes by noting the importance of high performance computing in comparative microarray gene expression analysis.

2 Gene Expression Comparative Analysis Gene expression comparative analysis is usually performed in the following three steps as illustrated in Fig. 1. 1. Data is collected in a wet-lab using a gene expression platform (cDNA, Microarrays, etc.). 2. Collected data is converted to a digital format and any un-natural variation is removed. 3. Data analysis is used to group together similar datasets to locate components responsible for biological functions.

Fig. 1. The three stages of gene expression comparative analysis

190

P. Church et al.

Microarrays are the most common gene expression collection platform. They are nylon boards that contain thousands of probes that react when exposed to a biological sample. A typical microarray experiment consists of many arrays. To be recognized by computers, the surface area of each array is scanned and converted to an image file. Image files are then converted to numerical datasets, the colour intensity of each probe is often represented by a ratio. Statistical algorithms can be applied in order to determine the numerical probability of similarity between datasets.

3 Normalization of Gene Expression Data Microarray data collection is not perfect; un-natural variations can arise due to differences in sample preparation, production of arrays and the processing of the arrays. Normalization algorithms are applied to collected data in order to standardize intensities and remove un-natural variations. Common normalization approaches include background subtraction, construction of artificial reference arrays and quantile normalization. • Background Subtraction [4] removes variation introduced from background noise during the scanning process. This method involves subtracting the background from the values of the probes. This method is reliant on having space between probes as background noise is usually not linear. • An artificial reference array [5] is a model based on the median of gene expression levels over all other arrays. Using this constructed model it is possible to normalize values across microarrays. • Quantile normalization [5] is another popular method for normalization across arrays based on transforming each array-specific distribution of intensities so they have the same values at specified quantiles, for example centring arrays on their mediums. A number of algorithms combine these methods to provide normalization. These algorithms are balanced between accuracy of ratios and correctness of the returned probes [6]. It is reasonable to expect that each method returns different results. Commonly used algorithms include: • RMA measure [7] normalizes data using an artificial reference array created through quantile normalization of probes. • Gcrma [8] performs a background adjustment based on sequence information of each probe. • Mas5 [9] makes use of the mismatched probes (MM) designed to remove similar targeted genes increasing the accuracy of microarray data. Normalization uses the robust average of the (logged) probe-MM values. • Plier [10] is similar to the Mas5 method however it assumes variations between perfect match (PM) probes are more accurate then the MM probes. • Qspline [11] fits splines to the quantiles of a target array, normalization is performed using these splines.

Harnessing Clusters for High Performance Computation

191

• Invariantset [12] normalize arrays by selecting invariant sets of genes (or probes) and then using them to fit a non-linear relationship between “treatment" and “baseline" arrays. Normalization using different algorithms can be applied to a dataset in multiple ways in a single experiment. Normalization algorithms such as RMA [7], MAS5 [9], gcrma [8] and plier [10] differ in their approach; some focus on delivering accurate data (value of ratios) and others correctness of data (filtering expressed genes) [6]. Performing normalization using different methods can uncover different sets of expressed genes, increasing the accuracy of results from comparative analysis. Most researchers do not take advantage of this method which allows for improved accuracy of their results due to the time necessary to run this extra analysis. Using high performance computing, multiple normalization methods can be run in parallel allowing many normalized datasets to be collected without the disadvantage of an increased processing time. This method of high performance normalization has been applied to a software package called EXP-PAC.

4 EXP-PAC EXP-PAC is a web based system developed for gene expression comparative analysis. The EXP-PAC system combines the features of two pre-existing software packages, MammoSapians [13] and EST-PAC [14] (see Fig. 2.). MammoSapians is a tool for gene expression analysis providing a unique method of post-analysis based on SQL queries. EST-PAC is a sequence analysis framework which provides data storage and management, security and sequence analysis through embedded tools. EXP-PAC provides users with the ability to upload a number of gene expression file formats (raw microarray data , SOFT [15], MAGE-tab [16], etc.). Uploaded raw microarray data files (also called CEL files) are normalized using the R statistical scripting language. EXP-PAC supports normalization through a distributed platform which uses the Sun Grid Engine [17] in order to speed up microarray data analysis. Normalized data can be linked to results from statistical analysis which is uploaded in a tab-delimited format. Microarray data uploaded to the system can then be queried using an interface dynamically generated from the uploaded microarray and statistical data. In addition, through creation of a sequence to probe ID map, it is possible for a user to perform comparisons on multiple species. Unique to EXP-PAC is cross-species analysis and high performance normalization. Cross Species analysis combines sequence and gene expression data to compare biological systems. High performance normalization, allows users to apply multiple normalization algorithms to different uploaded datasets. Different normalization tasks are constructed by the EXP-PAC system and each of them encapsulates the computation of gene expression data of a particular dataset with a particular normalization algorithm. Thus, multiple of such compute-tasks are submitted to the batch job scheduler of a high performance computing system which would then be scheduled to run concurrently. Multiple normalized datasets can be compared using the crossspecies analysis feature, highlighting differences between the different normalization methods.

192

P. Church et al.

Fig. 2. The structure of the EXP-PAC system

Fig. 3. Architecture showing EXP-PAC interacting with a high performance computing system

Harnessing Clusters for High Performance Computation

193

Accuracy of microarray comparative analysis can benefit from having access to multiple normalized datasets. The normalization methods described in Section 3 require each array be first loaded into memory in order to generate a standardized dataset. Implementing truly parallel normalization methods requires that new normalization techniques be developed. However, common methods of normalization can take advantage of embarrassing parallel methods to improve the speed of data generation. Each normalization job can be run on an individual compute node of a cluster with no modification to the base algorithms and implementation. Our system (see Fig. 3) consists of a bio-server which hosts the EXP-PAC platform; through this interface users can submit normalization jobs and the raw microarray data to an unoccupied cluster node. Datasets are made available as they are generated.

5 Experiment and Results Normalization of microarray data was performed on three CEL files which were downloaded from the Array Express [18] database. The CEL files were taken from experiments E-GEOD-8191 [19], E-GEOD-14764 [20] and E-MEXP-1594 [21]. Starting at 40 files, these experiments double in size and file quantity until a maximum of 160 files is reached. Choosing this range of CEL file sizes allows performance of normalization algorithms to be measured as file size increases. Implementation of the normalization algorithm were provided through the R statistical language [22] and bio-conductor [23] libraries. R scripts were written for the six most commonly used normalization methods described in Section 3; RMA, mas5, gcrma, justPlier, qspline and invariantset. Performance testing of EXP-PAC’s normalization algorithms made use of a high performance computer cluster at Deakin University. This cluster contains 20 compute-nodes connected using an InfiniBand network. Each compute-node has two Intel quad-core processors (a total number of 160 CPUs) running at 1.6 Ghz and 8 GB of ram. The CentOS Linux operating system is run on each node of the cluster. The Sun Grid Engine [17] middleware is used for resource management of the cluster (see Fig. 3). Computations to be performed on the cluster were submitted through the EXPPAC system to the Sun Grid Engine scheduler. Jobs submitted to the scheduler are queued and distributed across nodes of the cluster with processing power available. Normalization tasks carried out on single node and multi node (i.e., a computer cluster) platforms were implemented and compared as illustrated in Fig. 4. For each file size, single node normalization involved a sequential execution of six R scripts each representing a commonly used normalization method. A normalized gene expression data file is generated for each R script run. Conversely, multi node processing speeds up this normalization process by submitting the same six R scripts to a computer cluster which are then executed concurrently. The performances of six normalization methods (RMA, mas5, gcrma, justPlier, qspline and invariantset) were examined on each downloaded gene expression

194

P. Church et al.

Fig. 4. Implemented single and multi node normalization methods

experiment through both single node and multi mode platforms. Each method was run three times to ensure accurate performance results. Single node normalization showed major differences between the computation times of different normalization methods (see Fig. 5.). RMA measure was the quickest of the six tested methods; results followed a linear increase in time to file size. The qspline normalization method was the second fastest algorithm, completing normalization of forty files in two minutes and twenty three seconds. The gcrma method has a stable computational time when processing small files, while the one hundred and sixty file dataset required a great amount of processing time. Invarientset normalization took longer then gcrma, however it shows a linear increase in computation time. JustPlier is quicker when processing small amounts of data but shows a logarithmic increase in process time as the amount of input files increase. The Mas5 method is the most time consuming of the tested methods. In multi node processing, all of the 16 normalization tasks (3 file sizes, 6 normalization algorithms) were run concurrently (one task per CPU). The amount of normalization tasks is less than the twenty nodes available on our cluster. Therefore all normalization tasks were completed within the time taken for the largest file set and the slowest normalization method which is the 160 Files set and the justPlier normalization method. Figure 6 shows that a dramatic improvement in the computation time for normalization was achieved when it was moved from single node execution (a total time of 226 minutes) to multi node execution (a total time of 46 minutes). Comparison of these execution times shows that a speed-up of nearly five has been achieved.

Harnessing Clusters for High Performance Computation

195

Fig. 5. Average performance of pre-processing methods on a single node computer

Fig. 6. Comparison of single node and multi node computational time for normalization

6 Conclusion In this paper, we have presented the study of how high performance computing platforms such as clusters can be employed for running multiple gene expression microarray analysis normalization methods in parallel. We have presented a system called EXP-PAC that provides a framework for gene expression research. Compared to available gene expression software packages, EXPPAC is unique in that it provides a method of cross-species gene expression analysis and a method to apply high performance computing to common normalization methods. We have performed an experiment to parallelize the normalization of gene expression data using three different sizes of data file sets and six different normalization methods as a case study. The speed-up of the normalization process from single node execution to multi node execution is nearly five. Our result has demonstrated

196

P. Church et al.

that significant computational improvement in normalization of gene expression data can be achieved especially when large files and multiple normalization methods are applied. By generating these six normalized datasets in the time taken to produce one, researchers can compare the key differences in expressed genes efficiently.

References 1. Trott, J.F., Simpson, K.J., Moyle, R.L.C., Hearn, C.M., Shaw, G., Nicholas, K.R., Renfree, M.B.: Maternal Regulation of Milk Composition, Milk Production, and Pouch Young Development During Lactation in the Tammar Wallaby (Macropus eugenii). Biol. Reprod. 68, 929–936 (2003) 2. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M.: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 29, 365–371 (2001) 3. Church, P., Goscinski, A., Wong, A., Lefevre, C.: Exp-Pac: A Web Based Package For The Comparitive Analysis Of Microarray Data. Bioinformatics Australia, Melbourne (2009) 4. Yang, Y.H., Buckley, M.J., Speed, T.P.: Analysis of cDNA microarray images. Briefings in Bioinformatics 2, 341–349 (2001) 5. Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P.: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111–140 (2002) 6. Irizarry, R.A., Wu, Z., Jaffee, H.A.: Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22, 789–794 (2006) 7. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P.: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat. 4, 249–264 (2003) 8. Wu, Z., Irizarry, R., Gentleman, R., Martinez-Murillo, F., Spencer, F.: A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J. Am. Stat. Assoc. 99, 909 9. Hubbell, E., Liu, W.-M., Mei, R.: Robust estimators for expression analysis. Bioinformatics 18, 1585–1592 (2002) 10. Affymetrix, I.: Technical note: guide to probe logarithmic intensity error (PLIER) estimation (2005) 11. Workman, C., Jensen, L., Jarmer, H., Berka, R., Gautier, L., Nielser, H., Saxild, H.-H., Nielsen, C., Brunak, S., Knudsen, S.: A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 3 (2002) research0048.0041 research0048.0016 12. Li, C., Wong, W.H.: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 98, 31–36 (2001) 13. Lefèvre, C., Nicholas, K.R., Kumar, A., Strahm, Y., Powell, D., Seemann, T., Daly, K.A., Brennan, A., Menzies, K., Sharp, J., Digby, M.: MammoSapiens: eResearch of the lactation program. Building online facilities for collaborative molecular and evolutionary analysis of lactation and other biological systems from gene sequences and gene expression data: eResearch Australasia, Sebel and Citigate Hotels, Albert Park in Melbourne, Australia (2008)

Harnessing Clusters for High Performance Computation

197

14. Strahm, Y., Powell, D., Lefevre, C.: EST-PAC a web package for EST annotation and protein sequence prediction. Source Code Biol. Med. 1, 2 (2006) 15. Barrett, T., Suzek, T.O., Troup, D.B., Wilhite, S.E., Ngau, W.-C., Ledoux, P., Rudnev, D., Lash, A.E., Fujibuchi, W., Edgar, R.: NCBI GEO: mining millions of expression profiles– database and tools. Nucl. Acids Res. 33, D562–D566 (2005) 16. Rayner, T., Rocca-Serra, P., Spellman, P., Causton, H., Farne, A., Holloway, E., Irizarry, R., Liu, J., Maier, D., Miller, M., Petersen, K., Quackenbush, J., Sherlock, G., Stoeckert, C., White, J., Whetzel, P., Wymore, F., Parkinson, H., Sarkans, U., Ball, C., Brazma, A.: A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7, 489 (2006) 17. Gentzsch, W.: Sun Grid Engine: Towards Creating a Compute Power Grid. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2001) 18. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G.G., Oezcimen, A., Rocca-Serra, P., Sansone, S.-A.: ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucl. Acids Res. 31, 68–71 (2003) 19. Anderson, S., Rudolph, M., McManaman, J., Neville, M.: Key stages in mammary gland development. Secretory activation in the mammary gland: it’s not just about milk protein synthesis! Breast Cancer Res. 9, 204 (2007) 20. Denkert, C., Budczies, J., Darb-Esfahani, S., Györffy, B., Sehouli, J., Könsgen, D., Zeillinger, R., Weichert, W., Noske, A., Buckendahl, A.-C., Müller, B.M., Dietel, M., Lage, H.: A prognostic gene expression index in ovarian cancer - validation across different independent data sets. The Journal of Pathology 218, 273–280 (2009) 21. Ayroles, J.F., Carbone, M.A., Stone, E.A., Jordan, K.W., Lyman, R.F., Magwire, M.M., Rollmann, S.M., Duncan, L.H., Lawrence, F., Anholt, R.R.H., Mackay, T.F.C.: Systems genetics of complex traits in Drosophila melanogaster. Nat. Genet. 41, 299–307 (2009) 22. Ihaka, R., Gentleman, R.: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5, 299–314 (1996) 23. Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004)

Semantic Access Control for Corporate Mobile Devices Tuncay Ercan1 and Mehmet Yıldız2 1

Department of Computer Engineering, Yasar University, Universite cad. No:35-37, 35100 Bornova, Izmir, Turkey [email protected] 2 Global Technology Services, IBM Australia, Melbourne, Australia [email protected]

Abstract. Many of the mobile business applications are executed in different domains as business-to-consumer (B2C) and business-to-business (B2B) operations. Computing environments of mobile wireless devices owned by individuals or organizations have become totally distributed between peers and partners. Designing applicable access control mechanisms in this environment is difficult from the point of traditional security measures. Semantic web technologies offer appropriate access opportunities to the corporate resources by using related context in user devices and servers under the trust philosophy. Semantic web technology combines user requests and service descriptions for an efficient matchmaking. This paper examines various access control mechanisms and semantically analyzes them. The purpose of this paper is to present a more secure access control mechanism for mobile corporate devices. This model can be used as an additional security framework that enforces access control mechanisms in the organizations.

1 Introduction Current web technologies have brought a high benefit to pervasive users and their mobile devices. Together with the development of Information Technology (IT) products, employees working in different industrial sectors have been using internet. Both people and the organizations tend to use mobile devices such as laptops, personal digital assistants (PDAs), and smart mobile phones and have a strong demand for their networking and computing with each other. Their widespread use and characteristics make them ideal devices for different applications and allow employees to keep current with organizational activities while working away from the office. Typical users are sales, marketing people, and technicians. They place orders or check the availability of organizational resources like products, different data types, and maintenance information. They want easy and secure access to corporate processes. A combination of wired and wireless hardware technologies creates new advantages for business applications on the Internet. However, the increasing number of mobile devices causes confusion and difficulty. B2B and B2C companies offer their business services to customers by delegating necessary access rights for their corporate resources through the web services. While initial user authentications check the C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 198–207, 2010. © Springer-Verlag Berlin Heidelberg 2010

Semantic Access Control for Corporate Mobile Devices

199

validity of the users’ credentials, service providers check the identities and users’ privileges in the targeted company domain. Access decision is given by the company servers and based on the assessment of security policies associated with mobile device functions and user requirements. Web 2.0 is the current World Wide Web (WWW) technology and enhances users’ functionality, information sharing and collaboration with each other. Since this type of environment is used by mobile users, both their needs and services should meet in a common point where overall corporate management could be smoothly handled. Using semantic web technologies is the only approach for an efficient and effective access control mechanism. Mobile ontology describes a mobile domain for mobile networks, mobile applications, mobile users, and mobile devices [1]. Mobile context changes are related with different mobile devices, physical locations, applications executed by the users, and different service requests from the service providers in the internet. Since the importance of the interoperability for mobile networking services, semantic ontology may facilitate how each device is suitable for different types of information and interaction demands. Kalaoja et al [2] presented an analysis for the ontology in different functional domains to cope with the heterogeneity of service descriptions like discovery, integration and composition. Mobile devices combine different features like telephone, email, multimedia message, and web browsing to enable employees to keep current with organizational activities. However, this causes a problem of delegating the right access decision for the corporate resources. Traditional access control policies follow different rules to manage resource access between users and companies. However, in a pervasive environment, dynamic and spontaneous access rights among the peers and service providers should be context dependant and specified by the company managements. This paper reviews different access control mechanisms, analyzes them semantically, and offers a combined approach for secure access. The remainder of this paper is organized as follows. We first provide a detailed background through a comprehensive review of previous studies in the area. We present traditional access control policies and semantic approach in section 3. We propose our model in section 4 and finally draw conclusions in section 5.

2 Literature Review Since mobile communications and computing tools have been used for the last 5-6 years, best practices and standards in personal and corporate use for mobility must be identified. Considerable research on how to design different and robust security mechanisms for mobile devices and what the right mix of technology are summarized with the following topics. − − − − − − −

Semantic web technologies and mobile devices Access control applications and models Role-based access control (RBAC) Activity-based access Control Model (ACM) Context-based services and Information Retrieval Friend-of-a-friend (FOAF) model Attribute-based access control (ABAC)

200

T. Ercan and M. Yıldız

As mobile devices are smaller and more sophisticated, and become easier to obtain, their use has made them common tools for all users. Mobile devices act as both requestors and providers of data. Doulkeridis and Vazirgiannis [3] argued that context for mobile web services would play an important role in service discovery and focus on the semantic matching of static attributes. Some of the previous works have examined how to overcome main problems of mobile devices due to their well-known drawbacks. Jou [4] designed a semantics-based web content adaptation framework to provide contents to different mobile devices. Mobile devices run with the context-aware applications based on the context information like relative position of users, user preferences, device capabilities and available resources. Semantic languages are well suited for the proper usage of context information and facilitate knowledge sharing and interoperability among heterogeneous devices. Semantic languages require complex and heavyweight features that may not fit the capabilities of all user devices [5-8]. The highly dynamic and contextdependent requirements of corporate services in distributed environments motivate and recommend the use of ontology-based techniques to combine user requests and service descriptions. Ontology-based approach for Mobility Information Systems in [9] provides different companies and individual consumers with the information to facilitate their mobility in a metropolitan area. Main advantages are semantic user queries that are independent of the system implementation and the possibility of delegating user tasks to mobile software agents. Reference [10] and [11] respectively defines another automatically generated access mechanism based on a set of different access patterns in ontology-based systems and a role signature based on generic role identifiers to verify authorization by the roles. In order to handle problems with mobile agents and object frameworks, new features of firewalls enhanced flexible access control perimeters. Domain and Type Enforcement (DTE) firewall runs application-level proxies in restrictive domains to increase security [12]. A predefined criterion-based multilayer access control (CBMAC) approach extracted from authorization rules and enhanced existing access control models such as Role-Based, Mandatory, and Discretionary Access Control models to support multilayer access control was presented in [13]. Role-based access control (RBAC) is a widely used solution to cope with these problems. However, as the size of the organization increases, users’ attributes such as membership class or job position become more complicated. It is difficult that a centralized role server manages the large and complex role hierarchy. A distributed role hierarchy that can manage the role hierarchy effectively and practically in [14] ensures web security. Yamazaki et al [15] discusses that how an access control system enables managing dynamic security policies by using the Role-Based Access Control (RBAC) that the agents decide access rights dynamically by using context-enabled rules and an inference engine for the role defined. Reference [16-20] describe different models to assign dynamically users to roles by setting out attributes that are not accessible according to the specified access control policies. Hung et al [21] proposed Activity-based access Control Model (ACM) leveraging user’s activities to determine the access permissions for that user. In ACM, a user is assigned to perform a number of actions if s/he poses a set of satisfactory attributes and access permissions to hospital information are granted according to users’ actions. Jung et al [22] studied on a collected dataset of user activities in telecommunication

Semantic Access Control for Corporate Mobile Devices

201

industry and investigated how much the personal context of a certain person is interrelated with those of other people to build meaningful relationships by a semantic approach. Other information retrieval systems were proposed in [23-25] to enable semantic information to represent context and domain knowledge and aid interactions mediated by mobile devices in an easy and efficient way, anywhere and anytime for wireless devices. FOAF ontology schema evaluates the context query (semantically searchable), share ability among local and remote context-aware applications [26]. A trust mechanism that takes advantage of both the capabilities of the Semantic Web and mobile ad-hoc networks was described in [27], since mobile devices enable social interaction as a level of trust for another person. Attribute Based Access Control (ABAC) model as a new approach is based on subject, object, and environment attributes and supports both mandatory and discretionary access control (MAC-mandatory access control, DAC-Discretionary access control) [27]. ABAC provides a promising approach to define authorization over shared resources and is based on users’ attributes rather than their identities. However, user’s attributes are always asserted by different authorities that may not be accepted by the resource owner with the same degree of trust.

3 Traditional Access Control Policies and Semantic Approach Current information security mechanisms are insufficient and not well suited to afford the dynamic and ad hoc status in wireless and mobile environment. Access control mechanisms run the authentication and authorization decisions on an entity's (subject) access to corporate resources. Credentials are accepted as obligations to be fulfilled by the subject when the network connection is established. Environment can include everything around the user. It is independent from the subject (user or mobile device) and object (any kind of resource). Proper usage of the resources in a dynamic environment requires important and sensitive decisions. Some of the current access policies summarized below explain how to handle different access decisions. Discretionary access control (DAC) is the user centric and the ownership of every system resource is assigned one or more system users. System users control the use of objects under their control. This model is commonly used for being simple, flexible and ease of implementation. The resource access privileges in Mandatory Access Control (MAC) are determined by the system as distinct from DAC. Access to the data resources remains unchanged, since it has been predefined with administrative procedures [28]. Role Based Access Control (RBAC) emerged as an option to DAC and MAC policies. RBAC regulates access control action according to the subject's role. Thus, users inherit all the permissions associated with these roles and their order simplifies the definition of policies separately. Attribute-Based Access Control (ABAC) uses the attributes of subjects and objects for authorization instead of directly using the permissions between them. There are three different attributes: subject, resource and environment [29]. Organization Based Access Control (OrBAC) system has been developed because of the problems experienced in corporate access control policies. It works with the presence of three components that are subject, action and object. The policy describes how some subjects have permissions to perform some actions on

202

T. Ercan and M. Yıldız

some objects with the aim of controlling access. OrBAC mechanism gives permission to write policies regardless from the application [30]. New access control mechanisms, policies and languages have been developed for the definition and implementation of security requirements in real-world scenarios, such as internet usage and wireless pervasive networks. The representation of semantic knowledge in the access control mechanisms in the previous section has been partly covered. The resource owner (subject), the source of information (object) and the desired action (action) that would be performed on the source should be kept as ontological, in order to ensure the existence of semantic approach. When we examined the access control mechanisms, these ontological data (resources to access, entity that has access to the resources, and the action which will be held on the resource) would stand out. In this respect, the source, subject and actions in the access control mechanism must be represented as semantic to ensure the targeted semantic level. If we apply the proper corporate policies, a corporate policy decision maker can handle these components like subject, actions, conditions, domain ontology and policy object. By using the framework proposed in [31], particular pieces of information relevant to the corresponding users could be summarized and transferred again into the corporate mobile users’ environment by generating semantic templates as users’ input. Since the importance of the interoperability issues in heterogeneous mobile networking services, semantic ontologies may facilitate how each device is suitable for different kinds of information and different interaction demands. The advantages coming with the adaptation of semantic based models improve the efficiency and accuracy for selecting the right web services in the corporate environments. Many of the works in this area have been referenced in [32]. Even though the advancement in mobile and wireless technologies affects the demands for significant mobile services, discovering such services from any device is still a major challenge due to the variety of available mobile devices and the lack of service descriptions. Ontology-based approaches create alternatives to discover the mobile services with respect to user preferences and device capabilities. Niazi et al [33] provided a Delivery Context Ontology with the enriched profile-based service descriptions to discover, personalize, adapt and automatically execute of mobile services with maximum degree of user satisfaction.

4 Activity and Context Based Access Control When we looked at the previous works, we see that there are two types of security applications for mobile users. The first one is kept in the user device and processes automatically security-related mechanisms. Other is the rules of the system, and organization. Predetermined corporate policies ensure security and reliability for the usage of mobile devices at least for corporate applications and during working hours, if and only if a company employee in the work-related issues uses them. Employees deal with specific applications and data referred to the organizational activities. User activity-based access control mechanisms in the company side check for user activities to determine access rights for the user. Context-based applications handled with semantic web technologies have been increased in today’s web domains, in the area

Semantic Access Control for Corporate Mobile Devices

203

of e-commerce. Context data can be characterized as user profiles and rights, users’ networking environment, mobile devices, mobile applications and a particular time during the day.

Fig. 1. Additional classes

In our work, we tried to outline a combined policy of activity and context based access control approaches. Both use the web interfaces of the service provider domains. While activity-based application uses XML specifications, context based approach uses the creation of necessary ontology. Reference [34] presented a new method for searching documents, which have similar topics. It was designed to help mobile device users to search for documents in a peer-to-peer environment, which have similar topic to the ones on the users’ own device. The algorithms reasonably were designed for slower processors, smaller memory and small data traffic between the devices. These features allowed the application in an environment of mobile devices like phones or PDAs. Therefore, a common corporate ontology can be developed by using these aforementioned concepts and features. Figure 1 shows additional classes in dotted circles, with a few attributes determined by the corporate policy in a sample FOAF format. We can define the following steps in the creation of corporate policies. − Identification of corporate activities (beginning with the user access into the organization web site) related to the corporate policy. − Description of conditions (networking type, bandwidth, user device, user applications etc.) related with the activities by using the domain ontology − Determining the type of policy objects and association of policy objects described with the policy subject. Pre-determined corporate policies allow application-specific policies to move along with the logic of source, action and object in a distributed computing system. Figure 2 shows an example of application permission.

204

T. Ercan and M. Yıldız

Except the “companyA” employees, no one is allowed to practice “application A”. (Deny) “CompanyB” employees use “application A2” of “CompanyA”. (Allow) Fig. 2. Allow/Deny sentence

The context of an organizational services infrastructure is expected to support all mobile users, other business partners and remote users as well. Figure 3 illustrates a dynamic security model, which we proposed for the access solutions in a pervasive environment.

Fig. 3. Proposed Architecture

This model has been emerged through practical experience of authors in developing solutions for service-oriented projects and is based on personal and corporate attributes and user activities for the requested applications. Characteristics of the user’s environment taken into account allow them to deliver relevant information to the corporate policy database. It is imperative that any deployable technology be inserted into the policy database with the required attributes to allow/deny end users. This multilayer architecture different from the traditional 7 layer security process design can provide more powerful solution for the needs of the electronic commerce applications in different domains. The credentials are kept again in the AAA server. Our proposed model uses different aspects of semantic approaches in access control mechanisms. Our model uses existing traditional access control mechanisms and works strengthens the drawbacks of the wireless environment created by recent semantic approaches.

Semantic Access Control for Corporate Mobile Devices

205

5 Conclusion Defining the access control policies and the connectivity of mobile systems into the corporate web site will reduce the problems during service creation. Other partner organizations collaborating with the central company will also require similar type securities. Flexibility in allowing organizations to define their own security policies will assist in the growth of the secure collaborations through their own web sites. Mobile users are becoming more aware of computing and want the flexibility to initiate organizational processes on their portable devices in a safe mode. Dealing with new context information brings new access and privacy issues into the related organization. Facilitating a semantic approach in the context-based access control for a broad range of applications in different industrial sectors enables the benefit of different vertical markets. However, specific ontologies owned by private organizations are only feasible at the application or agency (company) level. There can be different ontologies and semantic models for a common consensus about business complexities affecting organizations. Semantic infrastructure should be integrated with today’s active security terms and standards such as rules, signature, encryption, proof, and trust. Computers at both end user and company side should have the required web scaled knowledge base of hyperlinked data in order to reason with the content between end user devices and company web servers. A web widget approach proposed in [35] allows adding semantic functionality to systems with just a few lines of code. One of the benefits of publishing widgets as centralized services (between the end user and the organization) is that updates in content and functionalities are instantaneously available for the users. This approach can easily be integrated into applications owned by the company.

References 1. Veijalainen, J.: Developing mobile ontologies; who, why, where, and how? In: Mobile Services oriented Architectures and Ontologies Workshop, MoSO 2007 (2007) 2. Kalaoja, J., Kantorovitch, J., Carro, S., Miranda, J.M., Ramos, A., INSTICC: The vocabulary ontology engineering for the semantic modelling of home services. In: 8th International Conference on Enterprise Information Systems (ICEIS 2006), May 23-27 (2006) 3. Doulkeridis, C., Vazirgiannis, M.: Querying and updating a context-aware service directory in mobile environments. In: WI 2004 IEEE/WIC/ACM International Conference on Web Intelligence, September 20-24 (2004) 4. Jou, C.: A semantics-based automatic web content adaptation framework for mobile devices. Web Information Systems and Technologies, 230–242 (2008) 5. Corradi, A., Montanari, R., Toninelli, A., IEEE Computer, S.: Adaptive semantic support provisioning in Mobile Internet environments. In: International Symposium on Applications and the Internet (SAINT 2005), Janaury 31-February 04 (2005) 6. Weissenberg, N., Gartmann, R., Voisard, A.: An ontology-based approach to personalized situation-aware mobile service supply. Geoinformatica 10(1), 55–90 (2006) 7. Drogehorn, O., Wust, B., David, K., IEEE: Personalised applications and services for a mobile user. In: International Symposium on Autonomous Decentralized Systems (ISADS 2005), April 4-8 (2005)

206

T. Ercan and M. Yıldız

8. Bianchini, D., De Antonellis, V., Melchiori, M., Salvi, D.: Lightweight ontology-based service discovery in mobile environments. In: Seventeenth International Conference on Database and Expert Systems Applications, Proceedings, pp. 359–364 (2006) 9. Faro, A., Giordano, D., Musarra, A., IEEE, I.: Ontology based intelligent mobility systems. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 1-5, pp. 4288–4293 (2003) 10. Villanueva, F.J., Villa, D., Barba, J., Rincon, F., Moya, F., Lopez, J.C.: Ontology access patterns for pervasive computing environments. In: Mikulecky, P., Liskova, T., Cech, P., Bures, V. (eds.) Ambient Intelligence Perspectives, pp. 236–244 (2009) 11. Crampton, J., Lim, H.W.: Role signatures for access control in open distributed systems. In: 23rd International Information Security Conference held at the 20th World Computer Congress, September 07-10 (2008) 12. Oostendorp, K.A., Badger, L., Vance, C.D., Morrison, W.G., Petkac, M.J., Sherman, D.L., Sterne, D.F., IEEE Comp. Soc.: Domain and type enforcement firewalls (1997) 13. Pan, L., Zhang, C.N.: A Criterion-Based Multilayer Access Control Approach for Multimedia Applications and the Implementation Considerations. ACM Transactions on Multimedia Computing Communications and Applications 5(2) (2008) 14. Lee, G.H., Yeh, H.J., Kim, W.I., Kim, D.K.: Web security using distributed role hierarchy. In: 2nd International Workshop on Grid and Cooperative Computing, December 07-10 (2003) 15. Yamazaki, W., Hiraishi, H., Mizoguchi, F., IEEE Computer, S.: Designing an agentbased RBAC system for dynamic security policy. In: 13th IEEE International Workshop on Enabling Technologies - Infrastructure for Collaborative Enterprises, June 14-16 (2004) 16. Al-Kahtani, M.A., Sandhu, R., IEEE Computer Society, I.C.S.: A model for attributebased user-role assignment. In: 18th Annual Computer Security Applications Conference, December 09-13 (2002) 17. Carminati, B., Ferrari, E., Tan, K.L., ACM: Enforcing Access Control Over Data Streams. In: 12th ACM Symposium on Access Control Models and Technologies, June 20-22 (2007) 18. Park, J.S., Ahn, G.J., Sandhu, R.: Role-based access control on the web using LDAP. Database and Application Security XV, 19–30 (2002) 19. Schwartmann, D.: An attributable role-based access control for healthcare. In: Bubak, M., DickVanAlbada, G., Sloot, P.M.A., Dongarra, J.J. (eds.), pp. 1148–1155 (2004) 20. Greenhalgh, C., Glover, K., Humble, J., Robinson, J., Wilson, S., Frey, J., Page, K., De Roure, D., IEEE: Combining System Introspection with User-Provided Description to Support Configuration and Understanding of Pervasive systems. In: 3rd International Conference on Pervasive Computing and Applications, October 06-08 (2008) 21. Hung, L.X., Lee, S., Lee, Y.K., Lee, H.: Activity-based access control model to hospital information. In: 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, August 21-24 (2007) 22. Jung, J.J., Lee, H., Choi, K.S.: Towards Efficient Reality Mining with Contexts and Semantics: a Case Study of Telecommunication. In: 2nd International Symposium on Intelligent Information Technology Application, December 21-22 (2008) 23. Naudet, Y., Aghasaryan, A., Toms, Y., Senot, C.: An Ontology-based Profiling and Recommending System for Mobile TV. In: 3rd International Workshop on Semantic Media Adaptation and Personalization (SMAP 2008), December 15-16 (2008) 24. Martins, D.S., Santana, L.H.Z., Biajiz, M., do Prado, A.F., de Souza, W.L., ACM: Context-aware Information Retrieval on a Ubiquitous Medical Learning Environment. In: 23rd Annual ACM Symposium on Applied Computing, March 16-20 (2008)

Semantic Access Control for Corporate Mobile Devices

207

25. Mena, E., Illarramendi, A., Royo, J.A., Goni, A.: A Software Retrieval Service Based on Adaptive Knowledge-Driven Agents for Wireless Environments. ACM Transactions on Autonomous and Adaptive Systems 1(1), 67–90 (2006) 26. Hu, D.H., Dong, F., Wang, C.L., IEEE Computer, S.O.C.: A Semantic Context Management Framework on Mobile Device. In: 6th International Conference on Embedded Software and Systems, May 25-27 (2009) 27. Yuan, E., Tong, J., IEEE Comp. Soc: Attributed based access control (ABAC) for web services. In: IEEE International Conference on Web Services, Proceedings, July 11-15, vol. 1,2 (2005) 28. Benantar, M.: Access Control Systems Security, Identity Management and Trust Models. Springer Science Business Media, Heidelberg (2006) 29. Yuan, E., Tong, J.: Attribute Based Access Control - A New Access Control Approach for Service Oriented Architecture (SOA). In: New Challenges for Access Control Workshop (2005) 30. Cuppens, F., Miège, A.: Modelling Contexts in the Or-BAC Model. In: 19th Annual Computer Security Applications Conference (2003) 31. Jung, J.J., Park, S.B., Jo, G.S.: Semantic template generation based information summarization for mobile devices. In: Shimojo, S., Ichii, S., Ling, T.-W., Song, K.-H. (eds.) HSI 2005. LNCS, vol. 3597, pp. 135–143. Springer, Heidelberg (2005) 32. Lau, B.Y.S., Pham-Nguyen, C., Lee, C.S., Garlatti, S., IEEE: Semantic Web Service Adaptation Model for a Pervasive Learning Scenario. In: 2nd IEEE Conference on Innovative Technologies in Intelligent Systems and Industrial Applications, July 12-13 (2008) 33. Niazi, R., Mahmoud, Q.H., IEEE.: An Ontology-Based Framework for Discovering Mobile Services. In: 7th Annual Conference on Communication Networks and Services Research, May 11-13 (2009) 34. Csorba, K., Vajk, I.: Iterative Search for Similar Documents on Mobile Devices. In: Dengel, A.R., Berns, K., Breuel, T.M., Bomarius, F., Roth-Berghofer, T.R. (eds.) KI 2008. LNCS (LNAI), vol. 5243, pp. 38–45. Springer, Heidelberg (2008) 35. Makela, E., et al.: Enabling the semantic web with ready-to-use web widgets. In: Proceedings of the First Industrial Results of Semantic Technologies Workshop, ISWC 2007 (2007)

A New Visual Simulation Tool for Performance Evaluation of MANET Routing Protocols Md. Sabbir Rahman Sakib1, Nazmus Saquib1, and Al-Sakib Khan Pathan2 1

Department of Electrical and Electronic Engineering Department of Computer Science and Engineering BRAC University, 66 Mohakhali, Dhaka 1212, Bangladesh {srsakib,nsaquib,spathan}@bracu.ac.bd 2

Abstract. A new user-friendly visual simulation tool; ViSim is presented. ViSim could be useful for researchers, students, teachers in their works, and for the demonstration of various wireless network scenarios on the computer screen. It could make the task of simulation more exciting and enhance the interest of the users without going into complex command-only text interface. Using our simulation tool, we measured the performance of several Mobile Ad-hoc Network (MANET) routing protocols. In this paper, we present the performance analysis of three prominent MANET routing protocols; DSDV, DSR, and AODV using ViSim. The details of various features of ViSim, brief descriptions of the selected routing protocols and their comparisons, details about the performed experiments, and gained results are also presented. Keywords: Comparison, Graphical, ns-2, Performance, Routing, Simulation.

1 Introduction A Mobile Ad-hoc Wireless Network (MANET) is a collection of autonomous nodes that communicate with each other by forming a multi-hop network, maintaining connectivity in a decentralized manner [1], [2]. It consists of a set of mobile nodes communicating amongst themselves using wireless links, without the use of any centralized entity. Because of the mobility feature, any node may sometimes go out of the range of other nodes in the network. Therefore, MANET routing is difficult since mobility causes frequent network topology changes and requires more robust and flexible mechanism to search for and maintain the routes. Because of the challenging features of MANET routing, it has been under tremendous scrutiny and interest from the time of its emergence. In fact, it is one of most addressed topics. Though many routing protocols have already been proposed and well-accepted in the research community because of their given promise and performance, there still remains the necessity of a flexible, user-friendly simulation tool that can make the task of simulation and visualization of routing protocols easier. Many simulators can successfully simulate various routing protocols of MANET. However, there are only a few tools to handle the simulations with a user-friendly graphical interface. This particular fact has motivated us to design and develop our tool so that the users could be able to deal with complex simulation scenarios in a much easier way without getting involved into the command-only interface. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 208–217, 2010. © Springer-Verlag Berlin Heidelberg 2010

A New Visual Simulation Tool for Performance Evaluation

209

We have named our simulation tool ‘ViSim’. ViSim could help a network administrator to choose a particular ad-hoc routing protocol for a specific scenario through analyzing the graphs for different routing protocols. Alongside describing various aspects and features of ViSim, we have also analyzed the performances of the prominent routing protocols for MANET using our tool. The rest of the paper is organized as follows: Section 2 describes ViSim. Performance evaluations using our tool are presented in Section 3, Section 4 talks about some relevant works, and finally Section 5 concludes the paper.

2 ViSim: A Visual Simulation Tool 2.1 Building Blocks of ViSim We have used two software in Windows environment; ActiveTcl8.3.5 and Microsoft Visual Basic (VB) 6.0. ActiveTcl is an industry-standard Tcl distribution, available for Windows, Linux, Mac OS X, Solaris, AIX and HP-UX. This software creates an environment in Windows to run the ns-2 [3] simulations and .tcl scripts. It is capable of executing the simulation faster than cygwin [4]. For details of ActiveTcl, the readers are encouraged to visit; http://www.activestate.com/activetcl/ . Microsoft VB is a popular software that we have used for developing ViSim prototype so that it can connect the simulation related tasks with a user-friendly graphical interface. 2.2 Overview of ViSim Our graphical simulation tool, ViSim is built using Visual Basic 6.0 in order to make comparisons among various MANET routing protocols since there are very few prototypes available today for performing such type of task. Most of the available tools are somewhat not user-friendly. Hence, keeping that in mind, we built ViSim in such a way that any naive user can also be able to use this tool to visualize the background simulations done in ns-2 (that is run with the help of ActiveTcl in Windows operating system). ViSim runs associated .tcl files for all the three mentioned protocols (DSDV [5], DSR [6], AODV [7]) and extracts the required information from the trace files that are generated. Eventually the graphs are plotted for different performance indicators such as Throughput, Goodput, and Routing Loads. ViSim can make the task of a network administrator easy to decide which routing protocol would be better for a particular MANET scenario. 2.3 Different Working Areas Figure 1 (left) shows the ViSim prototype/tool when it is run in Windows environment for the first time. The graphical interface has some working areas and functionalities that should be known before using it for analysis of various parameters: (a) Simulation Area: In this area, three routing protocols are mentioned. Clicking on the names of each protocol gives the options of simulating three network scenarios using that particular protocol. (b) Comparison Area: This area has the options; Throughput vs Time, Goodput (Packets), Routing Load (Packets), Goodput (Bytes), and Routing Load (Bytes). All

210

Md.S.R. Sakib, N. Saquib, and A.-S.K. Pathan

these buttons are used to select the parameters that the user needs for the performance analysis and comparison among the routing protocols. (c) Scenarios and Protocols Area: This area specifies the options of three network scenarios (radio buttons) and three routing protocols (tick boxes). Also it has two buttons namely; ‘Simulate’ and ‘Create Graph’. ‘Simulate’ button is used for playing the simulations and ‘Create Graph’ is used to plot the comparison graphs. (d) Output Area: This is shown as a blank window area when ViSim is run for the first time. Based on the choice of various options, the outputs or further options are shown in this area. The graphs are also plotted on this area.

Fig. 1. (Left) ViSim Interface, version 1.0 (in Windows XP). (right) DSR simulation options.

2.4 Functionalities of ViSim with Examples Now, let us see the functionalities of ViSim with some practical examples. Let us suppose that we want to visualize the simulation for DSR for a particular network scenario. For this task, first we have to click DSR button under simulation area. After clicking DSR button, ViSim shows three more options (DSR Simulation 1, DSR Simulation 2, and DSR Simulation 3) on the output area as shown in Figure 1 (right). From these three options any one could be chosen. For our task, let us choose DSR Simulation 3. After clicking this button, ViSim calls ns-2 in its background, then reads .tcl file that specifies the simulation scenario 3, generates .nam and .tr files. Once the .nam and .tr files are generated, ViSim calls the NAM (Network Animator tool) in its background and reads the generated .nam files. Consequently, it shows a display for visual simulation [see Figure 2 (left)]. On the NAM screen, there are few buttons such as play, forward, backward, stop buttons available to control the simulation as these are done usually in Linux based environment with ns-2 and NAM. To see the visual simulation on the screen, the play button should be clicked. Like any other simulation using NAM, we can also change the step size of the simulation. Now, if we want to make comparisons among three different protocols for performance analysis, we have to choose a specific network scenario. In our case, let us select Scenario 1. Then we have to select three mentioned protocols (or, any two or

A New Visual Simulation Tool for Performance Evaluation

211

Fig. 2. (Left) The output after choosing DSR Simulation 3. (right) A sample output graph (Throughput vs Time) using network scenario 1, all three protocols are compared.

one) and side by side the performance indicators should be clicked from the five options in the comparison area. Once the simulations are performed by clicking the ‘Simulate’ button, we can use the generated results in the background for plotting comparison graphs. Basically, this ‘Simulate’ button facilitates performing various simulations with three protocols for a particular network scenario at the same time. This reduces the burden of doing the tasks repeatedly or selecting one protocol at a time under Simulation area. Once all the simulations are completed, the graph can be generated by clicking the ‘Create Graph’ button. By clicking ‘Create Graph’ button, we send the command to read the generated .tr files (trace files) and extract the required information/values from those. These values are used to plot the graphs for different protocols for a specific scenario and for different performance indicators. Figure 2 (right) shows a sample output of what we have done so far (it should be mentioned that each simulation and plotting of graph takes a bit time as required by ns-2; for example in our case, it took about 25 seconds to plot the graph). Let us talk about the working mechanism of ViSim buttons briefly. When the user selects the simulation option in order to view the simulation for a particular scenario corresponding to the selected ad hoc network protocol, ViSim calls up a .bat file which contains shell script. This shell script calls the ns-2 and feeds files or file having extension .tcl, according to the choice of simulation. Then ns-2 generates trace file (extension .tr) and nam file (extension .nam). After that NAM is called via shell script and using NAM the script feeds .nam file into NAM which gives a GUI (Graphical User Interface) popup and using it, a user can actually observe the simulation. Again, when the user selects the Comparison option and clicks Create Graph after performing simulations, ViSim gathers the .tr files according to the choice of protocol, reads those and according to the performance indicators, it filters the data and picks up important information to generate the graph. For ViSim, we used some given network specifications. Note that any specification can be modified in the .tcl files according to the requirements to simulate another network setting. Also, various parameters used in ViSim code could be given new values. A diagram of the operational flows of ViSim is presented in Figure 3.

212

Md.S.R. Sakib, N. Saquib, and A.-S.K. Pathan

Fig. 3. Diagram of the operational flows of ViSim

2.5 ViSim File Organization and Simulation Scenario Modification In this sub-section we briefly talk about the file organization that is used in our tool. After installing ViSim a particular folder appears in which several .tcl files are kept. For any .tcl file, the general format that is used is: [The naming convention: a + Adhoc protocol (X) + Scenario (Y).tcl]=axy.tcl, where x represents the protocol whose value ranges from 0 to 2 and y represents the scenario whose value also ranges from 0 to 2. Now, every value of x or y is associated with the file name. Let us have a look at Table 1 for a clear idea regarding the naming convention that is used. Table 1. File Naming Convention ([row+column] values in the cells)

0 (AODV) 1 (DSR) 2 (DSDV)

0 (Scenario 1)

1 (Scenario 2)

2 (Scenario 3)

00

01

02

10

11

12

20

21

22

Therefore, if a user needs to change a file’s specifications, say for example; the 2nd scenario of DSR routing protocol, he needs to open the a11.tcl file and do necessary changes. The user’s manual is also available in ViSim website (mentioned later) which tells more about these facilities. We have noted this information here to indicate that ViSim could be used both as a simulation demonstration tool and a simulator tool for MANET routing protocols.

A New Visual Simulation Tool for Performance Evaluation

213

3 Performance Evaluations and Results Table 2 shows the specifications and parameters that we used for our experiments: Table 2. Simulation Parameters Simulation Parameter Channel Type Radio-propagation model Network interface type MAC type Interface Queue Type Antenna model Number of Mobile nodes Ad Hoc Routing Protocol Simulation Area Simulation Time Traffic Type Nodal speed Packet size Total Number of different Scenarios

Value Wireless Channel Two Ray Ground Model Wireless Physical 802_11 b Drop Tail Primary Queue Omni Direction 3-10 DSDV, DSR, AODV 500m x 400m 150 ms TCP 3-10 m/s 1040 Byte (Data Packets) 40 bytes(Acknowledgement Packets) 60 Bytes (Routing Packets) 15

To evaluate the performances of various routing protocols we took three network scenarios; Scenario 1 with 3 nodes, Scenario 2 and 3 with 10 nodes with different mobility characteristics. Comparisons among different protocols were based on the aggregate of the performance metrics resulting from the simulations of 15 different scenarios that had been performed for each protocol separately. To measure the performances, we used the following metrics: Throughput. The total bytes received by the destination node per second (Data packets and Overhead). Goodput (In terms of Number of Packets). The ratio of the total number of data packets that are sent from the source to the total number of packets that are transmitted within the network to reach the destination. Goodput (In terms of Packet Size in Bytes). The ratio of the total bytes of data that are sent from the source to the total bytes that are transmitted within the network to reach the destination. Routing Load (In terms of Number of Packets). The ratio of the total number of routing packets that are sent within the network to the total number packets that are transmitted within the network to reach the destination. Routing Load ( In terms of Packet Size in Bytes). The ratio of the total bytes of routing packets that are sent within the network to the total number of bytes that are transmitted within the network to reach the destination. To illustrate our experimental results using our tool, we first present all the outputs of 15 different cases in the Figure 4 [(a) to (f)], Figure 5 [(a) to (c)], Figure 6 [(a) to (c)], and Figure 7 [(a) to (c)].

214

Md.S.R. Sakib, N. Saquib, and A.-S.K. Pathan

Figure 8 (left) shows the aggregated result for ‘Throughput vs Time’ where we analyzed the total bytes received by the destination node per second (Data packets and overhead). Based on the results that we see here, the following comments could be made: (a) AODV: starts off quickly and the data rate is more stable, (b) DSR: starts off quickly however we can see that there are lots of fluctuations in the data rate, (c) DSDV: takes time to start off but the data rate has lesser fluctuations.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Throughput vs Time: (a) Scenario 1 (b) Scenario 2 (c) Scenario 3 [y axis represents Throughput (KB) and x axis represents Time (seconds)]. Goodput (packets): (d) Scenario 1 (e) Scenario 2 (f) Scenario 3.

(a)

(b)

(c)

Fig. 5. Routing Load (packets) (a) Scenario 1 (b) Scenario 2 (c) Scenario 3

We calculated Goodput in terms of number of packets and the packet size in bytes. Now, if we analyze the graph presented in Figure 8 (right), we can see that on an average, if 100 packets are transmitted in the network, 19 packets would be data packets for AODV, 16 for DSR, and 24 for DSDV. In term of bytes, on an average; if 100 bytes of packets are transmitted through the network, 36 bytes would be data

A New Visual Simulation Tool for Performance Evaluation

215

packets for AODV, 28 bytes for DSR, and 48 bytes for DSDV. From these data, we could deduce that; though DSDV takes time to converge, it actually is sending more data packets in number as well as in bytes than that of AODV and DSR. Now, the rest of the percentage of each individual graph will be the overheads that contain routing packets and acknowledgements.

(a)

(b)

(c)

Fig. 6. Goodput (bytes) (a) Scenario 1 (b) Scenario 2 (c) Scenario 3

(a)

(b)

(c)

Fig. 7. Routing Load (bytes) (a) Scenario 1 (b) Scenario 2 (c) Scenario 3

Fig. 8. (Left) Throughput vs Time. (right) Goodput for three MANET routing protocols.

We again calculated routing loads in terms of number of packets and the packets size in bytes. The results are presented in Figure 9. Again we can see that; though DSR has a better throughput, it actually contains more overhead for routing packets. However, DSDV has a relatively lower routing load than AODV and DSR.

216

Md.S.R. Sakib, N. Saquib, and A.-S.K. Pathan

4 Related Works Rosen et al. [8] describe a simulation tool, SHOPMET; used to study the creation and optimization of propagation maps for Node State Routing protocols within wireless ad hoc networks. Though the authors have tried to develop an all-in-one tool, this is too focused to work with a specific type of routing protocol and thus the scope of using it for various purposes is limited. In [9], the authors present their interactive ns-2 protocol and environment confirmation tool (iNSpect). Though iNSpect on the surface seems to be an elegant tool for visual simulation, it is not user-friendly and requires much effort to make the codes run properly. Other than these works, some good surveys on different simulation tools for MANET (or any other wireless networks) could be found in [10] and [11]. There are many works like [12], [13], [14], [15] available which have evaluated the performances of different MANET routing protocols. However, in our work we used our own developed tool to simplify the tasks of multiple simulations at a time. We also have considered various scenarios to run our simulations. We already have talked about various simulation tools and their features. Compared to all other works, ViSim gives much more user-friendliness, flexibility, functionalities, and ease of using complex simulation scenarios. The main difference between ViSim and any other simulation tool is that ViSim uses ns-2 simulations in the background but makes the tasks lot easier for the users. The plotting of graphs is also a great feature and this facility was not included in most of the previously developed graphical simulation tools. Our tool is developed in such a way that it can be used by all types of users. A naive user can use it for visualizing simulations or for simulation demonstration. An expert user can write his own tcl script and run it using ViSim tool and then can generate various performance comparison graphs without taking resort to other cumbersome programming methods.

Fig. 9. Routing loads for different experimented MANET protocols

5 Conclusions and Future Works In this paper, we have presented our user-friendly simulation tool/prototype which can ease the task of simulation of MANET routing protocols even in Windows based environments. Many users dealing with ns-2 simulations face troubles in setting up

A New Visual Simulation Tool for Performance Evaluation

217

Linux or other systems and environment. The use of ActiveTcl with graphical ViSim interface could really be beneficial for the research community in general. Using our simulation tool, we obtained different graphs and analyzed the results for different MANET routing scenarios. As our future works, we would like to add more functionalities to ViSim with easy access to the programming codes and parameter changes for various network scenarios. For the information about the official release of ViSim 1.0, the readers are encouraged to visit: http://faculty.bracu.ac.bd/~spathan/research/ visim.html Acknowledgments. Special thanks to Taufiq Abdur Rahman and Sadia Hamid Kazi for their cooperation in this work.

References 1. Pathan, A.-S.K., Hong, C.S.: Routing in Mobile Ad Hoc Networks. In: Misra, S., Woungang, I., Misra, S.C. (eds.) Guide to Wireless Ad Hoc Networks, pp. 59–96. Springer, Heidelberg (2009) 2. Marti, S., Giuli, T.J., Lai, K., Baker, M.: Mitigating routing misbehavior in mobile ad hoc networks. In: Proceedings of ACM MOBICOM, Boston, MA, USA, pp. 255–265 (2000) 3. The Network Simulator - ns-2, http://www.isi.edu/nsnam/ns/ 4. http://www.cygwin.com/ 5. Perkins, C.E., Bhagwat, P.: Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile Computers. In: Proceedings of ACM SIGCOMM, pp. 234–244 (1994) 6. Broch, J., Johnson, D.B., Maltz, D.A.: The Dynamic Source Routing Protocol for Mobile Ad Hoc Networks. IETF Draft (1999), http://tools.ietf.org/html/draft-ietf-manet-dsr-03 7. Perkins, C.E., Royer, E.M., Chakeres, I.D.: Ad Hoc On-Demand Distance Vector (AODV) Routing. IETF Draft (2003), http://tools.ietf.org/html/draft-perkins-manet-aodvbis-00 8. Rosen, S.L., Stine, J.A., Weiland, W.J.: A MANET Simulation Tool to Study Algorithms for Generating Propagation Maps. In: Proc. of the 2006 IEEE WSC, pp. 2219–2224 (2006) 9. Kurkowski, S., Camp, T., Mushell, N., Colagrosso, M.: A Visualization and Analysis Tool for Wireless Simulations: iNSpect. In: Proc. of 13th IEEE Int. Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 503–506 (2005) 10. Hogie, L., Bouvry, P., Guinand, F.: An Overview of MANETs Simulation. Electronic Notes in Theoretical Computer Science 150(1), 81–101 (2006) 11. Rahman, M.A., Pakštas, A., Wang, F.Z.: Network Modelling and Simulation Tools. Simulation Modelling Practice and Theory 17, 1011–1031 (2009) 12. Das, S.R., Castaneda, R., Yan, J., Sengupta, R.: Comparative Performance Evaluation of Routing Protocols for Mobile Ad hoc Networks. In: Proc. of 7th IEEE ICCCN, pp. 153–161 (1998) 13. Rahman, A.H.A., Zukarnain, Z.A.: Performance Comparison of AODV, DSDV and I-DSDV Routing Protocols in Mobile Ad Hoc Networks. European Journal of Scientific Research 31(4), 566–576 (2009) 14. Schmidt, R.d.O.: MANETs Routing Protocols Evaluation in a Scenario with High Mobility MANET Routing Protocols Performance and Behavior. In: Proc. of IEEE Network Operations and Management Symposium (IEEE NOMS’08), pp. 883–886 (2008) 15. Boukerche, A.: Performance Evaluation of Routing Protocols for Ad Hoc Wireless Networks. In: Mobile Networks and Applications, vol. 9, pp. 333–342. Kluwer Ac. Publ., NL (2004)

A Web Service Composition Algorithm Based on Global QoS Optimizing with MOCACO Wang Li and He Yan-xiang Computer School, Wuhan University, Wuhan, P.R. China [email protected], [email protected]

Abstract. Web services composition has gained a considerable momentum as a means to create and streamline B2B collaborations within and across organizational boundaries. This paper focuses on the web services composition and provides a novel selection algorithm based on global QoS optimizing and Multi-objective Chaos Ant Colony Optimization (MOCACO). Firstly, the web services selection model with QoS global optimization is converted into a multi-objective optimization problem. Furthermore, the MOCACO is used to select the service and optimize QoS to satisfy the user constraints. During the optimizing procedure, the random and ergodic chaos variable is used to make an optimal search, it overcomes the problem of low efficiency and easily being in a partial optimization that ant colony algorithm brings. The simulation shows that the MOCACO is more efficient and effective than Multi-objective Genetic Algorithm (MOGA) applied to services composition.

1 Introduction Web services are self-contained, modular applications that can be described, published, located, and accessed over a network using open standards. With the web applications becoming more and more complex, many simple web service are combined together to complicate a complex task to meet practical needs, called web services composition. It has gained a considerable momentum as a means to create and streamline B2B collaborations within and across organizational boundaries. Especially in the composition procedure, we can get many services with the same function but different QoS. So, how to buildup a scientific way to select web services is a very important issue in web services composition [1]. And more, the web services composition based on QoS has been betaken greatly and researched deeply. There are a lot of studies on web services composition based on QoS. In [2], a predictive QoS model is presented to compute the QoS for workflows automatically based on atomic task QoS attributes and the multi-restriction parameters of service QoS are transformed into single-goal function by weighted linear method. Paper [3] buildups a QoS registry and presents a dynamic QoS computation model for web services selection. Paper [4] gives an online web services composition method based on the global QoS optimization which composites the modified simplex method and the heuristic enumerate method to solve the multi-objective global optimization problem. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 218–224, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Web Service Composition Algorithm Based on Global QoS Optimizing

219

Paper [5] proposes a global optimizing and multi-objective algorithm based on Multi-objective Genetic Algorithm and it simultaneously minimizes two objectives functions and two constraints. However, many of the research on service selection of web service composition are almost QoS local optimization or mono-objective, and will not be so suitable to resolve the problem of web services selection with QoS global optimization and multi-objective. The purpose of this paper is to focus on the web services composition and provides a novel and automatic composition algorithm based on global QoS optimizing and Multi-objective Chaos Ant Colony Optimization (MOCACO). Firstly, the web services selection model with QoS global optimization is converted into a multi-objective optimization problem. And then the MOCACO is used to select the service and optimize the different QoS parameters to satisfy the user constraints. Especially, in the optimizing procedure, the random and ergodic chaos variable is adopted and used to make an optimal search, it overcomes the problem of low efficiency and easily being in a partial optimization that ant colony algorithm brings. The remainder of this paper is organized as follows: Section 2 describes the web service composition model based on multi-objective optimization. Section 3 presents the Multi-Objective Chaos Ant Colony Optimization method to efficiently solve the multi-objective optimization. Section 4 gives out the simulation results. Finally, the conclusion is made in Section 5.

2 Web Service Composition Model Based on QoS Multi-objective Optimization 2.1 Web Service Composition Model From the view of the graph theory and set theory, the web service composition can be regarded as an oriented graph G =< V , E > which has multi-i/o. The web service can be presented as a node in it and all of them compose the vertex set V . The interaction of services can be described as the edges of the graph and all of them combine the edge set E . And especially, the QoS of the web service can be move from the node in the graph into the edge presenting the cost from one web service to another. So the web service composition problem will be transferred to the optimized combination of finding the most suitable direct acyclic path from the input to the output in the graph G to satisfy the global QoS constrain. Assume there is a set of composite service CS , consisting of m services, that is CS = {WS1 , WS 2 ,..., WS m } . Every service WSi consists of ni candidate services which compose the service group SGi = {WSi ,1 , WSi ,2 ,..., WSi , ni } , and the oriented graph G is shown in Fig. 1. In Fig.1, S presents the start, T presents the destination. The weights of connection from one web service to another present the QoS of selected services, so the composite services can be abstracted as an oriented graph with weights G =< V , E , QoS > . V =< S , V1 , V2 ,..., T > presents a set of web services and the

220

L. Wang and Y.-x. He

Fig. 1. An oriented graph of composite service

problem of combination service is equal to looking for an optimized way from in an oriented graph with weights.

S to T

2.2 Multi-objective Optimization for QoS Generally, the universal web services QoS attribute includes some important parts: cost, time, network, dependability and so on. Intuitively, the web service should provide service with lower cost, less time, high dependability and high- bandwidth network. Based on the oriented graph G =< V , E , QoS > , we can set P =< S , T > as the path from the source node S to destination node T . So the following parameters can be defined for the path P :

C os t ( P) = ∑ Cost (m) .

(1)

Time( P) = ∑ Time(m) .

(2)

Re liability ( P) = ∏ Re liability (m) .

(3)

m∈P

m∈P

m∈P

Based on the above definitions, the web services composition problem with QoS may be stated as a MOP [6] that tries to find an optimal/near-optimal service flow path P =< S , T > that simultaneously minimizes the cost and the time, maximizing its reliability and satisfies the user QoS restrictions. If we have two solutions path P and P′ from the source node S to destination node T : x = [Cost ( P ), Time( P ),1/ Re liability ( P )] . z = [Cost ( P′), Time( P′),1/ Re liability ( P′)] .

There is only one of the three conditions can be given: 1) x

z ( x dominates z ), if xi ≤ zi ∧ xi ≠ zi , ∀i ∈ {1, 2,3} ;

2) z x ( z dominates x ), if zi ≤ xi ∧ zi ≠ xi , ∀i ∈ {1, 2,3} ; 3) x ∼ z ( x and z are non-comparable), if zi xi ∧ xi zi , ∀i ∈ {1,2,3}

A Web Service Composition Algorithm Based on Global QoS Optimizing

221

P is non-dominated with respect to a set Q if: RCW ⋅QoS , P′ ∈ Q and the non-functional requirements for the concrete service flow is RCW ⋅QoS = [C , T ,1/ R ] , where C , T , R denote the global cost, global time, and global reliability constraints. When P is non-dominated with A decision vector

P

P′ ∧ P

respect to the user QoS requirements and the whole domain of feasible solutions, it is called an optimal Pareto solution; therefore, the Pareto optimal set Ptrue may be formally defined as:

Ptrue = { P ∈ X f | P is non-dominated with respect to X f }. The corresponding set of objectives Ptrue = f ( X true ) constitutes the Optimal Pareto Front. Here, this MOP problem can be resolved by the Multi-objective Chaos Ant Colony Optimization algorithm.

3 Multi-objective Chaos Ant Colony Optimization for Web Service Composition The Multi-objective Chaos Ant Colony Optimization (MOCACO) algorithm is based on the Multi-objective Ant Colony Optimization algorithm (MOACO) [6] and the chaos operator, bringing the advantage of them together. During the ant colony optimization procedure, the randomization, ergodicity, initial sensitivity and chaos operator are brought in to make the chaos variable linear mapping to a domain of optimization variable. This will avoid the search becoming a partial optimization and making up the shortage of an ant algorithm. As a result, it improves the diversity and general optimization of this algorithm. 3.1 MOCACO Optimization Procedure In the MOCACO algorithm used for web services composition problem, a colony of ants are used to construct m solutions P at every generation and the detailed procedure is following [7,8]: 1) Initialize source node S , destination node T , N r and ϕ . 2) Initialize pheromone matrix τ ij , τ ij (0) = C ( C is a constant). 3) For every ant, construct the solution P . 3.1) Set tabulist = φ and P = φ ; 3.2) Set N i as the nodes in the neighborhood of node i that the ant has not visited yet. For all the nodes in N i compute the chosen probability:

pij =

[τ ij ]α [ηij ]β

∑ [τ

s ⊂ Ni

is

]α [ηis ]β

.

(4)

And get P = P ∪ { j} , tabuList = tabuList ∪ { j} .In (4), ηij is the heuristic value of moving from node i to node j , and we can define it as:

222

L. Wang and Y.-x. He

ηij = 1 3.3) If

Cost 2 ( j ) + Time2 ( j ) + (1/ Re liability ( j )) 2

.

| P |< n then go to 3.1), else go to 4).

4) Get Pknow = Pknow ∪ P −{Py | P − Py }∀Py ∈ Pknow . Pknow is the known Pareto Front [9]. 5) Update pheromone matrix τ ij : m

τ ij (t + 1) = ρ • τ ij (t ) + ∑ Δτ ijk .

(5)

k =1

Where ρ ∈ (0,1] and the Δτ ij presents the pheromones ant iteration.  k

k left on route in this

6) If not convergent, then go to 3), otherwise we will get the optimized P . 3.2 Chaos Pheromone Update During the optimization procedure, the initial value of pheromone matrix τ ij , τ ij (0) = C ( C is a constant), the strength of pheromones on every route is equal that the possibility of every route is equal. It’s hard for ant colony to find an optimization route; and more, the convergent speed of algorithm is slow. Here we adopt the chaos operator for the pheromone matrix τ ij , it will increase the efficiency of it’s searching by the random and ergodic of chaos. The chaos variable coursed by Logistic mapping is:

λi +1 = μ • λi • (1 − λi ) .

(6)

In (6), i = 0,1, 2,... , μ is the control parameter, with the domain between (2, 4] , when μ = 4 , the Logistic map- ping are the full mapping in (0,1) , which are at a totally chaos status. The iteration will create an ergodic chaos sequence which can be used to solve the problem of optimization in space. After using chaos operator and chaos variable to update the pheromone trail strength, the formula (5) will become: m

τ ij (t + 1) = ρ • τ ij (t ) + ∑ Δτ ijk + Aλi .

(7)

k =1

4 Simulation In this section we will give out some web service composition simulation results based on the MOCACO presented in this paper, and the results based on Multi- objective Ant Colony Optimization(MOACO) [8] and Multi-objective Genetic Algorithm(MOGA) [6] are also given out in order to compare the solution quality and performance for all of the three algorithm. In this simulation, 3 web service groups are adopted and the details of service groups are presented in Table 1. It means that the composed web service will combine m web service and each web service has n services. All three algorithms, MOCACO, MOACO and MOGA, have been implemented on a computer with Intel Pentium Dual 2.40GHz

A Web Service Composition Algorithm Based on Global QoS Optimizing

223

Table 1. Web service group for composition test

Service group Group 1 Group 2 Group 3

m 10 20 20

n 5 10 20

processor, 2GB RAM, and the operating system Windows XP professional. The compiler used was Java. For each groups, the test run 2 times and 100 and 200 iterations. The services QoS values were achieved randomly from a gaussian distribution function. The user’s constraint for the composed service flow is ϕ = [C , T , R] = [1000,10, 0.5] . In each ant colony for MOCACO and MOACO, the number of ants is set to 20, and the parameters of algorithm are α = 1.0 , β = 1.0 , ρ = 0.3 . Table 2. gives out the number of solutions found for every web service group. From the result in Table 2, we can find that the MOCACO algorithm demonstrates better performance than the MOACO and MOGA algorithm, achieving more optimal P amount than both of others whether the group’s web services amount is large or small. Furthermore, the running time for all of the three algorithms is also compared in this simulation and the results are shown in Fig. 2. It can be seen that the time astringency of MOCACO algorithm is better than the others. Table 2. Number of optimal solution for each test group

Service group Iterations 100 Group 1 200 Group 2 100 200 100 Group 3 200 25

20

) S ( e m15 i T g n i10 n n u R

MOCACO 14 15 24 26 36 39

MOACO 12 13 21 22 33 37

MOCACO(It.=100) MOACO(It.=100) MOGA(It.=100) MOCACO(It.=200) MOACO(It.=200) MOGA(It.=200)

5

0 1

2

Service Group

3

Fig. 2. Running time comparison result

MOGA 9 10 17 19 31 34

224

L. Wang and Y.-x. He

5 Conclusion In this paper, it introduces a novel and automatic composition algorithm based on global QoS optimizing and Multi-objective Chaos Ant Colony Optimization (MOCACO) for the web services composition problem. This algorithm brings the advantage of multi-objective ant colony optimization and chaos operators together. With the random, ergodic chaos variable, the ant colony optimization’s problem of low efficiency and easy in partial optimization is overcome. And more, the blindness and efficiency of chaos search is improved by the positive feedback of multi-objective ant colony algorithm. From the simulation presented in the paper, we can find that the MOCACO is able to find more optimal solutions than MOACO and MOGA algorithm. Especially, the MOCACO also shows its feasible and efficient performance. As future work, we will focus on the adaptive parameter adjusting in this algorithm and compare the performance of different chaos mapping function in this algorithm.

Acknowledgement This paper is supported by the National High-Tech Research Development Program of China (863 program) under Grant No.2007AA01Z138 and the China Postdoctoral Science Foundation under grant No. 20090460978.

References 1. Kleijnen, S., Raju, S.: An Open Web services Architecture, pp. 38–46. ACM Press, NewYork (2003) 2. Jorge, C., Amit, S., John, M.: Quality of Service for workflows and Web Service Processes. Journal of Web Semantics 1(3), 281–338 (2004) 3. Liu, Y.T., Anne, H.H., Zeng, L.Z.: QoS Computation and Policing in Dynamic Web Services selection. In: Proc. WWW 2004, pp. 66–73. ACM, New York (2004) 4. Wan, L., Gao, C., Xiao, W., Su, L. (eds.): Global optimization method of Web services composition based on QoS, Computer Engineer And Applications, vol. 24 (2007) 5. Liu, S., Liu, Y., Jing, N., Tang, G., Tang, Y.: A Dynamic Web Service Selection Strategy with QoS Global Optimization Based on Multi-objective Genetic Algorithm. In: Zhuge, H., Fox, G.C. (eds.) GCC 2005. LNCS, vol. 3795, pp. 84–89. Springer, Heidelberg (2005) 6. Schaerer, M., Barán, B.: A Multi-objective Ant Colony System For Vehicle Routing Problem With Time Windows. In: Proc. IASTED International Conference on Applied Informatics, Innsbruck (2003) 7. Yang, H., Wang, H., Hou, L., Sun, X. (eds.): Application of Chaos Ant Colony Optimization in the Intelligent Transportation System and Its Algorithm. Journal Of Chengdu University (Natural Science Edition) 4 (2007) 8. Qiqing, F., Xiaoming, P., Qinghua, L., Yahui, H.: A Global QoS Optimizing Web Services Selection Algorithm based on MOACO for Dynamic Web Service Composition. In: 2009 International Forum on Information Technology and Applications, pp. 37–42 (2009) 9. Van Veldhuizen, D.A.: Multiobjective Evolutionary Algorithms: Classifications, Analyses and New Innovations. Ph. D. thesis Air Force Institute of Technology (1999)

Experiences Gained from Building a Services-Based Distributed Operating System Andrzej Goscinski and Michael Hobbs School of Information Technology, Deakin University Waurn Ponds, Victoria, 3217, Australia {ang,mick}@deakin.edu.au

Abstract. The goal of this paper is to present the experiences gained over 15 years of research into the design and development of a services-based distributed operating system. The lessons learnt over this period, we hope, will be of value to researchers involved in the design and development of operating systems that wish to harness the collective resources of ever-expanding distributed systems. Keywords: Distributed Operating System Design, Service Oriented Computing.

1 Introduction Research of a distributed operating systems based on a set of services was addressed by the first author in 1985, following a study of two major set of works by Cheriton [1] and Tanenbaum [2]. The study confirmed that the author’s idea of service based approach to building operating systems was sound. Second, the study was also a trigger to research the logical design of distributed operating systems [3] where interprocess communication is a major platform for cooperating modules of an operating system. This study formed the inception for a proof of concept project called RHODOS (Research Oriented Distributed Operating System) [4], where the idea of a service based operating system supported by a small microkernel (initially called a nucleus) was demonstrated. This project has reached its research orbit when Dr Hobbs teamed up with Dr Goscinski. The authors received a strong positive kick when Tanenbaum presented his support to microkernel-based architecture of operating systems in comp.os.minix in January 1992 [5]. That was the beginning of our journey into the research of service-based distributed operating systems. Traditional operating system research has its heritage based in monolithic architectures such as UNIX [6] and Linux [7], where all functionality and services are provided within a protected, consolidated set of kernel code. A major concern with this approach is the difficulty in building such systems and with addition of functionality, due to the tight coupling of the kernel code which can lead to errors in one component of the kernel to affect other non-faulty components (a situation that is difficult to debug). An alternative approach is to provide the core functionality or services of an operating system as processes that are supported by a minimalist kernel layer, termed microkernel architecture. Examples of microkernel systems include: Mach [8], QNX [9] and L4 [10]. The MIT exokernel [11] project took the microkernel paradigm C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 225–234, 2010. © Springer-Verlag Berlin Heidelberg 2010

226

A. Goscinski and M. Hobbs

further by also reducing the number of interfaces that the operating supported, allowing users to define the services that were required. There are also a number of projects that looked at the designing and building an operating system that harnesses the collective resources of a distributed system; these include: Sprite [12], Plan9 [13] Amoeba [14], 2K [15] and SPIN [16]. Other distributed operating systems addressed usability, by making the complete distributed system appear as a single, large computer; these include: Mosix [17] and Kerrighed [18]. The goal of this report is to describe the authors’ experiences in designing, developing and implementing a distributed operating system based on service oriented principles. These experiences are drawn from over 15 years of research and development of distributed operating systems and will hopefully provide readers interested in developing future systems an insight into both the successes and problems we encountered. This report is structured as follows. In Section 2, the scope of our project are presented. Section 3 discusses our view on services and their relationship to resources managed by an operating system. Section 4 presents the resources managed by distributed operating systems through services. The often conflicting expectations of users of operating systems and the key responsibilities of operating systems are highlighted in Section 5. A description on how the computational resources of a distributed system are harnessed and exposed for the application of parallel processing is presented in Section 6. The broader role of user interface and execution environment of a services based distributed operating system is given in Section 7. Finally, Section 8 provides a summary of the important outcomes and experiences.

2 Project Scope In the 1980s one could identify two major research and development streams in the area of operating systems. One was very much user oriented and demonstrated itself in the development and perfecting of menu driven window environments. Another stream concentrated on issues such as architecture, development, maintenance and execution. We concentrated our effort on the latter, although we also devoted some projects on a command driven and window based interfaces. Computing systems in general and operating systems in particular depend on virtualization. If there is no virtualization there is no computing. Operating systems and their architectures reflect a need for and application of virtualization in a natural manner. We decided to provide virtualization by employing the service computing paradigm and identified some basic features of a distributed operating system and carried out our research toward their development. These features were as follows: Modularity: a variety of areas we have envisioned and problems encountered with Unix (e.g., modification, debugging) required avoiding monolithic architecture. This was one of the strongest factors that has influenced our research and led us to modularity based on service orientation. 2. Efficient / Adaptive / Flexible Communications: interprocess communication of distributed systems depends on interconnecting network including network protocols; and the design and implementation of communication primitives. 1.

Experiences Gained from Building a Services-Based Distributed Operating System 3. 4.

5. 6. 7.

227

Transparency: any real distributed system should be designed and implemented in such a way that distribution of all resources will be hidden. High Performance: any operating system, centralized or distributed, should manage resources in such a manner that the highest performance possible is offered (measured using response time, throughput, and system utilization). Dynamic Instantiation: an operating system should be able to automatically recognize and use new services provided by new resources. Single System Image (SSI): its provision means that the application developer sees the whole distributed system as a single powerful and reliable computer. Distribution and Parallelism Execution Management: an operating system should manage all resources automatically and transparently toward high performance / throughput execution without a direct involvement of programmers.

Our involvement in the provision of these features clearly demonstrated that transparency [3] is a subset of SSI [19]. The following dimensions were considered to offer SSI: transparency – distribution should be hidden; availability – that could be achieved if a virtual machine can be established automatically and dynamically – thus needing a resource discovery service that can identify computers and peripheral devices automatically and record their presence, load state and fault events; fault tolerance – as the scale of the distributed system increases, the probability that components will fail also increases – thus fault tolerance mechanisms of replication, checkpointing and recovery should be employed; communication paradigms – both message passing and (distributed) shared memory methods should be provided. Many of these features can also be generalized by the Autonomic Computing characteristics [20]. This important characteristic implies that there is a need for not only high performance but also ease of programming and use, reliability, availability through proper reaction to unpredictable changes and transparency. This could be achieved through the provision of autonomic computing characteristics. According to Horn [20], an autonomic computing system possesses at least the following characteristics: knows itself; configures and reconfigures itself under varying and unpredictable conditions; optimizes its working; performs something akin to healing; provides self-protection; knows its surrounding environment; exists in an open (non-hermetic) environment; and anticipates the optimized resources needed while keeping its complexity hidden. We addressed these characteristics in the Helos project [21].

3 Services vs. Resources The management and execution of user processes requires all resources, physical and logical, of a computer system to be exposed. We proposed a distributed operating system built as a set of cooperating services able to expose these resources. These services (for transparency reasons) communicate within a single computer as well as among remote computers using messages. We proposed a provision of well defined interfaces to invoke a service. This implies that messages must be designed having standardization in mind. For this purpose it was proposed that messages should have a distinct format that reflects the way a service consumes messages. Thus, the header being consumed first contains information that indicates what to do next.

228

A. Goscinski and M. Hobbs

A general model of the relationship between a resource; a service exposing this resource; and a client invoking the service; is shown in Fig. 1. Service

Resource

3. Operations

Client

1. Client Request 2. Load / Save State Data / State

Fig. 1. A view shows only a portion of the entire document

Initially, we believed that a simple client server model will satisfy our modularity requirements. We realized early that some operations (e.g., process migration, local and remote process creation), to achieve high performance, should be performed concurrently, leading to a more general service oriented approach. A service in a distributed operating system may, to complete a client’s request, be required to invoke additional services which could be on the same computer or more remote computers. Implemented correctly, this model of cooperating and distributed services enables a level of transparency in which the user is unaware of where services (and resources) are located. This provides the foundation on which many more desirable features can be built, such as a SSI and the provision of autonomic principles.

4 Services Exposing Resources The above analysis implies a need for the following basic system services that create virtual resources to hide and expose the major system resources and manage them: Process Service, Space Service, File Service, and Driver Services for I/O devices. One of the most crucial elements of decision making is the provision and exposure of a resource state. There are two different data states: internal data structures, and shared data structures. A process is an example of a resource that is maintained by a Process Service and due to its virtual nature requires its state to be shared. Conventionally known as a Process Control Block (PCB), it contains information that relates to both processes and resources such as space, I/O devices, and files. A structure of a PCB was proposed such that it could be shared with related services such as the Space Service, I/O Service, and File Service, respectively. This also follows the requirement of modularity and allows the associated services working on the PCB concurrently. This modularity requirement led us to the concept of a nucleus, later renamed microkernel for uniformity with other projects. The following data structures were proposed to be stored in the microkernel: clock data; page maps for process address space population and simple page fault handling; interrupt and exception tables; and processor registers for context switching to support local IPC, deal with interrupts and provide context switching, and basic memory (page) operations. The microkernel’s architecture was proposed to support portability. This implied a need to have a small component that was hardware dependent and a hardware independent component. They were made separate by a well defined interface.

Experiences Gained from Building a Services-Based Distributed Operating System

229

The microkernel and basic system services form a basic virtual computer system able to execute system and user processes. All these basic services were designed to run in the user space, as privileged user processes. These processes communicate among themselves using messages by invoking a required microkernel IPC primitive, send and receive. If the destination process executes on the source computer local message delivery is performed. “Locality” was proposed to be resolved by storing in the microkernel information regarding locally executing processes. Communication with remote processes was proposed and supported by an Interprocess Communication (IPC) Service. This service, invoked by a message from the microkernel, communicates over a network with a peer IPC Service running on a computer of a remote destination process (Fig. 2). This solution provided full transparency and very low communication overhead. Communication Facility User Process

Remote IPC

IPC Primitives: send() / recv()

Inter Process Communications Service

Transport Protocol IP / Ethernet Microkernel Local IPC

Network Service

Physical Layer

Fig. 2. IPC component interaction supporting transparent communication

The IPC Service was proposed to be made responsible to providing both an interface to the Network Service and information to support group communication. Thus, the IPC Service internally stores information about the most frequently accessed remote processes and communication groups.

5 Satisfying User Requirements and Resource Utilization Two sets of requirements, users and compute service providers, were addressed through the high performance and fault tolerance perspectives. Global Scheduling Service To achieve high performance execution of user applications there is a need for another service, a Global Scheduling Service [22, 23]. This service is responsible for the best possible allocation of compute resources to user processes to provide high execution performance and maintain an efficient (balanced) utilization of the computer systems’ resources. We proposed that the Global Scheduling Service be responsible to provide placement based on both: Static Allocation – at the beginning of running a distributed application and later when the computational load does not change frequently; and Dynamic Load Balancing – used in the case of load changing frequently. Provision of this service could be provided centrally or as a set of cooperating distributed services for large distributed system.

230

A. Goscinski and M. Hobbs

The quality of decisions made by the Global Scheduling Service depends directly on the validity of the input data. To address this problem a service called Resource Discovery Service [24] was proposed. This service, located on all computers of a distributed computer system, is responsible for collecting both: (i) static parameters – such as the number of processors and their characteristics, main memory size, disk size and I/O bandwidth; and (ii) dynamic parameters – such as current processor load, available memory, communication pattern and volume of each executing process. The execution of the decisions made by the Global Scheduling Service were proposed to be performed by additional services such as Local and Remote Process Creation Service, Local and Remote Process Duplication Service, and Process Migration Service (discussed in Section 6). Process Migration Service The Process Migration Service is a clear demonstration of advantages of servicesbased architecture. It was envisioned that this service will only coordinate process migration, and carry it out in a transaction fashion to only release resources on a source computer when the process on the destination computer was deployed. Global Scheduling Service

Migration-to-Migration Manager Coordination Messages

Migration Service

Migration Service Process Service Space Service IPC Service

Source Computer

Process State Spaces

Comms. Buffers

Process Service Space Service IPC Service

Destination Computer

Fig. 3. Service Cooperation to provide Process Migration

In migration of a process there is a need for transferring its state, address space memory, buffers, communication ports and open file information (Fig. 3). These components are stored in the process’ PCB. The Process Migration Service invokes relevant services, i.e., Process, Space, I/O, and File Services, to transfer their respective resources of the process and confirm completion of their operations [25]. The process migration service plays a critical role in the provision of the global scheduling service. It also confirms that through a service oriented approach, complex operating systems functions can be implemented transparently using sets of cooperating individual services. Checkpointing and Fault-Tolerance Service A failure of a computer or a process could lead to great computational loses as the process or the whole distributed application must be restarted from scratch. This implies a need for a service that could provide fault tolerance. For this purpose we proposed to employ two services, supported by the Resource Discovery Service: Checkpointing Service – records a state of a process (and of all processes of a distributed application)

Experiences Gained from Building a Services-Based Distributed Operating System

231

during a fault free execution of this process; and Recovery Service – restores a failed process (and of all processes of a distributed application) based on the last checkpointed state. Copies of both these services are provided on each computer of a distributed computer system [26, 27]. Dynamic Provision of Services All the services to satisfy user requirements and offer good resource utilization were designed in such a manner that they could be deployed dynamically [24]. For this purpose we proposed: (i) a resource that is not exposed through a service is not accessible by other services (it is not part of the resource pool); and (ii) services can be dynamically stopped and restarted (potentially at a different location). However, basic system services were designed to be deployed as replacements. For this purpose we ensured that: a fundamental service (such as a Process Service) cannot be shut down but it is possible to ‘hand-over’ to another instance of a service. Of course all these services to satisfy user requirements and offer good resource utilization were proposed and designed to run in the user space.

6 Services Supporting Parallel Computing Following an analysis of parallel applications and processes of a parallel application executing on a cluster we decided to introduce services, which were embraced by facility called Parallelism Management Facility [28] including the following services: • Local and Remote Group Process Creation Service – which creates a group of new processes based on code supplied by the Process Service and creates them on computers determined by the Global Scheduling Service; • Local and Remote Group Process Duplication Service – duplicates a specified process on a set of local and/or remote computers determined by the Global Scheduling Service; • Process Migration Service – which migrates a group of processes to the remote locations, specified by the Global Scheduling Service; • Distributed Shared Memory (DSM) Service – which provided an environment that supported shared memory programming paradigm; and • System Discovery Service – responsible for forming a virtual cluster out of available (lightly loaded) computers to support individual parallel applications. In the design, the System Discovery Service [24] provided aggregated information based on data provided by the Resources Discovery Services of the whole cluster directly to the Global Scheduling Service.

7 Programming and Execution Environment We decided that for the study of distributed operating systems and execution of distributed and parallel applications there is a minimum need for a user interface and execution environment. A decision was made to provide a shell that offers standard I/O redirection. This allowed us to execute cross-compiled programs written in the C

232

A. Goscinski and M. Hobbs

language. At one stage a Posix interface was created to make possible the execution of user programs/ applications running on Unix-like operating systems. Users wish to have an easy and commonly used environment to be provided to allow them to execute their programs/applications. In the majority of execution environments cases application developers did not have the opportunity to make a choice between the message passing (MP) and DSM communication paradigm. These paradigms and supporting them systems were developed independently of an operating system as separate services, rather than to be their integral parts. Two parallelization environments were considered in the area of message passing: at an early stage of the project a PVM environment was developed [29]. Later, a MPI environment was added [30]. As DSM addresses shared memory, an enhancement to the Space Service was made to provide a shared memory in a distributed system [31]. The proposed DSM environment required a copy of the enhanced Space Service deployed on each computer of a distributed system.

8 Concluding Remarks Our study of the design and implementation of service-based distributed operating systems has spanned over 15 years, during which a two key versions of the project have been developed: RHODOS [4] and following it GENESIS [32]. RHODOS focused more on the microkernel architecture, overall service based system design and implementation issues of a distributed operating system. Outcomes of this stage of the project included a demonstration that microkernel based architecture can be used to support a distributed operating system, where system level services such as process, memory and IO management are supported by a transparent message passing based IPC mechanisms [4, 33, 34]. These outcomes verified the set of features identified in Section 2 were valid and that common operating system functions could be supported by exposing them as services. GENESIS, also designed and implemented based on services and microkernel, involved a greater emphasis on higher level operating system functions and included: global scheduling, process migration, remote process management, fault tolerance and recovery, distributed shared memory, parallelism management, resource discovery and transparent remote resource use. These outcomes verified that complex autonomic characteristics are achievable through inter service cooperation. Over the lifetime of this project we have learned that service orientation forms an excellent basis of the design and implementation of distributed operating systems; although care needs to be taken to ensure that the performance of the system is maintained. Such systems could be easily configured, modified and debugged. Original solutions have also been achieved in the areas of parallel execution on non-dedicated clusters, provision of SSI and autonomic services, reliability and usability features supporting distributed systems. Two implemented systems demonstrated that service orientation allows for incremental development of complex systems, support changes in research directions, and achieve very good outcomes. We can confirm that research into operating systems is a very rewarding endeavor which, in our case, supported the completion of 10 PhD projects and contributed to knowledge in areas of operating system design, system services and service oriented

Experiences Gained from Building a Services-Based Distributed Operating System

233

computing. The challenges and overheads of such a large project are high in particular because PhD students have to complete their projects and leave, especially when undertaking it from the ground-up as we had done.

References 1. Cheriton, D.R.: The V Kernel: A Software Base for Distributed Systems. IEEE Software 1, 19–42 (1984) 2. Tanenbaum, A.S., Van Renesse, R.: Distributed operating systems. ACM Computer Surveys 17, 419–470 (1985) 3. Goscinski, A.: Distributed operating systems, the logical design. Addison-Wesley, Reading (1991) 4. De Paoli, D., Goscinski, A., Hobbs, M., Wickham, G.: The RHODOS Microkernel, Kernel Servers and Their Cooperation. In: IEEE 1st Intl. Conf. on Algorithms and Architectures for Parallel Processing, vol. 1, pp. 345–354. IEEE, Brisbane (1995) 5. Tanenbaum, A.S.: LINUX is obsolete. comp.os.minix, Google Groups (1992), http://groups.google.com/group/comp.os.minix/browse_thread/ thread/c25870d7a41696d2/ (last accessed 20/1/2010) 6. Ritchie, D.M., Thompson, K.: The UNIX time-sharing system. Communications of the ACM 17, 10 (1974) 7. Torvalds, L.B.: What would you like to see most in minix? comp.os.minix, Google Groups (1991), http://groups.google.com/group/comp.os.minix/msg/b813d52cbc5 a044b (last accessed 20/1/2010) 8. Golub, D.B., Julin, D.P., Rashid, R.F., Draves, R.P., Dean, R.W., Forin, A., Barrera, J., Tokuda, H., Malan, G., Bohman, D.: Microkernel operating system architecture and mach. In: Proc. USENIX Workshop on Micro-Kernels and Other Kernel Architectures, pp. 11–30 (1992) 9. Hildebrand, D.: An Architectural Overview of QNX. In: Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures, pp. 113–126. USENIX Assoc., CA (1992) 10. Liedtke, J.: On μ-Kernel Construction. In: Proc. 15th ACM symposium on Operating Systems Principles (SOSP), pp. 237–250 (1995) 11. Engler, D.R., Kaashoek, M.F., O’Toole, J.: Exokernel: an operating system architecture for application-level resource management. In: Fifteenth ACM Symposium on Operating Systems Principles, pp. 251–266. ACM, New York (1995) 12. Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The Sprite Network Operating System. Computer 21, 23–36 (1988) 13. Presotto, D., Pike, R., Thompson, K., Trickey, H.: Plan 9, A Distributed System. In: Proceedings of the Spring 1991 EurOpen Conference (1991) 14. Mullender, S.J., Rossum, G.v., Tanenbaum, A.S., Renesse, R.v., Staveren, H.v.: Amoeba: A distributed operating system for the 1990s. IEEE Computer 23, 44–53 (1990) 15. Kon, F., Campbell, R., Mickunas, M.D., Nahrstedt, K., Ballesteros, F.J.: 2K: A Distributed Operating System for Dynamic Heterogeneous Environments. In: 9th IEEE Intl. Symposium on High Performance Distributed Computing, pp. 201–209. IEEE, Pittsburgh (2000) 16. Bershad, B.N., Savage, S., Pardyak, P., Sirer, E.G., Fiuczynski, M.E., Becker, D., Chambers, C., Eggers, S.: Extensibility safety and performance in the SPIN operating system. In: Proc. of the 15th ACM symposium on Operating systems principles, pp. 267–283. ACM, Copper Mountain (1995) 17. Barak, A., La’adan, O.: The MOSIX Multicomputer Operating System for High Performance Cluster Computing. J. of Future Generation Comp. Systems 13, 361–372 (1998)

234

A. Goscinski and M. Hobbs

18. Morin, C., Lottiaux, R., Valle, G., Gallard, P., Margery, D., Berthou, J.-Y., Scherson, I.D.: Kerrighed and data parallelism: cluster computing on single system image operating systems. In: Sixth IEEE Intl. Conf. on Cluster Computing. IEEE, San Diego (2004) 19. Goscinski, A.: A single system image operating system for next generation application software. In: Glowacz, P.Z. (ed.) International Conference on Modern Directions in Electrotechnics, Automatics, Computer Sciece, Electronics and Telecommunication, pp. 147–152. University of Mining and Metallurgy, Crakov (2002) 20. Horn, P.: Autonomic computing: Ibm’s prospective on the state of information technology. IBM Corp. (2001), http://www.research.ibm.com/autonomic/ (last accessed 20/1/2010) 21. Goscinski, A., Silcock, J., Hobbs, M.: Building Autonomic Clusters: A Response to IBM’s Autonomic Computing Challenge. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 27–35. Springer, Heidelberg (2004) 22. Goscinski, A., Hobbs, M., Silcock, J.: The Genesis Cluster Operating System Supporting Parallel Processing. In: High Performance Computing Systems and Applications, pp. 301–313. Kluwer Academic Publishers, The Netherlands (2002) 23. Goscinski, A., Jeffers, P., Silcock, J.: Data Collection for Global Scheduling in the GENESIS System. In: International Symposium on Parallel Architectures, Algorithms and Networks I-SPAN’02, pp. 193–198. IEEE Computer Society, Makati City (2002) 24. Dines, E., Goscinski, A.: Toward self discovery for an autonomic cluster. In: Hobbs, M., Goscinski, A.M., Zhou, W. (eds.) ICA3PP 2005. LNCS, vol. 3719, pp. 125–131. Springer, Heidelberg (2005) 25. De Paoli, D., Goscinski, A.: The Rhodos Migration Facility. Journal of Systems and Software 40, 51–65 (1998) 26. Rough, J., Goscinski, A.: Exploiting Operating System Services to Efficiently Checkpoint Parallel Applications in GENESIS. In: Wanlei Zhou, X.-b.C., Goscinski, A., Li, G.-j. (eds.) The Fifth International Conference on Algorithms and Architectures for Parallel Processing, pp. 261–268. The IEEE Computer Society, Beijing (2002) 27. Maloney, A., Goscinski, A.: The Cost of Storing Checkpoints to Multiple Volatile Storage Locations Using at-least-k Semantics. In: Michael Hobbs, Y.H., Kuo, S.-Y., Zhou, W. (eds.) 13th IEEE International Symposium on Pacific Rim Dependable Computing (PRDC 2007), pp. 330–333. IEEE Computer Society, Melbourne (2007) 28. Hobbs, M., Goscinski, A.: The GENESIS parallelism management system employing concurrent process-creation services. Microprocessors and Microsystems 24, 415–427 (2000) 29. Rough, J., Goscinski, A., De Paoli, D.: PVM on the RHODOS Distributed Operating System, pp. 208–215. Springer, Heidelberg (1997) 30. Maloney, A., Goscinski, A., Hobbs, M.: An MPI Implementation Supported by Process Migration and Load Balancing. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 414–423. Springer, Heidelberg (2003) 31. Silcock, J., Goscinski, A.: A Comprehensive Distributed Shared Memory System that is Easy to Use and Program. Distributed Systems Engineering 6, 121–128 (1999) 32. Goscinski, A., Hobbs, M., Silcock, J.: GENESIS: an efficient, transparent and easy to use cluster operating system. Parallel Computing 28, 557–606 (2002) 33. Hobbs, M., Wickham, G., Paoli, D.D., Goscinski, A.: Generic Memory Object for Supporting Distributed Systems. In: International Conference on Automation, pp. 363–366. Allied Publishers, Indore (1995) 34. Joyce, P., De Paoli, D., Goscinski, A., Hobbs, M.: Implementation and Performance of the Interprocess Communications Facility in Rhodos. In: Intl. Conference on Networks/Intl. Conference on Information Engineering, pp. 571–575. IEEE, Singapore (1995)

Quick Forwarding of Queries to Relevant Peers in a Hierarchical P2P File Search System Tingting Qin , Qi Cao, Qiying Wei, and Satoshi Fujita Graduate School of Engineering, Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi-Hiroshima, 739-8527, Japan {tacit,caoqi,weiqy,fujita}@se.hiroshima-u.ac.jp

Abstract. In this paper, we propose a new file search scheme for a three-tier Peer-to-Peer (P2P) architecture. The proposed scheme consists of two parts. The first part determines a way of associating files held by each peer in the bottom layer to subservers in the middle layer, where each subserver plays the same role with the (centralized) server in conventional search engines such as Google and Yahoo!. The second part provides a way of forwarding a query received by the central server in the top layer to an appropriate subserver relevant to the query. The proposed scheme is based on the notion of “tags”, and a technique of priority sequence of tags is introduced to realize a quick forwarding of queries. The result of performance evaluation indicates that the number of tags which must be examined in forwarding a given query is bounded by a small constant. Keywords: Hierarchical P2P architecture, super-peer, tag-based sieving method, subserver.

1

Introduction

Recently, Peer-to-Peer (P2P) systems have attracted considerable attentions as a way of overcoming critical ﬂaws of conventional client/server systems, such as single point of failure and performance bottlenecks [2]. A P2P system consists of several nodes called peers, and those peers are connected with each other through a logical network called P2P overlay. Each peer holds abundant, various resources of digital contents such as documents, images, and music clips, which are shared with other peers through the P2P overlay, as in Napster [7] and Gnutella [1]. A key issue in realizing attractive services over P2P systems is how to ﬁnd a ﬁle which is interesting to a user in an eﬃcient and timely manner. In the case of ﬁle search in the World Wide Web (WWW), it is common to utilize crawler-based search engines such as Google and Yahoo!, which ﬁrst collect all web pages to a centralized server, and construct a list of indexes of such pages so

Corresponding author.

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 235–243, 2010. c Springer-Verlag Berlin Heidelberg 2010

236

T. Qin et al.

as to realize a quick response to a query issued by each user. Although it could quickly identify the location of a target ﬁle in the network, such a centralized approach can not be directly applied to the ﬁle search in P2P systems since a crawling in the network causes an inevitable delay in reﬂecting the change of ﬁles to the list of indexes. In fact, even in the case of WWW, a crawling does not capture all changes of the ﬁles, and locations indicated by the search engine are often expired. In this paper, we propose a new ﬁle search scheme for P2P systems. The proposed scheme is designed for a hierarchical P2P architecture consisting of top, middle, and bottom layers, which was originally proposed to realize a real-time ﬁle search in P2P networks [3]. The proposed scheme consists of two parts. The ﬁrst part determines a way of associating ﬁles held by each peer in the bottom layer to subservers in the middle layer, where each subserver plays the same role with the (centralized) server in conventional search engines (see Section 3.1 for the details). The second part provides a way of forwarding a query received by the central server in the top layer to an appropriate subserver relevant to the query. The proposed scheme is based on the notion of tags, and a technique of priority sequence of tags is introduced to the central server, in order to realize a quick forwarding of received queries. The result of preliminary performance evaluation indicates that the number of tags which must be examined in forwarding a given query is bounded by a small constant. The remainder of this paper is organized as follows. Section 2 outlines related work. Section 3 describes our proposed algorithm, and the result of preliminary evaluation is given in Section 4. Finally, Section 5 concludes the paper with future problems.

2

Related Work

There are several aspects in realizing an eﬃcient, reliable ﬁle search in P2P; e.g., query forwarding rules, determination of peers who receive query-associated messages, message transformation format, and deﬁnition and the maintenance of local indexes. In general, the goodness of P2P search is evaluated in terms of the following three metrics: 1) accuracy of the search result including the number and the quality of objects discovered per request; 2) amount of consumption of the network bandwidth; and 3) adaptiveness to the dynamic change of the network topology due to join and leave of participating peers. In this section, we overview related work on P2P ﬁle search schemes, for each type of control policies of the network topology; i.e., unstructured P2Ps and structured P2Ps. An advantage of unstructured P2Ps is its ﬂexibility to realize complicated ﬁle search schemes. Blind search is a basic ﬁle search scheme adopted in the original Gnutella [1], in which the originator of a search process ﬂoods a query message to all peers within a predetermined TTL hops, and it collects reply messages from those peers indicating whether or not a requested ﬁle is held by them, where each intermediate peer is not aware of the location of target peers holding a requested ﬁle. Thus, although it is simple, it causes a large number of redundant message transmissions, and will consume a large amount of network bandwidth.

Quick Forwarding of Queries to Relevant Peers

237

Modiﬁed-BFS [5] tries to overcome such drawback of the blind search by restricting the number of receivers of a transmitted query to a predetermined fraction of the number of neighbors of the transmitting peer, where such selection of receivers is conducted randomly. In k-random walk [6], the number of receivers is restricted to one except for the originator. More concretely, the originator sends out k query messages to randomly selected k neighbors, and during a search process, each message follows its own random search path. Such search process terminates with a success or a failure; i.e., after ﬁnding a target ﬁle or after exhausting the TTL. A ﬁle search in structured P2Ps is conducted in a more systematic manner compared with unstructured P2Ps, using several techniques, such as Distributed Hash Table (DHT), skip graph, and Bloom ﬁlter. Chord is a typical DHT-based P2P [10]. The Chord protocol supports just one operation; i.e., given a key, it maps the key onto a peer. More concretely, data allocation in Chord is realized by associating a key with each data item, and by storing a key/value pair at the peer to which the key maps. Pastry realizes a message routing and an object allocation in potentially very large overlay networks connected via the Internet [9], and Tapestry is a P2P network that provides a location-independent message routing to close-by endpoints, using only localized resources [11]. Each of the above search schemes for structured P2Ps tightly controls data allocation and the topology of the underlying network in order to realize certain kind of search ordering to facilitate the search of requested ﬁles. However, although it would certainly improve the eﬃciency of the ﬁle search process, such a tight control leads to a high overhead which increases the overall cost required for the data allocation and the topology maintenance.

3 3.1

Proposed Algorithm System Model

In this paper, we focus on a hierarchical P2P structure to realize an eﬃcient ﬁle search in distributed environment. More concretely, we adopt a three-tier architecture consisting of top, middle, and bottom layers, where the top layer consists of a centralized server, the middle layer consists of a number of subservers, and the bottom layer consists of a large number of user peers (UPs, for short). Note that the middle layer can be regarded as a collection of super peers which is commonly adopted in many existing P2P applications. In the following, we denote the central server by C, and a set of subservers as S = {S1 , S2 , . . . , Sm } (note that the central server C can be realized as a collection of several computers). In this system, several UPs at the bottom layer are grouped according to the similarity of users’ interests behind the peers, and/or the proximity of their geographical locations. Each group is associated with a subserver in the middle layer, and UPs in a group are connected to a subserver corresponding to the group by logical links. Each subserver acts as an individual search engine, by keeping a collection of “fresh” indexes of the contents held by the corresponding UPs; i.e., such indexes are repeatedly collected (and maintained) by the subserver, and a

238

T. Qin et al.

query message concerned with the ﬁles held by a UP in the group will be (locally) processed by the subserver. Central server C takes the responsibility for distributing such lookup services and maintaining UP/SP correspondance. In the following, we describe a way of quickly identifying subservers relevant to a given query in Section 3.4; i.e., we propose a scheme to deliver a given query to a target subserver over a subnetwork consisting of top and middle layers. We also describe a way of associating each UP to a group corresponding to a subserver, and a way of collecting indexes of ﬁles held by UPs to the corresponding subservers in a subnetwork consisting of middle and bottom layers, in Section 3.3. 3.2

Basic Tools

Before describing the details of the proposed scheme, we introduce two basic tools, which will play an important role in succeeding subsections. Tag-Based Sieving of Files: In the proposed scheme, the central server C maintains a set of tags, which will be attached to each ﬁle held by UPs and index held by subservers. Let T = {t1 , t2 , ..., tn } denote the set of all tags. Each tag in T is a keyword or a key phrase representing the “meaning” of objects in the real world. For example, tag “china” represents several meanings in various angles; e.g., it represents the name of country, one kind of cultural, a kind of food, and so on. In this paper, we assume that set T is predetermined by several experts and administrators, and a proposal of eﬃcient way of inserting, deleting, and modifying tags in T by the end users is left as a future work. It should be worth noting here that the set of tags attached to the ﬁles must be determined (or reﬁned) by taking into account the popularity of tags. Zipf’s ﬁrst law, which is a family of related discrete power law probability distributions, states that given some corpus of natural language utterances, the frequency of each word is inversely proportional to its rank in the frequency table [8]. This indicates that we should avoid a selection of high frequency words as a member of T , since it could not attain an eﬃcient sieving of ﬁles associated with the tags. On the other hand, a tag will not be useful if it is highly unpopular, i.e., if the number of ﬁles attached the tag is very small. In other words, tags contained in T must be a low-frequency but a representative word in some sense. Priority Sequence of Tags: Let σ be a bijection from T to {1, 2, . . ., |T |}. In the following, σ(t) is referred to as the priority of tag t, and we say that tag t1 is given a higher priority than tag t2 under σ, if σ(t1 ) < σ(t2 ). Note that σ naturally deﬁnes the following sequence of tags, which will be referred to as a priority sequence of tags, in what follows: σ −1 (1), σ −1 (2), . . . , σ −1 (|T |), where σ −1 denotes an inverse of function σ. We now introduce the notion of “inclusion relation” between tag sets, which plays an important role in the proposed scheme.

Quick Forwarding of Queries to Relevant Peers

239

Definition 1. Let T1 , T2 ⊆ T be two subsets of tags. T1 is said to be included by T2 under σ, denoted by T1 σ T2 , if the priority sequence of T2 is a prefix of the priority sequence of T1 . Example 1. Let T = {t1 , t2 , . . . , t9 } and assume that σ(ti ) < σ(ti+1 ) for 1 ≤ i ≤ 8. Subset T1 = {t1 , t2 , t3 } is included by subset T2 = {t1 , t2 } under σ, since the priority sequence of T2 , i.e., t1 , t2 , is a preﬁx of the priority sequence of T1 , which is t1 , t2 , t3 . On the other hand, subset T3 = {t2 , t3 , t4 } is not included by T2 = {t1 , t2 } under σ, since the priority sequence of T2 , i.e., t1 , t2 , is not a preﬁx of the priority sequence of T1 , which is t2 , t3 , t4 . Definition 2. Two tag sets T1 and T2 (⊆ T ) are said to be incomparable under σ, if neither T1 σ T2 nor T2 σ T1 . A function to check the inclusion of T1 by T2 is described as follows: function INCLUSION(T1 , T2 ) Step 1: If |T1 | < |T2 |, then return false and stop, where |T | denotes the cardinality of set T . Step 2: If T2 = ∅, then return true and stop. Step 3: Let t1 be a highest priority tag in T1 , and t2 be a highest priority tag in T2 . Let T1 := T1 \ {t1 } and T2 := T2 \ {t2 }. Step 4: If t1 = t2 , then return false and stop. Otherwise, go to Step 2. 3.3

File Uploading Process

This subsection describes a way of uploading indexes of ﬁles held by a UP, to a particular subserver. As was claimed previously, each subserver is associated with a subset of tags, and each ﬁle held by a UP is attached at least one tag by the user. Our scheme associates ﬁles with subservers through the notion of inclusion of tag sets. A concrete procedure, which is executed by each UP holding indexes to be uploaded, is described below. procedure FILE UPLOAD Step 1: Let Tˆ be the set of tags attached to the ﬁle index to be uploaded. Step 2: Find a subserver Si associated to a tag set T ∗ including Tˆ. Step 3: Connect to subserver Si and upload the ﬁle index to Si . This procedure is invoked by a UP when a ﬁle is newly created and/or the contents of a ﬁle is modiﬁed by the UP. A request of uploading indexes is handled by the central server C to determine a subserver to which the given ﬁle index should be transferred. As was claimed previously, the correspondence between tags and subservers are maintained by C using a list. Therefore, if a subserver Si is conﬁrmed in Step 2, then the UP can immediately acquire the information on Si , including its IP address and the port number. It should be noted that in our proposed scheme, each subserver merely stores indexes of ﬁles, and the actual contents of ﬁles are held by each UP; i.e., the load of each subserver

240

T. Qin et al.

concerned with an upload can be suﬃciently low. Meanwhile, each UP keeps the information on the subserver corresponding to each ﬁle held by the UP, so that indexes of ﬁles can be updated as soon as it is modiﬁed by the UP (i.e., we are assuming an event-driven upload rather than a polling and/or crawling). 3.4

Query Forwarding Process

We next consider the process of query forwarding, which is the key operation in the search process. The main diﬀerence between our three-tier P2P search engine and conventional search engines is that the central server plays a role of controller to balance the network traﬃc in the whole system. A system variable NR, indicating the total number of ﬁles discovered so far, plays a similar role to the TTL in ﬂooding-based schemes; i.e., every time a new ﬁle is discovered, NR is incremented by one, and the search process stops when NR reaches a predeﬁned value. A pseudo-code for the query forwarding process is given below. procedure QUERY FORWARD Step 1: Let T˜ be the set of tags corresponding to a query q received by the central server C. Step 2: C identiﬁes subserver Si associated to a tag set T ∗ including T˜. Step 3: C connects to subserver Si and forwards q to Si . Step 4: After receiving query q, subserver Si conducts a ﬁle search similar to conventional search engine, and directly notiﬁes the result to the requesting UP. The number of matching results is notiﬁed to C. Step 5: If the number of matching results is smaller than predetermined NR, C tries to ﬁnd another subserver Sj such that the associated tag set Tj and T˜ is not incomparable if any, and go to step 3. Otherwise, it stops.

4

Priority Sequence

In this section, we evaluate the performance of the proposed scheme in terms of the number of tags which must be examined in forwarding a given query to a relevant subserver. 4.1

Discrimination Tree

Let T = {t1 , t2 , . . . , tn } be a set of tags, and σ be a tag sequence deﬁned over set T . In the proposed scheme, each subserver is associated with a subset of tags in such a way that for any subset T of T , there exists a subserver which is associated with a set of tags including T under σ. Such an assignment can be represented by a tree structure described below: – Each vertex in the tree is associated with a set of tags. In the following, let T (u) denote a set of tags associated to vertex u in the tree. – The root of the tree is associated with an empty set of tags.

Quick Forwarding of Queries to Relevant Peers

241

– Let u be a vertex in the tree, and t be a lowest priority tag in T (u). Let i = σ −1 (t ) for brevity. Then, in the tree structure, vertex u has no children or it has n − i children associated with a tag set T (u) ∪ {σ(j)} for each i + 1 ≤ j ≤ n. – Each leaf in the tree corresponds to a tag set associated with a subserver, and a subserver associated to a leaf plays the role of its parent if it is the leftmost child of the parent (such assignment of the role of parent is recursively conducted until it reaches the root vertex). Observe that a collection of resultant sets of tags certainly satisﬁes the requirement described above. In the following, we use such tree structure for the “discrimination” of a given query, in a sense that a query received from a client is placed at the root vertex, and moves toward a leaf vertex associated with a tag set including the query. The number of children of a vertex (i.e., if it has at least one children or not) depends on the number of ﬁles associated with the vertex. More concretely, if there is a vertex associated with a set of tags, and if it is associated with a number of ﬁles exceeding a predetermined threshold, such set of ﬁles should be divided into several subclusters according to the predetermined priority sequence σ. The time required for determining a subserver relevant to a given query is proportional to the depth of a leaf vertex relevant to the query. Thus, the performance of the proposed scheme can be estimated by evaluating the maximum depth of the resultant discrimination tree. 4.2

Case Studies

Let N be the total number of ﬁles held by UPs in the system. In this section, we evaluate the maximum depth of a discrimination tree by assuming that the probability of attaching tags to ﬁles is provided for each tag in T . Let pi denote the probability of attaching tag ti to a ﬁle1 . We do not assume any correlation between attachments of diﬀerent tags; i.e., those attachments are assumed to be independent. In the next subsection, we evaluate the goodness of several priority sequences in terms of the maximum depth of the resultant discrimination tree, assuming that the number of ﬁles associated with a vertex must be bounded by α × N for some constant α. In the following, we analytically evaluate the goodness of two concrete priority sequences. Case 1: At ﬁrst, we consider a case in which a popular ﬁle is given higher priority. More concretely, we consider a priority sequence such that σ(ti ) < σ(tj ) iﬀ pi > pj . In this case, a child with a highest priority is associated with a largest (sub)cluster for each vertex (in the following, we refer to such highest priority child as the “left-most” vertex). Thus, for each level of a discrimination tree, the 1

Under the Zipf’s first law, the probability that the ith popular tag ti is selected as an attachment to a file is proportional to (1/i) for some constant , where parameter is generally referred to as Zipf’s parameter.

242

T. Qin et al.

(a) Case 1.

(b) Case 2.

Fig. 1. Result of numerical evaluation

left-most vertex at the same level will be associated with a largest cluster at that level, where the level of a vertex means the distance from the root to the vertex. According to the above observation, the expected size of a largest cluster at the ith level is calculated as follows: At the ﬁrst level, the left-most child of the root is associated with a cluster of size N × p1 , since there are N ﬁles at the root and the probability of attaching tag t1 to a ﬁle is p1 . At the second level, the left-most vertex at the level is associated with tags t1 and t2 , and such probability is N ×p1 × p2 . Similarly, the size of a largest cluster at the ith level is i given as N × j=1 pj . Thus, since the maximum size of a cluster is bounded by αN , the maximum level of the resultant discrimination itree is coincide with the smallest integer i satisfying the following inequality: j=1 pj ≤ α. Case 2: Next, we consider a case in which a popular ﬁle is given lower priority; we consider a priority sequence such that σ(ti ) < σ(tj ) iﬀ pi < pj . In this case, the following selection of tags will maximize the size of the resultant cluster: for each i from 1 to n sequentially, 1) it skip the selection of ti while pi < 0.5, and 2) it adds tag ti to the current set until the size of the resultant cluster becomes lower than αN . Let i be the maximum index such that pi < 0.5. If ij=1 (1 − pj ) ≤ α, then the maximum depth of the discrimination tree is exactly one. Otherwise, the maximum depth of a discrimination tree is i − i + 1, where i is the smallest i i integer satisfying the following inequality: j=1 (1 − pj ) × j=i +1 pj ≤ α. The result of numerical evaluation for both cases is given in Figure 1, assuming that the priority sequence of pi follows the Zipf’s ﬁrst law. In this ﬁgure, we ﬁx parameters as follows: N = 10000, the size of T is 100, parameter α is varied from 0.01 to 0.1; and Zipf’s parameter is varied from 0.05 to 0.15. It is easy to see that the (expected) maximum depth in Case 2 is much smaller then in Case 1.

5

Concluding Remarks

In this paper, we proposed a new ﬁle search scheme for a three-tier P2P architecture. The core of the proposed scheme is the way of uploading ﬁle indexes and forwarding received queries to relevant subservers based on the notion of tags.

Quick Forwarding of Queries to Relevant Peers

243

Our ongoing work focuses on the selection of tag set T . At present, there is only one tag set T in our scheme. We need to extend it to manipulate multiple tag sets in order to improve the accuracy of the search result. Another open problem is how to reﬁne the priority sequence, and how to determine an assignment of resultant clusters to subservers.

References 1. Adar, E., Huberman, B.A.: Free Riding on Gnutella. First Monday 5(10) (2000) 2. Balakrishnan, H., Kaashoek, M.F., Karger, D., Morris, R.: Looking Up Data in P2P Systems. Communications of the ACM 46(2), 43–48 (2003) 3. Qin, T.T., Cao, Q., Wei, Q.Y., Fujita, S.: A Hierarchical Architecture for RealTime File Search in Peer-to-Peer Networks. In:Proc. PDAA, in conjunction with PDCAT09 (December 2009) 4. Harren, M., Hellerstein, J.M., Huebsch, R., Loo, B.T., Shenker, S., Stoica, I.: Complex Queries in DHT-based Peer-to-Peer Networks. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, p. 242. Springer, Heidelberg (2002) 5. Kalogeraki, V., Gunopulos, D., Zeinalipour-Yazti, D.: A Local Search Mechanism for Peer-to-Peer Networks. In: Proc. CIKM (2002) 6. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and Replication in Unstructured Peer-to-Peer Networks. In: Proc. ACM SIGMETRICS (2002) 7. Napster Homepage, http://www.napster.com/ 8. Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323–351 (2005) 9. Rowstron, A.I.T., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001) 10. Sotica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In: Proc. ACM SIGCOMM (August 2001) 11. Zhao, B.Y., Huang, L., Stribling, J., Rhea, S.C., Joseph, A.D., Kubiatowicz, J.D.: Tapestry: A Resilient Global-scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications 22(1), 41–53 (2004)

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord* Zhang Jianzhong, Pan Kai, Yu Yuntao, and Xu Jingdong Department of Computer Science, Nankai University Tianjin, P.R. China [email protected]

Abstract. In digest-based distributed anti-spam technology, the research concentrates on how to publish and lookup digests efficiently. Based on the deep study of CTPH and DHT, we propose an approach to publish and lookup CTPH digests in Chord: iCTPH, in which the high-dimensional CTPH digests are mapped into one-dimensional Chord identifiers by iDistance method. Simulation experiments demonstrate that iCTPH has good publishing and looking up performance. For random generated digests, iCTPH can publish 500 similar digests to less than 5.6% nodes and recall 85% of the similar digests by querying no more than 4% nodes.

1 Introduction The research of anti-spam technology plays an important role in purifying the Internet environment. Based on the spam’s burst and similarity features, digest-based distributed anti-spam technology is proposed. This kind of technology recognizes spam by collaboratively publishing and looking up the email digests among the mail servers. The introduction of DHT provides a good platform for digest-based collaborative anti-spam technology. In order to avoid being blocked by the anti-spam system, one kind of the spam often has similar but not identical content. Therefore, we must employ a local sensitive algorithm to generate mail digest. CTPH[1] is a local sensitive algorithm. It can generate similar digests for similar messages. Kornblum had verified that CTPH can be used in anti-spam system in his work [1]. Since the messages’ similarity cannot be determined by simply comparing the CTPHs’ numeric values, it is necessary to study the publish and lookup method in DHT-based system. Based on the deep study of Chord and CTPH, we propose an approach to publish and lookup CTPH digests in Chord: iCTPH. Simulation experiments show that iCTPH has a good publish and lookup performance. For random generated digests, iCTPH can publish 500 similar digests to less than 5.6% nodes and recall 85% of the similar digests by querying no more than 4% nodes. *

This work was supported by a grant from Tianjin Natural Science Foundation (No. 08JCZDJC22100).

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 244–253, 2010. © Springer-Verlag Berlin Heidelberg 2010

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord

245

2 Background and Related Work 2.1 Similar Search in DHT DHT network is a kind of decentralized distributed systems that provide a lookup service similar to a hash table. Each node in DHT stores a partial view of the whole distributed system which effectively distributes the routing information. DHT can scale to extremely large numbers of nodes and automatically handle node arrivals, departures, and failures. Chord [2] is a famous DHT. The nodes in Chord are organized as a ring and identified by Chord identifier. Chord uses consistent hash function to map resource into a kbit ID and publish the resource to the successor of ID. When a key is being looked up, each node forwards the query to its successor in the identifier circle until one of the nodes determines that the key lies between itself and its successor. Chord guarantees that the lookup operation can be finished in O(logN) time, where N is the total node number in Chord ring. In this paper, we use Chord as the support network to publish and lookup CTPH digests. Generally, DHT only provides exact match service, which means given a key, DHT returns the node information where the corresponding resource locates. Range search in DHT concentrates on how to return all the relevant nodes where all the similar resources locate. 2.2 CTPH Algorithm CTPH algorithm combines piecewise hashing and fuzzy hashing. Before generating a digest, CTPH computes a trigger value related with the input length firstly. It uses a 7bytes window to scan the input sequence. Every time a new byte is scanned, a window hash value is generated to compare with the trigger value. If they are equal to each other, a traditional block hashing is performed. Each traditional hash value is mapped into one of the characters in a b64 character array. All of the characters make up this message’s CTPH digest. At the end of the algorithm, if the digest length is too short, the algorithm halves the trigger value and executes the algorithm for another time. Because the trigger value can be hit in a greater probability in this round of process, the CTPH digest may get much longer as a result. CTPH uses edit distance to measure the similarity between two digests. The edit distance is the minimum number of edit operations needed to transform one string into the other. Here “edit operation” is an insertion, deletion, or substitution of a single character. In order to determine the similarity of two mails, we can compute their CTPH digests respectively. The smaller the edit distance is, the more similar the mails are. 2.3 iDistance Method iDistance[3] technique is proposed to effectively perform kNN search in database system. This technique can map the high-dimensional object into one-dimensional value. It divides the vector space into n clusters(O1,O2, …,On), and the clusters are identified by their reference points(p1,p2, …,pn). The data objects are mapped into the clusters according to their distances from the reference point. The iDistance value y for an given object p Oi is :

∈

y = i×C+dist(p,pi) .

(1)

246

J. Zhang et al.

Here the dist is a function to calculate two objects’ distance. The constant C is used to separate the clusters. In iCTPH, mail digests are regarded as metric space objects. In order to publish the similar digests to the near nodes, we must map them into similar publish keys. iDistance method provides theoretical foundation for this kind of mapping. 2.4 Related Work SpamNet [4] is a famous digest-based anti-spam system. It stores all the known spam digests in a centralized server. Client determines a spam by looking up in the server. This centralized model has the single point failure problem and the load is not balance. DCC [5] is another widely used digest-based anti-spam system. If a mail is verified to be a spam, the mail’s digest will be stored in a central, collaborative repository. A mail server identifies a spam by asking the repository. In this way the load balance problem can be improved to a certain extent. However, there are some risks of single point failure in DCC also because it works in a centralizing manner essentially. Guoqing proposed a spam recognition method combining Chord and Bayes in his work [6]. It intends to identify a spam by publishing and looking up the number of similar digests. However, his work doesn’t provide an applicable algorithm to publish and lookup mail digests. DHTnil[7] introduces an applicable algorithm to publish and lookup Nilsimsa digests in DHT. It puts the digests space as an N-dimension hyper sphere and divides the hyper sphere into some sub sphere according to the pre-selected reference points. DHTnil publishes the digests that have the similar Euclidean distance to the near reference points. In this paper we propose iCTPH to publish and lookup CTPH digest in Chord. The difference between iCTPH and DHTnil are as follows:(1)iCTPH employs CTPH algorithm to generate mail digest while DHTnil employs Nilsimsa.(2)iCTPH adopts the Metric Space model while DHTnil adopts Vector Space Model.(3)iDistance method is employed in iCTPH to map the high-dimensional object into one-dimensional value while DHTnil uses serial number of the reference points as publish key.(4)iCTPH employs a interval search algorithm while DHTnil traverses nodes according to the subspace serial number.

3 iCTPH In order to publish the similar digests to the near nodes, we must map them into the similar publish keys. As the CTPH digests of similar e-mails share nothing common in digest numeric values, it won’t meet our goals if we use the CTPH digest directly as the publish key. iCTPH employs iDistance method to map the similar digests into near clusters. For publishing, iCTPH uses Formula (1) to map the similar digests into similar values. For looking up, iCTPH generates query interval for each cluster and performs a lookup operation in every node of each interval.

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord

247

3.1 Digest Space in iCTPH iDistance method can map high-dimensional object into one-dimensional value. In order to employ iDistance in iCTPH, we must prove that the CTPH digests form a metric space. Metric space consists of a collection of objects and a distance function. We denote the metric space as M=(D,d), where D is the domain of objects and d is the distance function. The function d must satisfy the following conditions: for any given objects x,y,z D:

∈

d(x,y)≥0 d(x,y)=0 iff x=y d(x,y)=d(y,x) d(x,z)≤d(x,y)+d(y,z)

(non-negative) (identity) (symmetry) (triangle inequality)

In iCTPH, we employ the ordinary edit distance function which assigns the same weight to each atomic edit operation. Table 1 enumerates the edit operations in CTPH algorithm. For example, we can transform string “three” to string “sreat” through the following edit operation sequence:C4={r(2,' h' ),c(4,' e' ,' a' ),i(4,' s' ),w(1,5) }. Table 1. CTPH edit operation

Operation i(m,'x') r(i,'x') w(i,j) c(i,' x','y')

Explanation Insert ‘x’ after the mth character Remove the ith character ‘x’ from the string Swap the ith and nth character Replace the ith character ‘x’ with ‘y’

It’s easy to prove that the digests in iCTPH form a metric space. From the definition of edit distance, the distance between any two strings must be greater or equal than zero (non-negative); If the distance between two strings is zero then they are the same, and similarly, the distance between two identical string is zero (identity); Since all the operation weight is the same, the edit distance from string x to y is the same as the distance from y to x (symmetry); Finally, the triangle inequality also holds because the number of edit operations transforming string x to z must be less than the indirect transformation (from x to y and then from y to z). As the edit distance function satisfies all of the features described in metric space definition, edit distance is a metric space distance function. 3.2 Digest Space Partition The main idea of iDistance is mapping a high-dimensional object into a cluster and computing a one-dimensional value according to the cluster number. In order to adopt iDistance in iCTPH, it is necessary to find ways to partition the digest space. Franz presented an approach to divide multi-dimension space: Voronoi diagram in his work [8]. Voronoi diagram divides the space based on a point set of the space. Every point in the set will be a core of a subspace. The subspaces don’t overlap with

248

J. Zhang et al.

each other, and the sum of all subspaces’ volume is equal to the volume of the entire space. To find out which subspace a given point belongs to, we can simply traverse all points in the point set, find the point closest to the given point and the related subspace is the one we want. Borrow from Voronoi diagram, we can choose some digests as the reference points of clusters and decide which cluster a digest belongs to by finding the closest reference point. Given a digest o, the cluster Cq to which o belongs (denoted as Ck) is computed as follows: Cq=Ck | edit(o,pk)=min{edit(q,pi)} .

(2)

In Equation (2), i varies from 1 to m, m means the number of clusters and pi denotes the reference point of the ith cluster. If we get more than one closest reference points, we always choose the first one. From Equation (1), it is easy to see that the similar digests are sure to be mapped into the similar (or even identical) y. 3.3 The Publish and Lookup Algorithm of iCTPH The digest of an e-mail is published in the following steps: (1)Compute the mail’s CTPH digest (denoted as o). (2)Decide which cluster the digest belongs to according to the Equation (2). (3)Calculate the publish key according to the Equation (3) and publish the digest to the corresponding node with the Chord interface. key(o)=edit(o,pk )+k×C .

(3)

This algorithm divides the Chord ring evenly and guarantees that the similar digests can get the similar publish key. As a result, the similar digests can be published to the near nodes. The publication algorithm is shown as below (the function with the form of foo.function means remote procedure call): iCTPH publish algorithm

Publish(mail,clusterpoints[]) { o = ComputCTPH(mail) ; SortbyDistance(o, clusterpoints[]); i = clusterpoints[0].number; key =edit_dist(pi,o)+i*c; Chord.publish(key,o) ; } 3.4 The Lookup Algorithm of iCTPH When iCTPH decides whether an email is spam, it performs a range query to get the number of the digests that are similar to this email. Assume that node Nq wants to decide whether an email is a spam, it proceeds as follows:

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord

249

(1)Compute the mail’s CTPH digest, denoted as o. (2)Sort the reference points from small to big by the distance from o. Assume the sorted order is p1,p2,…,pn. (3)For each pi, compute the one-dimensional query range according to Equation (4): Ii=[edit(o,pi)+i×C-r,edit(o,pi)+i×C+r] .

(4)

(4)For each i: 0≤i≤n, perform an InterQuery to the node Ni. Here node Ni is the middle node of Ii. Nq waits for the responses and accumulates the similar numbers. The query will not stop until enough similar digests are found. The process of the InterQuery is as follows: if the interval managed by Ni is contained in Ii, the InterQuery should be recursively performed by the successors and/or predecessors of Ni. Each node receiving a query executes a local search and returns the number of similar digests. The lookup algorithm is as follows: iCTPH lookup algorithm

Query(mail,radius, clusterpoints[],thresh) { num = 0; o = ComputCTPH(mail); SortbyDistance(o, clusterpoints[]); for each clusterpoint in clusterpoints[] { i = clusterpoint.number; range = ComputeRange(i) ; mnode = range.middle_node(); num+=mnode.InterQuery(o,radius,range) ; if(num>thresh) return sum; } return sum; } InterQuery(o,radius,range) { sum = 0; sum+=LocalSearch(o,radius); my_range = GetMyRange(); if(my_range.low>range.low) sum+=pre.InterQuery(o,radius,range) ; if(my_range.high
250

J. Zhang et al.

4 Simulation Experiments We adopt the Voronoi-like partitioning principle to select N reference points (N=500): First, we generate M CTPH digests randomly, after that we find the closest two digests and remove one of them. We do the same thing for the remained M-1 digests until there are only N digests left. We make M>>N to ensure that the selected reference points are uniformly distributed in the digest space. We restrict the Chord identifier space to 64 bits although the space is much larger in the real Chord implementation. A smaller identifier space makes it more convenient to represent Chord ID by the ordinary machine word, say 32 bits. In order to select an appropriate value C in the Equation (2), we divide the Chord ring evenly according to the number of nodes, which means that the clusters bind with identifier intervals of the same length. In our simulation experiment, since the interval length is much larger than the maximum edit distance (128) between two digests, we can avoid the clusters’ overlap. 4.1 Simulation of Publish Algorithm To guarantee the efficiency of similar lookup, we expect that the digests of similar email can be published to the same or near nodes. In this experiment, we generate 500 reference points and 250 nodes which distributed uniformly on the Chord ring. In order to reflect the real situation we repeat the experiment for 100 times, every time we generate 15 groups of similar digests and the group size varies from 1 to 500.We use the iCTPH algorithm to publish these digests and examine the number of nodes each group involves. As is shown in Figure 1(a), the number of nodes involved grows with the group size while the growth rate decreases. When the group size is 500, the number of nodes involved is only 14, accounting for 5%. Figure 1(b) shows the distribution of one group similar digests among the nodes involved, from which we can see that the similar digests are indeed published to a few nodes.

Fig. 1. Simulation of publish algorithm (a) and the distribution of one group similar digests among the nodes involved (b)

4.2 Simulation of Lookup Algorithm In this experiment we examine the relationship between recall rate and the number of nodes involved in lookup operation, which means that how many nodes we should query to recall a certain amount of similar digests (say 85%). We publish 200

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord

251

randomly generated similar digests and recall a certain amount of digests that are similar to the target digest. This experiment is repeated for 100 times to reflect the real situation. Table 2 shows the relationship between recall rate and the number of nodes involved. Table 2. Experiment result in priority selection mode and random selection mode Recall Rate 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Percentage(priority) 2% 2% 2% 2% 2% 3% 3% 4% 7% 9% 23%

Percentage(random) 3% 3% 3% 3% 3% 4% 4% 6% 7% 11% 30%

As is shown in the second column of Table 2, the number of nodes involved grows very slowly if the recall rate is confined to some lower level. However the node number increases drastically when the recall rate reach some high level, for example from 85% to 100%. The reason is that we sort the reference points by the distance from the target digest before starting a lookup operation, and do the interval query in that order. Generally, although most of the similar digests will be published to the first several nodes, there are still a few digests locating in the remote nodes. Under the premise of above, it only needs to query a few nodes if the recall rate is not very high. On the contrary, if the recall rate is extremely high, we must traverse much more nodes until arriving at the remote nodes in order to recall most of the similar digests. In spite of that, the result is still satisfactory. It only needs to traverse about 3% nodes to recall 80% of the similar digests. In practice, we must make a trade-off between accuracy and efficiency. Take the experiment above as an example, a recall rate of 50% can lead to an effective query process, but the accuracy may be impaired. On the contrary, a recall rate of 100% lead to the most accurate result, but it needs to query about 23% nodes, which is very ineffective. 4.3 Simulation of Load Balance In this experiment we evaluate the nodes’ load balance after publishing large amount of digests. We publish 200 groups of similar digests (100 digest in each group) to the nodes which evenly distributed on the Chord ring. Two alternative publish mechanism are employed here. The first one is called priority selection, in which we select the first reference point to compute the publish key when we have more than one candidates. Figure 2(a) illustrates the load balance in this publish mechanism.

252

J. Zhang et al.

The upper and lower curves in Figure 2(a) indicate the digest load and group load respectively. We can see that the load balance is not very good when the priority selection mechanism is employed. The nodes in the front contain much more digests than those in the back. The second publish mechanism (called random selection) is employed to improve the load balance. In this mechanism we publish the digest to a randomly chosen node when there is more than one candidate. We repeat the experiment in the same condition and the result is shown in Figure 2(b).

Fig. 2. Load situation in priority selection mode(a) and random selection mode(b)

As is shown in Figure 2(b), the load of the nodes is much more balance than Figure 2(a). However, the load balance is on the expense of lookup efficiency. The third column of Table 2 shows the lookup efficiency in this situation: From Table 2 we can see that the priority selection get better result than random selection in the same condition. For example, in a recall rate of 85%, the former needs to query only 4% nodes but the latter 5.77%. This is because the digests are more concentrated in the priority selection mechanism and the query process will stop much earlier. In the random selection, we had to query much more nodes to recall enough similar digests because of the discrete distribution of digests. So in practice, we must make a trade-off between lookup efficiency and load balance.

5 Conclusion and Future Work This paper proposes an approach to publish and lookup CTPH digest in Chord. Publish algorithm employs iDistance method to map high-dimensional CTPH digests into one-dimensional Chord identifiers. Lookup algorithm sorts the reference points by the distance from the target digest and performs an interval query in that order. As is shown in the simulation experiments, iCTPH can effectively publish the similar digests to a very few nodes. For example, the number of the nodes involved in the publish is only 5% when the group size is 500. iCTPH guarantees that it can recall 80% similar digests by querying no more than 3% nodes. Finally, the load is fairly balance if we adopt the randomly selection method. In the future, we will focus our attention on the improvement of the security of CTPH algorithm and lookup efficiency of iCTPH. CTPH cannot defense the duplication attack. This is because the trigger value selected by CTPH algorithm is dependent on the input length, and malicious user could generate dissimilar digests of one mail by duplicating the original mail text for different times. Although the author of the

iCTPH: An Approach to Publish and Lookup CTPH Digests in Chord

253

CTPH algorithm had taken some measures to handle this problem, it does not resolve the problem radically. Meanwhile, the lookup efficiency of iCTPH has not reach the best. The main reason is because of the discrete distribution of the digests, we had to traverse much more nodes in order to recall the “remote” digests. We hope to find some solutions to avoid traversing over many nodes.

References 1. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3(s1), 91–97 (2006) 2. Stoica, I., Morris, R., et al.: Chord: a scalable peer-to-peer lookup service for internet applications. In: Proc. of ACM SIGCOMM, USA, pp. 149–160 (2001) 3. Jagadish, H.V., Ooi, B.C., Tan, K.-L., et al.: iDistance: An adaptive B-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems 30(2), 364–397 (2005) 4. SpamNet: http://razor.sourceforge.net 5. DCC: http://www.rhyolite.com/anti-spam/dcc 6. Mo, G., Zhao, W., et al.: Multi-agent Interaction Based Collaborative P2P System for Fighting Spam. In: Proc. of the IEEE/WIC/ACM international conference on Intelligent Agent Technology, pp. 428–431. IEEE Computer Society, USA (2006) 7. Zhang, J., Lu, H., Lan, X., et al.: DHTnil: An Approach to Publish and Lookup Nilsimsa Digests in DHT. In: Proc. of the 10th IEEE International Conference on High Performance Computing and Communications, pp. 213–218. IEEE Computer Society, USA (2008) 8. Aurenhammer, F.: Voronoi diagrams - a survey of a fundamental geometric data structure. ACM Computing Surveys 23(3), 345–405 (1991)

Toward a Framework for Cloud Security Michael Brock and Andrzej Goscinski School of Information Technology, Deakin University Pigdons Road, Waurn Ponds, Australia {mrab,ang}@deakin.edu.au

Abstract. While the emergence of cloud computing has made it possible to rent information technology infrastructures on demand, it has also created new security challenges. The primary security concern is trusting data (or resources in general) on another organization’s system. This document seeks to examine the current state of security in cloud computing and presents a set of challenges to address the security needs of clouds. The end result is a framework to help the design and implementation of effective cloud security infrastructures. Keywords: Cloud computing; cloud security; security evaluation; security concepts; security models.

1 Introduction Cloud computing is the result of combining technologies, such as the Service Oriented Architecture (SOA), Internet Technologies (mainly Web services [1]), and virtualization [2]. To protect the clouds, clients and the services hosted on clouds, Service Level Agreements are used to form legal agreements between all parties. The end result is a (Web) service-based, scalable, Internet accessible distributed system that supports any client request regardless of hardware and software configuration. While cloud computing has made resources accessible, this act has immediately made resources vulnerable to intruder attacks. This challenge is primarily from sharing, virtualization and the use of Web services. While vendors have concentrated their effort on the improvement of performance and scalability, cloud security has been neglected. Security for clouds is important as it is more than just data that is kept on clouds. The resources could be services themselves that take data and perform processing or complete business logic workflows where multiple services are used in a specified order. Since cloud computing is an instantiation of distributed computing it brings its own inherent set of security problems (in particular data privacy and access control). These problems belong to three basic dimensions of cloud security: resource protection that is strongly associated with the identity administration and user provisioning; communication and storage security; and authentication and following it the act of authorization. Before addressing security in clouds, what is needed is a framework of ideas and generic concepts so that security can be implemented to support individual clouds. In this paper, we focus on the creation of the Cloud Security Framework (CSF). To achieve this goal, we characterize security problems of clouds, evaluate security of current cloud environments, present current security countermeasures, and propose the framework. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 254–263, 2010. © Springer-Verlag Berlin Heidelberg 2010

Toward a Framework for Cloud Security

255

2 Major Cloud Security Problems Clouds are distributed systems where all resources are virtualized and are offered via reusable services that are accessible over the Internet. While there are many cloud offerings, e.g., Amazon [3], Google [4], Azure [5], Saleforce [6], all could be placed in one of three categories. Infrastructure as a Service (IaaS) clouds offer very basic resources, specifically server virtualization [2] and data storage. Platform as a Service (PaaS) clouds offer complete hardware and software configurations. Software as a Service (SaaS) clouds offer complete software systems. When it comes to assessing cloud security, the underlying category of the cloud has to be considered. For example, how security is judged for a SaaS cloud is different to that of an IaaS cloud because what and how services are offered differ significantly. A cloud, since it is an instantiation of distributed systems, is a subject of security attacks. An intruder can: intercept messages; actively insert messages into connection; impersonate: can spoof any field in packet; hijack ongoing connection between a legitimate client and service; and carry out denial of service. Intruders can access and carry out operations on resources (e.g., data) despite the fact that they do not have rights to do so. Since clouds exploit sharing, shared services (programs) could leak information. Currently, cloud service providers require clients accessing their cloud services to provide automatically generated passwords and credit card number. This is not a secure access control solution. In traditional systems it is possible to apply strong controls to enforce policies over authorized access, authentication, confidentiality and integrity. The situation is more complicated in clouds; as a practical matter the client does not know the location of their data, the server that is performing the computation, routes to servers, and even where they are stored because the providers’ systems reacts dynamically to address changing requests of the clients and changing clients. Protection is much more complex when applications are shared because attackers can exploit information leakage. This implies that SaaS could be the subject of attacks that lead to information leakage. It has been demonstrated lately using the Amazon EC2 service as a case study that sharing combined with machine virtualization leaves clouds insecure [7]. The authors showed that it is possible to map the internal cloud infrastructure, identify a likely location of s particular target VM, and then initiate new VMs until one is placed co-resident with the target. This placement can be used “to mount cross-VM side channel attacks to extract information from a target VM on the same machine”. It is open whether such attacks could be performed within PaaS and SaaS clouds. Clouds could be located and managed within different organizations and in different countries. Thus data have to satisfy different compliance regulations, access policies of organizations differ, and mechanisms used could be completely different. The problem of security at the management level could add one additional dimension of cloud security. In summary the following dimensions of security require addressing to provide cloud security: network security, data security, virtualization security, and management security.

256

M. Brock and A. Goscinski

3 Major Cloud Security Evaluation The purpose of this section is to first examine security measures of the best known clouds. The outcome of this section together with that of Section 4 is presented in Section 5 where all the problems in clouds are tallied and possible solutions proposed. Security in EC2: Amazon’s Elastic Compute Cloud (EC2) [3] is an IaaS cloud that allows clients to create, upload and run their own virtual machines which have been preloaded with required software. As it is the client that places software in the virtual machines, EC2 becomes a security problem. Amazon does not take responsibility for the services that are run inside the virtual machines. Amazon proposed to move toward a stronger security approach such as the onetime token device [8]. A company that uses such a device also has monitoring and governance tools, including federated identity management, activity tracking, and remote control of authentication systems. The problem is with revoking issued tokens. In general, security only goes as far as Amazon’s own infrastructure. The protection of the data and software inside the virtual servers falls solely on EC2 clients; however, it has to be coded into the services thus making policy updates extremely difficult. Security in App Engine: Information about how an App Engine service could be secured is shown in the code deployment documentation [9]. To secure a service, security information had to be specified in an XML configuration file used when the service is placed on App Engine. The problem with this solution is that App Engine only offers authentication against the Google Accounts service. This approach is a simple username and password authentication; it is just as easy to break as it is to use. The granularity of security in App Engine is very coarse. Security is only carried out on a per service basis; it is unclear if selected elements of service functionality can be allowed or denied to clients. Resources behind App Engine services have to be secured by the service themselves. We could not find how to apply security to service resources. Security in Azure: When it comes to security, Azure [5] is better equipped than all other clouds. A PaaS cloud, Azure allows clients to create and then host their own services. Azure offers a security service that allows service authors to decide which clients can access their services and how [10]. The security mechanism is based on the Secure Assertions Markup Language: when requesting access to services, clients state a set of claims that identify themselves. The claims are issued by identity providers: services that are responsible for generating claims and signing them so they can be authenticated. To address differences between identity providers, Azure has inbuilt services that convert unknown claims into a readable form. Before making use of a service in Azure, the claims are first authenticated. If the claims are correct, the client request is processed by the service; otherwise, they are blocked. The disadvantage of the security system in Azure is the claim verification has to be invoked pragmatically. Azure services have to be coded with calls to the Azure Access Control service. The matching of functionality to claims needs to be coded into the service as well. While security exists in Azure, and it allows the service developer to set the policies, it is a manual process.

Toward a Framework for Cloud Security

257

4 Current Security Countermeasures The purpose of this section is to summarize current security models and countermeasures. 4.1 Resource Protection The problem is what mechanisms are the best for clouds. While clouds only offer resources via services, the underlying resources themselves have to be protected. As the services themselves are accessible by clients, the services themselves can be compromised thus security should start with the resource and then work towards the service. Resource protection can be implemented as: (i) discretionary systems – access to a resource can be granted or denied to any client at the discretion of the service provider. One of the most critical weaknesses is that they do not take into consideration semantics of stored data and client clearance; and (ii) non-discretionary systems – access to a resource can be granted or denied based on the classification of data or application and the clearance of a client. Access Control Matrix: One of the most commonly used security approaches is the Access Control Matrix [11]. As a matrix is used, granting, revoking and determining access rights are easy. The problem with ACMs is they do not work too well in distributed environments as they impose centralization. In response, ACMs exploit two implementation solutions by simply decomposing either the rows that leads to Access Control Lists (ACL) or by its columns that leads to Capabilities. Both approaches have advantages and disadvantages. Each resource in a system has a list of services and for each service, a set of rights it can perform. ACLs are coarse grained and only go as far as the whole resource but rights are easy to be revoked. Secure systems that use capabilities assign rights to the clients or services acting on behalf of clients. In general, a user capability defines a resource and what rights exist for it. Capabilities are like a ‘mirror’s view’ of ACLs: the advantages become disadvantages and vice versa. If a user leaves the environment, the subject is able to take any allocated capabilities with it. Thus it makes rights revocation difficult. Attribute Based Access Control: Another security model is Attribute Based Access Control (ABAC) [12]. ABAC differs significantly from previous approaches in that attributes are allocated to services and resources and rights are implied via polices. For services, attributes such as their name, and role are assigned. For resources, their characteristics, owner and domain are allocated. When a service attempts to perform an operation the attributes of both the service and resource are compared to each other and the operation allow if policy rules are satisfied. However, as attributes are used, authenticating the attributes becomes an issue. Information Flow Control: Access rights should only be granted taking into consideration semantics of stored classified data and user clearance. A possible implementation of this model, which is an extension to capabilities (called trusted capabilities) and modified access control lists, is proposed in [13]. A clearance capability is a trusted identifier with additional redundancy for protection, and contains a security

258

M. Brock and A. Goscinski

level to provide clearance of a client (user or another service) to access certain classes of information. After receiving a capability the following comparisons are made. First, the content of the clearance field is compared to the classification of a requested resource. If they are satisfied the protection state is considered secure. Otherwise, the requesting client is refused access. Next, if the protection state is secure, the access rights field is compared against the requested operation to determine whether the requested access confirms to mandatory and discretionary policy. If yes, the requested resource can be accessed. 4.2 Communication and Storage Security Resources within clouds can vary from stored information to complete business workflows. Securing data (or any form of storage) in a shared environment is complicated as they have to know who the clients are. Encryption provides the security protection to stored data. This leads to high costs whenever operation on these data is performed. Secure data communication is also a problem. Even if data security in clouds is solved, the communication path between the client and the cloud and between the cloud and the target data service has to be protected. During transfer, the confidentiality and integrity of data must be ensured. Transport Layer Security (TLS), and its predecessor, Secure Sockets Layer (SSL) and HTTPS, are cryptographic protocols that provide security for communications over networks such as the Internet, and as such are directly applicable to clouds. Cryptographic systems belong to one of two classes: symmetric cryptosystems and asymmetric cryptosystems. While symmetric systems are simple, key management is easily compromised thus this article focuses on asymmetric systems. An asymmetric key cryptosystem (AKC) is based on two keys: a private key and a public key which is publicly known. Encryption is made separate from decryption. If data is encrypted using AKC using either the public or private key, the opposite key is required to decrypt the data. In AKC, the use of either key is one way and cannot reverse the encryption processes. However, as the public keys are publically available, an intruder can still compromise communication through a man in the middle attack. To address this, trusted authorities, Certificate Authorities (CAs), are used. CAs act as an independent registry and are used to verify that a public key does belong to a given person. 4.3 Authentication In clouds it is necessary to validate the identity of services, service providers and cloud clients. Login name and password authentication, (i.e., single-factor authentication) is not strong enough to provide secure authentication. In response, two-factor authentication was proposed. It is a process where a client proves their identity with two of the three methods: “something you know (e.g., password), “something you have” (e.g., token or smartcard), or “something you are” (e.g., fingerprint). Twofactor authentication could be too difficult to implement if two-way authentication is needed; in this case clients want to authenticate cloud services, or cloud services of a workflow have to authenticate one another. For these purposes strong, encryption based, authentication is needed: such as signing. With signing, the message itself is

Toward a Framework for Cloud Security

259

not encrypted, only a hash of the message is encrypted. When the receiver receives the message, the hash is again generated and the transmitted hash decrypted with the senders public key. If the two hashes match, the message is authenticated. Kerberos [14] is an interesting approach because credentials, such as usernames and passwords, are never transmitted. Kerberos uses asymmetric cryptosystems, specifically the encryption and signing of tokens called tickets. To allow users to access services in remote domains (other than the one the user exists in), it is possible to share keys between Kerberos servers so they trust each other. Overall, Kerberos provides a very powerful and secure infrastructure for environments. It is distributed so it suits cloud environments.

5 Cloud Security Framework Proposal An analysis of the cloud security problems and the current state of security of the best known clouds shows that these problems do not have any real comprehensive solution and existing cloud security is in its infancy. All of this is despite the existence of excellent security models, countermeasures, and systems summarized in Section 4. There is a need for an approach to cloud security that is holistic, adaptable, and reflects client requirements. As a starting point of the development of such an approach we propose to set the framework taking into consideration: (i) Cloud Infrastructure Protection – by providing access control to protect against security threats, (ii) Communication and Storage Security against passive and active attacks – by providing encryption, and (iii) Authentication and Authorization – to make sure that only authenticated and authorized clients can be provided with cloud services. The purpose of this section is to present the Cloud Security Framework (CSF). Specifically, this section looks at the requirements of the framework and how it is designed to add security to clouds regardless of their underlying category. 5.1 Framework Requirements First, the CSF has to be service based. While there are various categories of clouds from IaaS to SaaS, all clouds offer resources via services. Second, the CSF has to use the non-discretionary model so that resource and service semantics are considered. For services, their functionality has to be clearable so that providers can control what functionality is usable by what client. For resources, especially data, information has to be classifiable. Third, the CSF has to be capable of assigning clearance to clients, i.e., users and services. At this stage it is proposed to have clearance provided to users by employers. If users are self-employed, they can obtain clearance based on their employment history from a cloud service provider. Services share clearance with their providers. Fourth, a single sign-on environment should be provided for each cloud. Users and services should not be forced to request access clearance to access each individual service of a cloud: a must as the offered resource could be a complete workflow. Fifth, the security method has to be transcendable. If a client has obtained clearance to use services in one cloud, the same clearance should be usable in another cloud. The reason is that a clearance is obtained from a user’s employer. Sixth, communication among clients should be encrypted – only initial request could be sent in clear. Communicated entities should authenticate each other when a cloud session is initiated.

260

M. Brock and A. Goscinski

5.2 Framework Logical Design The purpose of this subsection is to present a logical design of the CSF framework. We also present a simple case to demonstrate how our framework operates. The CSF is influenced by the Information Flow Control model and Kerberos. CSF has two main elements, a Gateway Server (GS) and a Single Sign-on Access Token (SSAT). GSs are hosted in clouds and manage the security of their host clouds. A SSAT is a time limited, non-forgeable and non-transferable entity that is granted to cloud clients. It is constructed and used according to the Information Flow Control model. This token identifies the client, services the client wishes to use and also verification tokens to prove the SSAT itself is valid. Only the intended client can use the token and once it expires, it cannot be reused. This addresses revoking rights. To ease management, the classification of services, and the resources behind them, is inherited from their providers. Fig. 1 presents a simple example where our CSF is used with a single client and a series of services that exist in multiple clouds. Before a client can use any service in the cloud, access to the cloud has to be granted. To do this, the client contacts the Gate Keeper (GK) service in the GS (1 in Fig. 1). Communication with the GK (or any other service) uses Transport Layer Security to protect against eves-dropping attacks. For simplicity, all the services provided in this example are from the same provider.

Fig. 1. Proposed Security Model and Workflow

The outcome of (1) is the client only having enough clearance to communicate with the Gateway Server itself. To use the cloud, the client has to request additional clearance by contacting the Clearance Broker (CB) (2). The reason for using a CB is because clouds themselves are very dynamic. Depending on the cloud, the service could exist as multiple instances to support client demand. Furthermore, due to client demand, services within a cloud might migrate between physical servers.

Toward a Framework for Cloud Security

261

To address the changing state of clouds, requests to the CB indicate the types of services the client wishes to use and the CB attempts to allocate clearances to the client to a specific service instance, not matter where in the cloud it exists. To support the CB, we plan to incorporate the Dynamic Broker of the RVWS framework [15, 16]. The Broker is an attribute based publication, discovery and selection service for clouds. Being attribute based, it makes it possible for users to use it to store access information (like the access control list for trusted capabilities) thus making it easier to develop and operate the CB. If clearance is granted, it is returned to the client as a Single Sign-on Access Token (SSAT). In relation to the Information Flow Control model, our SSAT lists in the rights field of trusted capabilities, all services the client wishes to use and their clearance is such that could access the requested services. A slight change that is made to the trusted capability is that it has a defined time period of which the SSAT can be used. If the client attempts to use the SSAT outside of its allocated time period, the attempt is blocked. Upon getting an SSAT, the client can now make use of the services, specifically Service 1.1 (3). When the client accesses Service 1.1, Service 1.1 ensures the SSAT has been verified. If there is no verification in the SSAT, Service 1.1 contacts the Clearance Verifier (3.0). This step is a precaution against SSAT forging. If the CV reports back that the Gateway Server did not generate the SSAT, the request is blocked. If the SSAT is examined and proved valid, the CV attaches a verification token to the SSAT. Service 1.1 can then start processing the client request. To give a full account of this example, our service makes use of a local service and a remote service in another cloud. During processing Service 1.1 requires the use of Service 1.2 thus makes a request to it with the client’s SSAT (3.1). Upon receiving the request, Service 1.2 starts processing as the SSAT has already been verified by the CV. During processing Service 1.2 eventually requires the use of Service 2.1. The problem is that Service 2.1 exists in another cloud, thus exists in another secure domain. Before using Service 2.1, Service 1.2 needs to get clearance, in particular a SSAT for Cloud 2 and its services. Thus, Service 2.1 acts on the clients behalf and contacts the Cloud 2’s Gateway Server (GS) (3.1.1). When contacting the GS on Cloud 2, the client’s SSAT of Cloud 1 is specified so the GS does not need to query the client of identifying information (it is already in the SSAT). When clearance is granted, trust capabilities for services in Cloud 2 are added to the SSAT. This satisfies the framework’s requirement that clearance should be transcendable. While capabilities are added, it is expected that they are removed once processing within Cloud 2 is complete. This is to prevent the client from accessing other services it was not originally cleared to use. After getting the required capabilities, Service 1.2 is able to make use of Service 2.1 (3.1.2). Once the processing is complete, Service 2.1 returns the outcome to Service 1.2. Service 1.2 in turn returns the response to Service 1.1 (3.2) and finally the complete result is returned to the client (4).

6 Conclusions and Future Work An analysis of the cloud security problems and the current state of security of the best known clouds shows that these problems do not have any real comprehensive solution. All of this is despite the existence of excellent security models, countermeasures,

262

M. Brock and A. Goscinski

and systems. In response we proposed in this paper the Cloud Security Framework (CSF) which shows similarities to the Information Flow Control approach that uses trusted capabilities and some elements from Kerberos. Through the use of a Gateway Server, our CSF framework is well designed to grant clients time based clearances to access classified services, protect against clearance forgery and allow access to services in remote clouds on behalf of clients. In future we will focuses mainly on refining our CSF framework so that it makes use of the powerful publication, discovery and selection features of the Dynamic Broker. Once the detailed CSF framework design is complete, we plan to implement the CSF in a cloud like environment and test its tolerance to attacks.

References [1] Papazoglou, M.: Web Services: Principles and Technology. Prentice Hall, Englewood Cliffs (2008) 978-0321155559 [2] TechTarget (2008) What is server virtualization? Updated (August 14, 2008), http://searchservervirtualization.techtarget.com/sDefinition/ 0,,sid94_gci1032820,00.html# (accessed August 6, 2009) [3] Amazon, Amazon Elastic Compute Cloud (2007), http://aws.amazon.com/ec2/ (accessed August 1, 2009) [4] Google, App Engine (2009), http://code.google.com/appengine/ (accessed February 17, 2009) [5] Microsoft, Azure (2009), http://www.microsoft.com/azure/default.mspx (accessed May 5, 2009) [6] Salesforce, CRM - salesforce.com (2009), http://www.salesforce.com/ (accessed) [7] Ristenpart, T., et al.: Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS’09, Chicago, Illinois, November 9-13 (2009) [8] Brooks, C.: Amazon add onetime password token to entice the wary. Search Cloud Computing (2009), http://searchcloudcomputing.techtarget.com/news/article/ 0,289142,sid201_gci1367923,00.html# (Updated September 11, 2009, accessed October 8, 2009) [9] Google, The Deployment Description: web.xml (2009), http://code.google.com/appengine/docs/java/config/ webxml.html (accessed November 20, 2009) [10] Chappell, D.: Introducting the Azure Services Platform, White Paper. David Chappell & Associates (May 2009), http://download.microsoft.com/download/F/C/B/FCB07D64-7D1F4776-8C65-02C266F71C7/Introducing_Azure_Services_Platform_v1.pdf [11] Goscinski, A.: Resource Protection. In: Distributed Operating Systems: The Logical Design, pp. 585–649. Addison-Wesley, Reading (1991) [12] Yuan, E., Tong, J.: Attributed based access control (ABAC) for Web services. In: IEEE International Conference on Web Services, ICWS 2005, Proceedings, p. 569 (2005) [13] Goscinski, A., Pieprzyk, J.: Security in Distributed Operating Systems. Datenschutz and Datensicherung (5) (1991)

Toward a Framework for Cloud Security

263

[14] Neuman, C.B., Ts’o, T.: Kerberos: an authentication service for computer networks. IEEE Communications Magazine 32(I.9), 33–38 (1994) [15] Brock, M., Goscinski, A.: Attributed Publication and Selection for Web Service-based Distributed Systems. In: Proc. of the 3rd Int. Workshop on Service Intelligence and Computing (SIC 2009) with the 7th IEEE Int. Conf. on Web Services (ICWS 2009), Los Angeles, CA, USA, pp. 732–739 (2009) [16] Brock, M., Goscinski, A.: A Technology to Expose a Cluster as a Service in a Cloud. In: 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010), Brisbane, Australia (2010)

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs Nagateru Iwasawa, Tatsuro Watanabe, Tatsuya Iwasaki, and Keiichi Kaneko Graduate School of Engineering Tokyo University of Agriculture and Technology Koganei-shi, Tokyo, Japan [email protected]

Abstract. This paper proposes a routing algorithm in an n-burnt pancake graph Bn , which is a topology for interconnection networks, with at most n − 1 faulty clusters whose diameters are at most 3. For an arbitrary pair of non-faulty nodes, the proposed algorithm constructs a fault-free path of length at most 2n + 10 between them in O(n2 ) time complexity. Keywords: Cluster Faults, Routing Algorithm, Polynomial Algorithm, Dependability, Interconnection Network, Disjoint Paths.

1 Introduction Recently, with the rapid developments in parallel computers, many interconnection networks are proposed and studied [9]. An n-burnt pancake graph Bn is a variant of Cayley graphs. Similar to a star graph or a pancake graph, Bn has nice symmetric and recursive structures. It can also connect a different number of nodes from those of a star graph, a pancake graph, and so on. Moreover, it is promising since it can connect many nodes in comparison to its small diameter and degree. Hence, there are many research activities with respect to it [1,3,4,5,10]. A routing algorithm in Bn is proposed by Cohen. Meanwhile, fault tolerance is a research field for interconnection networks [8]. In an interconnection network with many nodes, algorithms that are tolerant of faulty elements to some degree are necessary to make systems stable. A faulty cluster is a connected sub graph of multiple faulty nodes. A cluster-fault-tolerant algorithm in the the star graph is proposed by Gu et al. [7], where the diameters of faulty clusters are at most two, and the number of faulty clusters is at most n. A cluster-fault-tolerant algorithm for the pancake graph is proposed by Kaneko et al. [6] where the diameters of faulty clusters are at most two, and the number of the faulty clusters is at most n − 2. This paper proposes a cluster-fault-tolerant routing algorithm in Bn with faulty clusters. The algorithm constructs a path avoiding the faulty clusters.

2 Preliminaries This section introduces requisite lemmas as well as structure of Bn and its properties. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 264–274, 2010. c Springer-Verlag Berlin Heidelberg 2010

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs

265

Definition 1. If a sequence u = (u1 , u2 , . . . , un ) satisfies a condition {|u1 |, |u2 |, . . . , |un |} = n where n = {1, 2, . . . , n}, it is called a signed permutation of n. Definition 2. For a signed permutation u = (u1 , u2 , . . . , un ) of n, and an integer i(∈ n), the signed prefix reversal operation u(i) is defined as follows: u(i) = (−ui , −ui−1 , . . . , −u1 , ui+1 , . . . , un ). In the rest of this paper, −i and (u(i,...,j) )(k) are denoted by i and u(i,...,j,k) , respectively, to save space. Definition 3. Bn is an undirected graph with n! × 2n nodes. Each node is represented by a distinct signed permutation of n integers, and it is adjacent to the nodes if and only if they belong to {u(i) | 1 ≤ i ≤ n}. Bn is a simple and symmetric graph whose degree and connectivity are both n. So far, a polynomial-time shortest path routing algorithm is not found for Bn . However, Cohen et al. proposed a routing algorithm that constructs a path of length at most 2n in O(n2 ) time complexity [1]. Definition 4. A sub graph of Bn that is induced by the nodes that has k in the rightmost position of its signed permutation is isomorphic to Bn−1 . The sub graph is called a sub burnt pancake graph and it is denoted by Bn−1 (k) by specifying k as its index. Bn is decomposable to mutually disjoint 2n Bn−1 ’s. Figure 1 shows an example of B3 , which has 6 mutually disjoint B2 structures. In the figure, a signed permutation (u1 , u2 , . . . , un ) is denoted by u1 , u2 , . . . , un for simplicity. For two distinct nodes u and v in a #G 3,2,1 3,2,1 #H #C 1,2,3 1,2,3 #D graph G(V, E), an alternative sequence #E #A of nodes and edges a0 , a0 → a1 , a1 , 2,1,3 2,1,3 2,3,1 2,3,1 a1 → a2 , . . ., ak → ak+1 , ak+1 where 2,3,1 2,1,3 2,1,3 2,3,1 a0 = u and ak+1 = v is called a path #B #F 1,2,3 1,2,3 3,2,1 3,2,1 from u to v. The length of a path P is the 1,3,2 #D 1,3,2 1,3,2 #C #H #J number of edges included in P . The dis1,3,2 3,1,2 3,1,2 3,1,2 3,1,2 tance of two nodes u and v is denoted by 3,1,2 3,1,2 3,1,2 3,1,2 1,3,2 d(u, v) and defined by the length of the #G #I #L 1,3,2 1,3,2 1,3,2 shortest path between them. #K 1,2,3 1,2,3 3,2,1 3,2,1 In the rest of this paper, a path from a #A #E 2,1,3 2,1,3 2,3,1 2,3,1 node u to a node v is sometimes denoted 2,1,3 2,1,3 2,3,1 2,3,1 by u ⇒ v if it does not cause confusion. #B #F In addition, an edge between two adjacent 1,2,3 1,2,3 #L 3,2,1 3,2,1 #J #I #K nodes s and t is sometimes denoted by Fig. 1. An Example of B3 s → v. Definition 5. A connected sub graph in a graph is called a cluster. If all the nodes in a cluster C are faulty, C is called a faulty cluster. In this paper, we focus on n − 1 faulty clusters Ci (1 ≤ i ≤ n − 1) whose diameters are at most 3 in Bn . Let F represent the set of faulty clusters {C1 , C2 , . . . , Cn−1 }.

266

N. Iwasawa et al.

A center of a graph G(V, E) is a node c ∈ V that attains arg minc∈V v∈V d(c, v). Let γ(G) denote the set of centers of a graph G. Each faulty cluster Ci in Bn has at most 2 centers, which are denoted by c(i,1) and c(i,2) . Definition 6. If a sub burnt pancake graph Bn−1 (k) in Bn does not include any center of faulty clusters and it contains at most n − 2 faulty nodes, the sub burnt pancake graph is called a candidate sub burnt pancake graph and it is denoted by CBn−1 (k). Definition 7. A set of the nodes that have j (1 ≤ |j| ≤ n) in the left-most positions of their signed permutations and i (1 ≤ i ≤ n) in the right-most positions is called a port set from Bn−1 (i) to Bn−1 (j). The port set is denoted by P (i, j) (j = i, i). Definition 8. For an arbitrary node s = (s1 , s2 , . . . , sn ) in Bn , we call that si and si+1 are adjacent if they satisfy the condition: ⎧ (si = n) ⎨1 (si = 1) si+1 = n ⎩ si + 1 (si = 1, n) The maximal successively adjacent elements are called a block. If a node has multiple blocks, we can make at least one pair of them adjacent by at most two signed prefix reversal operations. Hence, routing between two arbitrary nodes in Bn is reducible to routing in Bn−1 by at most two operations. Therefore, routing from a node with j blocks to a node of sorted sequence is reducible to routing in Bj [10].

3 Algorithm In this section, we show the main theorem with several lemmas some of which are without proofs. Then the algorithm for the cluster-fault-tolerant routing is also introduced. Lemma 1. For two distinct nodes u and v in a port set P (l, m) (1 ≤ |l|, |m| ≤ n, |m| = |l|), the distance between them d(u, v) is at least 3. Lemma 2. In Bn , there is no cycle whose length is less than 8. Lemma 3. In Bn , for a node u = (u1 , u2 , . . . , un ) and a sub burnt pancake graph Bn−1 (k) where k = |u1 |, |un |, we can construct n disjoint paths Qi (1 ≤ i ≤ n) of length at most 4 from u to n distinct nodes in Bn−1 (k) that only include the nodes in Bn−1 (un ), Bn−1 (k) and Bn−1 (u1 ) in O(n2 ) time complexity. Proof: We give a constructive proof for this lemma by showing n disjoint paths that pass only Bn−1 (un ), Bn−1 (k), and Bn−1 (u1 ). The proof is divided into two cases k = ul and k = ul . Case 1⎧(k = ul ): u → u(i) → u(i,l) → u(i,l,n) ∈ Bn−1 (ul ) (1 ≤ i < l) ⎪ ⎪ ⎪ (i) (i,n) ⎪ → u ∈ B (u ) (i = l) u → u ⎪ n−1 l ⎪ ⎨ u → u(i) → u(i,i−l+1) → u(i,i−l+1,1) → u(i,i−l+1,1,n) ∈ Bn−1 (ul ) Qi = (l < i ≤ n − 1) ⎪ ⎪ ⎪ (n) (n,n−l+1) (n,n−l+1,1) ⎪ ⎪ (∈ B (u )) → u → u → u(n,n−l+1,1,n) u → u n−1 1 ⎪ ⎩ ∈ Bn−1 (ul ) (i = n)

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs

267

Case 2 (k = ul ): ⎧ u → u(i) → u(i,l) → u(i,l,1) → u(i,l,1,n) ∈ Bn−1 (ul ) (1 ≤ i < l) ⎪ ⎪ ⎨ u → u(l) → u(l,1) → u(l,1,n) ∈ Bn−1 (ul ) (i = l) Qi = (i) (i,i−l+1) (i,i−l+1,n) → u → u ∈ B (u ) (l < i ≤ n − 1) u → u ⎪ n−1 l ⎪ ⎩ u → u(n) (∈ Bn−1 (u1 )) → u(n,n−l+1) → u(n,n−l+1,n) ∈ Bn−1 (ul )(i = n) Each path in both cases can be constructed in O(n) time complexity. Consequently, n disjoint paths can be constructed in O(n2 ) time complexity. Here, we show an example for u = (6, 4, 5, 2, 3, 1), l = 4, and k = 2 in B6 . Then the following disjoint paths can be constructed:Q1: u → (6, 4, 5, 2, 3, 1) → (2, 5, 4, 6, 3, 1) → (2, 5, 4, 6, 3, 1) → (1, 3, 6, 4, 5, 2), Q2 : u → (4, 6, 5, 2, 3, 1) → (2, 5, 6, 4, 3, 1) → (2, 5, 6, 4, 3, 1) → (1, 3, 4, 6, 5, 2), Q3 : u → (5, 4, 6, 2, 3, 1) → (2, 6, 4, 5, 3, 1) → (2, 6, 4, 5, 3, 1) → (1, 3, 5, 4, 6, 2), Q4 : u → (2, 5, 4, 6, 3, 1) → (2, 5, 4, 6, 3, 1) → (1, 3, 6, 4, 5, 2), Q5 : u → (3, 2, 5, 4, 6, 1) → (2, 3, 5, 4, 6, 1) → (1, 6, 4, 5, 3, 2), and Q6 : u → (1, 3, 2, 5, 4, 6) → (2, 3, 1, 5, 4, 6) → (6, 4, 5, 1, 3, 2). Lemma 4. In Bn , there are at least two candidate sub burnt pancake graphs. Proof: Assume that the number of faulty clusters is equal to n − 1, and their diameters (n) are all equal to 3. Let k = |{Ci | c(i,1) = c(i,2) }| (1 ≤ k ≤ n − 1). Then, |{Ci | (n)

c(i,1) = c(i,2) }| = n − k − 1 holds. The number of sub burnt pancake graphs that include a center of a faulty cluster is at most 2k + (n − k − 1) = n + k − 1. Hence, the number of sub burnt pancake graphs that do not include any center of faulty clusters is at least 2n − (n + k − 1) = n − k + 1. Additionally, in these n − k + 1 sub burnt pancake graphs, there are at most two sub burnt pancake graphs that do include n − k − 1 faulty nodes. Hence, if k = 0, there are at most two sub burnt pancake graphs that include n − 1 faulty nodes. Then, at least (n − k + 1) − 2 = n − 1 candidate sub burnt pancake graphs exist. If k = 0, there are at least n − k + 1 candidate sub burnt pancake graphs, and if k = n − 1, there are exactly two candidate sub burnt pancake graphs. From the above discussion, there are at least two candidate sub burnt pancake graphs in Bn . Moreover, we can prove that if there are only two candidate sub burnt pancake graphs, they do not include any faulty node. Lemma 5. In Bn , for a faulty cluster C and its diameter d, it takes O(n) time complexity to obtain the set of centers of C, γ(C). Proof: If d = 0 or 1, then C = γ(C). If d = 2, for two distinct nodes u, v ∈ C, check if d(u, v) = 1 or not in O(n) time complexity. If d(u, v) = 1, select another node w ∈ C, and check if d(u, w) = 1 or not in O(n) time complexity. If it is equal to 1, {u} = γ(C). Otherwise, {v} = γ(C). If d(u, v) = 1, then d(u, v) = 2. Find the shortest path between them u → w → v in O(n) time complexity, and {w} = γ(C). If d = 3, we take advantage of a property with respect to a cluster. That is, for a center of a cluster C, c = (c1 , c2 , . . . , cn ), and another center of C, c(k) = (ck , ck−1 , . . . , c1 , ck+1 , . . . , cn ), we count the number of elements that occur at the left-most positions of the faulty nodes in C. Then, each of the elements c1 , c2 , . . ., ck , c1 , c2 , . . ., and ck occurs exactly once while each of the elements ck+1 ,ck+2 , . . ., and cn occurs exactly twice. Here, we

268

N. Iwasawa et al.

select two distinct nodes u and v in C. Then, from the fact that d(u, v) ≤ 3, d(u, v) can be calculated in O(n) time complexity. Now, the proof is divided into three cases according to the value of d(u, v). Case 1 (d(u, v) = 3): Construct the shortest path u → w → x → v from u to v in O(n) time complexity. Then, {w, x} = γ(C). Case 2 (d(u, v) = 2): Construct the shortest path u → w → v from u to v in O(n) time complexity. Let x = (x1 , x2 , . . . , xn ) be an arbitrary neighbor node of u other than w. From the above-mentioned property with respect to a cluster, there are at most two nodes in C that have x1 as their left-most elements. Hence, it is possible to check if x ∈ C or not in O(n) time complexity. If x ∈ C, d(x, v) = 3 holds, and there is a path x → u → w → v of length 3. Therefore, {u, w} = γ(C). Otherwise, let y be an arbitrary neighbor node of v other than w. If y ∈ C, d(u, y) = 3 holds, and there is a path u → w → v → y of length 3. Hence, {v, w} = γ(C). If y ∈ C, w ∈ γ(C). In addition, if we assume that w(k) ∈ γ(C), from the property with respect to a cluster, we can find k in O(n) time complexity by counting the number of the left-most elements of the faulty nodes in C. Case 3 (d(u, v) = 1): For an arbitrary node w in C other than u and v, we can calculate d(u, w) and d(v, w) in O(n) time complexity. If either of d(u, w) = 3 or d(v, w) = 3 holds, the situation can be reduced to Case 1. Otherwise, either of d(u, w) = 2 or d(v, w) = 2 holds, and the situation can be reduced to Case 2. Lemma 6. In Bn , for a non-faulty node u = (u1 , u2 , . . . , un ) and a candidate sub burnt pancake graph Bn−1 (k)(k = |u1 |, |un |), a faulty cluster of diameter d where 1 ≤ d ≤ 3 blocks at most one of the n disjoint paths Qi ’s (1 ≤ i ≤ n) given in the proof of Lemma 3. Proof: From Lemmas 1 and 2, if d ≤ 2, it can be easily proved that a faulty cluster cannot block multiple paths simultaneously. Hence, we assume that d = 3 to prove that a faulty cluster of diameter d cannot block any two of the n disjoint paths. From the definition of a candidate sub burnt pancake graph, there is not any center of faulty cluster in Bn−1 (k). Therefore, it is enough to consider two cases Case 1 and Case 2 where the lengths of two paths are both 4 in Case 1, and they are 3 and 4 in Case 2. We still divide the cases depending of the value k. Case 1 (The lengths of two paths are both 4.): Case 1-1 (k = ul ): The paths of length 4 have two types: Qi : u ⇒ u(i,i−l+1,1,n) (2 ≤ l < i ≤ n − 1) and Qn : u ⇒ u(n,n−l+1,1,n) . From the proof of Lemma 3, if k = ul , 2 ≤ l < i ≤ n − 1 must hold for Qi . First, let us consider two paths Qi : u ⇒ u(i,i−l+1,1,n) and Qj : u ⇒ u(j,j−l+1,1,n) of the first type as shown in Figure 2 where we assume that l < i < j ≤ n − 1. From Lemma 2, there cannot exist a path u(i,i−l+1) ⇒ u(j,j−l+1) of length 3. The distance between the nodes u(i,i−l+1,1) and u(j,j−l+1,1) that are obtained by reverting the signs of the leftmost elements of u(i,i−l+1) and u(j,j−l+1) , respectively, is more than 3. Otherwise, d(u(i,i−l+1) , u(j,j−l+1) ) = d(u(i,i−l+1,1) , u(j,j−l+1,1) ) ≤ 3. Let the non-faulty node u be u = (u1 , u2 , . . . , ul−1 , ul , ul+1 , . . . , ui−1 , ui , ui+1 , . . . , uj−1 , uj , uj+1 , . . . , un−1 , un ). Then two nodes u(i,i−l+1,1) and u(j,j−l+1) are

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs

269

represented by u(i,i−l+1,1) = (ul , ul+1 ,. . ., ui−1 , ui , ui+1 , ul−1 , . . . , u2 , u1 , . . . , uj−1 , uj , uj+1 , . . . , un−1 , un ) and u(j,j−l+1) = (ul , ul+1 , . . . , ui−1 , ui , ui+1 , . . . , uj−1 , uj , . . . , ul−1 , . . . , u2 , u1 , uj+1 , . . . , un−1 , un ). Here, we consider five blocks ‘ul ’, ‘ul+1 , . . . , ui−1 , ui , ui+1 ’, ‘ . . . , uj−1 , uj ’, ‘ul−1 , . . . , u2 , u1 ’, and ‘uj+1 , . . . , un−1 , un ’, and map them into 1, 2, . . . , 5, respectively. Then, two nodes u(i,i−l+1,1) and u(j,j−l+1) are represented by a = (1, 2, 4, 3, 5) and b = (1, 2, 3, 4, 5), respectively. In this notation, at least one signed prefix reversal operation is necessary to make two elements adjacent. If there is a path a ⇒ b of length 3, an operation must make multiple pairs of elements adjacent. Meanwhile, the right-most elements are not reversed since they are identical. With one signed prefix reversal operation for a, we can obtain a(1) = (1, 2, 4, 3, 5), a(2) = (2, 1, 4, 3, 5), a(3) = (4, 2, 1, 3, 5), and a(4) = (3, 4, 2, 1, 5) where only a(1) makes two elements adjacent. However, since a(1) = u(i,i−l+1) and from d(u(i,i−l+1) , u(j,j−l+1) ) ≥ 4 d(u(i,i−l+1,1) , u(j,j−l+1) ) ≥ 5 holds. If more than one operations are required to make the left-most two elements adjacent, at least 4 operations are required in total. Hence, and d(a, b) ≥ 4 holds. Therefore, the distance between two nodes u(i,i−l+1,1) and u(j,j−l+1) is at least 4. Since u(i,i−l+1) and u(j,j−l+1,1) are obtained by reverting the first elements of (i,i−l+1,1) u and u(j,j−l+1) , respectively, if d(u(i,i−l+1) , u(j,j−l+1,1) ) ≤ 3 holds, then (i,i−l+1,1) d(u , u(j,j−l+1) ≤ 3 also holds. However, this inequality contradicts the above result. Hence, d(u(i,i−l+1) , u(j,j−l+1,1) ) ≥ 4 holds. Consequently, a faulty cluster cannot block two paths u ⇒ u(i,i−l+1,1,n) and u ⇒ (j,j−l+1,1,n) u simultaneously. For two paths u ⇒ u(i,i−l+1,1,n) andu ⇒ u(n,n−l+1,1,n) , we can similarly prove that a single faulty cluster cannot block both of them at a time by considering j = n. Consequently, the lemma holds for Case1-1. Case 1-2 k = ul . The path of length 4 is in the form of u ⇒ u(i,l,1,n) (1 ≤ i < l). Here, we consider two paths u ⇒ u(i,l,1,n) and u ⇒ u(j,l,1,n) (i < j < l) as shown in Figure 3. From a similar discussion to Case1-1 with two paths u ⇒ u(i,i−l+1,1,n) and u ⇒ u(n,n−l+1,1,n) , we can easily prove that a single cluster cannot block two paths u ⇒ u(i,l,1,n) and u ⇒ u(j,l,1,n) at a time. Consequently, the lemma holds for Case1-2. Case 2 The lengths of two paths are 3 and 4. Case 2-1 k = ul . In this case, the path of length 3 is in the form of u ⇒ u(i,l,n) (1 ≤ i < l). Let us consider two paths u ⇒ u(j,j−l+1,1,n) and u ⇒ u(i,l,n) (i < l < j). From Lemma2, there is not any path of length 3 between any pair of nodes other than the pair of u(j,j−l+1,1) and u(i,l) shown in Figure 4. As similar to the proof of Case 1-1, it can be proved that the distance between two nodes u(i,l) and u(j,j−l+1) is no less than 4. Hence, the lemma holds for Case 2-1. Case 2-2 k = ul The paths of lengths 3 are in the forms of u ⇒ u(l,1,n) , u ⇒ u(j,j−l+1,n) (l < j ≤ n − 1), or u ⇒ u(n,n−l+1,n) . For two paths u ⇒ u(i,l,1,n) and u ⇒ u(l,1,n) , a faulty cluster cannot block them at a time from the similar reason to that there is no path u(i,i−l+1,1) ⇒ u(j,j−l+1,1) . As similar to the case of u ⇒ u(i,i−l+1,1,n) and

270

N. Iwasawa et al.

Fig. 2. Case 1-1

Fig. 3. Case 1-2

Fig. 4. Case 2-1

u ⇒ u(j,j−l+1,1,n) in Case 1-1, a faulty cluster cannot block both of u ⇒ u(i,l,1,n) and u ⇒ u(j,j−l+1,n) . Furthermore, as similar to the case of u ⇒ u(i,i−l+1,1,n) and u ⇒ u(n,n−l+1,1,n) in Case 1-1, u ⇒ u(i,l,1,n) and u ⇒ u(n,n−l+1,n) cannot be blocked by a faulty cluster. Hence, the lemma holds for Case2-2. From the above discussion, Lemma 6 holds. Theorem 1. In Bn , for a source node s, a destination node t, and a set of faulty nodes F where |F | ≤ n−1, a fault-free path s ⇒ t of length at most 2n+4 can be constructed in O(n2 ) time complexity [2]. Lemma 7. In Bn , for a non-faulty node u, a candidate sub burnt pancake graph Bn−1(k), and a set of faulty clusters F = {C1 , C2 , . . . , Cn−1 } with the diameters of clusters, we can obtain at least one fault-free path from u to Bn−1 (k) in O(n2 ) time complexity. Proof: From Lemma 3, n disjoint paths Qi (1 ≤ i ≤ n) from u to Bn−1 (k) in O(n2 ) time complexity. From Lemma 5, for all the faulty clusters Ci (1 ≤ i ≤ n − 1) we can obtain their centers c(i,1) and c(i,2) in O(n2 ) time complexity. Then, for c(i,1) and c(i,2) , let c˜(i,1) and c˜(i,2) be the nodes that are obtained by reverting the signs of the element whose absolute values are equal to k in the corresponding nodes. For each of the centers c(i,1) and c(i,2) , and their variants c˜(i,1) and c˜(i,2) , we can check whether it is reachable from u with a path of length at most 3 or not in O(n) time complexity. If a path u → u(i) → u(i,j) → c of length at most 3 can be constructed for c ∈ {c(i,1) , c(i,2) }, and u(i,j) is on the path Qi , Qi includes a faulty node u(i,j) . We can check whether u(i,j) is on the path Qi or not in O(n) time complexity. Otherwise, if there is a path c(i,1) , c˜(i,2) }, u(i,j) is u → u(i) → u(i,j) → c˜ of length 3 can be constructed for c˜ ∈ {˜ on the path Qi , and Qi is of length exactly 4, Qi includes a faulty node u(i,j,1) . We can check whether u(i,j) is on the path Qi or not in O(n) time complexity. Because Qi is in the form of u → u(i) → u(i,j,1) → u(i,j,1,n) (∈ Bn−1 (k)), the left-most elements of u(i,j) and u(i,j,1) are k and k, respectively. Hence, note that, for example, the fact that u(i,j) is adjacent to c˜(i,1) is equivalent to that u(i,j,1) is adjacent to c(i,1) . Note also that from Lemma 6, a cluster cannot block multiple paths. For each of c(i,1) , c(i,2) , c˜(i,1) , and c˜(i,2) , we can check if the corresponding cluster blocks one of Qi ’s in O(n) time complexity. Consequently, it takes O(n2 ) time complexity to detect all faulty paths among Qi ’s. From above, Lemma 7 holds. Lemma 8. In Bn , for a source node s = (s1 , s2 , . . . , sn ), a destination node t = (t1 , t2 , . . . , tn ), and a set of faulty clusters F with their diameters, if there exists at least one candidate sub burnt pancake graph Bn−1 (k) that satisfies k ∈ {s1 , sn , t1 , tn },

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs

271

at least one fault-free node s ⇒ t of length at most 2n+10 can be constructed in O(n2 ) time complexity. Proof: Proof is divided into the following three cases. Case 1 (k ∈ {s1 , t1 }): We can assume that k = s1 without loss of generality. Then, a fault-free path s → s(n) (= g)(∈ Bn−1 (k)) of length 1 can be constructed in O(n) time complexity. From Lemma 7, at least one fault-free path t ⇒ h(∈ Bn−1 (k)) of length at most 4 can be found in O(n2 ) time complexity. From Theorem 1, we can construct a fault free path g ⇒ h of length at most 2(n − 1) + 4 = 2n + 2 in Bn−1 (k) in O(n2 ) time complexity. Hence, a fault-free path s → g ⇒ h ⇒ t of length at most 1 + 4 + 2n + 2 = 2n + 7 can be obtained in O(n2 ) time complexity for Case 1. Case 2 (k ∈ {sn , tn }): We can assume that k = sn without loss of generality. From Lemma 7, we can construct at least one fault-free path t ⇒ h(∈ Bn−1 (k)) of length at most 4 in O(n2 ) time complexity. From Theorem 1, we can obtain a fault-free path s ⇒ h of length at most 2(n − 1) + 4 = 2n + 2 in Bn−1 (k) in O(n2 ) time complexity. Consequently, a fault-free path s ⇒ h ⇒ t of length at most 4 + 2n + 2 = 2n + 6 can be constructed in O(n2 ) time complexity for Case 2. Case 3 (Otherwise): From Lemma 7, each of fault-free paths s ⇒ g(∈ Bn−1 (k)) and s ⇒ g(∈ Bn−1 (k)) of lengths at most 4 can be constructed in O(n2 ) time complexity. From Theorem 1,we can obtain a fault-free path g ⇒ h of length at most 2(n−1)+4 = 2n + 2 in Bn−1 in O(n2 ) time complexity. Consequently, a fault-free path s ⇒ g ⇒ h ⇒ t of length at most 4 + 4 + (2n + 2) = 2n + 10 can be obtained in O(n2 ) time complexity for Case 3. From above discussion, Lemma 8 holds. Lemma 9. In Bn , for two non-faulty nodes s = (s1 , s2 ,. . ., sn ) and t = (t1 , t2 ,. . ., tn ), a set of faulty clusters F and the diameters of the clusters, if there exist at most four candidate sub burnt pancake graphs Bn−1 (ki ) where ki ∈ {s1 , sn , t1 , tn }, we can construct a fault-free path s ⇒ t of length at most 2n + 10 in O(n2 ) time complexity. Proof: From Lemma 4, there are at least two candidate sub burnt pancake graphs. The proof is divided into three cases depending on the number i of the candidate sub burnt pancake graphs and their distributions. Case 1 (i = 2 and {k1 , k2 } = {s1 , sn } or {k1 , k2 } = {t1 , tn }): We assume that {k1 , k2 } = {s1 , sn }. Then, we can construct two paths of length at most 4 from s to the distinct candidate sub burnt pancake graphs in O(n2 ) time complexity as follows: s → s(1) → s(1,n) (∈ Bn−1 (s1 )), s → s(i) → s(i,n) → s(i,n,1) → s(i,n,1,n) (∈ Bn−1 (sn )) (1 < i < n), and s → s(n) → s(n,1) → s(n,1,n) (∈ Bn−1 (sn )). With these n paths of length at most 4 to two candidate sub burnt pancake graphs Bn−1 (s1 ) and Bn−1 (sn ), we show that a single cluster fault cannot block two paths at a time. The lengths of all the paths other than s ⇒ s(i,n,1,n) are at most 3. Therefore, a faulty cluster cannot block two of them at a time from the proof of Lemma 6. Now, let us consider two paths s ⇒ s(i,n,1,n) and s ⇒ s(j,n,1,n) . It can be also proved that a single faulty cluster cannot block both of them at a time as similar to the proof of Lemma 6. Hence, at most one of n disjoint paths can be blocked by a single faulty cluster. At least one fault-free path of length at most 4 from s to either Bn−1 (k1 ) or Bn−1 (k2 ) can

272

N. Iwasawa et al.

be constructed in O(n2 ) time complexity. Assume that the path s ⇒ g(∈ Bn−1 (k1 )) can be constructed. From Lemma 6, a fault-free path t ⇒ h(∈ Bn−1 (k1 )) of length at most 4 can be constructed in O(n2 ) time complexity. From Theorem 1, a fault-free path of length at most 2n + 2 g ⇒ h can be constructed in Bn−1 (k1 ) in O(n2 ) time complexity. Therefore, a fault-free path s ⇒ g ⇒ h ⇒ t of length at most 4 + 2n + 2 + 4 = 2n + 10 can be constructed in O(n2 ) time complexity. When {k1 , k2 } = {t1 , tn }, the lemma also holds. Case 2 (i = 2 and {k1 , k2 } = {sn , tn }, {k1 , k2 } = {s1 , t1 }, {k1 , k2 } = {s1 , tn } or {k1 , k2 } = {sn , t1 }): We assume that {k1 , k2 } = {sn , tn }. From Lemma 7, the fault-free paths s ⇒ g(∈ Bn−1 (tn )), g ⇒ x(∈ Bn−1 (sn )), and t ⇒ h(∈ Bn−1 (sn )) whose lengths are at most 4 can be constructed in O(n2 ) time complexity. Since there is no faulty node in Bn−1 (sn ), we can obtain a path x ⇒ h of length at most 2(n − 1) in Bn−1 (sn ) in O(n2 ) time complexity by applying the algorithm by Cohen and Blum. Therefore, a fault-free path s ⇒ g ⇒ x ⇒ h ⇒ t of length at most 4 + 4 + 4 + 2(n − 1) = 2n + 10 can be constructed in O(n2 ) time complexity. When {k1 , k2 } = {s1 , t1 }, {k1 , k2 } = {s1 , tn } or {k1 , k2 } = {sn , t1 }, the lemma holds similarly. Case 3 (i ≥ 3): If there are three candidate sub burnt pancake graphs Bn−1 (k1 ), Bn−1 (k2 ) and Bn−1 (k3 ), either {s1 , sn } ⊂ {k1 , k2 , k3 } of {t1 , tn } ⊂ {k1 , k2 , k3 } holds. Hence, this case is reducible to Case1. Consequently, Lemma 9 holds. Theorem 2. In Bn , for a source node s, a destination node t, and a set of faulty clusters of diameters at most 3 F where |F | ≤ n − 1, a fault-free path s ⇒ t of length at most 2n + 10 can be constructed in O(n2 ) time complexity. Proof: Let s = (s1 , s2 , ..., sn ) and t = (t1 , t2 , ..., tn ). Then, if a candidate sub burnt pancake graph Bn−1 (k) such that k = |s1 |, |sn |, |t1 |, |tn | exists, a fault-free path s ⇒ t of length at most 2n + 10 can be constructed in O(n2 ) time complexity from Lemma 8. Even if only candidate sub burnt pancake graphs Bn−1 (k)’s for k = |s1 |, |sn |, |t1 |, and |tn | exist, a fault-free path s ⇒ t of length at most 2n + 10 can be constructed in O(n2 ) time complexity from Lemma 9. Consequently, Theorem 2 holds. Figure 6 shows the outline of the clusterfault-tolerant routing algorithm in Bn . We show an example of execution. For a source node s = (1, 2, 3, 4, 5, 6), a destination node t = (1, 2, 3, 4, 5, 6), and centers of faulty clusters c(1,1) = (6, 4, 5, 3, 2, 1), c(1,2) = (1, 2, 3, 4, 5, 6), c(2,1) = (3, 4, 5, 6, 1, 2), c(2,2) = (2, 1, 6, 5, 4, 3), c(3,1) = (4, 5, 6, 1, 2, 3), c(3,2) = (3, 2, 1, 6, 5, 4), c(4,1) = (2, 3, 4, 6, 1, 5), c(4,2) = (5, 1, 6, 4, 3, 2), Fig. 5. Path Construction in Case 2 in Proof of c(5,1) = (6, 1, 2, 3, 4, 5), and c(5,2) = Lemma 9 (5, 4, 3, 2, 1, 6), the algorithm constructs a fault-free path between s and t: (1, 2, 3, 4, 5, 6) → (4, 3, 2, 1, 5, 6) → (6, 5, 1, 2, 3, 4) → (3, 2, 1, 5, 6, 4) → (5, 1, 2, 3, 6, 4) → (6, 3, 2, 1, 5, 4) → (1, 2, 3, 6, 5, 4) → (5, 6, 3, 2, 1,

Cluster-Fault-Tolerant Routing in Burnt Pancake Graphs

273

CFT-route(n, F , s, t) Input: n for Bn ; a set of faulty clulsters F = {C1 , C2 , . . . , Cn−1 }, their diameters; non-faulty nodes s = (s1 , s2 , . . . , sn ), t = (t1 , t2 , . . . , tn ); Output: a fault-free path from s to t; begin calculate the centers c(i,1) and c(i,2) of Ci (1 ≤ i ≤ n − 1);

(n)

(n)

n−1 find I = {i | |Bn−1 (i) ∩ (∪n−1 j=1 {c(j,1) , c(j,2) })| = 0, |Bn−1 (i) ∩ (∪j=1 {c(j,1) , c(j,2) })| ≤ n − 2};

if ∃i ∈ I such that |i| = |s1 |, |sn |, |t1 |, |tn | then begin

construct n disjoint paths Qj from s to gj ∈ Bn−1 (i) (1 ≤ j ≤ n); construct n disjoint paths Rk from t to hk ∈ Bn−1 (i) (1 ≤ k ≤ n); find fault-free paths Qj and Rk ; construct a path from gj to hk in Bn−1 (i); return s ⇒ gj ⇒ hk ⇒ t end else if I = {sn , sn } then begin construct a path from s to hk in Bn−1 (sn ); return s ⇒ hk ⇒ t end else if I = {ii , i2 } and {|i1 |, |i2 |} = {|s1 |, |sn |} then begin construct n disjoint paths Tl from s to gl ∈ Bn−1 (i1 ) ∪ Bn−1 (i2 ) (1 ≤ l ≤ n); find a fault-free path Tl ; return s ⇒ gl ⇒ hk ⇒ t end else begin construct n disjoint paths Uj from s to gx ∈ Bn−1 (i1 ) (1 ≤ x ≤ n); construct n disjoint paths Vk from t to hy ∈ Bn−1 (i2 ) (1 ≤ y ≤ n); find fault-free paths Ux and Vy ; construct a path from gx to hy in Bn−1 (i1 ) ∪ Bn−1 (i2 ); return s ⇒ gx ⇒ hy ⇒ t end end;

Fig. 6. Cluster-Fault-Tolerant Routing Algorithm in Burnt Pancake Graphs

4) → (2, 3, 6, 5, 1, 4) → (1, 5, 6, 3, 2, 4) → (3, 6, 5, 1, 2, 4) → (2, 1, 5, 6, 3, 4) → (6, 5, 1, 2, 3, 4) → (4, 3, 2, 1, 5, 6) → (4, 3, 2, 1, 5, 6) → (1, 2, 3, 4, 5, 6)

4 Conclusions This paper proposed an algorithm that routes a fault-free path between two non-faulty nodes in Bn with at most n − 1 faulty clusters of diameters at most 3. We have proved that the time complexity of the algorithm is O(n2 ) and the length of the path given by the algorithm is at most 2n + 10. Future works include an empirical evaluation by computer experiments, improvement of maximum path lengths, and so on.

References 1. Cohen, D.S., Blum, M.: On the problem of sorting burnt pancakes. Discrete Applied Mathematics 61, 105–120 (1995) 2. Iwasaki, T., Kaneko, K.: A fault-tolerant routing algorithm of burnt pancake graphs. In: Proc. 2009 Int’l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 307–313 (2009) 3. Kaneko, K.: An algorithm for node-to-node disjoint paths problem in burnt pancake graphs. IEICE Trans. Inf. and Systems E90-D(1), 306–313 (2007) 4. Kaneko, K.: An algorithm for node-to-set disjoint paths problem in burnt pancake graphs. IEICE Trans. Inf. and Systems E86-D(12), 2588–2594 (2003)

274

N. Iwasawa et al.

5. Kaneko, K.: Hamiltonian cycles and hamiltonian paths in faulty burnt pancake graphs. IEICE Trans. Inf. and Systems E90-D(4), 716–721 (2007) 6. Kaneko, K., Sawada, N., Peng, S.: Cluster fault-tolerant routing in pancake graphs. In: Proc. 19th IASTED Conf. Parallel and Distributed Computing and Systems, pp. 423–428 (2007) 7. Gu, Q.P., Peng, S.: Cluster fault-tolerant routing in star graphs. Networks 35(1), 83–90 (2000) 8. Gu, Q.P., Peng, S.: Optimal algorithms for node-to-node fault tolerant routing in hypercubes. The Computer Journal 39(7), 626–629 (1996) 9. Akers, S.B., Krishnamurthy, B.: A group-theoretic model for symmetric interconnection networks. IEEE Trans. Computers 38(4), 555–566 (1989) 10. Gates, W.H., Papadimitriou, H.: Bounds for sorting by prefix reversal. Discrete Mathematics 27, 47–57 (1979)

Edge-Bipancyclicity of All Conditionally Faulty Hypercubes Chao-Ming Sun and Yue-Dar Jou Department of Electrical Engineering R.O.C. Military Academy, Kaohsiung 83059, Taiwan {sunzm,ydjou}@mail.cma.edu.tw

Abstract. In this paper, we consider the conditionally faulty hypercube Qn with n ≥ 2 that each vertices of Qn is incident with at least m faultfree edges, 2 ≤ m ≤ n − 1. We shall generalize the limitation m ≥ 2 in all previous results of edge-bipancyclicity. For every integer m, under the hypothesis, we prove that Qn is (n−2)-edge-fault-tolerant edge-bipancyclic, and the results are optimal with respect to the number of edge faults tolerated. This improves some known results on edge-bipancyclicity of hypercubes.

1

Introduction

The graph-embedding problem that asks whether the guest graph is a subgraph of a host graph plays an important issue in evaluating a network. An embedding strategy provides a scheme to emulate a guest graph on a host graph. This problem has became the subject of many studies in recent years. To ﬁnd a cycle of a given length in graph G is a cycle embedding problem, and to ﬁnd cycles of all lengths from 3 to |V (G)| is a pancyclic problem which is investigated in a lot of interconnection networks [1,2,4,6,13,14]. In general, a graph is pancyclic if it contains cycles of all lengths [4]. The pancyclicity is an important property to determine if a network’s topology is suitable for an application where mapping cycles of any length into the topology of network is required. The concept of pancyclicity has been extended to vertex-pancyclicity [10] and edge-pancyclicity [2]. A graph is vertex-pancyclic (edge-pancyclic) if every vertex (edge) lies on a cycle of every length from 3 to |V (G)|. Bipancyclicity is essentially a restriction of the concept of pancyclicity to bipartite graphs whose cycles are necessarily of even length. Based on this deﬁnition, clearly, if a graph is edge-bipancyclic, then it is vertex-bipancyclic. Moreover, if a graph is vertex-bipancyclic, then it is bipancyclic. However, both the converses are not true, as shown in Fig. 1. Therefore, the edge-bipnacyclic property is not only more important but also stronger than the other properties. It is useful to consider faulty networks because vertex faults and edge faults may occur in networks. When all edge faults are random, a bipartite graph G is k-edge-fault-tolerant edge-bipancyclic if G − F remains edge-bipancyclic C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 275–280, 2010. c Springer-Verlag Berlin Heidelberg 2010

276

C.-M. Sun and Y.-D. Jou (a)

(b)

x

y

Fig. 1. Clearly, both two graphs are bipartite. (a) The graph is bipancyclic, but no cycle of length 4 contains the black vertex; thus, it is not vertex-bipancyclic. (b) The graph is vertex-bipancyclic. Clearly, no cycle of length 6 contains the edge (x, y). Consequently, it is not edge-bipancyclic.

for any F ⊂ E(G) with |F | ≤ k [13]. However, each component of a network may have diﬀerent reliability. Based on this insight, Harary [8] ﬁrst proposed the concept of conditional connectivity. Afterwards Latiﬁ et al. [11] deﬁned the conditional vertex-faults, which each vertex is incident with at least m faultfree vertices. Chan and Lee [5] considered to replace the conditional vertexfaults with conditional edge-faults. In other words, a graph G is conditionally faulty if each vertex is incident with at least m fault-free edges, 2 ≤ m ≤ n − 1. The conditionally edge-fault-tolerant bipancyclicity, vertex-bipancyclicity, and v e (G), and Cm (G), are deﬁned to be the maximum edge-bipancyclicity, Cm (G), Cm integer k such that a conditionally faulty bipartite graph G with δ(G − F ) ≥ m where F ⊂ E(G) with |F | ≤ k, is k-edge-fault-tolerant bipancyclic, vertexbipancyclicity, and edge-bipancyclicity, respectively, and undeﬁned otherwise. The hypercube is one of the most versatile and unique interconnection networks discovered to date for parallel computation [12]. Embedding has been the subject of intensive study, with the hypercube being the host graph and various graphs being the guest graph. The problem of fault-tolerant embedding in the hypercube has been previously studied in [5,7,9,13,15,16,18]. Li et al. [13] proved that for n ≥ 2, Qn is (n − 2)-edge-fault-tolerant edge-bipancyclic. In this paper, we consider the conditionally edge-fault-tolerant edge-bipancyclicity. The following result improves a recent result presented by Li et al. e (Qn ) = n − 2 if n ≥ 2. Theorem 1. For every integer m, Cm

By Theorem 1, the following proof is straightforward. Corollary 1. [13] For n ≥ 2, Qn is (n − 2)-edge-fault-tolerant edge-bipancyclic. The remainder of this paper is organized as follows. In next Section, some basic deﬁnitions and related works are introduced. In Section 3, we prove the main results. Finally, Section 4 provides the conclusions.

2

Preliminaries

In this paper, graph-theoretical terminology and notation in [3] are used, and a graph G = (V, E) means a simple graph, where V = V (G) is the vertex set and

Edge-Bipancyclicity of All Conditionally Faulty Hypercubes

277

E = E(G) is the edge set of the graph G. For a vertex u, NG (u) denotes the neighborhood of u, which is the set {v|(u, v) ∈ E}. And |NG (u)| is the degree of v, denoted by dG (v). Moreover, the minimum degree of G, denoted by δ(G), is min{dG (v)|v ∈ V (G)}. Two vertices u and v are adjacent if (u, v) ∈ E. A graph P = v0 , v1 , . . . , vk is called a path if k + 1 vertices v0 , v1 , . . . , vk are distinct and (vi−1 , vi ) is an edge of P for i = 1, 2, · · · , k. Two vertices v0 and vk are called end-vertices of the path, and the number k of the edges contained in the path is called the length of P , denoted by l(P ). The distance between any two vertices, u and v, of G, denoted by dG (u, v) is the length of the shortest path joining u and v in G. For convenience, we use the sequence P = v0 , . . . , vi , P [vi , vj ], vj , . . . , vk , where P [vi , vj ] = vi , vi+1 , . . . , vj and the two vertices vi and vj are the end-vertices of P [vi , vj ]. Sometimes, we also use P = P [v0 , vi ] + P [vi , vj ] + P [vj , vk ] or v0 vk -path to denote a path P . Let G be a graph and E ⊂ E(G). The graph obtained by deleting all edges of E from G is denoted by G − E . A faulty edge of G is an edge that can be deleted from G. The n-dimensional hypercube, denoted by Qn , is a bipartite graph with 2n vertices; its any vertex u is denoted by n-bit binary string u = xn xn−1 . . . x2 x1 , where xi ∈ {0, 1} for all i, 1 ≤ i ≤ n. Assume that e = (u, v) is an edge of Qn and two vertices u = xn . . . xi . . . x1 and v = xn . . . xi . . . x1 are joined by an edge along dimension i, where 1 ≤ i ≤ n and xi represents the one complement of xi . Then e is called an edge of dimension i in Qn . In the rest of this paper, ui denotes the binary string xn . . . xi . . . x1 . The set of all edges of dimension i in Qn is denoted by Ei . It is clear that |Ei | = 2n−1 . For any given i ∈ {1, 2, . . . , n}, let Q0n−1 and Q1n−1 be two (n − 1)dimensional subcubes of Qn induced by all vertices with the ith bit being 0 and 1, respectively. Clearly, Qn − Ei = Q0n−1 ∪ Q1n−1 . We say that Qn is decomposed into two (n − 1)-dimensional subcubes Q0n−1 and Q1n−1 by the crossing edge set Ei . To prove Theorem 1, the following important lemma is often used. Lemma 1. There exist exactly (n − 1) disjoint cycles in Qn of length 4 that contain an edge (u, v) in common . Proof: Without loss of generality (W.l.o.g.), assume two vertices u and v in Qn are joined by an edge along dimension one. Since Qn is n-regular, NQn (u) = {v = u1 , u2 , u3 , . . . , un }, and NQn (v) = {u = v1 , v2 , v3 , . . . , vn }. Clearly, NQn (u) ∪ NQn (v) has 2n distinct vertices. Since dH (u, v) = 1, dH (ui , vi ) = 1 for 2 ≤ i ≤ n. In other words, if two vertices u and v are adjacent, each pair of vertices ui and vi are also adjacent. Therefore, for 2 ≤ i ≤ n, u, ui , vi , v, u forms exactly n−1 disjoint cycles of length 4 that contain the edge (u, v) in common, as shown in Fig. 2.

3

Proof of Theorem 1

e First, we claim that Cm (Qn ) ≥ n − 2 if n ≥ 2. For this purpose, we need to prove that Qn is (n − 2)-edge-fault-tolerant edge-bipancyclic if n ≥ 2. This claim is

278

C.-M. Sun and Y.-D. Jou

u

u2

u3

u4

un

v

v2

v3

v4

vn

Fig. 2. Illustration for Lemma 1

proved by induction on n. Clearly, the theorem is true on n = 2, 3. We assume that the theorem is true for every integer 3 ≤ k < n. W.l.o.g., let F be a faulty edge set of Qn with |F | = n − 2. Let e = (u, v) be any fault-free edge in Qn . From now on, it is necessary to construct fault-free cycles in Qn containing e whose lengths are 4, 6, . . . , 2n . n For 1 ≤ i ≤ n, let Fi denote the set of i-dimensional edges in F . Thus, i=1 |Fi | = |F |. W.l.o.g., assume that |F1 | ≤ |F2 | ≤ · · · ≤ |Fn |. We may split Qn into two (n − 1)-dimensional subcubes Q0n−1 and Q1n−1 by a crossing edge set En . We use FL and FR to denote the set E(Q0n−1 ) ∩ F and E(Q1n−1 ) ∩ F , respectively. Thus, F = FL ∪ FR ∪ Fn and |FL | + |FR | ≤ n − 3. There are two scenarios. Case 1: e ∈ E(Qin−1 ) for some i ∈ {0, 1}. W.l.o.g., assume e ∈ E(Q0n−1 ). Since |FL | ≤ n−3, by induction, there exist faultfree cycles in Q0n−1 containing e whose lengths are 4, 6, . . . , 2n−1 . Finally, we still need to construct a fault-free cycle containing e in Qn of every even length l with 2n−1 + 2 ≤ l ≤ 2n . Let C0 be one of the fault-free longest cycles containing e in Q0n−1 . Obviously, l(C0 ) = 2n−1 . Let l1 = l −l(C0 )−1. Since both l and l(C0 ) are even, l1 is odd and l1 = 1, 3, . . . , 2n−1 − 1. Since n ≥ 4 and l(C0 ) = 2n−1 , we can choose an edge (x, y) in C0 such that {(x, xn ), (y, yn ), (xn , yn )}∩{Fn ∪FR } = ∅. Since d(x, y) = 1, d(xn , yn ) = 1. By induction, there exists a fault-free cycle C1 containing (xn , yn ) in Q1n−1 of every even length from 4 to 2n−1 . Clearly, C1 contains fault-free xn yn -paths P1 of Q1n−1 whose lengths are 1, 3, . . . , 2n−1 − 1. Therefore, C0 − (x, y) + {(x, xn ), (y, yn )} + P1 forms the desired cycle, as shown in Fig. 3(a). Case 2: e ∈ Fn . W.l.o.g., assume the vertex u belongs to Q0n−1 . Since |F | = n − 2, by Lemma 1, there exists a fault-free cycle C of length 4 that contains the edge (u, v). Write C

Fig. 3. Illustration for Theorem 1

Edge-Bipancyclicity of All Conditionally Faulty Hypercubes

279

as u, x, xn , v, u. By induction, there exist fault-free cycles C0 in Q0n−1 containing (u, x) whose lengths are 4, 6, . . . , 2n−1 . Similarly, there exist fault-free cycles C1 in Q1n−1 containing (v, xn ) whose lengths are 4, 6, . . . , 2n−1 . Consequently, C0 + {(x, xn ), (u, v)} + C1 containing the desired cycle of every even length from 4 to 2n including e, inclusive, is established, as shown in Fig. 3(b). The claim is thus completed. Conversely, assume two vertices u and v in Qn are joined by an edge along dimension one. We can choose a set R consisting of n − 1 faulty edges, {(ui , vi )} for 2 ≤ i ≤ n, in Qn (see Fig. 2). Clearly, each vertex of Qn − R has at least n − 1 fault-free edges incident with it; that is, δ(Qn − R) ≥ n − 1. However, in the Qn with n ≥ 2, by Lemma 1, it is impossible to make a cycle of length 4 that contains the edge (u, v). Assume F is an edge subset of E(Qn ). Clearly, R ⊂ {F |Qn − F with δ(Qn − F ) ≥ m, is not edge-bipancyclic} for every integer e e m. Thus, Cm (Qn ) ≤ n − 2 if n ≥ 2. Hence, Cm (Qn ) = n − 2 if n ≥ 2. The theorem is completed. Recently, Tsai [19] showed that C2 (Qn ) = 2n−5. Naturally, we can consider that the problem is modeled as ﬁnding bipancyclicity and vertex-bipancyclicity of the graph. In other words, for every integer m, 3 ≤ m ≤ n − 1, how many faulty edges can be tolerated such that Qn is bipancyclic (or vertex-bipancyclic)? On the other hand, Shih et al. [17] showed that C2e (Qn ) = 2n − 5 for excluding a cycle of length 4. We are curious that for every integer m, 3 ≤ m ≤ n − 1, how many faulty edges can be tolerated such that the graph still satisﬁes Shih et al.’s proposal?

4

Conclusion

Fault tolerance is the ability of a network to perform in the presence of one or more faults. The most signiﬁcant point of information about a network’s fault tolerance is whether it can function at all in the presence of faults. Let F be an edge subset of Qn with |F | ≤ n − 2. In this paper, we proved that if for every integer m, the conditionally faulty hypercube with δ(G − F ) ≥ m, is (n − 2)edge-fault-tolerant edge-bipancyclic and the results are optimal with respect to the number of edge faults tolerated.

Acknowledgment The authors would like to thank the National Science Council of the Republic of China, Taiwan, for ﬁnancially supporting this research under Contract No. Nsc 98-2115-M-145-001.

References 1. Amar, D., Fournier, I., Germa, A.: Pancyclism in Chv´ atal-Erd¨ os graph. Graphs Combinat. 7, 101–112 (2004) 2. Alspach, B., Hare, D.: Edge-pancyclic block-intersection graphs. Discrete Math. 97(1-3), 17–24 (1991)

280

C.-M. Sun and Y.-D. Jou

3. Bondy, J.-A.: U.S.R. Murty, Graph Theory with Applications. North Holland, New York (1980) 4. Bondy, J.-A.: Pancyclic graphs I. J. Combinat. Theory 11, 80–84 (1971) 5. Chan, M.-Y., Lee, S.-J.: On the existence of Hamiltonian circuits in faulty hypercubes. SIAM J. Discrete Math. 4, 511–527 (1991) 6. Day, K., Tripathik, A.: Embedding of cycles in arrangement graphs. IEEE Trans. Comput. 12, 1002–1006 (1993) 7. Fu, J.-S.: Fault-tolerant cycle embedding in the hypercube. Parallel Comput. 29, 821–832 (2003) 8. Harary, F.: Conditional connectivity. Networks 13, 347–357 (1983) 9. Harary, F., Hayes, J.-P., Wu, H.-J.: A survey of the theory of hypercube graphs. Math. Appl. 15, 277–289 (1988) 10. Hobbs, A.: The square of a block is vertex pancyclic. J. Combinat. Theory B 20, 1–4 (1976) 11. Latifi, S., Hegde, M., Naraghi-pour, M.: Conditional connectivity measures for large multiprocessor systems. IEEE Trans. Comput. 43, 218–222 (1994) 12. Leighton, F.-T.: Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, San Mateo (1992) 13. Li, T.-K., Tsai, C.-H., Tan, J.J.M., Hsu, L.-H.: Bipanconnected and edge-faulttolerant bipancyclic of hypercubes. Inform. Proc. Lett. 87, 107–110 (2003) 14. Mitchem, J., Schmeichel, E.: Pancyclic and bipancyclic graphs-a survey. Graphs and Applications, 271–278 (1982) 15. Saad, Y., Schultz, M.-H.: Topological properties of hypercube. IEEE Trans. Comput. 37, 867–872 (1988) 16. Simmons, G.: Almost all n-dimensional rectangular lattices are Hamilton laceable. Congr. Numer. 21, 649–661 (1978) 17. Shih, L.-M., Tan, J.J.M., Liang, T., Hsu, L.-H.: Edge-bipancyclicity of conditional faulty hypercubes. Inform. Proc. Lett. 105, 20–25 (2007) 18. Sun, C.-M., Hung, C.-N., Hunang, H.-M., Hsu, L.-H., Jou, Y.-D.: Hamiltonian laceability of faulty hypercubes. J. Interconnection Networks 8(2), 133–145 (2007) 19. Tsai, C.-H.: Linear array and ring embeddings in conditional faulty hypercubes. Theoretical Computer Science 314, 431–443 (2004)

Accelerating Euler Equations Numerical Solver on Graphics Processing Units Pierre Kestener1 , Fr´ed´eric Chˆateau1 , and Romain Teyssier2 1

2

CEA, Centre de Saclay, DSM/IRFU/SEDI, F-91191 Gif-Sur-Yvette, France [email protected] http://irfu.cea.fr/en/index.php CEA, Centre de Saclay, DSM/IRFU/SAp, F-91191 Gif-Sur-Yvette, France

Abstract. Finite volume numerical methods have been widely studied, implemented and parallelized on multiprocessor systems or on clusters. Modern graphics processing units (GPU) provide architectures and new programing models that enable to harness their large processing power and to design computational ﬂuid dynamics simulations at both high performance and low cost. We report on solving the 2D compressible Euler equations on modern Graphics Processing Units (GPU) with high-resolution methods, i.e. able to handle complex situations involving shocks and discontinuities. We implement two diﬀerent second order numerical schemes, a Godunov-based scheme with quasi-exact Riemann solver and a fully discrete second-order central scheme as originally proposed by Kurganov and Tadmor. Performance measurements show that these two numerical schemes can achieves x30 to x70 speed-up on recent GPU hardware compared to a mono-thread CPU reference implementation. These ﬁrst results provide very promising perpectives for designing a GPU-based software framework for applications in computational astrophysics by further integrating MHD codes and N-body simulations.

1

Introduction

We report on implementing diﬀerent numerical schemes for solving the Euler equations on massively parallel architectures available today in graphics cards hardware. Euler equations govern inviscid ﬂow and are the fundamental basis of most computational ﬂuid dynamics (CFD) problems, which often require large computing resources due to the dimensions of the domain (space and time). Modern GPU provide eﬃcient cost-eﬀective computing power to potentially solve large problems and prepare for running on capability supercomputer. The purpose of this paper is to show one can eﬃciently perform high-order numerical schemes simulations of Euler equations on a single GPU system. GPU used to be graphics tasks dedicated co-processors. Before the advent of the Nvidia CUDA architecture (2006), some deep knowledge of the graphics pipeline model and low-level architecture was required to adapt a CPU code to run on the GPU. In 2005, Hagen et al. [1] implemented the Lax-Friedrichs Euler solver using the graphics pipeline approach, and designed shaders programs in C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 281–288, 2010. c Springer-Verlag Berlin Heidelberg 2010

282

P. Kestener, F. Chˆ ateau, and R. Teyssier

Cg language to harness the growing computing power of the vertex and fragment processors. They obtained speedup ranging from 10 to 30 when solving a shock-bubble problem on grid with up to 10242 cells. Nvidia CUDA is a parallel computing architecture which introduced a new programing model based on high-level abstraction levels which avoid the former graphics pipeline concepts and ease the porting of a scientiﬁc CPU application. More recently Brandvik et al. [2] compared a CUDA and a BrookGPU implementation of a 3D Euler numerical scheme, using a 300,000 grid-cells domain. They report runtime speedups of 16 for the GPU implementation (running on Nvidia GTX8800) versus the reference CPU implementation (running on Intel Core2 Duo, 2.33GHz). Let us ﬁnaly mention the ambitious and impressive work of Schive et al. [3] which presents a GPU-accelerated adaptive-mesh-reﬁnement code for astrophysics applications. Overall speed-up factors of ∼ 10 are demonstrated for large (40963 and 81923) eﬀective grid size. The hydrodynamics part of this code uses a Riemann solver-free relaxation scheme. In section 2, we brieﬂy describe the numerical schemes used to solve the 2D Euler equations the ﬁnite volume framework. First the Godunov scheme using a quasi-exact Riemann solver is presented. Then we recall basics of the Riemann solver-free Kurganov-Tadmor scheme. Details of the GPU implementation using the Nvidia CUDA tools are given in section 3, then we report on a comparative CPU/GPU performance analysis in section 4.

2

Finite Volume Numerical Schemes for Solving the Compressible Euler Equations

Let us consider the two-dimensional Euler equations of hydrodynamics for an ideal polytropic gas expressing the conservation of mass, momentum and energy: ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ ρ ρu ρv 2 ⎢ ρu ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ + ∂x ⎢ ρu + p ⎥ + ∂y ⎢ ρuv ⎥ ∂t ⎢ (1) ⎣ ρv ⎦ ⎣ ρuv ⎦ ⎣ ρv 2 + p ⎦ = 0, u(E + p) v(E + p) E ρ p = (γ − 1). E − (u2 + v 2 ) , (2) 2 where T U = (ρ, ρu, ρv, E) is the vector of conservative variables. ρ, u, v, p and E are the density, the x- and y- velocities, the pressure and the total energy respectively. γ denotes the adiabatic index, i.e. the ratio of speciﬁc heats. The value γ = 1.4 (for H2 at temperature 100oC) is often used in astrophysics simulation. Equation (1) can be rewritten as ∂t U + ∂x F(U) + ∂y G(U) = 0. F and G are the ﬂux vectors. The standard approach of ﬁnite volume methods is to discretize the integral form of the system of conservation laws. This allows the discrete approximation to satisfy the conservation property. The space cell-average conserved variables vector is:

1 Ui,j (t) = U(x, y, t)dxdy (3) |Ωi,j | Ωi,j

Accelerating Euler Equations Numerical Solver on GPU

283

where Ωi,j is the elementary grid cell. In case of a cartesian grid, Ωi,j is simply a square which center is (x = i, y = j) of sizes Δx, Δy. An overview of modern high resolution schemes using the ﬁnite volume framework can be found in the following references [4,5]. We will only summarize the main features of the two schemes considered here. 2.1

Multidimensional Godunov Scheme

The two dimensional Euler equations in integral (conservative) form are discretized in the ﬁnite volume framework as follow: Δt Δt n+1/2 n+1/2 n+1/2 n+1/2 n Un+1 F + G , (4) = U + − F − G i,j i,j i+1/2,j i−1/2,j i,j+1/2 i,j−1/2 Δx Δy where the ﬂux functions are now time and space averaged. Algorithm 1 summarizes the Godunov scheme with splitting direction technique Algorithm 1. Directional splitting Godunov scheme algorithm 0 initialize Ui,j buﬀer initialize nstep = 0 (discrete time variable) while t < tend do dt=computeDt(); //compute time step if nstep %2 == 0 then Godunov(X,dt); Godunov(Y,dt); else Godunov(Y,dt); Godunov(X,dt); end if if nstep %noutput == 0 then output U(); //dump ﬂuid variables arrays into a ﬁle end if end while Generate timing report return

and algorithm 2 shows the pseudo-code of the main routine implementing Eq. (4) to update ﬂuid cells Ui,j . Each time step, routine Godunov is called twice, one for each direction. Algorithm 2. Godunov time step routine (pseudo-code) Godunov(integer dir, float dt) apply boundary conditions to U for (i, j) ∈ {computing cells indexes} do • get state U (i, j), compute primitive variables T W (i, j) = (ρ, u, v, p) • solve Riemann problem at current cell interfaces along direction dir, i.e. compute Godunov state n+1/2 n+1/2 • compute incoming ﬂuxes Fi+1/2,j (resp. Gi,j+1/2 ) from Godunov state • update U (i, j) (see Eq. 4) end for

284

P. Kestener, F. Chˆ ateau, and R. Teyssier

2.2

Kurganov-Tadmor Central Scheme

Kurganov and Tadmor [6,7] introduced a class of Riemann solvers-free schemes based on a central approach: the solution of the Riemann problem is computed on a staggered cell, before being averaged back on the standard grid. The numerical solution is updated on the edges of the staggered grid, where it is smooth, and can be computed via a Taylor expansion, with no need to solve the actual Riemann problem. n , the fully discrete second order Kurganov-Tadmor Given the cell averages Ui,j scheme is a two-step predictor-corrector method. Let us deﬁne the reconstructing piecewise linear polynomial of the form: x,n y,n n n ˜i,j (x, y) = Ui,j + (x − i)Ui,j + (y − j)Ui,j U

(5)

x,n y,n where Ui,j and Ui,j are partial derivative approximates. By considering averages over staggered cell (centered around (x = i + 1/2, y = j + 1/2)), one gets [8] n Ui+ 1 ,j+ 1 = 2

2

1 n n n n Ui,j + Ui+1,j + + Ui,j+1 + Ui+1,j+1 4 1 x,n x,n x,n x,n Ui,j − Ui+1,j + + Ui,j+1 − Ui+1,j+1 16 1 y,n y,n y,n y,n Ui,j − Ui,j+1 . (6) + Ui+1,j − Ui+1,j+1 16 n+1/2

Δt n The predictor step estimates the half time steps values Ui,j = Ui,j − 2Δx Δt n n n n (Fi+1,j − Fi,j + Fi+1,j+1 − Fi,j+1 ) − 2Δy (Gni,j+1 − Gni,j + Gni+1,j+1 − Gni+1,j ) which are used in the corrector step to update U : n+1 Ui,j =

1 n n n n Ui,j + Ui+1,j + Ui,j+1 + Ui+1,j+1 + 4 1 n+ 12 n+ 1 x,n x,n + (Ui,j − Ui+1,j ) − λx F (Ui+1,j ) − F (Ui,j 2 ) + . . . (7) 16

n+1 in the Kurganov-Tadmor scheme requires Let us note that updating values Ui,j information in a larger neighborhood (5 × 5) compared to the Godunov scheme (3 × 3) due to the diﬀerent ways the ﬂuxes are calculated.

3

GPU Implementation

Over the past few year, the ever growing computing power of GPUs makes them and interesting candidate for high performance general purpose computation (GPGPU). By unifying the diﬀerent shaders processors, Nvidia CUDA architecture provides a new data parallel programming model which to not require graphics rendering technics knonledge. NVIDIA also introduced a C-like environment [9] much easier to use for designing scientiﬁc applications running on

Accelerating Euler Equations Numerical Solver on GPU

285

hybrid CPU/GPU systems. The currentGPU architecture, e.g. Tesla S1070, has 4 devices, each equipped with 240 32-bits cores working at 1.44Ghz. This system delivers up to 4 × 1037 Giga Floating Point Operations Per Second (GFLPOS). In addition, each device can access a 4GBytes GDDR3 memory at 110 GBytes/s. The CUDA programing model provides two high level key abstractions: a virtual hierarchy of thread blocks and the shared memory space, that make data and thread paralelism explicit. A CUDA kernel, deﬁned as the entry point for executing parallel code on GPU are parametrized by the grid of block and block of threads dimensions. Each thread of a block has access to a common on-chip low latency memory space named shared memory. One of the major asset of this kind of architecture is the cross-device scalability, which makes a program blind to the actual hardware ressources on GPU device (number of multiprocessors per chip, ...). Let us mention that the advent of OpenCL language which essentially uses the same programming model concepts as CUDA allow our results to apply on other GPUs. We developped a GPU CUDA-based implementation of the two numerical schemes described in section 2 using the same parallel programming pattern: the actual computational domain is splitted into overlapping sub-domains. The width of the ghost region clearly depends on the complexity of the numerical scheme; the Godunov scheme only requires one surrounding ghost-cell per subdomain whereas the Kurganov-Tadmor requires two. In the Godunov scheme, each inner cell only requires information from 3 × 3 neighborhood to solve the local Riemann problem. We also implemented kernels for computing time step as a parallel reduction and computing boundary conditions so that no transfert of data between CPU and GPU memory during a simulation time step is required except at initialization and at the end of the simulation.

4

Performance Analysis

The performance of the numerical scheme is evaluated on two systems whose GPU speciﬁcations are listed in Table 1. The performance in GFLOPS is calculated by the following formula: kNx Ny Nts /t ∗ 10−9

(8)

where t is the execution time, k is a numerical prefactor (340 for the Godunov scheme and 320 for the Kurganov-Tadmor scheme), Nx and Ny are the domain sizes and Nts is the number of time steps of the simulation run. In Fig. 1 are reported the timing measurements of a 200 time steps simulation run for the two numerical schemes on both CPU (Intel Xeon L5420) and GPU Table 1. The speciﬁcations of CUDA-capable systems CPU GPU # of SP # of SM SP clock GLFOPS Mem. B/W Mem. Capacity Intel Core2 Q6600 GTX8800 128 16 1.35GHz 518 86.4 GBytes/s 768 MBytes Intel Xeon L5420 Tesla S1070 240 30 1.44GHz 1037 110 GBytes/s 4.0 Gbytes

286

P. Kestener, F. Chˆ ateau, and R. Teyssier

Fig. 1. Runtime (in seconds) versus grid size for a 200 time step simulation Execution time tCP U and tGP U are measured on the two diﬀerent hybrid systems listed in Table 1. Runtime includes buﬀer transfert from host memory at initialization and to host memory at the end of simulation for saving data to ﬁle on the harddrive. Left: Runtime for the Tesla-based system. Right: Runtime for the GTX8800-based systems. The red and orange plots correspond to runtime measured on CPU for the Godunov and the Kurganov-Tadmor scheme. The blue and light-blue plots are the corresponding runtime measured on GPU.

(Tesla S1070). Note that the timing measurements include memory transferts between host and the graphics accelerator. By examining Fig. 1, one can notice that the CPU timings for the two numerical schemes have diﬀerent scaling behaviors as simulation domain size increases. The Godunov scheme simulations behaves as expected from the algorithm complexity, i.e. tsimu ∼ N 2 (Nx = Ny = N ). This is illustrated in Fig. 1 where Godunov timing curve plotted with log-log axes has a slope of 1.95 whereas the Kurganov-Tadmor corresponding plot is characterized by the slope 2.26 signiﬁcantly larger than 2. This is due to the fact that the CPU version of Kurganov-Tadmor scheme is based upon software package CentPack 1 which is not optimized regarding memory storage. However the GPU version do not need to store full grid intermediate variables because it uses the on-chip shared memory space. For small domain sizes (N ≤ 256), the GPU runtime is almost ﬂat. This can be explained by the fact that the GPU occupancy factor is very low (not enough block of threads to fully load the device). In Fig. 2 are shown CPU versus GPU speed-ups corresponding to timing shown in Fig. 1. The Godunov scheme reachs a maximun speed-up of 70 for domain size larger than 10002 on the Tesla-based system. The Kurganov-Tadmor have very high speed-up for domain size larger than 5002 this can be explained by the fact that corresponding CPU timings scale as N α with alpha larger than 2 whereas the GPU timing scales as N 2 . In Fig. 3 are shown the eﬀective GFLOPS 1

http://www.cscamm.umd.edu/centpack/

Accelerating Euler Equations Numerical Solver on GPU

287

Fig. 2. Speed-up (tCPU /tGPU ) versus grid size Speed-ups are computed using timings shown in Fig. 1. Left: Speed-up for the Godunov scheme simulation. Right: Speed-up for the Kurganov-Tadmor scheme.

Fig. 3. Eﬀective GFLOPS comparison. GFLOPS are computed using Eq. (8). Left: GFLOPS for the Godunov scheme simulation. Right: GFLOPS for the KurganovTadmor scheme.

measured for the numerical schemes. Let us notice that the CPU version of the Kurganov-Tadmor scheme has a decreasing GFLOPS count as the domain size

288

P. Kestener, F. Chˆ ateau, and R. Teyssier

increases. Once again, this is due to the fact that corresponding CPU timings scale as N α with alpha larger than 2.

5

Future Work

This work is the ﬁrst step in parallelizing astrophysical simulation codes. It is shown that that compressible Euler equations solvers can be eﬃciently implemented on modern GPU and speed-up above 70 can be achieved compared to a single-threaded CPU program. Although, at present only a 2D Euler solver is implemented, we believe further extension to 3D and to other ﬁelds (Poisson sover, magnetohydrodynmics,...) will provide a framework for developping new high performance simulations for astrophysics.

References 1. Hagen, T.R., Henriksen, M.O., Hjelmervik, J.M.: How to solve systems of conservation laws numerically using the graphics processor as a highperformance computational engine. In: Quak (ed.) Geometric Modelling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF. Springer, Heidelberg (2005) 2. Brandvik, T., Pullan, G.: Acceleration of a 3d euler solver using commodity graphics hardware. In: 46th AIAA Aerospace Sciences Meeting, Reno, NV (2008) 3. Schive, H.Y., Tsai, Y.C., Chiueh, T.: Gamer: A graphic processing unit accelerated adaptive-mesh-reﬁnement code for astrophysics. The Astrophysical Journal Supplement Series 186(2), 457–484 (2010) 4. Toro, E.: Riemann solvers and numerical methods for ﬂuid dynamics. A practical introduction, 2nd edn. Springer, Heidelberg (1999) 5. Leveque, R.: Finite Volume Methods for Hyperbolic Problems. Cambridge University Press, Cambridge (2002) 6. Kurganov, A., Tadmor, E.: New high-resolution central schemes for nonlinear conservation laws and convection-diﬀusion equations. Journal of Computational Physics 160, 241–282 (2000) 7. Kurganov, A., Tadmor, E.: Solution of two-dimensional riemann problems for gas dynamics without riemann problem solvers. Numer. Methods Partial Diﬀerential Equations 18, 548–608 (2002) 8. Jiang, G.H., Tadmor, E.: Nonoscillatory central schemes for multidimensional hyperbolic conservation laws. SIAM J. Sci. Comput. 19(6), 1892–1917 (1998) 9. NVIDIA: Cuda, http://developer.nvidia.com/object/gpucomputing.html

An Improved Parallel MEMS Processing-Level Simulation Implementation Using Graphic Processing Unit Yupeng Guo, Xiaoguang Liu, Gang Wang, Fan Zhang, and Xin Zhao Nankai-Baidu Joint Lab, Inst. of Robotics and Information Automatic System College of I.T., Nankai University, Tianjin, 300071, China {zick_gyp,liuxg74,wgzwp,fanzhang555}@yahoo.com.cn, [email protected]

Abstract. Micro-Electro–Mechanical System (MEMS) is the integration of mechanical elements, sensors, actuators, and electronics on a common silicon substrate through micro fabrication technology. With MEMS technologies, micron-scale sensors and other smart products can be manufactured. Because of its micron-scale, MEMS products’ structure is nearly invisible, even the designer is hard to know whether the device is well-designed and well-produced. So a visual 3D MEMS simulation implement, named ZProcess[1], was proposed in our previous work to help designers realizing and improving their designs. ZProcess shows the MEMS device’s 3D model using voxel method. It’s accurate, but its speed is unacceptable when the scale of voxel-data is large. In this paper, an improved parallel MEMS simulation implementation is presented to accelerate ZProcess by using GPU (Graphic Processing Unit). The experimental results show the parallel implement gets maximum 160 times speed up comparing with the sequential program. Keywords: MEMS, Processing-level Simulation, Parallel, GPU, CUDA.

1 Introduction While the electronics are fabricated using integrated circuit process sequences, the micromechanical components are fabricated using compatible ‘micromachining’ processes that selectively etch away parts of the silicon wafer or add new structural layers to form the mechanical and electromechanical devices. By modeling these ‘micromachining’ processes with the Mathematical Morphology Operation (MO) on voxel data, ZProcess, which is developed in our previous work [1,4,6], becomes a MEMS processing-level simulation implement. It uses voxel data to present MEMS production’s 3D topography. Consideration of the production’s micrometer-scale dimension (one millionth of a meter), usually we have to use 100,000,000 or much more voxels to insure the accuracy of MEMS production’s topography. The problem is, running on CPU, the simulation speed will become very slow when the voxel data come to that scale. So it is necessary to develop a parallel simulation implement which can accelerate the simulation program. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 289–296, 2010. © Springer-Verlag Berlin Heidelberg 2010

290

Y. Guo et al.

The rest of this paper is organized as follows. Section 2 introduces the basic MEMS fabrication processes and its MO model constructed by ZProcess, and the detail of the sequential algorithm. Section 3 gives the basic ideas on acceleration and the improved parallel algorithm. Section 4 shows the experimental data and the speed up we get. At last, we give our conclusion in section 5.

2 Basic MEMS Fabrication Processes and Their MO Model One of the basic building blocks in MEMS Fabrication processes is the ability to deposit thin films of material. Usually we call it deposition processing. MEMS deposition technology can be classified in two groups, one is using chemical reaction and the other is using physical reaction. Using deposition technology we can get a thin film with the thickness between a few nanometer to about 100 micrometer. Because the surface of substrate, on which we deposit the thin film, may not be smooth, the device’s surface will not be smooth also after deposition processing. For this reason, in MEMS processing-level simulation implement, we use Mathematical Morphology Operation (MO) to model the deposition processing [2]. The thickness of deposition film can be obtained through the processing parameters. We can add the voxels within the sphere whose center is the surface voxel and the radius is the thickness, just like rolling a ball on the substrate (Fig.2 shows the concept of MO). As mentioned above, ZProcess is based on voxel method. The MEMS device is treated as a set of voxels. The value of each voxel is mapped into 0 to 255: the voxel assigned 0 representing the transparent background and the voxel assigned other value representing opaque objects and meaning the different materials. We store the voxels in a one dimensional array, the voxel’s sequence number in the array is calculated as formula 1: we define Sn as the sequence number; dimX, dimY, dimZ is the dimension of the volume data respectively; and x, y, z is the voxel’s coordinate.

Sn = Z × dim X × dim Y + y × dim X + x .

(1)

，x, y

Algorithm 1 illustrates the sequential algorithm of deposition processing. Here and z is the dimension of MEMS device’s voxel data. Algorithm 1. Sequential Algorithm of Deposition Processing

for i := 1 to z do for j := 1 to y do for k := 1 to x do if the voxel[k, j, i] is a surface point then modify(set value to 1) the voxels within the sphere whose center is voxel[k, j, i] and radius is the thickness

Another basic processing is Etching. In order to form a functional MEMS structure, it is necessary to etch the thin films previously deposited or the substrate itself. Using the lithography and the mask, we can transfer the pattern we want to the material through etching processes. In program, we use mask data directly. In general, there are two kinds of etching processing: wet etching where the material is dissolved when immersed in a chemical solution; dry etch where the material is split using

An Improved Parallel MEMS Processing-Level Simulation Implementation

291

reactive ions. In ZProcess, we model the wet etching processing by the same way as the deposition. The change we make is erasing the voxels instead of adding the voxels within the sphere. A limit is added in the algorithm also, the surface point which is preparing to erase must be in the etching mask. For dry etching, we vertically erase the voxels, which are surface points and in the etching mask, within the thickness. Other basic processing, such as fabricating substrate, stripping resist, bonding and so on, can also be presented simply by operating on voxel data. In sequential program, we search the whole voxel data, set the chosen voxel to the substrate material in fabricating substrate processing or set the voxels whose value is equal to resist material to ‘0’ in stripping resist processing. With all these models we put forward above, we can give the 3D appearance of the MEMS device which produced by these basic processes. Fig.1 shows the simulation result of a micro-gripper, which is fabricated by totally 10 processes. In figure 1, the left is the micro-gripper’s SEM photograph, the right one is the screenshot of the micro-gripper’s simulation result (3D model) by MEMS Processing-level Simulation Implement. They are fabricating substrate, three deposition processes, four etching processes with different masks, bonding and stripping resist.

Fig. 1. The Micro-gripper’s SEM Photograph and Simulation Result

3 Parallel Methods to Improve the Simulation Implement on GPU 3.1 Introduction of CUDA and the Basic Parallel Consideration

CUDA is short for NVIDIA’s Compute Unified Device Architecture [7, 8, 9, 10]. It provides a programming interface to use the parallel architecture of NVIDIA’s GPUs for general purpose computing. CUDA-capable GPUs have hundreds of cores that can collectively run thousands of computing threads. Each core has shared resources, including registers and memory. With the C language and CUDA’s ‘nvcc’ compiler, it is convenient for developers to write CUDA program or embed it into other programs. In the sequential simulation implement, we have to search most of voxels in volume data and set them to the right value when execute only one step of processes in simulation. When the volume data is very large, the program’s speed becomes unacceptable. Unfortunately, the pattern of the etching mask is complex, if the volume data size is not large enough, the pattern in the mask will be confused when we transform the

292

Y. Guo et al.

mask’s vector-graph into scalar-graph. For example, the adjacent combs with small distance in micro-gripper’s structure layer may overlap. The simulation result, which is the device’s 3D model, will be confused too. For this reason, usually we have to use more than 100 million voxels to represent a MEMS device’s appearance. Since the operation on each voxel is not relevant, the program is well adapted to run on GPU because the task can be massively parallel. We can assign a single thread from the GPU to operate one voxel or more of the device’s volume data. So we improved all the sequential processing simulation programs with parallel methods. We embed it into the original program also. Before the simulation, volume data is transferred from the host memory to device memory. After all the parallel processing simulation is completed on GPU, the volume data is transferred back to host memory for displaying. 3.2 Three Dimensional Fast Mathematical Morphological Operation(FMO)

In deposition and etching processing simulation, we can use FMO instead of MO [3]. Fig.2 shows the 2D schematic illustration. Different with original morphological algorithm [2], it is not necessary to access every voxel inside the sphere when performing erasing operations since a large number of voxels overlap between two adjacent spheres. As shown in Fig.2, Pn and Pn+1 are two adjacent spheres. The overlapped oblique line part only needs to be erased one time. Since the MEMS device’s appearance is irregular, we have to extend the FMO to the three dimensional FMO. Firstly, we calculate the voxels inside sphere with the radius which is equal to thickness. Secondly, we calculate voxels inside half shell of the sphere in positive half of x, y and zaxis with the result we get above. It is important to ensure the continuity of shell’s surface. Since the voxel data is discrete, mathematical method which uses the formula for sphere surface to calculate a sphere’s shell is not useful. If we want to get a half shell in the positive half of x-axis, for example, we could, for each voxel in plane y-z, search the voxel data we get in first step along the z-axis, from ‘z’ to ‘0’, until get a sphere voxel. To different processing, we calculate the sphere and the half shells with different radius, and then put them into device’s constant memory as templates.

Fig. 2. Schematic Illustration of MO and FMO. In MO, we need to access every voxel in sphere. In FMO, we only need to operate the different part of the two spheres, as the schematic illustration shows.

An Improved Parallel MEMS Processing-Level Simulation Implementation

293

3.3 Parallel Method for MEMS Processing Used in Simulation

For deposition processing and wet etching processing, before we use FMO to modify the volume data, we should find out the surface voxels firstly. If not, we have to put every voxel in the volume data into the deposition or etching kernel, and decide which one is the surface point in the kernel, that will lead to high thread divergence. To reduce the thread divergence, we use the operation on GPU called compaction [11]. Firstly, we design a kernel to find out all of the surface point, then we use an array which is called the surface-point array to store the surface voxel information. The array’s size is same to the volume data. In this array, ‘1’ represents the surface voxel and ‘0’ represents non-surface voxel. Secondly, we have to compact these surface voxels. Using Cudpp [5], we scan the whole surface-point array. With that scan plan, we can calculate each element in the surface-point array, and the output data is the surface voxel’s subscript in the compaction array which we want. So we define the compaction kernel as algorithm 2 shows. Algorithm 2. Compaction Kernel

for each thread i if(surface array[i] == 1) compaction array[output data[i]] = i;

After that, we create threads block with the same size of compaction array. Since the compaction array storing the voxel’s sequence number in the volume data, we can easily calculate the coordinate of each voxel by: Z = S n (dim X × dim Y ) .

(2)

Y = ( Sn − Z × dim X × dim Y ) dim X .

(3)

X = Sn − Z × dim X × dim Y − Y × dim X .

(4)

For each surface point, we use FMO to modify the volume data. If the surface point’s nearby voxel (in positive half of x-axis, y-axis, or z-axis) is a surface point too, we use the half-shell of a sphere as our FMO template. Otherwise, we use the whole sphere. For dry etching, we find out the surface points and compact them too. Then we do not need to use FMO. We search each surface point in our etching mask data. If the surface point is in the mask, we erase the voxels within the depth. For other processing, such as fabricating substrate and stripping resist, we define the thread block’s dimension (CUDA allow developers to define three dimensional thread block) to fit the region of the volume data we want to operate. So each thread can operate one voxel. The thread’s ID is just the voxel’s coordinate. With the voxel’s coordinate and processing parameters, we can decide to set the voxel as a part of substrate or strip it away from the device.

294

Y. Guo et al.

4 Experimental Results We use following devices in our experiment: CPU: AMD Phenom II X4 945 (4 cores with 3.0GHz); Memory Size: 4Gb; GPU: NVIDIA Tesla 1060 with 4Gb global memory for parallel calculation (which has 240 stream processors), and NVIDIA 8600GT for displaying; Operation System: Red Hat AS 5.3; CUDA version: 2.2; GPU program compiler: nvcc; CPU program compiler: gcc version 4.2.3. With the small volume data size, we can finish the simulation quickly. With the large data size, we can build a more accurate 3D model of the MEMS device. So in the experiment, we choose the volume data size from 3 million voxels to 200 million voxels. Table 1 shows the experimental result of Micro-gyroscope’s simulation. We get the speed up from 23.0 times to 27.9 times. Here, the IO execution time means the time we used in reading mask data from disk for each etching processing. In Table 1, we can see the IO execution time occupy the most of time which parallel program used. To test the speed up of simulating MEMS processes with our parallel methods, we calculate the sequential and parallel program’s runtime without IO time. The speed up which we get is from 69.3 times to 164.5 times. Comparing with the sequential program, the parallel program gets the great and stable acceleration result. Table 1. Miro-gyroscope Simulation Experimental Result. We list sequential program’s runtime and parallel program’s runtime below, and then calculate the speed up we get. The timing unit is millisecond.

Micro-gyroscope Simulation Experiment Program Runtime (ms) and Speed Up Sequential Runtime Parallel Runtime Speed Up IO Execution Time Sequential Runtime (without IO) Parallel Runtime (without IO) Speed Up (without IO)

Volume Data Size(million voxels) 3

12

50

100

200

1428 62 23.0 42 1386

5627 213 26.4 168 5459

22336 828 27.0 679 21657

49360 1769 27.9 1478 47882

89762 3266 27.5 2630 87132

20

45

149

291

636

69.3

121.3

145.3

164.5

137.0

To verify the stability of parallel algorithm, another experiment, which simulates a different MEMS device named Micro-gripper, is done. In this experiment, we choose the same volume data size as Micro-gyroscope simulation experiment. As Fig.3 shown, the simulation of Micro-gripper and Micro-gyroscope both get high speed up with different data size. Despite the different complexity of processing, we get the stable speed up. Especially, with the representative volume data size of 100 million voxels, in which we can ensure showing the MEMS device’s appearance exactly with no confusion, we get more than 100 times speed up stably.

An Improved Parallel MEMS Processing-Level Simulation Implementation

295

180 160 140 p 120 U d 100 e 80 e p S 60 40 20 0 3

12 50 100 Volume Data Size(million voxels) Micro-gyroscope

200

Micro-gripper

Fig. 3. The Experimental Results in Simulating Micro-gripper and Micro-gyroscope. (The xaxis is the volume data size and the y-axis is the speed up.)

5 Conclusions The simulation to different process is the key technology to MEMS CAD software. Unfortunately, the simulation will spend much time because of its high algorithmic complexity and big data size according to real MEMS devices. In this paper, we present an improved parallel MEMS processing-level simulation implementation. By accelerating every basic MEMS processing’s simulation algorithm on GPU, we get 28 times speed up in micro-gyroscope’s simulation. Without IO execution time, the speed up will come to 160 times at most. We test different MEMS device, which produced with different processes and simulated by different volume size. In representative volume size, the experimental result of acceleration without IO is stable up to 100 times. Acknowledgement. This work was supported by Program for New Century Excellent Talents in University (NCET-07-0464), National Natural Science Foundation of China (60875059), National High Technology Research and Development Program of China (2009AA04Z320), and Science and Technology Development Plan of Tianjin (08JCZDJC22000).

References 1. Sun, G., Zhao, X., Lu, G.: Voxel-Based Modeling and Rendering for Virtual MEMS Fabrication Process. In: IEEE/RSJ IROS2006, Beijing, China, pp. 306–311 (2006) 2. Sun, G., Zhao, X., Zhang, H., Wang, L., Lu, G.: 3-D Simulation of Bosch Process with Voxel-Based Method. In: Proceedings of the 2nd IEEE International Conference on Nano/Micro Engineered and Molecular Systems, Bangkok, Thailand, pp. 45–49 (2007) 3. Zhang, F., Wang, G.: An Improved Parallel Implementation of 3D DRIE Simulation on Multi-core. In: 10th IEEE International Conference on High Performance Computing and Communications HPCC 2008, Dalian, China, pp. 891–896 (2008)

296

Y. Guo et al.

4. Zhao, X., Li, Y., Zhou, Y., Ren, L., Lu, G.: Virtual Process: Concept, Problems and Implementation Framework. In: The Fourth International Conference on Control and Automation (ICCA’03), Montreal, Canada, pp. 659–663 (2003) 5. CUDPP, http://www.gpgpu.org/developer/cudpp/ 6. Zhao, X., Sun, G., Ren, L., Lu, G.: On MEMS Design Automation. In: Proceedings of the 26th Chinese Control Conference, Zhangjiajie, Hunan, China, pp. 774–778 (2007) 7. NVIDIA, CUDA Compute Unified Device Architecture Programming Guide,V.2.0 (2008) 8. CUDA, http://developer.nvidia.com/object/cuda.html/ 9. Nickolls, J., Buck, I.: NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum (2007) 10. Lefohn, A.E., Sengupta, S., Kniss, J., Strzodka, R., Owens, J.D.: Glift: Generic, efficient, random-access GPU data structures. ACM Trans. Graph 25(1), 60–99 (2006) 11. Horn, D.: Stream reduction operations for GPGPU applications. GPU Gems 2, 573–589 (2005)

Solving Burgers’ Equation Using Multithreading and GPU Sheng-Hsiu Kuo1, Chih-Wei Hsieh1, Reui-Kuo Lin2, and Wen-Hann Sheu3 1

National Center for High-Performance Computing, Hsinchu, Taiwan 2 Taiwan Typhoon and Flood Research Institute, Taipei, Taiwan 3 National Taiwan University, Taipei, Taiwan

Abstract. Many-Core system plays a key role on High Performance Computing, HPC, nowadays. This platform shows the big potential on the performance per watt, performance per floor area, cost performance, and so on. This paper presents a finite difference scheme solving the general convection-diffusionreaction equations adapted for application of Graphics Processing Units (GPU) and multithreading. A two-dimensional nonlinear Burgers’ equation was chosen as the test case. The best results that we measured are speed-up ratio of 12 times at mesh size 1026×1026 by using GPU and 20 times at mesh size 514×514 by using full 8 CPU cores when compared with an equivalent single CPU code. Keywords: Finite difference scheme; multithreading; GPU.

1 Introduction In this paper, two parallelism models were used for the application on computational fluid dynamic (CFD) problem. One is multithreading and the other is using graphics processing units. OpenMP [1] is a parallelism model targeted multithreading. It is a set of compiler directives along with library routines to make an environment which support multi-platform shared-memory parallel programming in Fortran, C and C++ on all architectures. Nvidia provided the Compute Unified Device Architecture (CUDA) library to encourage the use of GPUs in 2007 which is an extended subset of the C language and supported by all latest NVIDIA’s graphics cards. There were applications using GPU implementation in computational fluid dynamics (CFD). Kruger and Westermann [1] proposed a framework for the implement of direct solvers for sparse matrices, and apply these solvers to multi-dimensional finite difference equations, i.e. the 2D wave equation and the incompressible Navier-Stokes equations. Goodnight, Woolley, Lewin, Luebke and Humphreys [3] presented boundary value heat and fluid flow problems using GPU. A Navier–Stokes flow solver for structured grids using GPU was presented in [4] Hagen, Lie, and Natvig [5] presented the implantations to compressible fluid flows using GPU. Brandvik [6] and Pullan [7] presented 2D and 3D Euler equations solvers on GPU and focus on performance comparisons between GPU and CPU codes based on considerable speed-ups using exclusively structured grids. Corrigan, Camelli, Löhner, and Wallin [8] presented an application in 3D unstructured grids for inviscid, compressible flows on GPU. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 297–307, 2010. © Springer-Verlag Berlin Heidelberg 2010

298

S.-H. Kuo et al.

We discuss the performance of solving fluid dynamic problem on a single computing node. The basic fluid dynamics model is the 2D viscous Burgers’ equations. An implicit convection-diffusion-reaction (CDR) scheme [9] which has high accuracy was used and solved by red-black SOR algorithm with different parallel paradigms. Here we presented a description about the numerical scheme, code validation and details of the computational expense with each model.

2 2D Nonlinear Viscous Burgers’ Equations Two-dimensional nonlinear viscous Burgers’ equations can be written in terms

∂u ∂u ∂u 1 ⎛ ∂ 2u ∂ 2 u ⎞ +u +v = + ⎜ ⎟ ∂t ∂x ∂y Re ⎝ ∂x 2 ∂y 2 ⎠

(1)

∂v ∂v ∂v 1 ⎛ ∂ 2 v ∂ 2v ⎞ +u +v = + ⎜ ⎟ ∂t ∂x ∂y Re ⎝ ∂x 2 ∂y 2 ⎠

(2)

where Re denotes the Reynolds number. To perform the precise comparison with results found in [10] and [11]. The initial condition are given by

u ( x, y,0 ) = sin (π x ) +cos (π y ) , 0 < x, y < 0.5 v ( x, y,0 ) = x + y,

0 < x, y < 0.5

(3)

and the boundary conditions are given by u ( 0, y, t ) = cos (π y )

⎫ ⎪ u ( 0.5, y, t ) = 1 + cos (π y ) ⎪ ⎬ 0 ≤ y ≤ 0.5, t ≥ 0 v ( 0, y, t ) = y ⎪ ⎪ v ( 0.5, y , t ) = 0.5 + y ⎭

(4)

u ( x,0, t ) = sin ( π x )

⎫ ⎪ u ( x,0.5, t ) = sin (π x ) + 1⎪ ⎬ 0 ≤ x ≤ 0.5, t ≥ 0 v ( x,0, t ) = x ⎪ ⎪ v ( x,0.5, t ) = x + 0.5 ⎭

(5)

3 Numerical Method Consider in this paper the finite-difference solution of the scalar convection– diffusion–reaction equation. uφ x + vφ y − k (φxx + φ yy ) + cφ = f

(6)

Solving Burgers’ Equation Using Multithreading and GPU

299

where u and v represent the velocity components along the x and y directions, respectively. In the above, k and c denote the diffusion coefficient and the reaction coefficient, respectively. Assume that f was a known value. Employ its general solution for Eq. (6) as follow

φ ( x, y ) = c1eλ x + c2 eλ x + c3 eλ y + c4 eλ y + 1

2

3

4

f c

(7)

where c 1~ 4 are constants. Substituting Eq. (7) into Eq. (6), we can determine λ1~ 4 as follows

λ1,2 =

u ± u 2 + 4ck v ± v 2 + 4ck and λ3,4 = 2k 2k

(8)

For the CDR model equation (6), we can discrete the equation at an interior node i. The idea is to approximate all the derivative terms using the center-like scheme

⎛ u m c ⎞ ⎛ u m c ⎞ ⎛ m c⎞ ⎜ − − 2 + ⎟ φi −1, j + ⎜ − 2 + ⎟ φi +1, j + 4 ⎜ 2 + ⎟ φi , j ⎝ 2h h 12 ⎠ ⎝ 2h h 12 ⎠ ⎝h 6⎠ v m c v m c ⎛ ⎞ ⎛ ⎞ +⎜− − 2 + ⎟ φi , j −1 + ⎜ − 2 + ⎟ φi , j +1 = fi , j ⎝ 2h h 12 ⎠ ⎝ 2h h 12 ⎠

(9)

where h is the uniform grid size. Given the above discrete representation of (6), the prediction quality depends solely on m in Eq. (9). By virtue of Eq. (7), we can substitute f λ y λ y φi ±1, j = c1eλ1( xi ±h ) + c2 eλ 2 ( xi ± h ) + c3e 3 j + c4 e 4 j + , φi , j = c1eλ1xi + c2 eλ 2 xi + c f f λ ( y ± h) λ ( y ± h) λ y λ y c3e 3 j + c4 e 4 j + , and φi , j ±1 = c1eλ1 xi + c2 eλ 2 xi + c3 e 3 j + c4 e 4 j + into Eq. c c (9) to get high accuracy. Then we can derive vh ⎛ uh ⎜ 2 sinh λ1 cosh λ2 + 2 sinh λ3 cosh λ4 ⎜ ch 2 ⎜ + cosh λ1 cosh λ2 + cosh λ3 cosh λ4 + 10 ⎜ 12 ⎝ m= cosh λ1 cosh λ2 + cosh λ3 cosh λ4 − 2

(

2

where λ1 =

)

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

(10)

2

2 2 uh vh ⎛ uh ⎞ ch ⎛ vh ⎞ ch , λ3 = . For time , λ2 = ⎜ ⎟ + , and λ4 = ⎜ ⎟ + k k 2k 2k ⎝ 2k ⎠ ⎝ 2k ⎠

stepping scheme, we consider φt = (φit +1 − φit ) dt , which yields first-order accuracy.

Then the Burgers’ equations will be cast into generalized form for velocity as Eq. (6). The definitions for u, v, k, c and f are tabulated in the table 1.

300

S.-H. Kuo et al. Table 1. Summary of the Burgers’ equations

φ

u

v

x-direction

u t +1

ut

vt

y- direction

v t +1

ut

vt

k

c

f

1 Re 1 Re

1 dt 1 dt

1 t u dt 1 t v dt

4 Iterative Algorithms and Parallel Paradigms A simple method to accelerate the iterative procedure, called successive overrelaxation (SOR), is used for the Gauss-Seidel iteration. The representative iterative scheme can be written as Eq. (11).

φin, +j 1' = φin, 'j + ω (φin, +j 1 − φin, 'j )

(11)

Here, n is the iteration level, ω is the parameter and when 1 < ω < 2 , overrelaxation is being employed. For parallel paradigms, a variation of the Gauss-Seidel procedure known as redblack SOR scheme was used in this study. It has the same convergence properties as the Gauss-Seidel procedure but is vectorizable. Imagine that the calculation points are colored like Fig. 1. The red points are surrounded by the black points. Red points are calculated first (using previous black values), then black points are calculated using the just updated red nodes values.

Fig. 1. Red-black SOR ordering

4.1 Sequential Program Procedure

The sequential procedure for solving coupled Burgers’ equation equations is described as follow: (1) (2) (3) (4)

Give the initial and the boundary values for u and v. Give the CDR coefficients u, v, k, c, and f as table 1. Calculate m for each point. Solve ut+1 by the red-black SOR algorithm.

Solving Burgers’ Equation Using Multithreading and GPU

301

(5) Solve vt+1 by the red-black SOR algorithm (6) If satisfy the steady-state condition then save results and stop the program, else t =t+1 and go to step (2). The steady-state condition is assumed that 1 i = M , j = N t +1 t 2 ∑ (φi , j − φi , j ) < 10−10 M ⋅ N i , j =1

(12)

The stopping criterion in the iterative procedure, SOR, for the interior points is given by 2 1 i = M , j = N n +1' ∑ (φi, j − φin, 'j ) < 10−12 M ⋅ N i , j =1

(13)

where M and N denote the number of points along x and y directions. 4.2 OpenMP Model

Same procedure was used as sequential program. We just added the OpenMP directives to this sequential code. To obtain good performance, we only use one parallel region for the red-black SOR solver. This model can save the time spending on forkjoint threads. The flow chart for the red-black SOR algorithm was shown in Fig. 2.

Fig. 2. Flow diagram showing red-black SOR algorithm using multithreading

4.3 GPU Program Procedure

A flow chart shown in Fig. 3 describes the calculation procedure for the GPU application. Our main goal is to reduce the time spending on data transfer between host and GPU device memory. That is, for the red-black algorithm, it only needs to send a

302

S.-H. Kuo et al.

value of global L2-norm, named Res, to the host memory at each iteration step. From sequential program procedure mentioned in section 4.1, we can easily find that we only need to calculate array m once before the iteration starts but array m would be read at every iteration step. For this reason, the calculation of array m was done in the host side and stored into constant memory in the GPU device which is cached.

Fig. 3. Flow diagram showing application of CDR scheme using GPU

5 Results and Conclusions The problem is investigated at Re = 50, and 500. In current study, uniform mesh of nodal points 21×21, 41×41, and 1026×1026 are employed for the cases at both Re = 50, and 500. Case 21×21, and 41×41 are running on CPU and 1026×1026 were running on GPU. The simulated velocity contours at Re = 50 are shown in Fig. 4. Comparison was made on the basis of the predicted mid-span velocity profiles along the vertical and horizontal center-lines at Re = 50, and 500 in Fig. 5. The solutions are also tabulated in Table 3-6, and compared well with the referenced numerical solutions [10]-[11]. In the following we presented the performance of our GPU and multithreading implementation. The sequential code for CPU is written in Fortran language. Then we added OpenMP directives on the CPU code. The sequential and OpenMP codes were compiled by PGI Fortran compiler with compiler flag, named –fast, which is a generally optimal set of options including global optimization, SIMD vectorization, loop unrolling and cache optimizations [11]. Finally, a C/C++ code employ the GPU of a typical video card using CUDA library was made. To ensure the accuracy, double precision was used for both GPU and CPU codes. We use g++ compiler for the main program and NVCC for the kernel program, both with compiler flag –O3. The testbed we used to simulate this problem performed on CPUs and GPU hardware are specified in Table 2.

Solving Burgers’ Equation Using Multithreading and GPU

303

Fig. 4. The simulated contours of u(x, y) and v(x, y) for the Burgers’ equation with Re=50 at steady state

Fig. 5. Comparison of the predicted velocity profiles u(0.25,0.25) and v(0.25,0.25) at Re =50 and Re =500

Fig. 6 shows the speed-up ratio at different mesh size using full 8 CPU cores. We can see the super-linear speed-up of 20 at mesh size 514×514 in Fig. 6. The super linear speedup is caused by thrashing when running on a single core. When mesh size larger than 514×514, the parallel efficiency decrease to 30%. Since the data set is no longer fit to the cache size, we cannot get more benefits from the multithreading. The data size becomes a bottleneck of parallel efficiency of multithreading. When GPU was involved in calculation, the speed-up ratio is only 1.5-2 times compared with a single CPU code at small mesh size. However, when mesh size was increasing, red-black SOR algorithm needs more iteration steps to achieve the convergence condition and also needs to calculate more points. That means percentage of running time of memory copy from host and GPU device will decrease along with the increasing of mesh size. Then, we can obtain better performance. Fig. 7 shows the comparison between multithreading using different number of CPU processors and one GPU card. GPU has well performance on large size problem and get 12 times speed up at mesh size 1026×1026.

304

S.-H. Kuo et al.

Presented in this paper, the implicit CDR scheme was modified for calculation in GPU. For a two-dimensional Burgers’ equation benchmark, using one GPU card offers 12 times speed up when running on a single CPU and 6 times speed up when running with 8 CPU cores. This allows in the future running a large scale size problem for solving convection-diffusion-reaction equations in two or three dimensions without running on a traditional CPU cluster.

Fig. 6. Speed-up ratio of using full 8 CPU cores at different mesh size

Fig. 7. Comparison of speed-up ratio when using multithreading and GPU Table 2. Details of computer hardware used to run the simulations of CDR scheme Hardware CPU

GPU

Details Intel Xeon X5472 Frequency of processor cores L2 cache size Cores NVIDIA Tesla S1070 GPU computing server Frequency of processor cores RAM # of Streaming Processor Cores

3.0 GHz 12MB 4 1.44GHz 4GB DDR3 240

Solving Burgers’ Equation Using Multithreading and GPU

305

Table 3. Comparison of the predicted values for u(x, y) at Re = 50 with other solutions reported in [10][11]

(x, y) (0.1,0.1) (0.3,0.1) (0.2,0.2) (0.4,0.2) (0.1,0.3) (0.3,0.3) (0.2,0.4) (0.4,0.4)

Present M=N=21 0.97543 1.17374 0.86488 0.98567 0.66205 0.76398 0.57654 0.73365

M=N=41 0.97103 1.15533 0.86181 0.97159 0.66262 0.76148 0.57670 0.72979

M=N=1026 0.96951 1.14852 0.86082 0.96665 0.66283 0.76072 0.57677 0.72854

Bahadir M=N=21 0.96688 1.14827 0.85911 0.97637 0.66019 0.76932 0.57966 0.75678

Jain & Holla M=N=21 0.97258 1.16214 0.86281 0.96483 0.66318 0.77030 0.58070 0.74435

Table 4. Comparison of the predicted values for v(x, y) at Re = 50 with other solutions reported in [10][11]

(x, y) (0.1,0.1) (0.3,0.1) (0.2,0.2) (0.4,0.2) (0.1,0.3) (0.3,0.3) (0.2,0.4) (0.4,0.4)

Present M=N=21 0.10031 0.14973 0.16862 0.17481 0.26376 0.22442 0.32809 0.32296

M=N=41 0.09867 0.14262 0.16722 0.16937 0.26354 0.22280 0.32686 0.31886

M=N=1026 0.09810 0.14001 0.16676 0.16749 0.26347 0.22228 0.32645 0.31749

Bahadir M=N=21 0.09824 0.14112 0.16681 0.17065 0.26261 0.22576 0.32745 0.32441

Jain & Holla M=N=21 0.09773 0.14039 0.16660 0.17397 0.26294 0.22463 0.32402 0.31822

Table 5. Comparison of the predicted values for u(x, y) at Re = 500 with other solutions reported in [10][11] Present (x, y) M=N=21 (0.15,0.1) 0.98095 (0.3,0.1) 1.10408 (0.1,0.2) 0.83719 (0.2,0.2) 0.86041 (0.1,0.3) 0.67189 (0.3,0.3) 0.76252 (0.15,0.4) 0.53886 (0.2,0.4) 0.57882 * Not on the grid point.

M=N=41 0.96204 1.02011 0.84143 0.86410 0.67629 0.76904 0.54464 0.58476

M=N=1026 * 0.96937 0.84441 0.86915 0.67877 0.77406 * 0.58768

Bahadir Jain & Holla M=N=21 M=N=21 0.96650 0.95691 1.02970 0.95616 0.84449 0.84257 0.87631 0.86399 0.67809 0.67667 0.79792 0.76876 0.54601 0.54408 0.58874 0.58778

M=N=41 0.96066 0.96852 0.84104 0.86866 0.67792 0.77254 0.54543 0.58564

306

S.-H. Kuo et al.

Table 6. Comparison of the predicted values for v(x, y) at Re = 500 with other solutions reported in [10][11] Present (x, y) M=N=21 (0.15,0.1) 0.09581 (0.3,0.1) 0.12356 (0.1,0.2) 0.17908 (0.2,0.2) 0.16351 (0.1,0.3) 0.26224 (0.3,0.3) 0.21580 (0.15,0.4) 0.31570 (0.2,0.4) 0.29940 * Not on the grid point.

M=N=41 0.08829 0.09409 0.17894 0.16254 0.26194 0.21585 0.31506 0.29907

M=N=1026 * 0.07697 0.17889 0.16262 0.26175 0.21621 * 0.29894

Bahadir Jain & Holla M=N=21 M=N=21 0.09020 0.10177 0.10690 0.13287 0.17972 0.18503 0.16777 0.18169 0.26222 0.26560 0.23497 0.25142 0.31753 0.32084 0.30371 0.30927

M=N=41 0.08612 0.07712 0.17828 0.16202 0.26094 0.21542 0.31360 0.29776

Acknowledgments. The computing facilities and financial support provided by the National Centre for High Performance Computing (NCHC) in HsinChu, Taiwan, is greatly appreciated.

References [1] OpenMP home page, OpenMP: simple, portable, scalable SMP programming, http://www.openmp.org [2] Kruger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graphics 22(3), 908–916 (2003) [3] Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. Graphics Hardware, 1–11 (2003) [4] Harris, M.J.: Fast fluid dynamics simulation on the GPU. In: GPU Gems, ch. 38, pp. 637– 665 (2004) [5] Hagen, T.R., Lie, K.A., Natvig, J.R.: Solving the Euler equations on graphics processing units. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3994, pp. 220–227. Springer, Heidelberg (2006) [6] Brandvik, T., Pullan, G.: Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware. In: Proc. Inst. Mech. Engineers. Pt C: J. Mech. Engrg. Sci., vol. 221(12), pp. 1745–1748 (2007) [7] Brandvik, T., Pullan, G.: Acceleration of a 3D Euler solver using commodity graphics hardware. In: AIAA Paper 2008-607, 46th AIAA Aerospace Sciences Meeting and Exhibit (January 2008) [8] Corrigan, A., Camelli, F., Löhner, R., Wallin, J.: Running unstructured grid based CFD solvers on modern graphics hardware. In: AIAA Paper 2009-4001, 19th AIAA Computational Fluid Dynamics (June 2009) [9] Sheu, T.W.H., Wang, S.K., Lin, R.K.: Lin, An implicit scheme for solving the convection-diffusion-reaction equation in two dimensions. Journal of Computational Physics 164(1), 123–142 (2000)

Solving Burgers’ Equation Using Multithreading and GPU

307

[10] Bahadir, A.R.: A fully implicit finite-difference scheme for two-dimensional Burgers’ equations. Appl. Math. Comput. 137(1), 131–137 (2003) [11] Jain, P.C., Holla, D.N.: Numerical solution of coupled Burgers’ equations. Int.J. Numer. Meth. Eng. 13, 213–222 (1978) [12] PGI home page, PGI recommended Default Flags, http://www.pgroup.com/support

Support for OpenMP Tasks on Cell Architecture Qian Cao, Changjun Hu, Haohu He, Xiang Huang, and Shigang Li University of Science and Technology Beijing, 100083 Beijing, China [email protected]

Abstract. OpenMP task is the most significant feature in the new specification, which provides us with a way to handle unstructured parallelism. This paper presents a runtime library of task model on Cell heterogeneous multicore, which attempts to maximally utilize architectural advantages. Moreover, we propose two optimizations, an original scheduling strategy and an adaptive cutoff technique. The former combines breadth-first with the work-first scheduling strategy. While the latter adaptively chooses the optimal cut-off technique between max number of tasks and max task recursion level according to application characteristics. Performance evaluations indicate that our scheme achieves a speedup factor from 3.4 to 7.2 compared to serial executions. Keywords: Task; OpenMP; prarallel; Cell architecture.

1 Introduction Modern processors are now moving to multicore architectures in order to extract more performance from available chip area. Heterogeneous multicores take one more step along the power efficiency trend. The Cell Broadband Engine (Cell BE) is a representative heterogeneous multicore. It comprises a conventional Power Processor Element (PPE) that controls eight synergistic processing elements (SPEs). PPE has two levels of cache while SPEs don’t have caches but each has 256KB of local store (LS). PPE can access the main memory while SPE only operates directly on its LS. With the ever increasing of the hardware complexity, modern applications are getting more complex. Irregular and dynamic structures, such as unbounded loops, recursion kernels are widely used. To solve such problems, many mainstream programming models [1-5] use tasks as high level abstraction. OpenMP is a widely used programming model and OpenMP 3.0 specification [6] has shifted from a thread-centric to a task-centric execution model. It adds a new task model, which allows the programmer to explicitly specify task. Explicit task are useful for expressing unstructured parallelism and dynamically defined units of work. In this paper, we implement task parallelism mechanism on Cell processor. Considering that Cell processor has separate control core and accelerate cores, we propose an original strategy to maximally utilize the advantages of heterogeneous multicore, using control core to create, destroy and distribute tasks, while accelerate cores to execute tasks. Moreover, the implementation does not disobey the OpenMP 3.0 specification. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 308–317, 2010. © Springer-Verlag Berlin Heidelberg 2010

Support for OpenMP Tasks on Cell Architecture

309

To further improve performance, two optimization approaches are proposed. The first is a combination of work-first and breadth-first scheduling strategy. It doesn’t only reduce the number of communications between PPE and SPE, but also improves load balance. The second optimization is an adaptive cut-off technique, which dynamically adjust the optimal cut-off during application execution. The experimental results indicate that our task implementation combined with the optimizations achieves a speedup factor from 3.4 to 7.2 compared to serial executions. And it outperforms the XLC RTL in most kernels. Moreover, it achieves approximate speedup factors compared with the Nanos library and Intel work-queue in most benchmarks, even better in some kernels. The rest of the paper is organized as follows. The task implementation on Cell is presented in section 2. Section 3 describes the optimizations. Section 4 shows the evaluation results. The related works are presented in section 5 and the last section concludes the paper.

2 Task Design and Implementation on Cell BE 2.1 Design of Task Queues The runtime library sticks to the OpenMP 3.0 standard. The design of task queues is shown in Fig. 1. main memory

… Tg Tl Tl … Tl Tl Tl … Tl … Tl Tl … Tl

Tg1 Tg2 Tg3 Tg4 Tg5

GQ

LQ0 for SPE0 Tl1 Tl2 Tl3 LQ1 for SPE1 Tl1 Tl2 Tl3

…

LQ7 for SPE7 Tl1 Tl2 Tl3

m

4

5

n0

4

5

n1

4

5

n7

Fig. 1. Task queues model

#pragma omp parallel #pragma omp single { #pragma omp task {Task0} #pragma omp task untied {Task1} … } Fig. 2. A code segment including task

We separate the conventional task queues into local task queues (LQ) for SPE threads, and a global task queue (GQ) for all the threads to share. The LQs are located in the main memory, since the LS is limited. The GQ is also located in global memory. A task tied to a thread is put into the corresponding LQ when the task is suspended. Only the thread which the task is tied to could resume execution of the task. To accelerate the access to LQ, we store the task id of every task item in the LQ. The task id is unique, and dynamically decided on run time. Furthermore, we save a pointer pointing to the corresponding task item in the LQ because Cell processor has a global mapping address space including global memory and local stores. The GQ stores all suspended untied tasks and tasks that haven’t yet been started execution by any thread. A task in GQ could be resumed by all available SPE threads.

310

Q. Cao et al.

In order to describe clearly, a code segment including task construct is given in Fig. 2. And Fig. 3 illustrates the detailed working principles. We assume that SPE3 encounters the first task construct. It sends a signal to PPE, informing PPE to generate Task0. When PPE receives the signal, it generates a new tied task Task0, and puts it to the GQ. The tied attribute of Task0 is recorded in the task context. The generation strategy is described in Fig. 3 (a). Analogically, PPE generates a untied task Task1, and put it to the GQ. SPE3

SPE3

PPE

g lo b a l ta s k q u e u e

SPE1

lo c a l ta s k q u e u e 1

SPE1

lo c a l ta s k q u e u e 1

PPE

g lo b a l ta s k q u e u e

SPE2

lo c a l ta s k q u e u e 2

SPE2

(a )

lo c a l ta s k q u e u e 2

(b ) SPE3

SPE3 SPE1

lo c a l ta s k q u e u e 1

SPE1

lo c a l ta s k q u e u e 1

PPE

g lo b a l ta s k q u e u e

PPE

g lo b a l ta s k q u e u e

SPE2

lo c a l ta s k q u e u e 2

SPE2

lo c a l ta s k q u e u e 2

(c )

(d )

Fig. 3. Working principles of task queues

As shown in Fig. 3 (b), PPE informs SPEs that a new task is ready to be fetched after the task generation. SPEs which are looping infinitely receive the signal, and fetch the new task. The first SPE that requests the new task starts execution of the task. Whether a SPE try to fetch the new task or not depends on its current execution status and working ability. Here we assume that SPE1 starts to execute Task0, while SPE2 starts to execute Task1. SPE1 and SPE2 have their own LQs. In Fig. 3 (c), we suppose that SPE1 encounters the task scheduling point. And SPE1 would suspend Task0 to start a new task or resume a previously suspended task. Under such conditions, SPE1 stops the execution of Task0, and put it to LQ1. Meanwhile, SPE1 sends signal to PPE, informing the PPE its current execution. Task0 is a tied task, which could be resumed only by SPE1. In task context, task breakpoint is involved. In Fig. 3 (d), we suppose that SPE2 encounters the task scheduling point. And SPE2 would suspend Task1 to start a new task or resume a previously suspended task. Under such conditions, SPE2 stops the execution of Task1, and put Task1 to GQ. Meanwhile, SPE2 sends signal to PPE, to inform the current execution status. Task1 could be resumed by any available thread running on the SPE. 2.2 System Implementation The detailed system implementation of task parallelism is illustrated in Fig. 4. On the PPE side, PPE initializes the runtime system when it encounters the parallel construct. PPE first creates SPE threads and loads the SPE runtime. Then it creates a GQ and 8 LQs for the 8 SPEs, and sends the entry address of each LQ to the

Support for OpenMP Tasks on Cell Architecture

311

PPE Parallel region

Set global options SPEs

Create SPE threads

Infinite loop

Create work items Send a signal waiting for signal

PPE

fetch workitems & execute task construct

Infinite loop

if(0) if(1) / default

Send a signal, malloc Infinite loop

malloc Send a address signal

(1)

”

Send task create

Create task to GQ Infinite loop Send a address signal Infinite loop

fetch workitems & execute Send task create Yes

“

”

Resume original task Yes Create task to GQ

tied untied Infinite loop dma_put to LQ head dma_put to GQ head

“

“

Send task create

”

？

LQ empty

No Fetch LQ head execute

Send address

Infinite loop

Fetch GQ tasks and execute Send barrier

？

LQ empty

No Infinite loop

Fetch LQ head execute Fetch GQ tasks & execute

Send barrier signal finish

finish

Fig. 4. Overall implementation of task parallelism mechanism on Cell BE

corresponding SPE. PPE sends signals to SPEs after partitioning the work items. Then PPE is trapped into the infinite loop, waiting for signals from SPEs. When it receives a task-creating signal from any SPE, PPE creates a new task and puts the new generated task into GQ. And then PPE sends a signal to the available SPEs. PPE continues executing until the arrival of the final barrier signal from all SPEs. On the SPE side, all SPEs execute infinite loop, waiting for signals from PPE. After the creation of a new task, PPE will send a signal containing necessary information of the new generated task. The first SPE which fetches the task items from GQ starts execution of the new task. SPE invokes different outlined procedures according to different task types, such as for, sections, task, etc. If SPE encounters a task construct when it executes a task, a new task needs to be created. Under such conditions, SPE sends a signal to PPE, notifying PPE to create task. As specified above, PPE creates the task and sends signal to SPE. Whether the SPE continues execution of previous task or goes into infinite loop to wait for a new task depends on the task scheduling strategy. SPE sends the final barrier signal to PPE after the execution of a parallel region, so that PPE could continue doing following works.

312

Q. Cao et al.

In our implementation there is a point, which is marked in Fig. 4 by a numeral in a couple of brackets. If there’s no if clause, or the scalar-expression of if clause doesn’t equal 0, either the new generated task or the parent task will be executed, which is unspecified in OpenMP 3.0 specification. There are two main task scheduling strategies, breadth-first (BF) and work-first (WF). In Fig. 4, BF branches to the right, and WF branches to the left. In our base line version, the scheduling is set to BF.

3 Optimizations As mentioned above, there are two main task scheduling strategies, BF and WF. In brief, WF works well when data locality is good, while in irregular data access situation, BF outperforms. We propose a task scheduling strategy which is a combination of the BF and WF. It maximally utilizes the advantages of the two strategies. On one hand, our strategy adopts both GQ and LQs, which is similar to BF. BF scheme significantly reduce the number of communications between PPE and SPE, as illustrated in Fig. 4. On the other hand, the strategy we proposed introduces the work stealing technique, which is from WF. This allows the SPE threads with little workloads to steal task from the threads with heavy workloads. And thus the problem of unbalanced workloads is solved. Now, SPE fetching task follows the principle: first LQ, and then GQ, at last stealing from others. In order to avoid excessive tasks created, a cut-off technique is introduced, which reduces the overhead associated with task creation of runtime system. There’re two simple but effective cut-off policies: max number of tasks (max-task) and max task recursion level (max-level) [7]. We have observed that the best cut-off technique depends on application characteristics. And thus we present an adaptive cut-off, which dynamically chooses the optimal cut-off between max-task and max-level. /*1st time*/ Parallel region begin Cut-off := max-task; T1_maxtask := Current_Time; The Parallel region is executed; T2_maxtask := Current_Time; Time_maxtask := T2_maxtask –T1_maxtask; Parallel region end. /*2nd time*/ Parallel region begin Cut-off := max-level; T1_maxlevel := Current_Time; The Parallel region is executed; T2_maxlevel := Current_Time; Time_maxlevel :=T2_maxlevel –T1_maxlevel; Parallel region end. if Time_maxtask > Time_maxlevel Optimal_Cut-off := max-level; else Optimal_Cut-off := max-task; Fig. 5. Algorithm of adaptive cut-off technique

Support for OpenMP Tasks on Cell Architecture

313

In most applications it is often the case that a parallel region is usually invoked many times during program execution, thus enabling our runtime to learn and adapt to characteristics specific to the parallel region. The algorithm of our adaptive cut-off choosing policy is given in Fig. 5. We use the first two executions of the parallel region to test the performance. In the first execution of the parallel region, the cut-off is set to max-task. When the parallel region is invoked again, we use max-level during the execution. We obtain two execution times, Time_maxtask and Time_maxlevel. Then we compare them to estimate which performs better. In subsequent executions, we apply the better one to the parallel region execution.

4 Evaluation In our experiment, the performance is measured with several kernel applications, Alignment, N-Queens, SparseLU, Multisort, FFT and Strassen. These benchmarks have been used previously in the Intel work-queue model [8], the Nanos system [9] and the IBM XL parallelizing compilers [10]. The input parameters are for different benchmarks are summarized in Table 1. Table 1. Input Parameters for Different Benchmarks Applications Alignment N-Queens SparseLU Multisort FFT Strassen

Input parameters 100 protein sequences a chessboard size of 14*14 matrix size of 5000*5000, submatrix size of 100*100 array size of 128MB array of 32M of complex numbers matrix size of 1280*1280

In order to illustrate the optimization solutions, Fig. 6 shows the normalized speedups due to the optimized scheduling strategy and the adaptive cut-off technique. The base line is the execution speed on one SPE using the BF strategy. In Fig. 6, “n SPE_before” and “n SPE_after” respectively represents the application execution on n SPEs before and after the optimizations. On the whole, the optimizations proposed achieve noticeable performance improvements. Furthermore, with an increased number of SPEs the improvement is more obvious. The first optimization scheme, namely the scheduling strategy which is a combination of WF and BF, effectively achieves load balance by work stealing. Not surprisingly, Multisort and SparseLU, which are suffered from severe load unbalance, benefit more from the optimizations than the other benchmarks. The benchmarks are used to evaluate the task model in four environments. The first is Intel’s work-queue. The compilers used are the Intel C compiler 9.1. We refer to this implementation as “Intel work-queue”. Alignment isn’t evaluated in such environment since it couldn’t be implemented effectively with Intel’s work-queue.

314

Q. Cao et al.

9 8 8SPE

8SPE

7

8SPE

8SPE

8SPE

6 5 4SPE

4SPE

4SPE

4

4SPE

4SPE

8SPE

2SPE

3

4SPE

2SPE

2SPE

2SPE

2 1SPE

1SPE

2SPE

2SPE

1SPE

1SPE

1SPE

1

1SPE

0 Alignment

Multisort

SparseLU

N-Queens

FFT

Strassen

Fig. 6. Speedups due to the optimizations Alignment 8

4

6 4

2

2

0

0 1

2

4

8

M ultisort

3

8 Speed-up

Speed-up

4

10

Intel work-queue Nanos RTL XLC RTL Task on Cell

2

6

2

0

0

4

0 2 4 Number of processors

1

8

FFT 7 6 5

2 4 Number of processors

8

Strassen

8 Intel work-queue Nanos RTL XLC RTL Task on Cell

4

1

6

Intel work-queue Nanos RTL XLC RTL Task on Cell

2

1

Number of processors 5

8 Speed-up

Task on Cell

10 Intel work-queue Nanos RTL XLC RTL Task on Cell

Speed-up

6

Speed-up

Speed-up

8

Nanos RTL XLC RTL

SparseLU

N- Queens

10

10

Intel work-queue Nanos RTL XLC RTL Task on Cell

4 3 2 1

1

2 4 Number of processors

8

0 1 2 4 Number of processors

8

1

2 4 Number of processors

8

Fig. 7. Normalized speedups

The second version is the task model implementation in Nanos. The applications are compiled with the Mercurium compiler (as source-to-source compiler) and Intel C compiler 9.1 as the backend. We refer to it as “Nanos RTL” below. The third is the OpenMP task implementation in IBM XL compilers, which is referred to as “XLC RTL”. The benchmarks are compiled with the IBM XL compiler, V 10.1 with –O3. The last one is our task implementation on Cell, which is referred to as “Task on Cell”. The experiment is conducted on a Cell BE blade [11] with two Cell processors running at 3.2GHz with 1GB of system memory. The PPE has a 32KB L1 instruction cache, a 32KB L1 data cache, and a 512KB L2 cache. In this experiment, the programs are bound to one Cell processor to avoid the NUMA effect. The system runs Fedora9 (Linux Kernel 2.6.25-14). Our programs are compiled in the Cell SDK3.1.

Support for OpenMP Tasks on Cell Architecture

315

Fig. 7 illustrates the speedup for all benchmarks when the baseline is the sequential execution speed. The serial version of the application is compiled with Intel C compiler. The x-coordinate in Fig. 7, namely “number of processors”, represents the number of processors in the former three task environments while it represents the number of SPEs in our implementation. On the whole, our task model on Cell achieves almost linear speedups for most benchmarks. Our version exposes good scalability because every thread executes in an exclusive SPE. Especially Alignment and SparseLU, they obtain even better scalability than others since they contain a greater amount of parallel work. Additionally, in most of the kernels our version achieves similar performance achievements as the Nanos RTL version and Intel work-queue. Even in Alignment it performs a litter better. Only in FFT, it suffers a slight performance degradation. In our Task on Cell version, the execution information could be collected by the control core PPE, and this guarantees a proper scheduling strategy and load balance. When we compare the Task on Cell environment with the XLC RTL, we can see that the former obtains speedups almost the same as the latter for the kernels Alignment, N-Queens and SparseLU. Our implementation achieves a much better speedup for applications Multisort. Because the XLC RTL version implements untied tasks as tied tasks. When task generation gets to the leaf nodes all tasks are already bound to a specific thread, this obviously causes severe load unbalance. Task on Cell version allows the untied tasks to be executed by any available SPE threads, which is presented in Section 2, so it obtains a better speedup. As the number of processors increasing, different versions may obtain different performance improvements. Nevertheless, the Cell processor has only 8 SPEs. Therefore, the benchmarks are evaluated from 1 to 8 processors in our experiments.

5 Related Work There have been several proposals for expressing dynamic and irregular parallelism in programming languages. The Intel work-queue model [8] is the first to add dynamic task generation to OpenMP. This proprietary extension allows hierarchical generation of tasks by nesting taskq constructs. And synchronization of descendant tasks is controlled by implicit barriers at the end of taskq constructs. The Cilk [1] is a programming language and runtime system developed at MIT to express task-level parallelism. It is an effective extension of C for multithreading. It keeps all workers busy by creating plenty of logical threads. And the oldest work stealing strategy is adopted. Nevertheless, it lacks sections and loop constructs. Intel Threading Building Blocks (TBB) [2] is a C++ runtime library without special compiler support. It provides support for task-based programming and loop level parallelism. The users need to specify tasks. The library maps the logical tasks onto physical threads. It exploits the natural cache locality. But the programmers who program with TBB need to be familiar with C++. The Task Parallel Library (TPL) developed by Microsoft [12,13] supports looplevel parallelism in a manner similar to OpenMP. It supports parallel constructs like

316

Q. Cao et al.

parallel for, as well as other constructs such as task and future. It can be seen as an embedded domain specific language. The dynamic sections [14] was presented as an extension to the OpenMP sections. The thread detects a section instance will insert the section into an queue. They are executed by a team of threads. The Mercurium compiler [8], utilizing the Nanos runtime [9], contains the first prototype implementation of OpenMP 3.0 tasks [15]. The OpenMP 3.0 task model has been implemented in the IBM XL compilers [10]. It includes a compiler and a runtime library. The former transforms the input code into a multithreaded code with calls to the runtime library while the latter supports thread management, synchronization, and scheduling. Nevertheless, this implementation does not schedule tasks by work-stealing. Addison et al. [16] integrated the new OpenMP tasking model into the OpenUH compiler framework. Their works focus on compiler fronted support for tasks, compiler translation, and extensions to the runtime library. Specifically for the Cell, there have been several proposals [17, 18] for task. Alejandro Rico et al. [17] analyzed the performance of Cell Superscalar in terms of its scalability. They show that the low performance of the PPE limits the scalability. Bellens et al. [18] presented CellSs, which is a flexible task-based programming model for heterogeneous architectures. It provides a higher-level abstraction and allows users to program the Cell BE using OpenMP-like annotations for functions.

6 Conclusions We design a runtime library to implement OpenMP task model on Cell processor, which maximally utilizes heterogeneous architectural advantages. Two optimizations, a scheduling strategy which combines WF with BF and an adaptive cut-off technique, are proposed. Evaluations indicate that our implementation matches Intel work-queue, Nanos RTL in speedup factor, even outperforms in some applications. Acknowledgments. The research is partially supported by the Hi-Tech Research and Development Program (863) of China under Grant No. 2008AA01Z109, the Key Project of Chinese Ministry of Education under Grant No. 108008, and the National Key Technology R&D Program under Grant No. 2006BAK11B00.

References 1. Frigo, M., Leiserson, C.E., Randall, K.H.: The Implementation of the Cilk-5 Multithreaded Language. In: ACM SIGPLAN conference on Programming language design and implementation, pp. 212–223. ACM Press, New York (1998) 2. Reinders, J.: Intel Threading Building Blocks. Technical report, O’Reilly Media Inc. (2007) 3. T.X.D. Team: Report on the Experimental Language X10. Technical report, IBM (2006) 4. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. J. Int. J. High Perform. Comput. Appl. 21, 291–312 (2007) 5. The Fortress Language Specification. Version 1.0 B (2007)

Support for OpenMP Tasks on Cell Architecture

317

6. OpenMP Application Program Interface, Version 3.0. OpenMP Architecture Review Board (2008) 7. Duran, A., Corbalán, J., Ayguadé, E.: Evaluation of OpenMP task scheduling strategies. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 101–110. Springer, Heidelberg (2008) 8. Shah, S., Haab, G., Petersen, P., Throop, J.: Flexible Control Structures for Parallelism in OpenMP. In: 1st European Workshop OpenMP, pp. 1219–1239 (1999) 9. Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for OpenMP Tasks in Nanos v4. In: Proc. Conf. Center for Advanced Studies on Collaborative Research, pp. 256–259. ACM Press, New York (2007) 10. Teruel, X., Unnikrishnan, P., Martorell, X., et al.: Openmp tasks in ibm XL compilers. In: Proc. of the 2008 conference of the center for advanced studies on collaborative research, pp. 207–221. ACM Press, New York (2008) 11. Altevogt, P.: IBM BladeCenter QS21 Hardware Performance. IBM Technical White Paper WP101245 (2008) 12. Leijen, D., Hall, J.: Optimize Managed Code for Multi-Core Machines. J. MSDN Magazine, 1098–1116 (2007) 13. Leijen, D., Schulte, W., Burckhardt, S.: The design of a task parallel library. In: International Conference on Object Oriented Programming, Systems, Languages and Applications, pp. 227–242. ACM Press, New York (2009) 14. Balart, J., Duran, A., Gonza‘lez, M., Martorell, X., et al.: Nanos Mercurium: A Research Compiler for OpenMP. In: 6th European Workshop OpenMP, pp. 103–109 (2004) 15. Ayguadé, E., Duran, A., Hoeflinger, J., et al.: An Experimental Evaluation of the New OpenMP Tasking Model. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 63–77. Springer, Heidelberg (2008) 16. Cody, A., James, L., Lei, H., Barbara, C.: OpenMP 3.0 Tasking Implementation in OpenUH. In: 2nd Open64 Workshop at CGO (2009) 17. Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the cell BE. J. Scientific Programming 17, 59–76 (2009) 18. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a programming model for the Cell BE Architecture. In: Proc. of the 2006 ACM/IEEE Conference on Supercomputing. ACM Press, New York (2006) 19. Certner, O., Li, Z., Palatin, P., et al.: A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Programs. In: 1st Workshop on Programmability Issues for Multi-Core Computers, pp. 740–745. ACM Press, New York (2008) 20. Duran, A., Corbalán, J., Ayguadé, E.: An adaptive cut-off for task parallelism. In: Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, pp. 1–11. IEEE Press, Los Alamitos (2008) 21. Martorell, X., Labarta, J., Navarro, N., Ayguad´e, E.: Nano-Threads Library Design, Implementation and Evaluation. Technical Report UPC-DAC-1995-33, DAC/UPC (1995) 22. Cong, G., Kodali, S., Krishnamoorthy, S., et al.: Solving large, irregular graph problems using adaptive work-stealing. In: Proc. of the International Conference on Parallel Processing, pp. 536–545. IEEE Press, New York (2008)

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line Ning Zhang1, Yan Chen2, Naixue Xiong3, Laurence T. Yang4, Dong Liu2, and Yuyuan Zhang2 1

PLA Communication Network Technology Management Center, Beijing, China 2 PLA Academy of Communication Command, Wuhan, China [email protected] 3 Department of Computer Science, Georgia State University, Atlanta, USA [email protected] 4 Department of Computer Science, St. Francis Xavier University, Canada [email protected]

Abstract. Fiber optic communication transmission network is the basis for communication networks, responsible for a large number of long-distance transmissions of voice, data, images, and other business. Man-made construction, natural disasters and other unexpected events are important factors in fiber optic cable line blocking. Occasional and sudden onset also leads to the unpredictability of fiber optic cable line blocking. This paper proposes an algorithm for faults acquiring and locating on fiber optic cable line, which can effectively reduce search time and processing time of fault point, and establishes a fault point database. Keywords: Fiber Optic Cable, Fault Location, Algorithm Research.

1 Introduction As a large number of fiber optic cable lines are used in communication, fiber optic cable lines block frequently. The basic task of fiber optic cable line maintenance is to prevent the blocking and take rapid troubleshooting. In optical fiber communication systems, optical fiber lines blocking is the main cause of communication interruption. In engineering construction and maintenance work, how to ensure the reliability and security of optical fiber communication systems is the most important work [26-28]. Many factors may cause communication cable fault, including a variety of natural factors and human damages. In dealing with fiber optic lines blocking, if we can find fault points more quickly and accurately, we can reduce the influence and loss. Fiber optic cable fault is an optical communication system blocking caused by fiber optic cable lines itself. Considering the cause of optical communication system blocking, there are two main forms of fiber optic line failure, fiber disruption and loss increase. Considering the difficulty of finding fiber optic cable line fault point, there exist obvious failures and hidden failures. Obvious failures are mostly generated by external influences. The main causes include mining, fires, flooding, lightning, crush, theft and so on. They can be found C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 318–327, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line

319

easily. First, use the OTDR instrument to get the approximate distance between the fault points and the test point/station, and the nature of the failures. Then maintenance workers can use the routing information to identify the approximate geographic location of the fault points. Finally, maintenance workers search along the fiber optic cables line to check whether there is a breaking or construction on the ground, or there is an obvious pull-off, theft, fire on the aerial cable lines, or there is construction above the wells and pipelines. By looking for these unusual conditions, we can find the exact location of the fault points. Hidden failures are caused by external influences of nature. There are no obvious external signs along the fiber optic lines, and no obvious abnormal changes on the roads. It is difficult to find the exact location of the fault points. On the fiber optic cable lines, we can not inspect the abnormal situations visually, such as lightning, ants damage, mouse damage, shooting to the aerial cable lines, pipelines collapse, temperature influence, aging of optic fiber, terrain changes, pest bite, vibration and other damage to the fiber optic lines. In the search for hidden failures, if we can not find the exact location of the fault points, it may cause unnecessary waste of money and human resources, such as greatly increased earthwork digging to the buried fiber optic cable, greatly increased operations to the clips on the aerial cable lines, and also extend the blocking time. Therefore, to get the exact location of the fault points is the key to dealing with fiber optic cable failures. From the above analysis of fiber optic cables blocking we can conclude that most of the blockings are caused by natural disasters and man-made damage. Only a few blockings are caused by the quality of fiber optic cables themselves. Although we can use OTDR (Optical Time Domain Reflectometer) to measure the length of fiber optic cable lines between a fault point and the station, we still can not get the exact location of the fault point. Because most of the fiber optic cable lines are buried underground or under the sea-bottom, and sometimes they are aerial cables. The routes of fiber optic cable lines are always curves. In some areas, there are a lot of reserved lines over there, so the cable length does not mean the distance between the fault point and the stations. If a blocking is caused by natural disaster or external force, it is easy to find the fault point. But if it is a hidden failure, it is so hard to find the fault point by the test results of OTDR. Therefore, we need to find the relationship between the measured length of fiber optic cable lines and the distance between the fault points and the stations, which is the basis to find the fault points. In this paper, we explore a location algorithm for fault points on the fiber optic cable lines, which are quite different from other algorithms, and we establish a GIS-based fiber optic line database, with detailed descriptions of line routings. Then we use the previous fault points to calculate the fault interval. The rest of this paper is organized as follows: Section 2 presents related work of faults locating. Section 3 introduces our method. Finally, Section 4 concludes the paper and highlights the future work.

2 Related Work There are some papers researching on the causes of fiber optic cable line blocking. They analyze characteristics of different blockings, such as fiber optical connector

320

N. Zhang et al.

failure, cable failure in the middle part. They propose some detailed troubleshooting principles, methods and specific processes and also cite practical examples. Some papers analyze the causes of fiber optic cable connector failures repeatedly occur on the restored lines. Many problems can lead to the increased loss, such as improper operation, improper cable selection, and excessive fiber optic cable connectors on fiber optic cable lines. Then they propose the preventive measures. And there are also some papers researching on digitization and visualization of the fiber optic cable routing, which make full use of GPS technology and digital map in order to get detailed routing descriptions of fiber optic cable lines. But the line information is relatively simple and they should choose the elements according to their specific application. OTDR is the main tool of fault point location. Some papers analyze the error of the OTDR measurement and location. In the optical cable fault location test, the accuracy of the fault location test is directly related to the accuracy of OTDR instrumentation. If the instrument parameter, the choice of instrument scale range and the cursor are set incorrectly, it will result in test results error. Causes of the error include measuring reference error, cable structure difference, instrument operator error, errors of different instruments, incorrect optical refractive index value and so on. Some papers present a number of formulas to calculate the error. There are also some researches on how to improve the measurement accuracy of OTDR. Some of them use mathematical transformation to analyze the location of the fault points. In order to get an exact test result, some of them discuss the OTDR setup and test curve, which are very practical approaches and have played an active role in rush-repair to the fault points. For hidden failures, not only the OTDR is used to test the distance between the fault point and the station, but also a variety of reference points and signs are used on cable lines to get the exact fault location on the road. This method uses the loss steps of the fiber optic cable welding points and signs outside the fiber optic cable to get the exact location of the fault point. The nearest joint point is called adjacent joint point. In order to get the distance between connector and fault point, this method should first get the cable length between the adjacent joint point and fault point, then minus the reserved fiber optic cable lines through completion information. Although all the fiber optic cable products have signs of length, but because of the rub in construction, the signs may be vague and shed off. It is difficult to maintain these signs. In order to get the fault point location, there are many reference points can be used, including cable length in connector boxes and fiber welding points. A series of formulas are also provided to locate fault points. All these methods use empirical data to estimate fault points, such as relative length of the fiber optic cable lines, reserved length in the connector boxes, characteristics of fiber optic cable lines, signs of the lines in adjacent joint point and so on. There are also some algorithms on fault point location, which give complex formulas to calculate the length between fault points and test points. The length is associated with the tested length through OTDR, the length of reserved cable lines, the quantity of optic cable joint, the natural curvature of optical cable, and other parameters. All these parameters are hard to get, and their accuracy is difficult to be guaranteed. Qualitative estimate is always used, so there are great arithmetic errors.

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line

321

3 The Faults Acquiring and Locating Algorithm Because of project constructions, natural disasters and other reasons, the fiber optic cable line information will change. Real-time updating of data is very difficult. Therefore, accurate and complete line information is the basis of fault location. We describe the detailed line information in database. As the line information constantly changing, time element is added in database. Dynamic updated line information and former fault points are used for fault location. 3.1 Database Construction In order to quickly find fault points of fiber optic cable lines, we not only need some appropriate equipment, but also need to establish a basic routing information database of fiber optic cable lines and a precise location database of the former fault points. The former database is used to determine the interval of the fault points, and the later is used for refinement of the interval. • Line Routing Information Database In order to get the distance between any two fault points, we need to establish fiber optic cable management systems based on GIS. To realize the digitization and visualization of fiber optic cable routing, we use GPS to get the geographical coordinates of identifications, human wells, stations and other related elements along the fiber optic cable lines, and marked them out on digital map. Fiber optic cable lines information includes spatial coordinates and spatial topological relations. According to the geometric forms, fiber optic cable lines information can be classified into three kinds of objects: point, line and surface. Each object has complex attributes, and there are complex topology relations between the objects. Most of the line information is about point and line objects, and the pipeline information is the most important information on fiber optic cable lines. Pipeline object has many parameters. Pipeline connections are very complicated, and an optic fiber can be connected with other optic fiber in a different fiber optic cable lines. Fiber optic cable lines information includes geospatial data and equipment attribute data. We combine them for data entry, data query and data statistics and data analysis, and then form the special layers on digital map. We store detailed routing information, including fiber optic cables set-up time, communication direction, laying mode, services type, maintenance unit, cable type, equipment type and so on. We can calculate the distance between any two points on digital map by these coordinates information. Then we compare the cable length tested by OTDR with the line routing information, convert the cable length to the distance between the test point and the fault point on the ground, and finally we can locate the fault point. •

Fault Points Database

It includes precise locations of the previous fault points and the precise length of cable lines between the fault points and their adjacent stations. We store the information of each fault, including recovery time, cause of the failure, fiber type, test equipment and other information. This information can help repair the lines. Once the fiber optic line block, we record the coordinates of the fault points and the length of cable lines

322

N. Zhang et al.

between fault points and the adjacent stations. The coordinates are measured by GPS and the length is measured by OTDR. If the routing information changes after a blocking, in addition to adding the fault points information to the fault points database, we update the line routing information. Through the data accumulation of the above two databases for some time, we will find the relationship between the geographic coordinates and the length of fiber optic lines. 3.2 Interval Calculating To facilitate the description of the interval calculating process, we take the model shown in Figure 1 as an example. Figure 1 includes three stations, two identifications, two previous fault points, and a new fault point.

Fig. 1. Interval calculating

Assume that the fault point 101 is blocking now, L2' is the length of fiber optic line between the fault point and the station 2 measured by OTDR, L3' is the length of fiber optic line between the fault point and the station 3 measured by OTDR. If we can measure the length of the fiber optic line underground, we can get the precise coordinate of the fault point. But we can not, so we use the calculated distance instead. We can calculate the coordinates of node S1 and S2 in the digital map by the routing coordinates on the lines. The distance between node S1 and the station 2 is L2' and the distance between node S2 and the station 3 is L3'. Because fiber optic lines are always curve and have some reserved lines underground. The location of node S1 and S2 are calculated by accumulating the coordinates of the straight line on the ground. So when we take L2' and station2 concerned, the fault point must be between node S1 and station2. Then when we consider L3' and station3, the fault point must be between node S2 and station3. We can conclude that the coordinate of fault point (N,E)101 must be in the interval of [S1, S2]. In figure 2, we use a simplified model for further explanation. The fiber optical lines are linear. The reserved lines are underground and invisible. Use the above-mentioned method, we can calculate the interval [S1, S2], and the fault point O must be in it. In formula (1) and (2), AB, BS1, ED, DC, CS2 are calculated by the routing coordinates on the fiber optical lines. S1: AB+BS1=300m

(1)

S2: ED+DC+CS2=700m

(2)

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line

323

Fig. 2. A simplified model

3.3 Interval Refinement In the solution process above, we only use routing coordinates of the lines in digital map. In order to refine the interval, we will use the fault point database. Because it not only contains the coordinates of previous fault points, but also contains the length of these fault points to the nearest two stations. We can use the length of lines underground instead of the distance on the ground. Figure 3 shows a specific interval refinement process.

Fig. 3. Interval refinement

For example, we find a fault point F, which locates between the interval [S1, S2] and station 3. The coordinate of fault point F is (N,E)F, the length of cable lines between fault point F and station 2 is L2''(550m), and the length of cable lines between fault point F and station 3 is L3''(450m). We can use L3'' instead of the distance between point F and station 3. It is much more precise. We can get a new interval [S1, S3]. It is smaller than [S1, S2]. The fault point O must be in it. We can easily find the fault point without a long distance searching. L3'' is measured by the OTDR when blocked on TF. In formula (3), FC and CS3 are calculated by the routing coordinates on the fiber optical lines. S3: L3''+FC+CS3=700m

(3)

As time goes by, because of the construction along the fiber optic cable lines and geographic changes, the routing of the fiber optic cable lines changes. Some fault points in the database will be invalid and therefore we need to join the time parameter to fault points to determine which fault points can be used in the interval refinement.

324

N. Zhang et al.

As Shown in Figure 3, the latest fault point in the interval [A, S1] is used for the refinement of S1, and the latest fault point in the interval [E, S2] is use for the refinement of S2. The older fault points are only used for reference. If there are more than one fault points in the interval, we need to choose the most useful one. Usually, the latest fault point has the highest priority. We must constantly update the line routing information database and the fault point database, change the routing information of the fiber optic lines and remove the invalid fault points. 3.4 Data Update As time goes by, because of the construction along the fiber optic cable lines and geographic changes, the routing of the fiber optic cable lines changes. Some fault points in the database will be invalid and therefore we need to join the time parameter to fault points to determine which fault points can be used in the interval refinement. As Shown in Figure 3, the latest fault point in the interval [A, S1] is used for the refinement of S1, and the latest fault point in the interval [E, S2] is use for the refinement of S2. The older fault points are only used for reference. If there are more than one fault points in the interval, we need to choose the most useful one. Usually, the latest fault point has the highest priority. We must constantly update the line routing information database and the fault point database, change the routing information of the fiber optic lines and remove the invalid fault points.

4 Simulations and Performance Analysis In figure 4, AB=BC=CD=DE=EF=500m. There are some reserved lines at point B(100m), C(50m), D(50m), and E(100m). Point G1, G2, G3, G4, G5 and G6 are fault points. The location of these points is shown in Figure 4. Point O is a new fault point. By OTDR, we get the cable length between point O and station 2 is 1350m, the cable length between point O and station 3 is 1450m. Through calculation, we have selected different reference point, the fault interval are shown in Table 1. Thus, when the information of reference point is valid, if we select the reference point nearest to the fault point we can get relatively smaller fault interval. To select the reference points at both ends of fault point will get a smaller fault interval than to select only one reference point.

Fig. 4. Line Model

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line

325

Table 1. Relationship between Reference Point and Interval Size

Referece point null G1 G2 G3 G4 G5 G6 G1,G6 G1,G5 G1,G4 G2,G6 G2,G5 G2,G4 G3,G6 G3,G5 G3,G4

OS2

OS1

150m 150m 50m 0m 150m 150m 150m 150m 150m 150m 50m 50m 50m 0m 0m 0m

150m 150m 150m 150m 0m 50m 150m 150m 50m 0m 150m 50m 0m 150m 50m 0m

Interval Size 300m 300m 200m 150m 150m 200m 300m 300m 200m 150m 200m 100m 50m 150m 50m 0m

Table 2. Influence Mode by Line Changing to Fault Interval Calculating

Fault point G1

G2 G3 G4 G5

G6

Changing of lines L(FG1)+30m L(FG1)+20m L(FG1)-180m L(FG2)+30m L(FG2)-20m L(FG2)-80m L(FG3)+30m L(FG3)-20m L(AG4)+30m L(AG4)-20m L(AG5)+30m L(AG5)-20m L(AG5)-80m L(AG6)+30m L(AG6)-20m L(AG6)-180m

OS2 180m 130m -30m 80m 30m -30m 30m -20m 150m 150m 150m 150m 150m 150m 150m 150m

OS1 150m 150m 150m 150m 150m 150m 150m 150m 30m -20m 80m 30m -30m 180m 130m -30m

Interval Size

Result

330m 280m 120m 230m 180m 120m 180m 130m 180m 130m 230m 180m 120m 330m 280m 120m

Yes Yes No Yes Yes No Yes No Yes No Yes Yes No Yes Yes No

In table 2, L(FG1) means the cable length between point F and point G1. It can be seen that when the cable length from reference point to neighbor station increases due to construction or other reasons, the calculated fault interval may enlarge, but still can be used to find the fault point. When the cable length from reference point to neighbor station get shorter due to construction or other reasons, let L = cable length from fault point to neighbor station - the distance between fault point and neighbor station. When the reduced cable length is greater than L, then the fault point is not in the

326

N. Zhang et al.

interval we calculated. When the reduced cable length is smaller than L, then the fault point is still in the interval we calculated. Due to the uncertainty of the old fault point, when we choose a reference point, we should firstly consider the latest fault point before the nearest old fault point.

5 Conclusion The accuracy of the interval depends on the sampling accuracy of the fiber optic lines routing and the accuracy of fault points in the database. If there are more sampling points on fiber optic cable lines routing, there will be a smaller fault interval. If the fault points are more accurate, the interval refinement process can be more effectively. Therefore, to update the line information database and the fault point database is a critical work. It plays an important role in estimating the credibility of the results. If fiber optic cable line faults occur frequently in certain points or certain regions, we need to reinforce the fiber optic cable lines. If the loss or the joint increases in some regions, we should consider replacing the core of the fiber. Considering that there are still much space for further improvement on the algorithm, our future task is to explore more powerful versions of the algorithm in both efficiency and precision. One direction is to employ some artificial intelligence to determine whether the time parameter has failed, or whether the fault point is available.

References 1. Chen, Y.: OTDR Accurate Use of Optical Cable lines to Find Obstacles Points. Telecom Engineering Technics and Standardization 21(5), 72–74 (2008) 2. Lu, G.: Analysis on the Methods for Acquiring and locating the Faults on the Fiber Optic Cable Line. SCI-TECH Information Development & Economy 19(14), 193–194 (2009) 3. Bai, L.: Accurately Positioning of Optical Cable Circuit Obstacle. Journal of Hebei Energy Institute of Vocation and Technology 7(1), 78–79 (2007) 4. Tateda, M., Horiguchi, T.: Water penetration sensing using wavelength tunable OTDR. IEEE Photon. Technol. Lett. 3(1), 1–3 (1991) 5. Tani, Y., Sasaki, H., Kubota, Y., Watanabe, K.: Accuracy evaluation of a hetero-core splice fiber optic sensor. In: Proc. of SPIE, vol. 5952, pp. 59520L-1–59520L-8 (2005) 6. Cibula, E., Donlagic, D.: In-line short cavity Fabry-Perot strain sensor for quasi distributed measurement utilizing standard OTDR. Optics Express 15(14), 8719–8730 (2007) 7. Han, J.: Fault Positon of Railway Optical Cable Line. Railway Signalling & Communication 43(10), 62–63 (2007) 8. Hao, G.: Using OTDR to Measure the Fault Point of Cabled Yarn. Telecom Engineering Technics and Standardization (7), 60–62 (2003) 9. Guo, Z.: Accurate location of fiber cable troubles by using Optical Time Domain Refiectormeter and error analysis. Ningxia Engineering Technology 2(3), 274–276 (2003) 10. Hao, G.: Fault Location of Fiber Optic Link Using OTDR. Optical Fiber & Electric Cable and Their Applications (6), 36–38 (2004) 11. Duan, J., Liu, Q., Zhu, Y., Zhang, J.: A new way for fault location in fiber optic cable maintenance. Optical Fiber & Electric Cable and Their Applications (5), 40–43 (2000) 12. Zhong, Z., Wen, K., Wang, R.: Event Detection and Location in OTDR Data. Journal of PLA University of Science and Technology (Natural Science Edition) 5(5), 22–25 (2004)

A Novel Algorithm for Faults Acquiring and Locating on Fiber Optic Cable Line

327

13. Mei, L., Hu, S.: Maintenance of Long-distance Optical Fibal Cable. Tianjin Communications Technology (1), 48–51 (2002) 14. Zhao, Z., Huang, D., Mao, Q.: Optical Communication Engineering [M]. People’s Posts & Telecom Press, Beijing (1998) 15. Yang, X.: Optical fiber communication systems [M]. National Defense Industry Press, Beijing (2000) 16. Su, H., Wu, L., Lu, Z., Wang, J.: The Application of GIS in Telecommunication and Research in Demand. Telecom. Science 18(2), 28–31 (2002) 17. Chen, W., Lu, J.: Application of GIS to Dynamic Resource Management of Local Telecommunication Network. Jour. of Geodesy and Geodynamis 27, 147–149 (2007) 18. Li, N., Guo, M.: The Application of GIS in the Optical Cable Fault Location. Laser Journal 26(4), 73–74 (2005) 19. Hou, G., Wang, J., Liu, J.: A Communication Network Management Information System Developed by Merging MIS and Geographic Information System. Transactions of Beijing Institute of Technology 24(4), 338–341 (2004) 20. Liu, X., Li, X., He, Y.: A Review of Application of GIS in Communications. Journal of Shanghai University(Natu. Scie. Edit.) 13(4), 389–393 (2007) 21. Wang, C., Yang, H.: Technical Research on Faults Location of Optical Cable. ElectroOptic Technology Application 20(2), 26–28 (2005) 22. Kong, F., Ju, T.: The Implementation of Telecom. Circuitry Management Based on GIS. Journal of Nanjing Univ. of Posts and Telecom. 21(2), 12–16 (2001) 23. Chai, Y., Tang, Y., Li, N., Dai, W.: Fault Check & Safeguard System of Optical Cable for Communication based on GIS. Journal of Chongqing University 27(8), 65–68 (2004) 24. Guo, M., Li, N., Li, S., Chai, Y.: Intelligent Diagnosis Method of Optical Cable Based on GIS and OTDR. Journal of Chongqing University(Natural Science Edition) 28(7), 78–81 (2005) 25. Bai, X., Liu, S.: Design of Automatic Monitoring Optical Cable System. Telecommunications for Electric Power System 30(5), 20–23 (2009) 26. Xiong, N., Vasilakos, A.V., Yang, L.T., Yi, P., Wang, C.-X., Vandenberg, A.: Art Vandenberg. Distributed Explicit Rate Schemes in Multi-input Multi-output Network Systems (appear to IEEE T-SMC part C) 27. Xiong, N., Jia, X., Yang, L.T., Vasilakos, A.V., Pan, Y., Li, Y.: A Distributed Efficient Flow Control Scheme for Multi-rate Multicast Networks. IEEE Transactions on Parallel and Distributed Systems (TPDS), TPDS-2008-10-0421 28. Xiong, N., Vasilakos, A.V., Yang, L.T., Song, L., Yi, P., Kannan, R., Li, Y.: Comparative Analysis of Quality of Service and Memory Usage for Adaptive Failure Detectors in Healthcare Systems. IEEE Journal on Selected Areas in Communications (JSAC), IEEE JSAC 27(4), 495–509 (2009)

A Parallel Distributed Algorithm for the Permutation Flow Shop Scheduling Problem Samia Kouki1, Talel Ladhari2, and Mohamed Jemni1 1

Ecole Supérieure des Sciences et Techniques de Tunis, Research Laboratory UTIC. Tunis, Tunisia [email protected], [email protected] 2 Ecole Supérieure des Sciences Economiques et Commerciales, Tunis, Tunisia [email protected]

Abstract. This paper describes a new parallel Branch-and-Bound algorithm for solving the classical permutation flow shop scheduling problem as well as its implementation on a cluster of six computers. The experimental study of our distributed parallel algorithm gives promising results and shows clearly the benefit of the parallel paradigm to solve large-scale instances in moderate CPU time.

1 Introduction A Permutation Flow Shop Problem (PFSP) is one of the most widely studied scheduling problems in literature. It is commonly used as a benchmark for testing new exact and heuristic algorithms and has became one of the most intensively investigated topics in combinatorial optimization and scheduling theory. This interest is not only motivated by its practical relevance, but also by its deceptive simplicity and challenging hardness. Though, the PFSP is still considered as a very hard nut to crack. The PFSP can be defined as follows. Each job from the job set J = {1, 2, …, n} has to be processed no preemptively on m machines M1, M2, …, Mm in that order. The processing time of job j on machine Mi is pij. At any time, each machine can process at most one job and each job can be processed on at most one machine. The problem is to find a processing order of the n jobs, the same for each machine (i.e. passing is not allowed), such that the time Cmax at which all the jobs are completed (makespan) is minimized. Using the notation specified in Pinedo [1], this problem is denoted F | prmu |Cmax. The PFSP attracted attentions of many research since the discovery of the wellknown polynomial time solution for the F2||Cmax. Particularly, during the last 30 years, computational complexity of the combinatorial optimization problem has been made clear based on the theory of NP-completeness, and it has been verified that PFSP with three or more machines are NP-hard. This complexity result strongly suggests, however, that an enumerative solution approach is essentially unavoidable in this case. The Branch and Bound (B&B) technique proved to be one of the most powerful methods for solving exactly NP-hard combinatorial problems and specially scheduling ones. Interestingly, one can notice that the PFSP has been one of the first C.-H. Hsu et al. (Eds.): ICA3PP 2010, LNCS 6082, Part II , pp. 328–337, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Parallel Distributed Algorithm

329

combinatorial optimization problems to which the B&B technique has been applied shortly after its development, in 1960, by Land and Doig [2]. Following this pioneering work, several authors proposed sequential B&B algorithms. The first B&B algorithms for the F|prmu|Cmax were developed simultaneously, but independently, by Ignall and Schrage in [9] and Lomnicki [38]. The most significant contributions include McMahon and Burton [12], Carlier and Rebai [13], Cheng et al. [37] and Ladhari and Haouari [30]. All these algorithms, except the latter, can solve only instances of very limited size. In this context, and in connection to the previous work presented in [30], we propose in this paper a new parallel distributed version of PFSP algorithm.

2 Notation of the PFSP In this paper, we use the following traditional assumptions: • • • • •

All jobs are ready for processing at time zero. The machines are continuously available from time zero onwards. No pre-emption is allowed (that is, once the processing of a job on a machine has started, it must be completed without interruption). At any time, each machine can process at most one job, and each job can be processed on at most one machine. Only permutation schedules are allowed (i.e. all jobs have the same ordering sequence on all machines).

Using the notation specified in [22], this problem is denoted F|prmu |Cmax. For a given job sequence, we define the makespan as the time to complete the schedule. The PFSP is to find a job sequence with minimum makespan. More precisely, the PFSP can be stated mathematically as follows: if we have a permutation (sequence) σ=(σ(1), σ(2),…, σ(n)) of the set J of jobs, then we define the completion time, Ciσ( j) of job σ( j) on machine Mi as follows: C1σ(1) = p1σ(1) Ciσ(1)= Ci_1,σ(1)+ piσ(1), i= 2, . . . ,m C1σ (j)= C1σj_1+ p1σ( j), j = 2, . . . , n Ciσ( j)= max(Ciσ(j-1), Ci-1,σ(j)) + piσ( j), i= 2, . . . ,m and J= 2, . . . ,n The makespan is defined as Cmax(σ) = Cmσ(n) The PFSP is to find a permutation s* in the set S of all permutations of the set {1, 2, …, n} such that Cmax(σ*) =

C

The first study of the PFSP goes back to 1954 when Johnson published his seminal paper [28] showing that, for the two-machine case, the following property holds: Job i precedes job j in an optimal sequence if: min{p1i, p2j} ≤ min{p1j , p2i}

330

S. Kouki, T. Ladhari, and M. Jemni

A nice consequence of this result is that the two-machine flow shop problem can be solved in polynomial time [29]. However, for m≥3, the PFSP is NP-hard. In this work, we consider the mono-objective case, which aims to minimize the overall completion time of all jobs, i.e. the makespan. The PFSP can be solved by two broad classes of methods: Exact methods and Heuristic methods. In this paper we are interested only on the exact methods to solve the PFSP with makespan criterion using parallel B&B algorithm.

3 The B&B Algorithm Solving the PFSP with Makespan Criterion The PFSP attracted attentions of many research since the discovery of the well-known polynomial time solution for the F2||Cmax. Particularly, during the last 30 years computational complexity of the combinatorial optimization problem has been made clear based on the theory of NP-completeness, and it has been verified that PFSP with 3 or more machines are NP-hard. This complexity result strongly suggests, however, that an enumerative solution approach is essentially unavoidable in this case. The B&B technique proved to be one of the most powerful methods for solving exactly NP-hard combinatorial problems and specially scheduling ones. Interestingly, one can notice that the PFSP has been one of the first combinatorial optimization problems to which the B&B technique has been applied shortly after its development, in 1960, by Land and Doig [2]. Following this pioneering work, several authors proposed sequential B&B algorithms. The first B&B algorithms for the F|prmu|Cmax were developed simultaneously, but independently, by Ignall and Schrage in [9] and Lomnicki [38]. Following this pioneering work, several additional branch-and-bound algorithms have been published. The most significant contributions include McMahon and Burton [12], Carlier and Rebai [13], Cheng et al. [37] and Ladhari and Haouari [30]. All these algorithms, except the latter, can solve only instances of very limited size. The B&B method is presently the only way to solve the PFSP and to give optimal solutions [9, 39]. Furthermore, as heuristics do not usually produce optimal solutions, the goal of the B&B algorithm is to solve a constrained optimization problem [17]. The principle of B&B is to make an implicit search through the space of all possible feasible solutions of the problem. Indeed, it is an intelligent search algorithm for finding a global optimum of problems of the form min f(x), x X. B&B is characterized by the three following basic components:

∈

¾ A branching rule ¾ A lower bounding ¾ A search strategy. The following algorithm describes the ideas presented above: 1. LIST={S}; 2. UB: value of some heuristic solutions. CurrentBest: heuristic solution 3. While LIST≠ Ø Do 4. Choose a branching node k from LIST

A Parallel Distributed Algorithm

331

5. Remove k from LIST 6. Generate children child(i) for i=1,…,nk and compute the corresponding lower bounds LBi. 7. For i=1 To nk DO 8. IF LBi
4 The State of the Art of Parallel B&B Algorithm The parallelization of the B&B algorithm has been widely studied in the last two decades. Many previous studies were interested in the parallelization of the B&B algorithm. For instance, an exhaustive survey of the state of the art of parallel B&B algorithms, covering the period from 1975 to 1994, has been presented by B. Gendron and T. Crainic in [3]. After this period, in 1995, S. Okamoto, I. Wantanabe and H. Lizuka, presented in [28] a parallel B&B algorithm for the flow shop problem and implemented in a parallel machine nCUBE2 multiprocessor which is a shared memory parallel machine. After that, in 2003, K. Aida and al. proposed in [16] a parallel algorithm dedicated especially for the resolution of the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. In 2004, the algorithm proposed in [8], presented another algorithm devoted to shared memory parallel machines. However, the authors, D. A. Bader did not present experimental results dealing with the PFSP but a general parallel B&B. D. Caromel and al., in [7] presented a framework for using B&B algorithm in grid computers. They used in this framework a hierarchical master-worker strategy and used the flow shop problem to test their framework. Remark that, the authors used randomly generated instances and did not used known benchmarks. Moreover, many other frameworks for parallel B&B algorithms have been designed to solve some combinatorial optimization problems; we present some examples of them. PUBB [37], is a framework which proposes a C interface and allows to parallelize B&B algorithms on any specific combinatorial optimization problem. Bob++ library is a set of C++ classes. Its goal is to allow the implementation of sequential and parallel search algorithms (Branch and X, Dynamic programming, etc) to solve a

332

S. Kouki, T. Ladhari, and M. Jemni

specific combinatorial optimization problem. In [31], the authors presented many others frameworks dealing with parallel B&B method in shared and distributed environments, like PPBB-Lib, MallBa, ZRAM. These frameworks used a C interface. However, others frameworks as ALPS/BiCePS, PICO [6], and MW proposed a C++ interface. In [1], authors propose a P2P design and implementation of a parallel B&B algorithm on the ProActive grid middleware. They applied their application to the flow shop scheduling problem and experimented on a computational pool of 1500 CPUs from the GRID'5000 Nation-wide experimental Grid. Remark that the authors used other benchmarks than we used in our study. In [5], Benjamin W. Wah and Y. W. Ma proposed a parallel machine for processing nondeterministic polynomial complete problems, and evaluate their system using the vertex covering problem. In [7], D. Caromel, A. di Costanzo, L. Baduel and S. Matsuoka propose in 2009, a framework built over the master-workers approach that helps programmers to distribute problems over grids. They performed their experimental results using the Flow shop scheduling problem, with randomly generated instances. In [15], J. Lemesre, C. Dhaenens and E.G. Talbi propose an exact parallel B&B algorithm, using the bi-objective PFSP where the optimal solutions are found using the Pareto approach and represent a set of solutions of best compromise. In [21], M. Mezmaz, N. Melab and E.-G. Talbi present a parallel B&B algorithm using grid’5000, they resolved the tail056 Taillard’s benchmark in 25 days. In fact, although this may be considered as a promising result, more investments have to be investigated to reduce the running time which is still huge. Unlike the work cited above, in this paper we deal specifically with the PFSP. To our knowledge, the parallelization of this particular problem has not been studied in its generality, and just few works have been achieved to solve particular instances. In this context, we propose a new parallel distributed algorithm for PFSP with makespan criterion. We are interested, in particular, to the well known benchmarks of Taillard [11], presenting some hard instances not yet solved, and this constitutes the originality of our work. We highline here, that some instances of the PFSP [26] are still not yet solved neither with sequential algorithms nor with parallel ones. Since the publication of the seminal paper of Johnson (1954) [29], the PFSP has became one of the most intensively investigated topics in combinatorial optimization and scheduling theory. This interest is not only motivated by its practical relevance, but also by its deceptive simplicity and challenging hardness. Nevertheless, the flow shop problem is still considered as a very hard nut to crack. Indeed, up to the mid of the 1990s, the best available branch-and-bound algorithms experience difficulty in solving instances with 15 jobs and 4 machines [10]. In this context, ETSI and INRIA at Sophia Antipolis France organized in 2005 the 2nd Grid PLUGTESTS Flow Shop Challenge Contest [36]. The goal of this contest is focused on solving the well known benchmarks problems of Taillard [11]. Indeed, in 2007, a challenge was launched in order to solve some of the instances of combinatorial optimization problem, for instance the PFSP, and a call for competition has been announced in the web [34].

A Parallel Distributed Algorithm

333

5 Our Parallel Algorithm 5.1 Introduction The use of parallel machines makes possible reaching optimal solutions in less time as well as increasing the size of problems to be solved. Several factors are combined to use parallel computing to solve the combinatorial optimization problems. Indeed, finding an optimal solution for a problem can be very difficult and sometimes impossible on a single processor. 5.2 The Parallel Algorithm Our algorithm is based mainly on the classical serial algorithm of the B&B applied to the PFSP. In this paper we focus on the one-machine-based lower bounds for the PFSP. The principle of this method can be described as the following: we consider an instance of the PFSP [1]. If we relax the constraint that each machine can process at most one job at a time, for all machines but one, say, Mk (k = 1,…, m), then a relaxation of the F|prmu|Cmax p can be a obtained by setting for ∑ , (or rj = 0 if k = 1) all j ∈J a release date rj= and a delivery time qj = ∑ , (or qj = 0 if k = m) . The resulting relaxation is a one-machine problem with release dates (heads) and delivery times (tails) and is denoted 1|rj,qj|Cmax. This latter is strongly NP-hard [14]. If we set all release and delivery dates, one relaxation of 1|rj,qj|Cmax is obtained by r and q , respecsetting release dates and delivery times of all jobs j €€ J to tively. This yields for each machine Mk (k =1,.., m) a lower bound LBk =

r

∑

p

q

(1)

Hence, a valid lower bound for F |prmu|Cmax is LB = max

(2)

The first branching will sequence a job in the first available position of the schedule, while the second branching sequences a job in the last available position of the schedule. In our distributed algorithm we establish a strategy of parallelization, which consists in distributing the search tree among all processors and we adopt the depth first search strategy. The parallelization approach of our algorithm can be described as follows: 1. Having initialized the list of nodes, the master processor distributes the tree among all processors. 2. Every processor receives a portion of data corresponding to the sub tree that it has to treat. 3. All processors work in parallel to explore their local data by applying a local B&B algorithm. 4. Every processor when finishing its treatment sends the local optimal solution it founds to the master.

334

S. Kouki, T. Ladhari, and M. Jemni

5. The master, after receiving the results from all the processors, calculates the global optimal solution. Our distributed algorithm can be parameterized to use a number of processors as match as the number of jobs (or less). If the number of available processors is less than the number of jobs we can assign more than one job to one processor.

6 Experimental Results We implemented our algorithm with C and compiled it with Microsoft Visual C++ 2010. We used MPI library (Message Passing Interface) in order to ensure communication between processors. All the computational results were carried on a cluster of computers of six computers Intel Pentium IV 3,2 GHz with 256 MB of RAM. The network used to connect computers is an Ethernet 100 Mb/s. In order to validate practically the efficiency of our algorithm we performed our experimental tests on a well-known set of benchmark instances from the literature [10] (i.e. the benchmarks of Taillard [11] containing some instances unsolved). These instances are known to be very hard and about a decade after their publication, many of them are still open (instances with 20 machines and some others with 10 machines). Table 1 presents the sequential and parallel running times as well as the speedups we obtained using 2, 4 and 6 processors. The speedup is a criteria used to evaluate the improvement in time when more than one processor is used. Let T(p) denote the time of execution of an application on p >= 1 processors. The speedup is defined as S(p) = T(1)/T(p). However there are many possibilities to measure T(1). For all our experimental study we calculate T(1) as the time required by the parallel algorithm running on one processor of the cluster. Our methodology for the experimental study, was the following: we run at first a set of measurements of our parallel program applied to 50X10 instances (50 is the number of jobs and 10 is the number of machines), with modifying the number of processors used (2, 4 and 6) . Then, we modify the number of jobs from 50 to 100 then 200 and we run a new set of measurements for each instance. The analysis of the different obtained measurements allows us to deduce the following: -

-

The experimental study shows clearly that the parallelization of the B&B method improved significantly the running time of the sequential version of our program. Our algorithm is perfectly scalable since for all measurements the speedup increases with the number of processors used. Generally communication and intrinsically serial parts of the program avoids reaching speedup equal to p when p processors are used. The obtained speedups are good in general. In some cases, the speedups attempted values more than the number of the processors. The obtained results indicate that the number of generated sub- problems decreases significantly when the number of processors increases. Therefore, speedups on p processors attempt sometimes values larger than p.

A Parallel Distributed Algorithm

335

Table 1. Sequential and Parallel Running times of the 50X10, 100x10 and 200X10 instances Runtimes (ms) Nbr of Processors

1

2

40

23

4

Speedups 6

2

8

1.3 2.86

4

6

2.42

5

Instance

50X10 instances tail041

16

tail042

31

25

9

7

3.96

4.75

tail043

~0

~0

~0

~0

---

---

---

tail044

~0

~0

~0

~0

---

---

---

tail045

47

27

13

11

1.7

3.4

4.23

tail046

~0

~0

~0

~0

---

---

---

tail047

~0

~0

~0

~0

---

---

---

tail048

33

11

7.5

6.8

3

3.01

4.85

tail049

31

16

7.5

6.8

1.93

4

4.5

tail050

31

22

11.8

7.3

1.37

2.61

4.24

tail071

265

143

60

1.85

100X10 instances 75

3.53

4.41 5

tail072

15

8

6

3

1.87

2.5

tail073

250

147

70

50

1.7

3.57

5

tail074

~0

~0

~0

~0

---

---

---

tail075

~0

~0

~0

~0

---

---

---

tail076

265

129

68

58

2.05

3.89

4.5

tail077

312

230

80

67

1.35

3.9

4.65

tail078

234

119

60

47

1.96

3.9

4.97

tail079

~0

~0

~0

~0

---

---

---

tail80

2059

935

680

420

2.2

3.02

4.90

tail091

---

---

---

---

---

---

---

tail092

2560

1400

750

560

1.8

3.41

4.57

tail093

30

20

10

6

1.5

3

5

tail094

---

---

---

---

---

---

---

tail095

2137

1200

577.5

380

1.7

3.7

5.5

tail096

---

---

---

---

---

---

---

tail097

---

---

---

---

---

---

---

tail098

15050

6940

3860

2870

2.1

3.89

5.23

tail099

1670

900

535

350

1.85

3.12

4.77

tail0100

5800

3680

1480

1100

1.57

3.9

5.272

200X10 instances

336

S. Kouki, T. Ladhari, and M. Jemni

7 Conclusion and Perspectives In this paper, we proposed a new distributed parallel algorithm for the PFSP. Our parallel version was implemented and validated. This allowed us to prove, first, the feasibility of our solution and secondly, the interest of parallelization for this type of problem. This interest is reflected by both the gain of computing time thanks to faster execution and the ability to solve instances unresolved on sequential machines. Furthermore, although we have not used a large number of processors, our experimental study showed the practical efficiency of our parallelization strategy since we reached good speedups using an available cluster of six machines. In our future work, we will use parallel machines with a large number of processors as well as improving our algorithm by incorporating a new technique to load balance the work of different processors of the parallel machine.

References 1. Bendjoudi, A., Melab, N., Talbi, E.-G.: P2P design and implementation of a parallel B&B algorithm for grids. International Journal of Grid and Utility Computing 1, 159–168 (2009) 2. Land, A.H., Doig, A.G.: An automatic method for solving discrete programming problems. Econometrika 28, 497–520 (1960) 3. Gendron, B., Crainic, T.G.: Parallel B&B Algorithms: Survey and synthesis. Operation Research 42(6), 1042–1066 (1994) 4. Le Cun, B., Crainic, T.G., Roucairol, C.: Parallel Branchand-Bound algorithms. In: Parallel combinatorial optimization, Wiley, John & Sons incorporated, Chichester (2006) 5. Benjamin, Wah, W., Ma, Y.W.: MANIP-a parallel computer system for implementing B&B algorithm. In: International Symposium on Computer Architecture, pp. 239–262 (1981) 6. Phillips, C.A., Eckstein, J., Hart, W.E.: Pico: An object oriented framework for parallel branch-andbound. Technical report, RUTCOR Research Report (2000) 7. Caromel, D., Di Costanzo, A., Baduel, L., Matsuoka, S.: Grid’BnB: A parallel B&B Framework for Grids. In: International conference on high performance computing, HIPC (2007) 8. Bader, D.A.: Parallel algorithm design for branch and bound. International Series in Operations Research & Management Science 76, 5-1-5-44 (2004) 9. Ignall, E., Schrage, L.E.: Application of the branch-and-bound technique to some flow shop problems. Operations Research 12, 400–412 (1965) 10. Anderson, E.J., Glass, C.A., Potts, C.N.: Local search in combinatorial optimization: Machine Scheduling. In: Local Search in Combinatorial Optimization, pp. 361–414. John Wiley and Sons, Chichester 11. Taillard, E.: Benchmarks for basic scheduling problems. European Journal of Operational Research 64, 278–285 (1993) 12. McMahon, G.B., Burton, P.G.: Flow-Shop Scheduling with the Branch-and-Bound Method. Operations Research 15(3), 473–481 (1967) 13. Carlier, J., Rebai, I.: Two branch-and-bound algorithms for the permutation flowshop problem. European Journal of Operational Research 90(2), 238–251 (1996) 14. Lenstra, J.K., Rinnooy Kan, A.H.G., Bruker, P.: Complexity of machine scheduling problems. Annals of Discrete Mathematics, 1–343 (1977) 15. Lemesre, J., Dhaenens, C., Talbi, E.G.: An exact parallel method for a bi-objective permutation flowshopproblem. European Journal of Operational Research 177(3), 1641–1655 (2007)

A Parallel Distributed Algorithm

337

16. Aida, K., Natsume, W., Futakata, Y.: Distributed computing with hierarchical masterworker paradigm for parallel B&B algorithm. In: CCGrid 2003, 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 156–163 (2003) 17. Mitten, L.: Branch-and-bound methods: general formulation and properties. Operations Research 18, 24–34 (1970) 18. Haouari, M., Ladhari, T.: A branch-and-bound-based local search method for the flow shop problem. The Journal of the Operational Research Society 54(10), 1076–1084 (2003) 19. Haouari, M., Ladhari, T.: Minimising maximum lateness in a two-machine flowshop. The Journal of the Operational Research Society 51(9), 1100–1106 (2000) 20. Mezmaz, M., Melab, N., Talbi, E.-G.: B&B@Grid: une approche efficace pour la gridification d’un algorithme Branch and Bound. INRIA, number: RR-6937, Mai (2009) 21. Mezmaz, M., Melab, N., Talbi, E.-G.: A Grid-enabled B&B Algorithm for Solving Challenging Combinatorial Optimization Problems. In: Parallel and Distributed Processing Symposium, IEEE International, March 2007, pp. 1–9 (2007) 22. Pinedo, M.: Scheduling: theory, algorithms, and systems. Prentice-Hall, Englewood CliKs (1995) 23. Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flow shop and job shop scheduling. Mathematics of Operations Research 29, 1–117 (1976) 24. Santoro, N.: Design and analysis of distributed algorithms. Wiley, Chichester (2006) 25. Kacsuk, P., Fahringer, T., Németh, Z.: Distributed and parallel systems from cluster to grid computing. Springer, New York (2007) 26. Bellman, R., Esogbue, A.O., Nabeshima, I.: Mathematical Aspects of scheduling and Applications, p. 202. Pergamon Press, Oxford (1982) 27. Čiegis, R., Baravykaite, M.: Implementation of a Black-Box Global Optimization Algorithm with a Parallel B&B Template. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1115–1125. Springer, Heidelberg (2007) 28. Okamoto, S., Wantanabe, I., Lizuka, H.: A new Par. algorithm for the n-job, m-machine flow-shop scheduling problem. Systems and Computers in Japan 26(2) (1995) 29. Johnson, S.M.: Optimal two- and three-stage production schedules with setup times included. Naval Research Logistics Quarterly 8, 1–61 (1954) 30. Ladhari, T., Haouari, M.: A computational study of the PFSP based on a tight lower bound. Computers & Operations Research 32, 1831–1847 (2005) 31. Crainic, T.G.: Parallel Branch-and-Branch Algorithms: Survey and synthesis. Operations Research 42(6), 1042–1066 (1994) 32. Bozejko, W.: Solving the flow shop problem by parallel programming. Journal of parallel and distributed computing 69(5), 470–481 (2009) 33. Yu, W., Hoogeveen, H., Lenstra, J.K.: Minimizing Makespan in a Two-Machine Flowshop with Delays and Unit-Time Operations is NP-Hard. Journal of Scheduling 7, 333–348 (2004) 34. http://www2.lifl.fr/~talbi/challenge2007/ 35. http://www.utic.rnu.tn 36. http://wwwsop.inria.fr/oasis/plugtest2005/2ndGridPlugtestsReport/ 37. Wang, X., Cheng, T.C.E.: Two-machine flowshop scheduling with job class setups to minimize total flowtime. Computers and Operations Research 32(11), 2751–2770 (2005) 38. Masahiro, Y.S., Higaki, M., Hirabayashi, R.: A generalized utility for parallel B&B algorithms. In: 7th IEEE Sym. on Parallel and Distr. Processing (October 1995) 39. Lomnicki, Z.: A branch-and-bound algorithm for the exact solution of the three-machine scheduling problem. Operational Research Quarterly, 89–105 (1965)

A Self-Adaptive Load Balancing Strategy for P2P Grids Po-Jung Huang1, You-Fu Yu1, Quan-Jie Chen1, Tian-Liang Huang1, Kuan-Chou Lai1, and Kuan-Ching Li2 1

Department of Computer and Information Science, National Taichung University Taichung, Taiwan, R.O.C. {bcs097107,bcs097103,bcs098112,bcs098105}@ms3.ntcu.edu.tw, [email protected] 2 Department of Computer Science and Information Engineering, Providence University Taichung, Taiwan, R.O.C. [email protected]

Abstract. The grid computing system becomes a promising platform for high performance computing. Recently, grid systems integrate with the P2P technology to enhance the efficiency of distributed computing. Load balancing is one of the most important issues in P2P grid systems, and the efficiency of the resource utilization is also one of the key issues. This study proposes a Self-Adaptive Load Balancing (SALB) strategy to improve the efficiency of load balancing. SALB selects appropriate neighbors according to the Small World Theory; then, the proposed strategy transfers jobs to these neighbors in order to distribute load. Experimental results show that the proposed algorithm performs well and that the SALB strategy could improve the resource utilization and shorten the job completion time. Keywords: P2P, Grid, Load balancing, Neighbor sites, Self-Adaptive.

1 Introduction Because the communication technology advances, distributed computing environments are realized. The grid computing system is one of high-performance distributed computing systems and becomes a promising platform that integrates and shares distributed resources for high performance computing. Moreover, Peer-to-Peer (P2P) computing is also one of the important distributed computing architectures. The P2P system is composed of peers that make a portion of their resources directly available to other peers. Peers are both resources’ suppliers and users. Due to the demand of mass distributed computing and efficient data transmission, grid systems integrate with the P2P technology to support high-performance distributed computing. To fully exploit P2P grid computing systems, load balancing is one of the key issues in achieving high performance. Load balancing aims to distribute the workload to computing sites to maximize the resource utilization and to minimize the job execution time. Load balancing strategies could be categorized into static or dynamic. The static load balancing strategies are easy to be implemented; but, they usually couldn’t obtain the optimal solutions. On the contrary, the dynamic approaches make appropriate C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 338–347, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Self-Adaptive Load Balancing Strategy for P2P Grids

339

resource allocations at runtime to obtain the better performance. However, the dynamic approach needs to collect the dynamic information to make the optimal decision. This paper presents a distributed dynamic load balancing strategy, named the Self-Adaptive Load Balancing (SALB) strategy. The SALB approach consists of two main phases: the neighbor-selecting phase and the job-transferring phase. In the first phase, SALB chooses sites as the neighbor sites according to the Small World Theory and the neighbor-selecting index (NSI). After the neighbors of one site are determined, each site only collects the neighbors’ information by the P2P approach to minimize the information gathering overhead. Our proposed strategy obtains the relative load index (RLI) by the states between the site and its neighbors, and the absolute load index (ALI) by the state of each site. In the second phase, when the related load index and the absolute load index of each site are generated, each site gets the load balancing state according to RLI and ALI. If the site’s state is “sendable”, the load balancing mechanism is triggered, and SALB transfers jobs to the neighbor site with the minimal job turnaround time (JTT). Experimental results show that SALB could maximize resource utilization, shorten job turnaround time and achieve load balancing. The rest of this paper is organized as follows. Section 2 describes the related work and Section 3 presents the information system and the load balancing strategy. Experimental results are shown in Section 4. Section 5 concludes with the results and future works.

2 Related Work The grid computing system is a distributed computing system that integrates distributed resources. Grid computing systems share distributed computing resources by middleware. There are many proposed middleware, such as gLite[5], Globus Toolkits[6], UNICORE[15], and so on. In the grid middleware, one of the most important issues is how to schedule jobs and balance load. Recent studies also consider the performance effect of the system heterogeneity. Some projects, for example, Jalapeno [14] and JNGI [8], combine the P2P technology with grid systems. In this study, we adopt JXTA [10] to implement P2P functions for discovering computing resources in Unigrid[16]. JXTA supports the following P2P services: peer discovery protocol, peer information protocol, peer resolver protocol, peer endpoint protocol, pipe binding protocol and rendezvous protocol. One of the important issues in the P2P grid systems is load balancing. Load balancing strategies could be categorized into dynamic and static. In general, the static load balancing strategy [18] adopts the prior information to make decisions, such as the execution rate of each node, to distribute load. On the contrary, the dynamic load balancing strategy [19], [20], [21] uses the system information to make decisions at run time. For example, JRT [19] uses the node’s performance and work information to choose the optimal site without considering the job heterogeneity in the grid system. This paper focuses on dynamic distributed load balancing with the limited information to achieve the optimal performance. There are similar approaches, e.g., RESERV[4] and JRT[19].

340

P.-J. Huang et al.

Our proposed SALB approach evaluates the site status and the job turnaround time to make decisions; therefore, the utilization could be improved and the job turnaround time could be reduced.

3 Load Balancing Strategy This section presents the Self-Adaptive Load Balancing (SALB) strategy for P2P grid systems. In P2P grid systems, each site consists of a super node and several general nodes. Super nodes exchange site information with each other and manage the resources and jobs in general nodes. In this study, SALB applies the sender-initiated strategy. The sender-initiated strategy migrates jobs from the overloaded node to lightly-loaded computing nodes for reducing the load of overloaded nodes. In the SALB strategy, we define that the super node is responsible for collecting resources in the local site periodically and for exchanging information between super nodes. The super node is also responsible for transferring jobs to neighbor sites if the site is overloaded. The general nodes are responsible for executing jobs and for supplying site statuses to super nodes. We have implemented the prototype of our proposed SALB strategy, which follows the OSGA standard. The components of the load balancing strategy could organize in a layered architecture which consists of upper layer and lower layer. The upper layer builds on the services offered by the lower layer in addition to interacting and cooperating with components on the same level. The upper layer consists of resource monitoring, load balancing and job migration; and the lower layer consists of configure service, file transfer, information service and execution management. 3.1 Neighbor-Selecting Phase The SALB approach consists of two main phases: the neighbor-selecting phase and the job-transferring phase. Before introducing the first phase, some terms are pre-defined. Let t denote the time interval of re-selecting neighbors. Let rk(i) denote the number of remaining resource k in node i, where k = 1……m, and m is the number of types of resources in the P2P grid system. Let w be defined as the weight of resources in the nodes, the weight of resource k in node i be defined as wk(i). Let u be defined as the resource usage (%), the usage of resource k in node i be denoted as uk(i) , k = 1……m. We define the neighbor-selecting index (NSI) as in Eq.(1). NSI is the index of selecting neighbors. According to the Small World Theory [12] (Six Degrees of Separation) and NSI, we define Nr be the number of neighbors for each site. This study obtains the minimal number of Nr such that Nr6 > N, where N is the total number of sites. Let NSI be the difference between local site i and site j, then

NSI(i,j) = w1(L)

r1(i) rk(i) + ...... + wk(L) . r1(j) rk(j)

(1)

When one node joins the P2P grid system, the proposed strategy randomly selects 2Nr sites as the candidates of neighbors and chooses the first Nr sites with the smaller NSI as the neighboring sites.

A Self-Adaptive Load Balancing Strategy for P2P Grids

341

In this study, we define un(i), un(i,Avg), un(i,Max) and un(i,Min) as follows. Let un(i) denote the utilization of resource n in site i, un(i,Avg) denote the average utilization of resource n between site i and its neighbors, un(i,Max) denote the maximum utilization of resource n between site i and its neighbors, un(i,Min) denote the minimum utilization of resource n between site i and its neighbors. We also define the relative load index (RLI) in Eq.(2) and absolute load index (ALI) in Eq.(3). Let RLI be the load index between the site L and its neighbors. Let ALI be the load index of the site L, then

RLI(i) = (w1(L)

u1(i)-u1(i,Avg) uk(i)-uk(i,Avg) + ...... + wk(L) )*100% , u1(i,Max)-u1(i,Min) uk(i,Max)-uk(i,Min)

(2)

ALI(i) = w1(L) u 1(i) + ...... + w k(L) u k(i) .

(3)

and

Table 1. States of RLI and ALI

≧ ＞＞

If RLI THR If THR RLI and RLI TLR If RLI ≦ TLR

State of RLI High Medium Low

ALI ≧ THA If THA ALI and ALI TLA If ALI ≦ TLA If

＞＞

State of ALI High Medium Low

Table 2. Different combinations of site’s status RLI status Low Low Low Medium Medium Medium High High High

ALI status Low Medium High Low Medium High Low Medium High

Site’s status UnSendable UnSendable Sendable UnSendable UnSendable Sendable Sendeble Sendable Sendable

Neighbor’s status Receivable Receivable Unreceivable Receivable Receivable Unreceivable Unreceivable Unreceivable Unreceivable

In this paper, the SALB strategy pre-defines the threshold of high RLI (THR), the threshold of low RLI (TLR), the threshold of high ALI (THA) and the threshold of low ALI (TLA). Table 1 shows the relationship between THR, TLR, THA and TLA. Table 2 shows the different combinations of the site states. The state of a site could be “sendable” or “unsendable”. The “sendable” state represents the site is overloaded, and the overloaded site must transfer jobs to other sites. The “unsendable” state represents that the site don’t need to transfer jobs to other sites. The states of neighbor sites could be “receivable” or “unreceivable”. The “receivable” state represents the site can receive jobs from other sites, and the “unreceivable” state represents the site is overload and it couldn’t receive jobs from other sites. If the state of one site is “sendable”, the

342

P.-J. Huang et al.

job-transferring phase is triggered. If the state of site is “unsendable”, the site does nothing. 3.2 Job Transferring Phase Before describing the progress of the second phase, we define some terms used in this phase, for example, RJ_time, IJ_time, JRT, J_time, T_time and JTT. Let RJ_time be the sum of the forecasted remaining time to complete all executing jobs, and Remain_time(s, Rk) be the forecasted remaining time to complete the executing job, Rk in site s, then m

RJ_time = ∑ Remain_tim e (s, R k ) . k =1

(4)

Let IJ_time be the sum of the execution time of all waiting jobs, and Excution_time(s, Sk) be the execution time of job Sk in site s, then n

IJ_time = ∑ Execution_ time(s, S k ) . k =1

(5)

Therefore, the job response time, JRT is defined as

JRT = RJ_time + IJtime .

(6)

Let J_time be the execution time of job J which is migrated to site s, then

J_time = Excoution_ time(s, J) .

(7)

Let T_time be the migration time for job J, then T_time =

Data size + File size . Bandwidth

(8)

We define the job turnaround time (JTT) as JTT =

JRT + J_time + T_time . CPU_Number

(9)

In this phase, if one site’s state is “unsendable”, the SALB strategy doses nothing. If one site’s state is “sendable”, the SALB strategy collects the necessary information, and chooses the neighbor site with the “receivable” state and with the minimal JTT to be the candidate site for job migration. In this phase, when the candidate site is the site itself, the job migration is ignored. When the candidate site is not itself, the SALB suspends the jobs which are to be migrated to the candidate site, and then migrates these jobs to the candidate site. The migrated job re-starts execution in the candidate site and then the original suspended job is deleted. When there is no neighboring site with smaller JTTs, SALB tries to find out the site with smaller JTTs in the non-neighboring sites. If such neighbor sites exist, SALB migrates the job to such a site with the minimal JTT; otherwise, the migration process is ignored. The following pseudocode shows the SALB algorithm, as shown in Fig. 1.

A Self-Adaptive Load Balancing Strategy for P2P Grids

343

Algorithm SALB Input : LocalSite LS, Job j, total number of sites N, time interval T Output : The optimal execution site of job j Main () { LS’s neighbors = Find_Neighbors(LS,N) time_last = time_now while ( LS.state == sendable ) { If ( time_now – time_last) > T { LS’s neighbors = Find_Neighbors ( LS, N ) time_last = time_now } Find out the last idle job j in the local job queue If the JTT of job j in local site > the JTT of job j in some LS’s neighbors { Find out the neighbor with the minimal JTT to be candidate Migrate job j to the candidate site } Else { If the JTT of job j in local site > the JTT of job j in some non-neighbors { Find out the site with the minimal JTT to be candidate Migrate job j to the candidate site } } } Return the neighbor site with the minimal JTT } Find_Neighbors( Site LS, number N ) { Get the minimal Nr, s.t. Nr6 > N. Random find at most 2* Nr site, Sort these 2* Nr sites according to NSI Return first Nr sites with the smaller NSI }

Fig. 1. SALB algorithm

4 Experimental Results This study adopts JXTA 2.5.1, Java 1.6.0, and Condor 6.7.20 in the experiment environment to implement the SALB strategy in the Taiwan UniGrid for performance evaluation. In order to simplify the experimental environment, this study assumes that there are five sites, and that each site consists of one super node and some general nodes. However, our approach could be extended to more complex systems. Table 3. shows the specification of our experimental platform, and host222 is in charge of job submission. After submitting jobs from host222, the super node of the site which receives the submitted jobs would allocate the general nodes in the same site to execute the jobs, or determine whether the load balancing is needed or not.

344

P.-J. Huang et al. Table 3. System specification Site

Hosts

Peer Types

CPU Speed

Memory

1 1 2 2 3 3 4 4 5 5

Host201 Host204 Host205 Host208 Host206 Host207 Host221 Host223 Host222 Host224

Super node General node Super node General node Super node General node Super node General node Super node General node

Intel P-D 3.40GHz X 2 Intel P-D 3.40GHz X 2 Intel P-4 3.40GHz X 2 Intel P-4 3.40GHz X 2 Intel P-4 3.40GHz X 2 Intel P-4 3.40GHz X 2 Intel P-4 3.40GHz Intel P-4 2.00GHz Intel P-4 3.40GHz X 2 Intel P-4 3.40GHz X 2

512M 512M 512M 512M 512M 512M 256M 256M 512M 512M

This study applies five benchmarks [9]. These benchmarks include f77split, fd_predator_prey, fd1d_heat_explicit, satisfiability and linpack_bench. The f77split is a program which splits a file containing multiple FORTRAN77 routines into separate files. The fd_predator_prey is a program which solves a pair of predator prey ODE's using a finite difference approximation. The fd1d_heat_explicit implements a finite difference solution, explicit in time, of the time dependent 1D heat equation. The satisfiability demonstrates, for a particular circuit, an exhaustive search for solutions of the circuit satisfiability problem. The linpack_bench is the LINPACK benchmark program. The numbers of executing each benchmark are 50, 100, 150 and 200 in this experiment. For evaluating the performance of SALB, we also implement FIFO (First-In First-Out) and JRT[19] strategies for comparison. In this experiment, we focus on the efficiency of load balancing strategies, and record the experimental data every 5 seconds. The time interval of re-selecting neighbors is 60 seconds. THR, TLR, THA and TLA are pre-defined as 20%, -20%, 60% and 40% respectively. We also assume that there are three types of resources: CPU, memory and bandwidth in this experiment. The weights of CPU, memory and bandwidth are 0.6, 0.3 and 0.1 respectively.

(a)

(b)

Fig. 2. (a) Average execution time, and (b) average CPU utilization in different approaches

Figure 2.a. shows the experimental results by applying different load balancing approaches. The experimental results show that the FIFO strategy has the worst execution time. When the number of jobs increases, the increment of execution time becomes

A Self-Adaptive Load Balancing Strategy for P2P Grids

345

obvious by applying the FIFO strategy due to that FIFO could not ensure the improvement of load balancing. The difference of average execution times by using three approaches in executing 50 jobs is slight. The SALB approach outperforms other two approaches because SALB could migrate idle jobs to the neighbor sites in under-load. In general, JRT doesn’t consider the heterogeneous of jobs. Because SALB considers the heterogeneity of both jobs and sites, the SALB approach could not only reduce the system idle time but also migrate jobs to the sites with the optimal execution performance. In the meantime, due to that SALB chooses sites as the neighbor sites according to the Small World Theory and the neighbor-selecting index, the SALB strategy could reduce the dynamic information gathering overhead and improve the resource utilization. Figure 2.b. shows the CPU utilization for different numbers of jobs by SALB, JRT and FIFO strategies. Experimental results show that the SALB strategy has the maximum CPU utilization, due to that JRT only considers the heterogeneous of sites. However, because SALB considers job heterogeneity, site heterogeneity and performance of executing jobs, SALB could maximize the effective resource usage.

(a)

(b)

(c)

(d)

Fig. 3. Average CPU utilizations of executing (a)50 jobs, (b)100 jobs, (c)150 jobs, (d)200 jobs

Figure 3 shows that the average CPU utilization by applying SALB, JRT and FIFO approaches in executing 50, 100, 150 and 200 jobs. In these figures, SALB outperforms JRT and FIFO when the numbers of jobs are 100, 150 and 200. Experimental results show that the site adopting SALB could spend more computing resources in executing jobs. It is reasonable that the SALB approach allocates resources depending on the job

346

P.-J. Huang et al.

response time, CPU number, memory, and communication bandwidth. Therefore, SALB could improve the resource utilization and obtain better system performance.

5 Conclusions and Future Works In this paper, we propose an efficient decentralized load balancing strategy, named the Self-Adaptive Load Balance (SALB) strategy, for the P2P grid systems. The SALB approach consists of two main phases: the neighbor-selecting phase and the job-transferring phase. In the first phase, SALB chooses neighbors by the Small World Theory and NSI. Because the sites make decisions only according to the information of neighbors, SALB could reduce the information gathering overhead. In the second phase, when the state of one site is “sendable”, the load balancing mechanism is triggered; and, SALB transfers the job to the neighbor site with the minimal JTT. Experimental results show that SALB has the minimal average execution time and maximal average CPU utilization. SALB indeed could improve the resource utilization, shorten execution time and achieve load balancing. In the future, we plan to improve the neighbor-selecting mechanism and to deploy the SALB strategy to UniGrid to verify its performance.

Acknowledgements This study was sponsored by the National Science Council, Taiwan, Republic of China under contract numbers: NSC 98-2218-E-007-005, NSC 97-2221-E-142-001- MY3, and NSC96-2221-E-126-004-MY3.

References 1. Wang, C., Student Member, IEEE,, Xiao, L., Member: An Effective P2P Search Scheme to Exploit File Sharing Heterogeneity. IEEE Transactions on Parallel and Distributed Systems 18(2) (February 2007) 2. Chen, C., Tsai, K.-C.: The Server Reassignment Problem for Load Balancing in Structured P2P Systems. IEEE Transactions On Parallel And Distributed Systems 19(2) (February 2008) 3. Condor, http://www.cs.wisc.edu/condor/ 4. Vincze, G., Novák, Z., Pap, Z., Vida, R.: RESERV: A Distributed, Load Balanced Information System for Grid Applications. In: Eighth IEEE International Symposium on Cluster Computing and the Grid 5. gLite, http://glite.web.cern.ch/glite/ 6. Globus, http://www.globus.org/toolkit/ 7. Sun, J., Li, L., Chen, H., Tan, H.: A Proximity-Aware Load Balancing Algorithm in P2P Systems. In: The 3rd International Conference on Grid and Pervasive Computing – Workshops 8. JNGI, http://jngi.jxta.org 9. Burkardt, J.: http://people.sc.fsu.edu/~burkardt/index.html 10. JXTA, https://jxta.dev.java.net/

A Self-Adaptive Load Balancing Strategy for P2P Grids

347

11. Kiran, M., Aisha-Hassan, Hashim1, Kuan, L.M., Jiun, Y.Y.: Execution Time Prediction of Imperative Paradigm Tasks for Grid Scheduling Optimization. IJCSNS International Journal of Computer Science and Network Security 9(2) (February 2009) 12. Six Degrees of Separation, http://en.wikipedia.org/wiki/Six_Degrees_of_Separation 13. Sun, http://www.sun.com/ 14. Therning, N., Bengtsson, L.: Jalapeno: secentralized Grid computing using peer-to-peer technology. In: CF ’05: Proceedings of the 2nd conference on Computing frontiers (May 2005) 15. UNICORE, http://www.unicore.eu 16. Unigrid, http://140.114.91.35/index.html 17. Li, Y., Yang, Y., Zhu, R.: A Hybrid Load balancing Strategy of Sequential Tasks for Computational Grids. In: 2009 International Conference on Networking and Digital Society (2009) 18. Pan, Y., Lu, W., Zhang, Y., Chiu, K.: A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs 19. Wu, Y.-J., Lin, S.-J., Lai, K.-C., Huang, K.-C., Wu, C.-C.: Distributed Dynamic Load Balancing Strategies in P2P grid Systems. In: The 5th Workshop on Grid Technologies and Applications(WoGTA’08), December, pp. 95–102. National University of Tainan (2008) 20. Yang, Y., Li, Y., Zhu, R.: A Hybrid Load balancing Strategy of Sequential Tasks for Computational Grids. In: 2009 International Conference on Networking and Digital Society (2009) 21. Duan, Z., Gu, Z.: Dynamic Load Balancing in Web Cache Cluster. In: Seventh International Conference on Grid and Cooperative Computing (2008)

Embedding Algorithms for Star, Bubble-Sort, Rotator-Faber-Moore, and Pancake Graphs Mihye Kim1, Dongwan Kim2, and Hyeongok Lee2,∗ 1

Department of Computer Science Education, Catholic University of Daegu, 330 Hayangeup Gyeonsansi Gyeongbuk, South Korea [email protected] 2 Department of Computer Education, Sunchon National University, 413 Jungangno Suncheon Chonnam, South Korea {wandong,oklee}@scnu.ac.kr

Abstract. Star, bubble-sort, pancake, and Rotator-Faber-Moore (RFM) graphs are well-known interconnection networks that have node symmetric, maximum fault tolerance and hierarchical partition properties. These graphs are widely assumed to improve the network cost of a hypercube. This study proposes embedding methods for a star graph and its variations, and provides an analysis of the relevant costs. Results show that a bubble-sort graph can be embedded in a star graph with dilation 3, and in a RFM graph with dilation 2, while a star graph can be embedded in a pancake graph with dilation 4. The results suggest that the embedding method developed for the bubble-sort graph can be simulated in star graphs and RFM graphs in constant time O(1). Keywords: Interconnection network, Embedding, Star graph, Bubble-sort graph, RFM graph, Pancake graph.

1 Introduction High-performance computing is becoming increasingly popular due to advances in semiconductor technology and the identification of new applications. The need for parallel processing as a method for achieving high-performance has dramatically increased, and has resulted in much research into parallel computers. Parallel computers may be broadly classified into two fundamental types depending on the topology of their interconnection: multi-processors with shared memory and multicomputers with distributed memory [1]. In a multi-computer system, each processor has its own memory and is connected to the others via an interconnection network. Inter-processor communication is achieved by sending messages from each computer to another through the interconnection network. The overall performance of the multi-computer system is thus significantly dependant on the effectiveness of its interconnection network. The most well-known topologies of interconnection networks are the mesh [2], hypercube [3], and star graph [4]. Parameters for measuring the performance of interconnection networks include degree, diameter, embedding, fault-tolerance, scalability, ∗

Corresponding author.

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 348–357, 2010. © Springer-Verlag Berlin Heidelberg 2010

Embedding Algorithms for Star, Bubble-Sort, RFM, and Pancake Graphs

349

and symmetry [5]. The embedding of an interconnection network is intended to analyze the interrelationship between graphs (topologies) to observe whether a certain graph G is included in, or interrelated with, another graph H. The evaluation of embedding is significant, because the fact that graph G can be efficiently embedded in another graph H with less cost means that the method developed in the interconnection network with graph G can be used in the interconnection network with graph H at less cost [1]. Therefore, this paper offers an analysis of the embeddings among the common variations of the star graph including star, pancake [2], bubble-sort [6], and Rotator-Faber-Moore (RFM ) graph [7]. The rest of this paper is organized as follows. Section 2 presents the definitions and properties of the graphs from a theoretical perspective. Section 3 outlines the embedding methods proposed with an analysis of their dilations. Section 4 summarizes and concludes the paper.

2 Related Work An interconnection network can be represented as an undirected graph G = (V, E) with each processor presented as a node (vertex) v of G, and the communication channel between those processors presented as an edge (v, w). V(G) and E(G) represent the set of nodes and edges of graph G, respectively. That is, V(G) = V(G) = {0, 1, 2, …, n–1} and E(G) consists of pairs of distinct nodes from V(G). There exists an edge (v, w) between two nodes v and w of G if and only if a communication channel between v and w exists. The interconnection networks proposed up to now can be divided into three variations based on their number of nodes: the mesh variation with n×k nodes, the hypercube variation with 2n nodes, and the star graph variation with n! nodes. A variation of the star graph has approximately n! nodes and degree n. Star, bubble-sort, pancake [8], transposition, macro-star, rotator [1], and RFM graphs have all been proposed as variations of the star graph. The star graph has a lower degree and a smaller diameter than a hypercube when it has a similar number of nodes. An n-dimensional star graph Sn consists of n! nodes and n(n–1)!/2 edges. The address of each node is represented as a permutation of n distinct symbols {1, 2, 3, ..., n}. An edge between nodes v and w exists if and only if the corresponding permutation to the node w can be obtained from that of v by interchanging the first symbol in the permutation with any of the remaining n–1 symbols. An undirected star graph Sn can be defined as shown in Eq. (1), where n distinct symbols = {1, 2, .., n}, and a permutation of , S = s1s2...sn, si [4].

∈ V(S ) = {(s s ...s ...s )│s ∈, i ≠j, s ≠s } n

1 2

i

n

i

i

j

E(Sn) = {(s1s2...si...sn)(sis2...s1...sn)│(s1s2...si...sn)

∈ V(S ), 2≤i≤n}. n

(1)

Fig. 1 shows an example of a 4-dimensional star graph Sn, with a recursive structure that can be partitioned into n disjoint n–1 stars: Sn-1(1), Sn–1(2), …, Sn–1(n). The diameter of Sn is ⌊3(n–1)/2⌋, and the average distance between nodes is n+2/n+Hn–4 (where, ∑ ) [4].

350

M. Kim, D. Kim, and H. Lee 1234

a

2134

3214

2341

3124

2314

c

3421

4321

3412

2413 b

a 4231

1432

1342 d

b 2431

d

1324

4312

4231

3241

4132 3142

1243 2143

1423 4123 c

Fig. 1. Example of a 4-dimensional star graph

An n-dimensional bubble-sort graph Bn consists of n! nodes and n(n–1)!/2 edges. The address of each node is represented as a permutation of n symbols {1, 2, 3, ..., n}. An edge exists between two arbitrary nodes v and w if and only if the corresponding permutation to the node w can be obtained from that of v by interchanging two adjacent symbols v and w in the permutation. The bubble-sort graph Bn thus can be defined as shown in Eq. (2), where n distinct symbol sets = {1, 2, .., n}, and a permutation of , B = b1b2...bn, bi ∈ [6].

∈

V(Bn) = {(b1b2...bn)│bi , i≠j, bi≠bj} E(Bn) = {(b1b2...bibi+1...bn)(b1b2...bi+1bi...bn)│(b1b2...bi...bn)

∈V(B ), 1≤i≤n–1}. n

(2)

Let us define as i-dimensional the edge connecting the permutation b1b2b3...bibi+1...bn, in which two symbols bi and bi+1 in continuous positions on the permutation B=b1b2b3...bibi+1...bn have been exchanged. Because the number of idimensional edges adjacent to B is n–1, the bubble-sort graph Bn is a regular graph of degree n–1 and has a diameter of n(n–1)/2. It is node- and edge-symmetric as well as bipartite. Fig. 2 shows a 4-dimensional bubble-sort graph. 2134

1243

2314 3214

1234 1324

3124

3241 3421

1423 4123

4213

2431 2341

2143 2413

1342 4231 4321

3142 3412

1432 4132

4312

Fig. 2. Example of a 4-dimenstional bubble-sort graph

Embedding Algorithms for Star, Bubble-Sort, RFM, and Pancake Graphs

351

An n-dimensional RFM graph RFMn represents a node with a permutation of n symbols {1, 2, 3, ..., n}. An edge of RFMn is defined by applying both of the edge generators (i.e., dimensional edge) of the directed Rotator graph and the Faber-Moore graph together [7]. The dimensional edge of a Rotator graph Ri is (123...i...n) → (23...i1...n), 2 ≤ i ≤ n, and that of a Faber-Moore graph Fj is (123...j–1jj+1...n) → (j123...j–1j+1...n), 2 ≤ j ≤ n. Therefore, an RFMn can be defined by Eq. (3) with n distinct symbols = {1, 2, .., n}, and a permutation of , R = r1r2...ri...rn, ri∈ [7].

∈

V(RFMn) = {(r1r2...ri...rn)│ri , i≠j, ri≠rj}, E(RFMn) = {Ri, Fj} Ri :{(r1r2...ri...rn)(r2p3p4...rir1...rn)│(r1r2...ri...rn) V(RFMn), 2≤i≤n}

∈

Fj :{(r1r2...rj-1rjrj+1...rn)(rjr1r2...rj-1rj+1...rn)│(r1r2...rj-1rjrj+1...rn)∈V(RFMn), 2≤j≤n}.(3) An RFMn graph is regular, and has n! nodes, a degree of (2n–3), and a diameter of n– 1. It has the maximum fault-tolerance and includes simple cycles. Fig. 3 shows an example of a four-dimensional RFMn graph. a

f

1234

b

f 3214 e

2134 c

2314

3124 1324 l

4231 e i

k

4132 g

1342 c 3412

2431

3142

4312 h

h 2341 i

3421 j

d

d

1432

4321

k

2413

b

3241

l 1423

4213 j

4123 a

1243 2143

g

Fig. 3. Example of a 4-dimensional RFM graph

3 Embedding Analysis The embedding of one graph G into another graph H is a mapping mechanism for examining whether the former is included in the structure of the latter and/or how they are interrelated. This can be interpreted as simulating one interconnection topology using another. The embedding of graph G into a graph H is defined as a function f = (ø, ρ) where ø is a function that maps the set of vertices in G, V(G) one-to-one into the set of vertices in H, V(H), and ρ is a function that corresponds to each edge (v, w) in G to a path in H that connects nodes ø(v) and ø(w). Parameters for measuring the efficiency of an embedding method include dilation, congestion, and expansion [9], [10], [11].

352

M. Kim, D. Kim, and H. Lee 1 2 4

3 5 G1 6

e 7

dilation 2 congestion 2

1 3

2 4

5

6 7

G2

Fig. 4. Mapping example

The dilation of edge e in G is the length of the path ρ(e) in H, and the dilation of embedding f is the maximum value of all dilations in G. The congestion of edge e' in H is the number of ρ(e) included in e', and the congestion of embedding f is the maximum number of all edge congestions in H. The expansion of embedding f is the ratio of the number of vertices in H to that in G. For instance, let each node (vertex) in the set V(G1) be mapped to the corresponding same node number in the set V(G2) in Fig. 4. Then edge e (3, 6) in G1 can be mapped to edges (3, 1) and (1, 6) or edges (3, 7) and (7, 6) in G2 (i.e., to the path from 3 to 6 in G2). Hence, the dilation of this embedding is 2 because the length of the path ρ(e) in G2 is 2. Here let us assume that edge e (3, 6) in G1 was mapped to edges (3, 1) and (1, 6) in G2. In this case, we can see that the congestion is 2 and the expansion is 8/7 because edge e (1, 3) in G2 is routed by two edges (1, 3) and (3, 6) in G1. In star graph Sn, we assume that two nodes S(=s1s2...si...sn) and S'(=sis2... si– s s 1 1 i+1...sn) exist, and the permutation of node S' is obtained by interchanging the symbols s1 and si from the permutation of node S. Then the edge that connects nodes S and S' is defined as an i-dimensional edge and is denoted as Si, 2 ≤ i ≤ n. In pancake graph Pn, the edge is defined i-dimensional edge that connects node P(=p1p2...pi...pn) and the permutation pipi-1pi-2...p1pi+1...pn, in which symbols from pi to the first symbol p1 have been exchanged in reverse order from node P. This is represented as Pi, 2 ≤ i ≤ n. In bubble-sort graph Bn, the edge connecting the permutation b1b2...bi+1bi...bn, in which two adjacent symbols bi and bi+1 have been exchanged from node B(=b1b2...bibi+1...bn), is defined as an i-dimensional edge and denoted as Bi, 1 ≤ i ≤ n–1. In a star graph Sn, node V is defined as V = Si(U), where node V is connected to an arbitrary node U by dimensional edge Si. We assume that node V is reached from node U by sequentially applying dimensional edges Si, Sj, and Sk. When the dimensional edges are applied in sequence to V, the permutation of node Si(U) in the first time unit is generated from node U via dimensional edge Si (i.e., where Si(U) is adjacent to U via Si). In the second time unit, the permutation of node Sj(Si(U)) is generated from node Si(U) through Sj. In the third time unit the permutation of Sk(Sj(Si(U))) is created from Sj(Si(U)) via Sk. Consequently, node V is reached from node U using the dimensional edges Si, Sj, and Sk. and is denoted by V = Sk(Sj(Si(U))). Sk(Sj(Si(U))) can be represented simply as SkSjSi(U). Dimensional edges applied sequentially to U are designated as dimensional edge sequence <Si, Sj, Sk>. The basic principles of embedding applied to this study are as follows. Node mapping to star graph Sn, pancake graph Pn, and bubble-sort graph Bn is based on one-toone mapping with identical node numbers. When mapping two adjacent nodes (U, V) of a source graph to a target graph, the dimensional edge sequence is defined using the edge definition of the target graph. This is formulated with dimensional edges of

Embedding Algorithms for Star, Bubble-Sort, RFM, and Pancake Graphs

353

the target graph used for the shortest path from ø(U) to ø(V). The dilation of embedding is represented as the number of dimensional edges. Theorem 1. A bubble-sort graph Bn can be embedded into a star graph Sn with dilation 3 and expansion 1. Proof. We assume that two nodes B(=b1b2b3...bi-1bibi+1...bn) and B'(=b1b2b3...bi– 1bi+1bi...bn) exist in bubble-sort graph Bn, and nodes S(=s1s2s3...si–1sisi+1...sn) and S'(=s1s2s3...si–1si+1sisi+2...sn) exist in star graph Sn. We also assume that nodes B and B' of Bn are connected by an i-dimensional edge and are mapped to nodes S and S' of Sn, respectively. We then prove Theorem 1 by analyzing the dilation of this mapping using the number of dimensional edges employed until the permutation of node S is transformed to the permutation of node S' in Sn because nodes S and S' are not adjacent. Here we examine the routing process of a specific edge sequence. Let <Si+1, Si, Si+1> be the edge sequence of the shortest path from S to S'. In the first time unit, the permutation of node Si+1(S) becomes si+1s2s3...si–1sis1...sn from node S via dimensional edge Si+1. It follows that SiSi+1(S)=sis2s3...si–1si+1s1...sn from node Si+1(S) through Si. Then, Si+1SiSi+1(S) becomes s1s2s3...si–1si+1si...sn from node SiSi+1(S) via Si+1. Note that the permutations of nodes Si+1SiSi+1(S)(=s1s2s3...si–1si+1si...sn) and S'(=s1s2s3...si– 1si+1si...sn) are the same, and the number of dimensional edges employed to route from S to S' is 3. Therefore, we can see that the nodes B and B' in Bn can be embedded into nodes S and S' in Sn, respectively, with dilation 3. Theorem 2. The dilation cost for embedding a star graph Sn into a bubble-sort graph Bn is O(n). Proof. The permutation S' is composed of sis2s3...si–1s1si+1...sn (2 ≤ i ≤ n) where node S' is adjacent to node S(=s1s2s3...si–1sisi+1...sn) via i-dimensional edge Si in Sn (i.e., S'=sis2s3...si–1s1si+1...sn). Two nodes B(=b1b2b3...bi–1bibi+1...bn) and B'(=bib2b3...bi– 1b1bi+1...bn) exist in bubble-sort graph Bn. Nodes S and S' of Sn are mapped to nodes B and B' of Bn, respectively. We then prove Theorem 2 by showing that the number of dimensional edges on the shortest path from B to B' in Bn is c×n, where c is a constant. Let us examine the shortest path routing from B to B' in Bn. First, we continuously exchange symbol b1 in the first position of node B(=b1b2b3...bi–1bibi+1...bn) in Bn with adjacent symbols to move it to the ith position, then exchange symbol bi in ith position with adjacent symbols to move it to the first position. The edge sequence applied to the routing of this process is . When this edge sequence is applied sequentially to node B, symbol b1 will be successively exchanged with adjacent i-1 adjacent symbols, resulting in symbol b1 moving to the ith position. The resulting permutation is Bi–1Bi–2...B3B2B1(B)=b2b3...bi-1bib1bi+1...bn. Because symbol bi is in the (i–1)th position in the permutation b2b3...bi–1bib1bi+1...bn, if the edge sequence is applied sequentially to the permutation of node Bi–1Bi–2...B3B2B1(B), symbol bi will interchange with i–2 adjacent symbols to achieve a permutation where bi appears in the first position: B1B2B3...Bi–3Bi–2(Bi–1Bi–2... B3B2B1(B))=bib2b3...bi–1b1bi+1...bn. Therefore, the n-dimensional edge is the maximum dimensional edge of those used for routing from S' to S(=s1s2s3...si–1sisi+1...sn) via idimensional edge Si (2 ≤ i ≤ n) in Sn, and the number of dimensional edges applied to

354

M. Kim, D. Kim, and H. Lee

route from B to B’ in Bn is 2n–3 on which node S and node S' in Sn is mapped. Hence, we obtain the result that the dilation cost for this embedding process is O(n) (2 ≤ i ≤ n). Theorem 3. A bubble-sort graph Bn can be embedded into an RFM graph Rn with dilation 3 and expansion 1. Proof. A one-to-one mapping of the n! nodes of Bn and of Rn is possible with identical addresses. The process of mapping the nodes of the two graphs is as follows. First, node B(=b1b2b3...bi...bn) in Bn is mapped to node R(=r1r2r3...ri...rn) in Rn; then node B' connected to B by an i-dimensional edge in Bn is mapped to node R' in Rn. To analyze the dilation of this embedding, we divide the edges of Bn into three cases based on their dimension. We then analyze dilation through the number of edges used to generate the address (permutation) of node R' from the address of node R when mapping nodes B and B' in Bn to nodes R and R' in Rn. Theorem 3 is proved by dividing the problem into three cases depending on the dimensional edges of Bn. Case 1. 1-dimensional edge (i=1) The permutation of node B' is b2b1b3...bi...bn where B' is adjacent to node B(=b1b2b3...bi...bn) via a one-dimensional edge in Bn. Here, node R(=r1r2r3...ri...rn) is connected to node R'(=r2r1r3...ri...rn) by two-dimensional edge R2 in Rn. Therefore, it is clear that the bubble-sort graph Bn can be embedded into the RFM graph Rn with dilation 1 when the two adjacent nodes B and B' via B1 are mapped to nodes R and R' of Rn, respectively. Case 2. 2-dimensional edge (i=2) B'(=b1b3b2b4...bi...bn) is a node that is adjacent to node B(=b1b2b3b4...bi...bn) via a two-dimensional edge in Bn. Because node R(=r1r2r3...ri...rn) is not adjacent to node R'(=r1r3r2...ri...rn) in Rn, we analyze the dilation using the number of edges required for the shortest path routing from R to R' in Rn. Take the dimensional edge sequence required for routing from R to R' to be . Following this edge sequence, the permutation of node F3(R) first becomes r3r1r2...ri...rn from node R via dimensional edge F3, and then R2(F3(R))=r1r3r2...ri...rn from node F3(R) through dimensional edge R2. Because the permutation of node R2(F3(R))(=r1r3r2...ri...rn) is the same as that of node R'(=r1r3r2...ri...rn), the adjacent nodes B and B' via the two-dimensional edge in Bn can be embedded into RFM graph Rn with dilation 2. Case 3. i ≥ 3-dimensional edge The permutation of node B', adjacent to node B(=b1b2b3...bi–1bibi+1bi+2...bn) via idimensional edge of Bn, is b1b2b3...bi–1bi+1bibi+2...bn. Because nodes R(=r1r2r3...ri– 1riri+1ri+2...rn) and R'(=r1r2r3...ri–1ri+1riri+2...rn) are not adjacent to each other in Rn, we analyze the dilation using the number of dimensional edges required for the shortest path routing from R to R'. Take the dimensional edge sequence required for optimal routing from R to R' to be . The routing process using this sequence is as follows. First, the permutation of node Fi(R) becomes rir1r2r3...ri–1ri+1ri+2...rn from node R via dimensional edge Fi. Next, Ri+1(Fi(R)) becomes r1r2r3...ri–1ri+1riri+2...rn from node Fi(R) through dimensional edge Ri+1. We can thus see that the permutation of node Ri+1(Fi(R))(=r1r2r3...ri–1ri+1riri+2...rn) via the dimensional edge sequence
Embedding Algorithms for Star, Bubble-Sort, RFM, and Pancake Graphs

355

Ri+1> from node R of Rn is the same as the permutation of node R'(=r1r2r3...ri– Therefore, node B' adjacent to node B through three- and greaterdimensional edges (i ≥ 3) can be mapped to node Ri+1(Fi(R)) in Rn, which is formed by applying two-dimensional edges sequentially on node R in Rn. This embedding is thus possible with dilation 2. 1ri+1riri+2...rn).

Theorem 4. An n-dimensional star graph Sn can be embedded into an n-dimensional pancake graph Pn with dilation 4 and expansion 1. Proof. The permutation of node S' is sis2s3...si–1s1si+1...sn where S' is adjacent to node S(=s1s2s3...si-1sisi+1...sn) via i-dimensional edge Si (2 ≤ i ≤ n). Consider nodes S and S' in Sn to be mapped to nodes P(=p1p2p3...pi–1pipi+1...pn) and P'(=pip2p3...pi–1p1pi+1...pn) in Pn, respectively. Because nodes P and P' are not adjacent to each other in Pn, we analyze the dilation using the number of dimensional edges required for the shortest path routing from P to P' in Pn. We prove Theorem 5 by dividing it into four cases depending on the i-dimensional edge of a star graph. Case 1. 2 ≤ i ≤ 3-dimensional edge The permutation of node S' is s2s1s3...si–1s1si+1...sn where S' is adjacent to node S(=s1s2s3...si–1sisi+1...sn) through two-dimensional edge S2 in Sn. In pancake graph Pn, node P(=p1p2p3...pi–1pipi+1...pn) is adjacent to node P'(=p2p1p3...pi–1pipi+1...pn) via twodimensional edge P2 in Pn. Consequently, nodes S and S', which are connected by a two-dimensional edge, are mapped to pancake graph Pn with dilation 1. The permutation of node S' connected to S by three-dimensional edge S3 is s3s2s1...si–1s1si+1...sn. Similarly, because nodes P and P' are adjacent through three-dimensional edge P3 in Pn, the nodes S and S' of Sn are mapped to pancake graph Pn with dilation 1. Case 2. 4-dimensional edge (i = 4) The permutation of node S' is s4s2s3s1s5...si–1sisi+1...sn where node S' is adjacent to node S(=s1s2s3s4s5...si–1sisi+1...sn) via four-dimensional edge S4 in Sn. When mapping nodes S and S'(=s4s2s3s1s5...si–1sisi+1...sn) in Sn to nodes P(=p1p2p3p4p5...pi–1pipi+1...pn) and P'(=p4p2p3p1p5...pi–1pipi+1...pn) in Pn, respectively, take the dimensional edge sequence required for the shortest path routing from node P to node P' in Pn to be . Then, examine the change in permutation from P to P', when the dimensional edge sequence is applied sequentially to node P. At first, the permutation of node P4(P) becomes p4p3p2p1p5...pi–1pipi+1...pn from node P through four-dimensional edge P4. Next, P3P4(P) becomes p2p3p4p1p5...pi–1pipi+1...pn from node P4(P) via P3. It follows that P2P3P4(P)=p3p2p4p1p5...pi–1pipi+1...pn from node P3P4(P) through P2. Then, the permutation of P3P2P3P4(P) is p4p2p3p1p5...pi–1pipi+1...pn from node P2P3P4(P) via P3. Consequently, the permutation of node P3P2P3P4(P)(=p4p2p3p1p5...pi–1pipi+1...pn), on which the dimensional edge sequence is sequentially applied to node P in Pn, has the same permutation as node S'(=s4s2s3s1s5...si–1sisi+1...sn) in Sn. Therefore, nodes S and S' connected by fourdimensional edge S4 in Sn can be embedded into pancake graph Pn with dilation 4. Case 3. 5 ≤ i ≤ (n–1) dimensional edge The permutation of node S' is composed of sis2s3s4...si–1s1si+1...sn where node S' is adjacent to node S(=s1s2s3s4s5...si–1sisi+1...sn) via i-dimensional edge Si. When mapping

356

M. Kim, D. Kim, and H. Lee

nodes S and S'(=sis2s3s4...si–1s1si+1...sn) in Sn to nodes P(=p1p2p3...pi–1pipi+1...pn) and P'(=pip2p3p4...pi–1p1pi+1...pn) in Pn, respectively, we assume that the dimensional edge sequence is used for the shortest path routing from P to P' in Pn. Now we examine the change in permutation from node P to node P' when the dimensional edge sequence is sequentially applied to node P. At first, the permutation of node Pi–1(P) becomes pi–1pi–2pi–3...p3p2p1pipi+1...pn from node P through (i–1)-dimensional edge Pi–1. Then, Pi–2Pi–1(P) is p2p3p4...pi-2pi–1p1pipi+1...pn from node Pi–1(P) via Pi–2, after which Pi–1Pi–2Pi–1(P) becomes p1...pi–1pi– from node Pi–2Pi–1(P) through Pi–1. Then, PiPi–1Pi–2Pi– 2...p4p3p2pipi+1...pn 1(P)=pip2p3p4...pi–2pi-1p1pi+1...pn from node Pi–1Pi–2Pi–1(P) via Pi. Therefore, the permutation of node PiPi–1Pi–2Pi–1(P)(=pip2p3p4...pi–2pi-1p1pi+1...pn), on which the dimensional edge sequence is sequentially applied to node P in Pn, is the same as the permutation of node S'(=sis2s3s4...si–1s1si+1...sn) in Sn. Therefore, we observe that nodes S and S' connected by an i-dimensional edge (5 ≤ i ≤ (n–1)) in Sn can be embedded into pancake graph Pn with dilation 4. Case 4. n-dimensional edge (i = n) The permutation of node S' consists of sns2s3...si....sn–1s1 where S' is adjacent to node S(=s1s2s3...si...sn–1sn) via n-dimensional edge in the star graph Sn. When mapping nodes S and S'(=sns2s3...si....sn-1s1) in the star graph Sn to nodes P(=p1p2p3...pi....pn–1pn) and P'(=pnp2p3...pi....pn–1p1) in the pancake graph Pn, respectively, take the dimensional edge sequence required for the shortest path routing from P to P' in Pn to be . Then, examine the change in permutation from P to P', when the dimensional edge sequence is applied sequentially to node P. At first, the permutation of node Pn(P) becomes pnpn–1...pi...p3p2p1 from node P via dimensional edge Pn. Then, Pn–1Pn(P) is p2p3p4...pi...pn–1pnp1 from node Pn(P) through dimensional edge Pn–1. It follows that Pn–2Pn–1Pn(P) is pn–1...pi...p4p3p2pnp1 from node Pn–1Pn(P) through dimensional edge Pn–2. Then, Pn–1Pn–2Pn–1Pn(P) becomes pnp2p3...pi....pn-1p1 from node Pn–2Pn–1Pn(P) via Pn–1. Consequently, the permutation of node Pn–1Pn–2Pn–1Pn(P)(=pnp2p3...pi....pn–1p1) in the pancake graph and the permutation of node S'(=sns2s3...si....sn–1s1) in Sn are the same. Therefore, nodes S and S' connected by n-dimensional edge in Sn can be embedded into pancake graph Pn with dilation 4.

4 Conclusion The star graph, which is common in the topology of parallel computers, is an interconnection network with a node-symmetric hierarchical structure, and smaller diameter and greater fault-tolerance than the hypercube. In this paper, we have proposed methods for embedding star, pancake, and RFM graphs into one another. We also analyzed the relevant embedding costs for the graphs. The embedding methods proposed in this study are based on computing the minimum number of edges used in a target graph when a source graph is mapped to a target graph. The edge definition of graphs was used to compute the number of edges, assuming that the graphs had the same number of nodes. The dilation of embedding was then analyzed by the number of edges used for the shortest path routing in a target graph. The use of this edge

Embedding Algorithms for Star, Bubble-Sort, RFM, and Pancake Graphs

357

definition of graphs to analyze embedding is possible because the star, bubble-sort, pancake, and RFM graphs are all node-symmetric. Results of this study show that bubble-sort graph Bn can be embedded into star graph Sn with dilation 3 and expansion 1, and into RFM graph Rn with dilation 2 and expansion 1. In addition, an n-dimensional star graph can be embedded into an ndimensional pancake graph with dilation 4 and expansion 1. The results suggest that the embedding method developed for bubble-sort graph Bn can be simulated in both star graph Sn and RFM graph RFMn in additional constant time O(1). Acknowledgement. This research was supported by Basic Science research program through the National research Foundation of KOREA(NRF) funded by the Ministry of Education, Science and Technology(2009-0086676).

References 1. Feng, T.: A Survey of Interconnection Networks. IEEE computer, 12–27 (December 1981) 2. Ranka, S., Wang, J., Yeh, N.: Embedding Meshes on the Star Graph. Parallel and Distributed Computing 19, 131–135 (1993) 3. Saad, Y., Schutltz, M.H.: Topological Properties of Hypercubes. IEEE Trans. Comput. 37, 867–872 (1988) 4. Akers, S.B., Harel, D., Krishnamurthy, B.: The Star Graph: An Attractive Alternative to the n-Cube. In: Proc. International Conference on Parallel Processing, August 1987, pp. 393–400 (1987) 5. Azevedo, M.M., Bagherzaeh, N., Latifi, S.: Low Expansion Packing and Embedding of Hypercubes into Star Graphs: A Performance-Oriented Approach. IEEE Parallel and Distributed Systems 9(3), 261–274 (1998) 6. Chou, Z., Hsu, C., Sheu, J.: Bubblesort Star Graphs: A New Interconnection Network. In: 9th International Parallel Procession Symposium, pp. 41–48 (1996) 7. Lee, H., Hye, Y., Lim, H.: RFM Graphs: A New Interconnection Network using the Graph Product. The KIPS Transactions: Part D 5, 2615–2626 (1998) 8. Mohammad, H., Hal, I.: On the Diameter of Pancake Network. Journal of Algorithms (25), 67–94 (1997) 9. Qiu, K., Akl, S.G., Meuer, H.: On Some Properties and Algorithms for the Star and Pancake Interconnection Networks. Journal of Parallel and Distributed Computing 22, 16–25 (1994) 10. Qiu, K., Meijer, H., Akl, S.G.: Parallel Routing and Sorting on the Pancake Networks. In: Dehne, F., Fiala, F., Koczkodaj, W.W. (eds.) ICCI 1991. LNCS, vol. 497, pp. 235–242. Springer, Heidelberg (1991) 11. Corbett, P.F.: Rotator Graphs: An Efficient Topology for Point-to-Point Multiprocessor Networks. IEEE Transaction Parallel Distributed System 3(5), 622–626 (1992)

Performance Estimation of Generalized Statistical Smoothing to Inverse Halftoning Based on the MTF Function of Human Eyes Yohei Saika1,2, Kouki Sugimoto1, and Ken Okamoto2 1

Department of Electrical and Computer Engineering, Wakayama National College of Technology, Nada Noshima 77, Gobo, Wakayama 644-0023, Japan [email protected] 2 Department of Advanced Engineering, Wakayama National College of Technology, Nada Noshima 77, Gobo, Wakayama 644-0023, Japan [email protected]

Abstract. We construct a method of the generalized statistical smoothing (GSS) to the problem of inverse halftoning for a halftone image which is converted by the error diffusion method. Especially, we construct the present method so as to achieve the optimal performance on the basis of the mean square error (MSE) between original and restored images both of which are observed through the MTF function of human vision system. Using the numerical simulation for several 256-level standard images, we clarify that the optimal performance of the GSS is realized if we appropriately set the parameters controlling both edge enhancement procedure and generalized parameter scheduling. We also find the GSS restores the original image more accurately than other conventional filters, such as the average and Gaussian filters.

1 Introduction In the field of print technology, a lot of image processing technologies have been playing important roles to print a digital image, such as grayscale and color images. Among these technologies, the technique called as digital halftoning [1] is essential to convert a multi-level image into the halftone image. The halftone version of the grayscale image which is expressed as a set of black and white dots is visually similar to the original image if we observe it through the human vision system. For many years, a lot of techniques have been proposed, such as the dither method [2] and the error diffusion method [3]. On the other hand, the technique called as inversehalftoning [4] is also important to reconstruct the grayscale image from the halftone image. For this problem, various methods have been attempted. From the theoretical point of view, the MAP estimate based on the Bayesian inference has been applied by Stevenson [5]. Recently, Saika and Inoue [6] have investigated the inverse halftoning based on the Bayesian inference to the problem of inverse halftoning for the halftone image obtained by the dither method. Then, Saika, et al. [7] has investigated the Bayes-optimal solution of the maximizer of the posterior marginal (MPM) estimate to inverse halftoning on the basis of statistical mechanics of the Q-Ising model C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 358–367, 2010. © Springer-Verlag Berlin Heidelberg 2010

Performance Estimation of Generalized Statistical Smoothing

359

on the square lattice. They have estimated statistical performance of the MPM estimate. Then, from the practical point of view, the conventional filters, such as the average and Gaussian filters have been used for this problem. Following the strategy, Wong [8] has proposed the statistical smoothing both for the dither and error diffusion methods. Then, in order to improve the performance of the statistical smoothing, Saika and Yamasaki [9] have introduced the generalized parameter scheduling into the statistical smoothing. In this study, from the practical point of view, we construct a method of inverse halftoning by making use of the generalized statistical smoothing (GSS) so that the optimal performance is realized when the grayscale images are observed through the MTF function which approximates the human vision system [10]. Especially, we construct the method by making use of the edge-enhancing threshold and the parameter scheduling, both of which are introduced into the original statistical smoothing. Then, by using numerical simulations for several 256-level standard image, such as “Lena” and “Cameraman” with 256×256 pixels, we derive the results that the performance of the GSS is improved by introducing both the edge-enhancing threshold and the generalized parameter scheduling if we estimate the performance based on the MTF function approximating the human vision system. Then, we indicate that the optimal performance of the GSS is superior to those of other conventional practical filters, such as the average and the Gaussian filters. The contents of this article are as follows. In Section 2, we briefly show the general formulation for the problem of inverse halftoning by means of the GSS for the hafton image obtained by the error diffusion algorithm using the Floyd-Steinberg kernel. Then we show the MSE for the grayscale image observed through the MTF function which represents the human vision system. Then, in Section 3, using the numerical simulations for standard images, we investigate the performance of the GSS to the problem of inverse halftoning based on the performance estimation using the human vision system given as the MTF function of the human eyes. Then, Section 4 is devoted to summary and discussion.

2 General Formulation In this Section, as shown in Fig. 1, we show the general formulation of the GSS to the problem of inverse halftoning for the halftone image obtained by error diffusion algorithm (Fig. 2) using the Floyd-Steinberg kernel (Fig. 3). We first consider an original grayscale image {ξx,y}(ξx,y=0, ,255, x,y=0, ,L-1) arranged on the square lattice. Here the pixel value ξx,y represents the brightness at the (x,y)-th site on the square lattice. Here we use several standard images, such as “Lena” in Fig. 4(a) and “Cameraman” in Fig. 5(a). Then we convert the original graylevel image {ξx,y} into a halftone image {τx,y}(τx,y=0, Q-1, x,y=0, ,L-1) using the conventional error diffusion algorithm following the block diagram shown in Fig. 2. As shown in Fig. 4(b) and Fig. 5(b), the density of the black and white dots is approximately proportional to the original image and therefore the halftone image is seen to be visually similar to the original grayscale image.

…

…

…

360

Y. Saika, K. Sugimoto, and K. Okamoto

Digital Halftoning

Inverse Halftoning

Fig. 1. Digital halftoning and inverse halftoning

ξx,y

Tx,y

+

τx,y

threshold

‐

+

+ Ex,y Error filter aaij Fig. 2. Error diffusion algorithm

Fig. 3. Floyd-Steinberg’s kernel

Next, we reconstruct the original grayscale image by making use of the GSS using the information on the halftone image {τx,y}. In this study, we tune the parameters controlling both of the edge enhancing and the parameter scheduling so as to restore the original grayscale image with high image quality if we observe the images through the MTF function which approximates the human vision system. Following the original version of the statistical smoothing [8], the present method also proceeds through pixel by pixel in a raster scan. Then, the core process at each pixel is composed of the two parts as follows. On the first step, we compute a mean value:

Performance Estimation of Generalized Statistical Smoothing

(a)

(b)

(c)

(d)

(e)

(f)

(g)

361

(h)

Fig. 4. (a) The 256-level standard image “Lena” with 256×256 pixels, (b) the halftone image converted from the standard image (a) by the error diffusion using the Floyd-Steinberg’s kernel, (c) the restored image due to the GSS when κ=2.0 and D=0 (d) the restored image using the GSS when κ=0.2 and D=0, (e) the restored image obtained by the GSS when D=35 and κ=2.5. (f) the restored image obtained by the GSS when D=20 and κ=2.5, (g) the restored image obtained by the Gaussian filter, (h) the restored image obtained by the average filter.

362

Y. Saika, K. Sugimoto, and K. Okamoto

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. (a) The 256-level standard image “Cameraman” with 256×256 pixels, (b) the halftone version of the standard image (a) converted by the error diffusion using the Floyd-Steinberg’s kernel, (c) the restored image from the halftone image (b) by using the GSS when D=0, =0.2, (d) the restored image from the halftone image (b) by using the GSS when D=0, =2.0, (e) the restored image obtained by the Gaussian filter, (f) the restored image obtained by the average filter.

κ

μm, n =

∑

i , j∈Rm ,n

ai , j xiold ,j

κ

(1)

which is averaged over the pixels in the region Rm,n. Then, {ai,j} is the kernel of the conventional filter, such as the average and Gaussian filters. Then the region Rm,n is composed of the (m,n)-th site and the sites at (m+δx, n+δy) around the (m,n)-th site where the condition:

Performance Estimation of Generalized Statistical Smoothing

|x

− x |< D

old

old

m +δ x , n+δ y

363

(3)

m ,n

is satisfied. Here, D is the threshold which is introduced to detect edges in the halftone image and is appropriately set respective of the choice of the halftone image. If we set to D=256, the present method is regarded as the original version of the statistical smoothing proposed by Wong [8]. On the other hand, as clearly seen from eq. (2), the smoothing procedure does not work if D=0. In this procedure, we then compute a measure vm,n given by the standard deviation: vm , n

⎡ 1 =⎢ ⎢⎣ Rm , n

∑

old i, j

x

( i , j )∈Rm ,n

− μi , j

r

⎤ ⎥ ⎥⎦

1/ r

.

(4)

averaged over the pixels included in the region Rm,n. Here ||Rm,n|| is the number of the pixels in the region Rm,n. Then, on the next step of the core process is the smoothing procedure shown as

xmnew,n

⎧μ m ,n + γvm ,n ⎪ = ⎨μ m ,n − γvm ,n ⎪ old ⎪⎩ xm ,n

if xmold,n > μ m ,n + γvm ,n if xmold,n < μ m ,n − γvm ,n

(5)

otherwise

Here γ is a positive parameter which depends on the number of the steps. Through the procedure of the GSS, we develop the parameter γ following scheduling: κ

⎛n⎞ γ =⎜ ⎟ ⎝5⎠

(6)

where n (n=1,…,5) is the number of the steps. This scheduling is also shown in Fig. 7. If we set to =1, the present method is same as the original version of the statistical smoothing which is proposed by Wong [8]. Then, the present method is regarded as the conventional smoothing filter, such as the average and Gaussian filters, when γ=0. In order to clarify the performance of the present method for several standard images, we numerically evaluate the MSE between the original and restored images and the MSE between the original and restored images observed through the MTF function approximating the human vision system. First, when we clarify the performance of the present method for inverse-halftoning, we estimate the accuracy of restoring the original image based on the MSE as

γ

σ=

1 ∑∑ z −ξ L L-1 L-1

2

x =0 y =0

x,y

2 x ,y

(7)

Here {ξx,y} and {zx,y} are the original and restored images. Next, as shown in Fig. 8, we also investigate the performance of our method if images are observed through the MTF function:

364

Y. Saika, K. Sugimoto, and K. Okamoto

[

](

[

])

⎧5.05 exp − 1.38 k + k 1 − exp − 0.1 k + k ⎪ ⎪ H (k , k ) = ⎨ (5 ≤ k + k ) ⎪ ( k + k < 5) ⎪⎩1 2

2

x

y

2

2

x

y

2

x

y

2

x

y

2

2

x

y

(8)

which approximates the human vision system. Here we can see that the human vision system works as a low-pass filter for grayscale images. When we here estimate the performance of our method for image restoration, we evaluate the similarity between the original and restored images as 2 1 L L zˆ x , y − ξˆx , y , ∑ ∑ 2 L x =0 y = 0

σ′ =

(9)

Where

zˆ

L

x, y

k x ,k y

x

x =1 Y =1

zˆ

zˆ

L

= ∑ ∑ exp[−i (k x + k y )] zˆ

k x ,k y

=

= ∑∑ z

k x ,k y

y

k x ,k y

H (k , k ), x

L

(10)

(11)

y

1 ∑ ∑ exp[i (k x + k y )] zˆ L 2

,

L

x =1 Y =1

x

y

k x ,k y

(12)

3 Performance In this Section, using the numerical simulations for several 256-level standard images with 256×256 pixels, such as “Lena”, we investigate the performance of the GSS to the problem of inverse-halftoning for the halftone version of the 256-level standard image “Lena” (Fig. 4(a)) converted by the error diffusion method using the FloydSteinberg’s kernel. First we show the conditions of the numerical simulations for this problem. Then we show the performance of the GSS for the halftone version of the 256-level standard image “Lena” based on the MTF function of the human vision system and the pixel-wise similarity. When we carry out numerical simulations, we use several 256-level standard image “Lena” with 256×256 pixels. Then, we convert the gray-level image into the halftone version of the grayscale image by making use of the error diffusion algorithm using the 3×5 Floyd-Steinberg’s kernel in Fig. 3. Next, we reconstruct the original grays cale image by the GSS. Further, we estimate the performance based on the MSE and the MSE for the grayscale image observed through the MTF function which approximates the human vision system. First, we estimate the performance of the GSS for the halftone version of the 256level standard image “Lena” with 256×256 pixels rewritten by the error diffusion method using the 3×5 Floyd-Steinberg kernel based on the MSE in eq. (7) and the

Performance Estimation of Generalized Statistical Smoothing

365

Fig. 6. Generalized parameter scheduling

Fig. 7. Performance measure based on the MTF function representing the human vision system

Fig. 8. The MSE and the MSE for the grayscale image observed through the MTF function as a function of the parameter D obtained by the GSS to inverse halftoning for the halftone version of the 256-level standard image “Lena” with 256×256 pixels

366

Y. Saika, K. Sugimoto, and K. Okamoto

MSE for the grayscale image observed through the MTF function in eq. (9). First, using the numerical simulation for the 256-level standard image “Lena” with 256×256 pixels, we obtain σ=201.25 and σv=185.34 if we tune the parameter κ so as to minimize σ, that is κ=2.0 when D=0. On the other hand, we also obtain σ=213.28 and σv=178.34 if we tune the parameter γ so as to minimize σv, that is κ=0.2 when D=0. Then, the restored images are shown in Figs. 4(c) and (d). These results indicate that the edge enhancement by means of the generalized parameter scheduling is available of image reconstruction with high image quality. Next, we further investigate the performance of the GSS for the halftone version of the 256-level standard image “Lena” rewritten by the error diffusion algorithm using the Floyd-Steinberg’s kernel. when the edge enhancing procedure caused by the threshold D(>0) is introduced at κ=2.5. As shown in Fig. 8, we make clear that the optimal performance of the GSS is realized at D=35 in the case that the performance is examined via the MTF function which represents the human vision system, although the optimal performance of the GSS is realized around D=20 in the case that the performance is examined without the MTF function. The restored images at D=20 and D=35 are shown in Figs. 4(e) and (f). These results indicate that the optimal image is respective of the estimate which we choose, and then that the grayscale image with high quality through the human vision system is reconstructed by the GSS by introducing the edge enhancement procedure, such as the threshold D and the generalized parameter scheduling. Further, using the numerical simulation for the 256-level standard image “Lena” with 256×256 pixels, we clarify that the GSS restores the original image more accurately than other methods, such as the average and Gaussian filters. Actually, we obtain the result that the average/Gaussian filter realizes the performance with =350.78/205.23 and therefore that the restored images (Figs.4(g) and (h)) obtained by the average and Gaussian filters seems to be less accurate than that obtained by the GSS, if we estimate the performance through the MTF function approximating the human vision system. In addition, as shown in Figs. 5 (a)-(f), the above properties clarified for the 256-level standard image “Lena” are also shown for the 256-level standard image “Cameraman”.

σ

4 Summary and Discussion In the previous Sections, we have constructed the method of inverse halftoning by making use of the GSS for the halftone version of the 256-level grayscale image, so that the optimal performance is realized when the grayscale image is observed through the human vision system. For our purpose, we have introduced both the edge enhancement procedure and the generalized parameter scheduling into the original version of the statistical smoothing in order to restore the original grayscale image with high image quality. Here, when we evaluate the performance of this method, we investigate how the MSE between the original and restored images both of which are observed through the MTF function. The numerical simulation for the 256-level grayscale standard image “Lena” makes clear that the performance of the present method is improved by introducing both the edge enhancing procedure and the generalized parameter scheduling into the original statistical smoothing even if we investigate the

Performance Estimation of Generalized Statistical Smoothing

367

performance through the human vision system. Then, we also find that the optimal performance of the present method is superior to those of other practical filters, such as the conventional average and Gaussian filters. These results suggest that the present method is practically useful method to the problem of inverse halftoning. As future problems, in order that we establish that this approach is practically useful technique to inverse halftoning, we are going to investigate the performance of the GSS on the basis of various aspects of the image quality, such as the accuracy of the edge reconstruction. Then, we are going to investigate the performance of the GSS based on the performance estimation using the MTF function which approximates the human vision system more accurately.

References 1. Ulichney, R.: Digital Halftoning. The MIT Press, Cambridge (1987) 2. Bayer, B.E.: An optimum method for two-level rendition of continuous-tone pictures. In: ICC Conf. Record, pp. 11–15 (1973) 3. Floyd, R.W., Steinburg, L.: Adaptive algorithm for spatial gray scale. In: SID Int. Sym. Digest of Tech. Papers, pp. 36–37 (1975) 4. Miceli, C.M., Parker, K.J.: Inverse halftoning. J. Electron Imaging 1, 143–151 (1992) 5. Stevenson, R.: Inverse halftoning via MAP estimation. IEEE Trans. Image Processing 6, 574–583 (1995) 6. Saika, Y., Inoue, I.: Probabilistic inference to the problem of inverse-halftoning based on statistical mechanics of spin systems, Probabilistic inference to the problem of inversehalftoning based on statistical mechanics of spin systems. In: Proceeding of the. SICEICCAS2006, pp. 4563–4568 (2006) 7. Saika, Y., Inoue, J., Tanaka, H., Okada, M.: Bayes-optimal solution to inverse halftoning. Central European Journal of Physics, 444–456 (2009) 8. Wong, P.W.: Inverse Halftoning and Kernel Estimation for Error Diffusion. IEEE Trans. Image Processing 4, 486–498 (1995) 9. Saika, Y., Yamasaki, T.: Generalized Statistical Smoothing to the Problem of Inverse Halftoning for Error Diffusion. In: The Proceeding of the ICCAS 2007, pp. 781–784 (2007) 10. Dooley, R.P., Shaw, R.: Noise Reduction Perception in Electrography. J. Appl. Photogr. Eng. 5(4), 190–196 (1979)

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control Ming-Yuan Zhong1 and Jong-Jiann Shieh2 2

1 [email protected] Associate Professor, Department of Computer Science and Engineering, Tatung University, No 40 Chungshan North Road Section 3, Taipei, Taiwan 104 [email protected]

Abstract. The on-chip cache consumes a substantial portion of energy in today's processors. Loops have temporal locality, so that loop buffer had been proposed. We attempt to apply concept of trace cache in the architecture of the loop buffer, however it is quiet bulky and complicated. If using a trace cache as a loop buffer, we do save the energy. Contrarily, it debases the integral performance due to long latency at fetch stage. We therefore propose these methods of (1) doing innermost loop detection at commit stage and filling/active at fetch stage; and (2) assisting loop buffer in storing the innermost loops with forward branches to pack the instructions captured from the instruction cache as basic blocks. With the preceding modifications, we hope to strengthen the loop buffer for gaining performance and reducing more power. Results with SPEC2000 indicate that up to 45% (integer benchmarks) and 55% (floating benchmarks) of reductions in instruction fetch power compared with the design without loop buffer. Furthermore, we got 3% (integer benchmarks) and 2% (floating benchmarks) of power improvement than the design of the loop buffer that deal with loops at fetch stage. Keywords: Trace cache, loop buffer, instruction cache, low power.

1 Introduction Today in the design of portable devices, except for more applications, higher quality effect of audio and video, the thinner and smaller body, users extremely want what they own having more life time with lower power consumption and longer life of the electrical cell. However, compared with the growth of modern embedded apparatus, advance of the battery technique is as slow as a turtle. In the near future we should give the first place to exploit the components for low power consumption about power reduction. Reducing power of embedded processors is becoming increasingly important for mobile applications. Much of the dynamic power of a typical embedded processor is consumed by instruction fetching—for example, 30–50% in Lee et al. [1] and Segars—since instruction fetching happens on almost every cycle, involves switching of large numbers of high capacitance wires, and may involve access to a power hungry set-associative cache. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 368–380, 2010. © Springer-Verlag Berlin Heidelberg 2010

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

369

The previous researches focused on program behaviors found that 46% of program instructions are in loops [4]. Our simulation shows, on average, a loop has 63.4 instructions for integer benchmarks and 212.1 instructions for floating benchmarks. According to Amdahl’s Law, we should improve the performance for the large parts in a program if the payoff is rewarding. Therefore, we can conclude that if we can efficiently buffer the loop in the high speed buffer, we can tremendously improve the bandwidth of instruction stream. As object-oriented languages have become widely used in current programs, the instruction stream characteristics of object-oriented language can affect the performance of a memory system design. Chu et al [6] showed that object-oriented programs execute almost seven times more calls (4.6% versus 0.7%) and have smaller function sizes (48.7 versus 152.8) than traditional programs. Jason and Wayne [5] have done the research of instruction fetch characteristics of media processing. They proved that media applications are highly loop-centric. The media programs spend nearly 95% of their time processing within the two innermost loop levels of their programs. Consequently, their instruction memory working set sizes are very small, typically less than 8KB, and their instruction memory spatial locality is very high. In addition, they found that multimedia is characterized by a fairly regular, predictable control flow. While the frequency of branches is similar to general-purpose applications, the static branch miss-prediction rate is 2-3% lower than general-purpose applications. They concluded that multimedia application do indeed have nearly idealistic instruction fetch characteristics. Memory hierarchy has been proven that it can effectively improve the processor performance. In recent years with the advance of the VLSI technique and depreciation of the memory cost, the memory capacity in the processor is becoming larger. The whole structure is very complicated, which consumes the most energy in all blocks of the processors. Accordingly, many kinds of buffering techniques have proposed for easing loading of the cache. It is found that trace cache is more complicated than other buffering techniques. We ever proposed the improvement of this structure, and proved that it can get good performance in our simulation, and then go a step further to simplify the trace cache which is used to a buffer only for dealing with innermost loops. The aspects of our study emphasized are as follows: 1. To avoid to fetch the wrong-path instructions: the controller do innermost loop detection at commit stage and (re)fill/act at fetch stage, in addition, it records the executed within-the-innermost-loop forward branch outcome. 2. To collect more than one loop body into the buffer: in general, only a loop body detected resides in the loop buffer, but we want more flexibility that the buffer can store one to four different innermost loops. While encountering a loop, the controller will find and match whether it exists in the buffer. If not, the loop buffer controller will fill this new loop rather than more frequently flush the loop buffer. 3. To handle incorrect instruction filling and fetching due to the change of control flow: we adopt this method that the loop buffer controller just counts back n entries from current position and then refills instructions located at another control path, which are usually in the instruction cache. Where n is the number of in-flight instructions between instruction fetch (IF) and execution stage (EXE).

370

M.-Y. Zhong and J.–J. Shieh

4. Implementablity: seeing that the trace cache is so complicated that it may delay the fetch latency, furthermore, increase the whole energy and difficulty of implementation. Results with SPEC2000 indicate that up to 45% (integer benchmarks) and 55% (floating benchmarks) of reductions in instruction fetch power compared with the design without loop buffer. Furthermore, we got 3% (integer benchmarks) and 2% (floating benchmarks) of power improvement than the design of the loop buffer that deal with loops at fetch stage.

2 Related Works [1] is capable of storing only innermost loops without forward branch. To indicate where a backward branch exists, [1] uses a special branch instruction “sbb”. If a “sbb” is detected and taken, loop buffer controller starts to fill instructions into loop buffer. Only if a “sbb” is detected and taken twice successively, CPU core starts to fetch instructions from loop buffer until this “sbb” is detected and not taken. To reduce design complexity, [1] uses a counter to generate loop buffer addresses, called loop buffer program counter (LPC). Consequently, this causes that [1] is only capable of storing a loop without any forward branch so that utilization of loop buffer and the reduction in instruction energy are limited. Instead of using a special branch instruction “sbb” in [1], [2] deploys a special register to record the address of backward branch. Once a backward branch is detected and taken twice successively, loop buffer controller starts to fill instructions into loop buffer. After successively filling, CPU core begins to fetch instructions from loop buffer. We also use this method to detect an innermost loop in this paper. Unlike [1], [2] also can store instructions within an innermost loop before a forward branch. This is because instruction addresses before a forward branch are sequentially. To reduce design complexity, [2] also uses a counter to generate LPC so that [2] encounters the same limitation with [1]. Tein [3] uses BTB to assist loop buffer in storing the innermost loops with the forward branch. An extra bit is added in BTB to record some forward branch result. This bit indicates whether loop buffer stores the fall-through or target trace of a forward branch. Loop buffer miss occurs when this bit is different with the result of branch predictor, i.e. branch predictor has changed its direction prediction of this forward branch. There are two actions to handle loop buffer miss: (1) Active to Idle state by normal situation and (2) Active to Fill state that when branch predictor changes its state from weakly taken/not-taken to strongly taken/not-taken. It is different from [1], this approach does not need special branch instruction or compiler to assist loop buffer controller in innermost loop detection. In the thesis [7], Chen improved the block-based trace cache fetching trace and proposed a new next trace prediction hash function achieving higher trace table hit rate. Instead of detecting stale block’s id during execution in block based design, Chen proposed a scheme that can detect the stale block’s id during the fetch time by adding a next fetch address of a block.

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

371

3 Block-Based Loop Buffer 3.1 The Overall Architecture The architecture of our approach is illustrated in fig. 1. This includes loop buffer data stores, loop buffer controller, and a multiplexer. The major differences between the structure proposed in [3] and fig. 1 are our method (1) detecting a loop at backend stage, (2) multiple-way data storage to reserve the completed loops immediately while filling a new innermost loop, and (3) the block buffer including many lines of blocks which are grouped into one loop depended on the configurations. Cof/BB is the information about branch updating, like taken/not taken or taken address. Loop buffer controller is responsible for: (1) determining that the requested instruction should be fetched from loop buffer or instruction L1 cache; (2) receiving the information from the branch predictor/BTB to create or update P-bits for construct a trace; (3) determining when to fill instructions into loop buffer; (4) innermost loop detection. According to the control signals from loop buffer controller, multiplexer decides that a requested instruction should come from loop buffer or instruction L1 cache. Recording forward branch outcome, such as taken or no taken, of an innermost loop during filling instructions into loop buffer assists loop buffer controller in determining the requested instruction should be fetched from loop buffer or instruction L1 cache. 3.2 The Controller Operation The operation mechanism of the controller including three states: IDLE, FILL and ACTIVE, is similar with [1]. The state diagram of the loop buffer controller’s finite state machine is shown in the fig. 2. When CPU core initializes or resets, loop buffer controller enters IDLE state. During IDLE state, loop buffer controller continuously detects whether an innermost loop exists in program, as action A in fig. 2. If and only if an innermost loop has been detected, in other words, the backward branch is not a function call/return or an indirect jump, and this innermost loop does not have been stored in loop buffer, action B is taken. To determine whether an innermost loop exists in loop buffer, we use a special register for each way, called S_Addr 0-3, to record the starting address of an innermost loop been stored in loop buffer. If the starting address of an innermost loop is different from the S_Addrs, this innermost loop does not exist in loop buffer. Otherwise, loop buffer controller would enter ACTIVE state, as action F in fig. 2. During FILL state, instructions are sequentially filled into loop buffer from first entry, as action C in fig. 2. In the meanwhile, the starting address of loop is stored in S_Addr at beginning and the branch predicted result of each forward branch stored in loop buffer are recorded into P-bits respectively. Normally, after successfully filling all instructions of an innermost loop, loop buffer controller enters IDLE state. If the loop buffer controller also detects this loop just recorded in the same time, it changes to ACTIVE state to fetch the loop from loop buffer immediately, as action K in fig. 2. Loop buffer controller return to IDLE state in two situations. First, loop buffer is full caused by a BIG loop which the size of an innermost loop is larger than the

372

M.-Y. Zhong and J.–J. Shieh

capacity of loop buffer, as action D in fig. 2. To cope with BIG loop, loop buffer only stores N instructions of this BIG loop. N is the number of entry for loop buffer. Second, the loop is been filled completely then it is not detected immediately at fetch stage, as action E in fig. 2. To avoid ping-pong effect, i.e. a branch outcome changes between taken and not-taken iteratively, the loop buffer controller only refills instructions if branch predictor changes its state from weakly taken/not-taken to strongly taken/not-taken, as action J in fig. 2. During ACTIVE state, CPU core fetches instructions from loop buffer instead of instruction L1 cache, as action G in fig. 2. Loop buffer miss occurs when the P-bit is different with the branch predictor's result, i.e. branch predictor has changed its direction prediction of this forward branch. We use action J to handle loop buffer miss. In action J, loop buffer controller returns to IDLE state. When loop buffer controller encounters a branch misprediction (action H in fig. 2) or CPU core has already fetched last instruction of loop buffer (action I in fig. 2), loop buffer controller enters IDLE state. The first condition is occurrence of a branch misprediction in the running loop that indicates the execution flow has changed and instructions located on another execution flow do not exist in loop buffer. The second condition is that loop buffer encounters a BIG loop. Since loop buffer dose not store a whole BIG loop, only a part of instructions can be stored in loop buffer. After CPU core has fetched the last instruction of loop buffer, loop buffer controller must enter IDLE state to let CPU core fetch other instructions of BIG loop from instruction L1 cache.

Fig. 1. Architecture of trace-based loop buffer

Fig. 2. State diagram of loop buffer controller

3.3 Detecting a Loop at Backend Stage When a branch occurs, the target of the branch is compared with the addresses in the base address registers. If there is a match, then a signal is sent to the instruction cache to indicate that the request should be canceled. The state machine is placed in the ACTIVE state, and a read command for the loop buffer state elements is issued to read the first block for the hit way. If the branch target does not match either of the stored base addresses, then a new loop is detected. The controller collects noncontiguous basic blocks from the next loop iteration into a single contiguous cache memory location in the loop buffer at

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

373

completion time. The controller receives instructions from commit buffer to construct a trace until one of the termination conditions is met. The trace termination conditions determine when trace is completed. Here, a trace is terminated under the following conditions: meeting the same backward branch again or buffer full. 3.4 Multiple-Way Data Storage Loop buffer data stores are illustrated in fig. 3. Each way includes a direct mapped cache with no tag compare needed at fetch time, in which each line stores fixed-size instructions, so that it enables tight integration with the CPU core and very low energy level per access. In ACTIVE state, the loop counter will add the number of instructions for each line until the branch prediction is different from P-bit, or the loop counter exceeds the value stored in the number valid register for the associated way. The loop controller set the value of counter to 0 while end of loop iteration is met or an existing loop is detected. In FILL state, the number of instructions is updated and a write command is generated for the loop buffer storage elements until the backward branch is met. The number valid registers are also incremented to record how many valid execute instructions are stored in this way. If the controller detects a new loop while each way has stored a loop, it fills the new one by using the LRU counter to displace the oldest loop in some way.

Fig. 3. Construct of the loop buffer storage

3.5 The Block-Based Loop Buffer The block-based loop buffer shown in fig. 4 is a direct mapped cache with no tag compare needed at fetch time. The most recently used loops are stored in the block buffers. The loop buffer controller maintains replacement. Different from the blockbased trace cache, each line of the block buffer not only b instructions along with the fetch address of the first instruction, but also a register “NI” of the number of instructions which is calculated during the pending block time in one line, “B” indicating whether a forward branch exists in a block, and a P bit recording the forward branch's outcome in each line. S_Addrs is the starting address of a loop. During the fetch cycle, a PC is matched with any other of S_Addr at IDLE state; the loop buffer

374

M.-Y. Zhong and J.–J. Shieh

controller is changed to the ACTIVE state then fetches the blocks from loop buffer next cycle. Register "REST" is set to store the number of remaining instructions not fetched in a block. At ACTIVE state, the fetch engine can use this value of REST to regulate how many blocks to be fetched. We propose a fetching mechanism for loop buffer to ensure that the CPU can fetch instructions as much as possible per clock. In some situation, it can get instructions as many as the hardware configuration. Notice that the CPU only fetches a branch per fetch clock in our approach to avoid the complexity of processing the branch prediction, e.g. multiple predictions. For example, as shown in fig. 5, a loop with several blocks is stored in the loop buffer. The loop consists of 5 blocks: b1 with 8 instructions (b1 is 8), b2 with 5 instructions (b2 is 5), b3 with 6 instructions (b3 is 6), b4 with 4 instructions (b4 is 4), and b5 with 3 instructions (b5 is 3). Assume that the fetch engine once gets up to 8 instructions, the block size is 8, and the branch prediction in each block is always the same with P bit. If the value of REST is zero, there are three conditions we want to handle. When processing b1 and b2, the CPU finds that b1 contains 8 instructions, then it only fetch this block, meanwhile, REST is still 0. In next cycle, CPU will deal with b2 and b3 then it finds the total instructions of b2 and b3 are more than 8, but they are individually less than 8. Therefore, it gets 8 instructions, and REST is set the value of (b2 + b3 - 8). The third condition is processing b4 and b5. They are totally less than 8, hence, it only get b4 because we don't handle more than 2 branches per cycle. If REST is larger than 0, we have to concern two situations. First, if REST plus b3 is greater than 8 while processing the rest of b2 and b3, the CPU fetches 8 instructions. In other situation, if REST plus b3 is less than or equal to 8, it only fetch the rest of the block.

Fig. 4. Block-based loop buffer

Fig. 5. An example for blocks in loop buffer

4 Methedology Our experimental environment consists of a PC which consists of a 1.8GHz AMD XP 2500+ processor with 1GB RAM and 512 KB L2 cache running RedHat 9.0, SimpleScalar 3.0d simulators and SPEC CPU2000 benchmark suits. SimpleScalar is an execution driven simulation tool with out-of-order issue and execution, and in-order commit. SimpleScalar includes several simulators, simfast, sim-safe, sim-cache, sim-profile, and sim-outorder. We modify sim-outorder,

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

375

the most detailed simulator, and collect corresponding statistics for the different architectures. SPEC CPU2000, including 10 integer (CINT2K-b) and 8 floating-point (CFP2K) benchmark programs and corresponding input data sets, is used for simulation-based computer architecture research. We apply ten integer and nine floating-point benchmark programs for modified sim-outorder simulator. For each benchmark program, we first skip 250 million instructions to avoid the initial startup behavior and then execute next 100 million instructions. In this paper, we experimented with the different sizes of loop buffer, includes 16, 32, 64, 128, 256, and 512 entries. Other parameters used in SimpleScalar are shown in the table 1. Table 1. SimpleScalar configuration parameter for simulation

Parameter Loop Buffer Instruction L1 Cache Data L1 Cache L2 Cache Instruction Fetch Width Branch Predictor

Value 16, 32, 64, 128, 256, 512 entries per way 512 sets, direct-mapped, 32 Bytes per block 128 sets, 4-way, 32 Bytes per block 1024sets, 4way, 64 Bytes per blocks 8 instructions per cycle Bimodal, 2048 bytes

5 Simulation Results 5.1 Distribution of Loops As shown in fig. 6, there are the maximum number (136.3) of instructions in a loop for the twolf benchmark and the minimum number (30.2) of instructions for the parser benchmark. In fig. 7, however the mgrid benchmark has 926.2 instructions in a loop which is the maximum and the minimum is the art benchmark which has 16.5 instructions in a loop. On average, a loop has 63.4 instructions for integer benchmarks and 212.1 instructions for floating benchmarks. Because our approach is block-based loop buffer, we must measure how many blocks that are fixed size (8 instructions) blocks instead of basic blocks. Therefore, we can evaluate the power consumptions of different size loop buffer to find which size of loop buffer should be used so that we can save as more waste of energy as possible. Fig. 8 shows the number of blocks for a loop in CINT2K benchmarks. There are 10.5 blocks in average. Fig. 9 shows the number of blocks in CFP2K benchmarks. The maximum number is 115.8 for mgrid benchmarks, but the minimum is only 2.1. There are 27 blocks for a loop on average.

376

M.-Y. Zhong and J.–J. Shieh

Fig. 6. Number of instructions for a loop in CINT2K

Fig. 7. Number of instructions for a loop in CFP2K

Fig. 8. Number of blocks for a loop in CINT2K

Fig. 9. Number of blocks for a loop in CFP2K

5.2 Performance with Various Ways Loop Buffer Fig. 10 shows IPC running CINT2K program for baseline without the loop buffer and loop buffer with different sizes. Loop buffer does not affect the performance too much compared to baseline. Instead of depraving, IPC improves as the size of loop buffer grows. In CFP2K, the results as shown in fig. 11 are totally the same with the baseline that the loop buffer is not beneficial for instructions needed long execution time. See the fig. 12 and 13 that loop buffer is the 2-way organization; it is found that IPCs of gzip in fig. 12 conspicuously improve due to loop buffer, and IPCs of votex are better than these in fig. 9. IPC of galgel with 16-entries in fig. 13 is a little better than the others. Except galgel, IPCs are similar to these in fig. 9.

Fig. 10. IPC of CINT2K for 1-way

Fig. 11. IPC of CFP2K for 1-way

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

Fig. 12. IPC of CINT2K for 2-way

Fig. 13. IPC of CFP2K for 2-way

Fig. 14. IPC of CINT2K for 4-way

Fig. 15. IPC of CFP2K for 4-way

377

The loop buffer is 4-way structure in fig. 14 and fig. 15. IPC of crafy are much better due to the size of loop buffer and result of IPC is better than 1-way and 2-way. In CFP2K, IPCs in fig. 15 are totally similar to these of baseline. In conclusion, our approach has no impact on the performance. 5.3 Power Reduction of Various Ways Loop Buffer First, we discuss the SPEC2000int called CINT2K which is a integer benchmark set. Fig. 16 shown for 1-way is set in different loop buffer size. The maximum power reduction for the design with128-entries is about 48.2%, but it only further decreases 0.2% to result for 64-entries by 48%. There are no significant changes among results of the benchmark vortex with each size that means the reduction of power is nearly balanced off by the power consumption of loop buffer. Simulation results show that the set of different sizes averagely decrease the power by 41.2%, 43.8%, 44.8%, 45.2%, 42.1%, 39.8%. Fig. 17 shows the simulation results for the configuration of 2-way. It can be found that the downside of power saving from 128 to 512 entries changes not so fast because of storing 2 loops in the loop buffer so that the CPU accesses the loop more frequently. Simulation results show that the set of different sizes averagely decrease the power by 42.6%, 45%, 46.3%, 47%, 45%, 43.6%. In fig. 18, result of the benchmark twolf is unlike to the fig. 16 and 17. The reduction does not decreases but increases with the sizes of 128 to 512 entries. It also can be found that the result for 512 entries does not change significantly which means the saving of fetch power is nearly offset by the power consumption of loop buffer. Result of mcf for 512 entries increases of because the loop buffer frequently stores 4

378

M.-Y. Zhong and J.–J. Shieh

Fig. 16. Reductions in instruction fetch power for CINT2K(1-way)

Fig. 17. Reductions in instruction fetch power for CINT2K(2-way)

Fig. 18. Reductions in instruction fetch power for CINTK(4-way)

Fig. 19. Reductions in instruction fetch power for CFP2K(1-way)

Fig. 20. Reductions in instruction fetch power for CFP2K(2-way)

Fig. 21. Reductions in instruction fetch power for CFP2K(4-way)

sequential loops. Simulation results show that the set of different sizes averagely decrease the power by 42.5%, 45%, 50.1%, 51.6% 50.4%, 50.3%. In fig. 19, the power reduction of most of the floating benchmarks raises as the size of buffer increases. This is because there are many big loops consisted of the benchmark programs. The benefit of power is caused by the size of loop buffer. Simulation results show that the set of different sizes averagely decrease the power by 44.1%, 47%, 51.8%, 57.4%, 61.7%, 64.6%. In fig. 20 and 21, we can see the results of 2-way and 4-way are almost the same with 1-way. Hence, we consider that the loops does not exist in the loop buffer so long that the loop buffer controller usually flushes the oldest loop stored in loop buffer and fills new one. There is no much benefit for multi-way design. The effects

Power Improvement Using Block-Based Loop Buffer with Innermost Loop Control

379

of multi-way polices are limited. In fig. 20, simulation results show that the set of different sizes averagely decrease the power by 44.3%, 47.1% 52%, 57.5% 61.9%, 65.1%. Power savings, as shown in fig. 21, are 44.3%, 47.2%, 51.6%, 57.6%, 62%, and 65%. 5.4 Simulation Results with Different Loop Buffer Designs In CINT2K benchmarks as shown in fig. 22, we compare the different designs: 1wayf [3] (handling at fetch stage), 1way-c (1 way at commit stage), 2way-c (2 way at commit stage), and 4way-c (4 way at commit stage). Because the simulation results of the design in [3] are based on Mibench benchmarks, here we organize the loop buffer of [3], then run it on the benchmarks of SPEC2000 to compare with our block-based loop buffer. For each design, 64-entry or 128-entry has the maximum power reduction. But larger loop buffer may be not beneficial for instruction fetch power according to fig. 22. Compared to 1way-f with 1way-c of our approach, 1way-c still has improvement in instruction fetch power, still less 2way-c and 4way-c. The bigger the size of loop buffer is, the more the reduction of power is. But the power savings of 1way-c, 2way-c, and 4way-c are still over the 1way-f by 0.4%, 2.5%, 5.9%. Fig. 23 shows the results of floating benchmarks for different sizes. Results of power reduction for our approach are further much than 1way-f by 1.88%, 2.02%, and 2.07%.

Fig. 22. Reductions in instruction fetch power of different designs (CINT2K)

Fig. 23. Reductions in instruction fetch power of different designs (CFP2K)

6 Conclusions Since we use multi-bank memory to store the innermost loops, only one-way loop instructions are accessed per fetch while the other ways still consume static power. There will be more and more respect for static energy consumption. Although we only address reduction in instruction fetch energy, i.e. dynamic energy in our work, it will be more interesting if we study the static energy reduction of our design, like keep the inactive banks in sleep state – no power consumed. Results with SPEC2000 indicate that up to 45% (integer benchmarks) and 55% (floating benchmarks) of reductions in instruction fetch power compared with the design without loop buffer. Furthermore, we got 3% (integer benchmarks) and 2% (floating benchmarks) of power improvement than the design of the loop buffer that deal with loops at fetch stage.

380

M.-Y. Zhong and J.–J. Shieh

References [1] Lee, L., Moyer, B., Arends, J.: Low-Cost Embedded Program Loop Caching – Revisited. U. Mich. Technique Reports, number CSE-TR-411-99.W.-K. Chen, Linear Networks and Systems (Book style). Wadsworth, Belmont, pp. 123–135 (1993) [2] Anderson, T., Agarwala, S.: Effective hardware-based two-way loop cache for high performance low power processors. In: International Conference on Computer Design: VLSI in Computers & Processors (2000) [3] Wu, I.–W., Tein, B.-H., Chung, C.-P.: Instruction Fetch Energy Reduction Using ForwardBranch and Subroutine Bufferable Innermost Loop Buffer. In: International Computer Symposium (2006) [4] Wu, C.-K., Chiu, J.-C.: Design of Buffering Mechanism for Improving Instruction and Data Stream. Master Degree Thesis, Department of Electrical Engineering, National Sun Yat-Sen University (June 2003) [5] Fritts, J., Wolf, W.: Instruction fetch characteristics of media processing. In: SPIE Photonics West, on Media Processors 2002, San Jose, CA, January 2002, pp. 72–83 (2002) [6] Chu, Y., Ito, M.R.: An efficient instruction cache scheme for object-oriented languages. In: IEEE International Conference on, on Performance, Computing and Communications, pp. 329–336 (April 2001) [7] Chen, S.-L., Shieh, J.-J.: Performance Evaluation of a Trace cache Engine, Master Degree Thesis, Department of Computer Science and Engineering, Tatung University (January 2000)

An Eﬃcient Pipelined Architecture for Fast Competitive Learning Hui-Ya Li, Chia-Lung Hung, and Wen-Jyi Hwang Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, 117, Taiwan {royalfay,nicky730216}@gmail.com, [email protected]

Abstract. This paper presents a novel pipelined architecture for fast competitive learning (CL). It is used as a hardware accelerator in a system on programmable chip (SOPC) for reducing the computational time. In the architecture, a novel codeword swapping scheme is adopted so that both neuron competition processes for diﬀerent training vectors can be operated concurrently. The neuron updating process is based on a hardware divider with simple table lookup operations. The divider performs ﬁnite precision calculation for area cost reduction at the expense of slight degradation in training performance. Experimental results show that the CPU time is lower than that of other hardware or software implementations running the CL training program with or without the support of custom hardware.

1

Introduction

Competitive learning (CL) [9] is an eﬀective artiﬁcial neural network [4] for clustering analysis. Neurons in a CL network compete among themselves to be activated. The weight vector associated with each neuron corresponds to the center of its receptive ﬁeld in the input feature space. The goal of CL is to minimize the error in clustering analysis/pattern classiﬁcation [7] or the quantization distortion in vector quantization (VQ) [8]. The basic CL algorithm is based on the winner-take-all (WTA) [5] scheme, where adaptation is restricted to the winner that is the single neuron best matching the input pattern. Although the basic CL is simple to implement, the average CPU time of identifying the best matching neuron for an input vector will be long when the number of neurons is large and/or the dimension of weight vectors associated with each neuron is high. Such long training time may be a limitation for realtime applications. The high computational cost may be circumvented by employing partial distance search (PDS) [6] in the original or transform domains, which remove undesired neurons with less number of multiplications. PDS contains a simple modiﬁcation to the distortion computation. During the calculation of the distance sum, if the partial distance exceeds the distance to the current best matching neuron, the calculation is aborted. A moderate acceleration is achieved [13].

To whom all correspondence should be sent.

C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 381–390, 2010. c Springer-Verlag Berlin Heidelberg 2010

382

H.-Y. Li, C.-L. Hung, and W.-J. Hwang

Other alternatives [11,12] are based on the VLSI realization of search engines involving parallel search with various systolic arrays, which are originally used for the design of VQ encoders. As compared with the software counterparts, the hardware circuits provide higher throughput. However, these architectures do not provide functions for online neuron updating, which is important to CL training. In [10], a hardware CL architecture based on PDS in the wavelet domain is proposed. Although the architecture is able to perform online neuron updating in hardware, it can only process one training vector at a time. The architecture therefore has only limited throughput for fast CL training. The objective of this paper is to present a novel architecture for the fast training of CL with WTA activations. The architecture is based on ﬁeld programmable gate array (FPGA) [3] so that it is reconﬁgurable for various competitive networks. FPGA-based reconﬁgurable hardware can be programmed almost like software, maintaining the most attractive advantage of ﬂexibility with less cost than traditional application-speciﬁc integrated circuit (ASIC) implementations. Moreover, the FPGA hardware implementation can exploit parallelism for the CL algorithm so that the training time can be lowered. In addition to the employment of FPGAs, a novel pipeline architecture is proposed to attain high throughput. The architecture adopts a novel codeword swapping scheme for online neuron updating so that neuron competitions for different training vectors can operate concurrently. By the employment of codeword swapping scheme, each training vector can carry its current winning neuron as it traverses through the pipeline. Consequently, neurons failing the competition for a training vector are immediately available for competitions for subsequent training vectors. The eﬃciency of the pipeline can then be eﬀectively enhanced. When a training vector reaches the ﬁnal stage of the pipeline, a hardwarebased neuron updating process is activated. The process involves the computation of learning rate and new codeword for the winning neuron. To accelerate the process, a novel lookup-table based circuit for ﬁnite precision division is proposed. It is able to reduce the computational time and lower the area cost at the expense of slight degradation in training process. The combination of codeword swapping scheme for neuron competition and lookup-table based divider for neuron updating eﬀectively expedites the CL training process. The proposed architecture has been adopted as a hardware accelerator for the softcore NIOS II processor [2] running at 50 MHz. Experimental results show that our design is an eﬀective alternative for the applications where realtime CL training is desired.

2

The Proposed Architecture

We ﬁrst brieﬂy review some basic facts of CL with WTA activation. Consider a CL network with N neurons. Let yj , where j = 1, .., N, be the weight vectors of the network. In the CL algorithm with WTA activation, given a training vector x, the weight vector yj ∗ satisfying j ∗ = arg min D(x, yj ) 1≤j≤N

(1)

An Eﬃcient Pipelined Architecture for Fast Competitive Learning

383

Fig. 1. The proposed CL architecture

will be updated, where D(u, v) is the squared distance between u and v, and both vectors have the same dimension 2n × 2n . The updated yj ∗ is then given as yj ∗ ← yj ∗ + ηj ∗ (x − yj ∗ ),

(2)

where ηj ∗ is the learning rate of the j ∗ -th neural unit. Discovering the vector yj ∗ requires the exhaustive search over N vectors. When N and/or n are large, the computational complexity of CL algorithm is very high. Figure 1 shows the proposed architecture for fast CL training. The architecture is a (N + 1)-stage pipeline for a CL network with N neurons, and can be divided into two units: winner selection unit and winner update unit. As shown in the ﬁgure, there are N stages in the winner selection unit. Each stage in the unit corresponds to one neuron in the CL network. In the unit, a novel codeword swapping scheme is employed for subsequent online neuron updating. The proposed codeword swapping scheme allows each training vector to carry its current winning neuron as it traverses through the pipeline. Consequently, neurons failing the competition for a training vector are immediately available for the competitions for the subsequent training vectors. Therefore, the neuron competitions for diﬀerent training vectors can operate concurrently. The throughput of the pipeline can then be enhanced. To further elaborate the codeword swapping scheme, assume the current training vector is located at stage i. Let yi be the codeword associated with stage i. The current minimum distortion between xk and the current best matching codeword yj ∗ , denoted by Dmin , are available, and are presented at the input of stage i. When i = 1, we set the initial Dmin = ∞, and initial yj ∗ = null. As the current training vector x enters stage i, Di , the squared distance between xk and yi is computed, and is compared with Dmin . When Di < Dmin , no swapping is necessary. In this case, yi becomes new yj ∗ . When Di ≥ Dmin , the yi fails the competition. The codeword swapping scheme will be activated. It swaps yj ∗ and yi . After the swapping process, new yi becomes yj ∗ , and old yi goes to the previous stage. Consequently, when xk completes its operations at stage i, the yi is always the yj ∗ regardless the results of competition. Therefore, xk is able to carry yj ∗ as it traverses through the pipeline. It should be noted that, as xk at stage i, the next training vector xk+1 should not enter the stage i − 1. This is because the current best matching vector yj ∗ to xk will stay at stage i − 1 until the completion of the swapping operation.

384

H.-Y. Li, C.-L. Hung, and W.-J. Hwang

Fig. 2. The architecture of stage i

In the CL algorithm, the best matching codeword yj ∗ to the xk should not join the competition for the subsequent training vectors until the yj ∗ is updated. Therefore, allowing the next training vector xk+1 to enter stage i − 1 when xk is at stage i may result in pre-mature competition. That is, the un-updated best matching codeword to xk may be used for competition for xk+1 . Consequently, in the proposed pipeline, successive training vectors do not reside consecutive stages. They are separated by at least one stage. An (N + 1)-stage pipeline therefore allows N/2 competitions to be performed concurrently. Because some stages in the pipeline may be vacant, an identiﬁer Ii is desired at each stage i to indicate whether there is a training vector in that stage. In our design, each training vector xk is associated with a 1-bit identiﬁer bk = 1. As xk enters stage i, we set Ii = 1 by letting Ii ← bk . When there is no training vector at stage i, Ii will be reset to 0. The architecture of stage i (1 < i < N ) in the pipeline is shown in Figure 2. It contains a swapping unit, registers, comparator, multipliers, and multiplexers for competition and codeword swapping. As the current training vector xk enters stage i, Di , the squared distance between xk and yi is computed, and is compared with Dmin . The output of the comparator at stage i is denoted by ci . When Ii = 0, we set ci = 0. Otherwise, ci is dependent on the comparison results of Di and Dmin . When Di < Dmin , ci = 0. Otherwise, ci = 1. Figure 3 shows the architecture of the swapping unit, which contains only one multiplexer and one register. The register contains the codeword associated with the stage i. The multiplexer determines the results of the swapping operation. It has two control lines: ci and ci+1 . The truth table for the multiplexer in the swapping unit is shown in Table 1. The operation of the swapping unit is presented in detail as follows. When xk is at stage i, both stages i − 1 and i + 1 are vacant. That is, Ii−1 = Ii+1 = 0. Therefore, ci−1 = ci+1 = 0. When Di < Dmin , ci = 0. In this case, the yi wins the competition, and no swapping is necessary. The output of the multiplexer is the same as the current output of the register. As a result, both yi and Di

An Eﬃcient Pipelined Architecture for Fast Competitive Learning

385

Fig. 3. The architecture of swapping unit Table 1. The truth table for the multiplexer in the swapping unit ci 0 1 ×

ci+1 0 0 1

MUX yi yi−1 yi+1

out

become yj∗ and Dmin , respectively. The original best matching vector then stays at stage i − 1, which will then be available for subsequent training vectors. On the other hand, when Di ≥ Dmin , ci = 1. The yi then fails the competition, and swapping operation is activated. The multiplexer output at stage i − 1 (i) is actually the current register output at stage i (i − 1). Therefore, after the swapping operation, the best matching codeword will be moved to stage i. The original codeword yi then becomes new yi−1 . This allows the codeword to join the competition for the subsequent training vectors. The architecture of the stage 1 is the simpliﬁcation of that of stage i (1 < i < N ). Since no competition is necessary at stage 1, comparator is removed. The codeword y1 is always the winner, and D1 is the current Dmin . Figure 4 shows the architecture of stage N , which can be viewed as the ﬁnal stage of winner selection unit. Consequently, there is no need for the update of Dmin . As shown in the ﬁgure, there is only one multiplexer in the circuit, which is used for codeword swapping between stages N − 1 and N . In addition, the multiplexer can be adopted for storing the updated codeword obtained from stage N +1. Similar to other stages, there are two control lines for the multiplexer: cN and cN +1 . The control line cN is used for the activation of codeword swapping process. However, cN +1 is not used for swapping. When cN +1 = 1, the updated codeword at the stage N + 1 is available. In this case, the stage N will store the updated codeword for subsequent competition. The stage N + 1 of the pipeline is the winner selection unit. It consists of control module and update module, as depicted in Figure 5. The control module generates cN +1 so that the updated codeword can be delivered back to stage N . The winner selection unit may be vacant. In this case, cN +1 should not be

386

H.-Y. Li, C.-L. Hung, and W.-J. Hwang

Fig. 4. The architecture of stage N

set to 1 because no updated codeword is available. To solve this problem, the bk associated with each training vector xk is used as the input to the control module. Since bk = 1, the control module will be activated as xk enters the winner selection unit. The update module involves the computation of learning rate and updated new codeword. In our implementation, we set learning rate as ηj ∗ =

1 , 4 × rj ∗

(3)

where rj ∗ denotes current number of times the weight vector yj ∗ has been selected for updating. The counter in the module is used for computing rj ∗ . To compute the learning rate, each codeword yi should be associated with its own ri . When yi−1 and yi are decided to be swapped, ri−1 and ri will be swapped as well. For the sake of brevity, the circuit for swapping ri−1 and ri at each stage i is not shown in Figures 2 and 4. After the actual winner has been identiﬁed at the ﬁnal stage, its rj ∗ will be increased by 1 by the counter. The computation of learning rate involves division. In our design, a lookup-table based divider is adopted for reducing the area complexity and accelerating the updating process. Given any integer w > 0, eq.(3) can then be rewritten as ∗

ηj =

2w × 2−(w+2) . rj ∗

In our design, a ROM-based divider is used for storing

(4) 2w rj ∗

. There are 2w entries

in the ROM, where the i-th entry contains the ﬁnite precision value of shift circuit then shifts the output of ROM left by (w + 2) positions.

2w . i

The

An Eﬃcient Pipelined Architecture for Fast Competitive Learning

387

Fig. 5. The architecture of stage N + 1

w

In our implementation, each 2i , i = 1, ..., 2w , has only ﬁnite precision with w w ﬁxed-point format. Since the maximum value of 2i is 2w , the integer part of 2i w w has w bits. Moreover, the fractional part of 2i contains z bits. Each 2i therefore is represented by (w + z) bits. There are 2w entries in the ROM. The ROM size therefore is (w + z) × 2w bits. Note that the ROM size of the divider circuit can be further reduced. Given w, 2w 2w i values may be the same for diﬀerent i values when i are stored with ﬁnite precision. This is especially true for large i values. These i values can share a single entry for reducing the storage size of the ROM. This can be accomplished by adding a simple encoder in the address port of the ROM. All the i values sharing the same entry will be mapped to the same address by an encoder. Signiﬁcant number of entries then can be reduced. For example, when k = 12, w and all the fractional part of the 2i is truncated (i.e., z = 0), only 127 entries are necessary in the ROM instead of 212 entries. The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore NIOS CPU, DMA controller and SDRAM, as depicted in Figure 6. The set of training vectors is stored in the SDRAM. The training vectors are then delivered to the proposed circuit one at a time by the DMA controller for CL training. As the delivery of the training vectors is completed, the softcore CPU then retrieves the resulting codewords from the proposed architecture. This completes the process of the CL training.

388

H.-Y. Li, C.-L. Hung, and W.-J. Hwang

Fig. 6. The SOPC architecture

3

Experimental Results

Following are some numerical results of the proposed CL architecture. The target FPGA device for the hardware design is Altera Cyclone III EP3C120 [1]. The vector dimension of codewords is w = 2 × 2. The proposed architecture is adopted as an hardware accelerator of a NIOS II softcore processor. The area cost of the entire NIOS-based SOPC system with N = 128 is shown in Table 2. There are three diﬀerent types of area cost considered in this experiment: number of logic elements (LEs), number of embedded memory bits, and the number of DSP blocks. Although the NIOS processor itself also consumes hardware resources, the entire SOPC uses 50153 LEs, which is 42% of LEs of the target FPGA device. Our proposed table lookup-based divider is able to simplify the neuron updating process at the expense of slight performance degradation. As shown in Table 3, as compared with the usual ﬂoating point divider based on long division, the increase in average distortion using our divider is small for all the training images “House”, “Lena” and “Tree.” In this experiment, the number of weight vectors is N = 64. Table 2. The area cost of the entire NIOS-based SOPC system with N = 128

LEs Memory bits DSP blocks

CL circuit 40449/119088 (34%) 0/3981312 (0%) 520/576 (90%)

Entire SOPC system 50153/119088 (42%) 615664/3981312 (15%) 524/576 (91%)

An Eﬃcient Pipelined Architecture for Fast Competitive Learning

389

Table 3. The inﬂuence of long division and lookup-table based division on PSNR values Image name House Lena Tree

Division 26.3675 24.2825 21.555

Lookup-table based division 26.3656 24.2814 21.5478

Table 4. The CPU time of the proposed hardware architecture, its software counterpart and the hardware architecture proposed in [10] for diﬀerent N N 4 8 16 32 64 128

Software 72.5246 ms 136.434 ms 254.086 ms 489.116 ms 956.051 ms 1919.19 ms

Architecture in [10] 30.176139 ms 45.9108 ms 77.389762 ms 140.35643 ms 266.267944 ms 518.116394 ms

Proposed hardware architecture 7.525947 ms 7.526027 ms 7.526187 ms 7.526507 ms 7.527147 ms 7.528427 ms

Table 4 shows the CPU time of the proposed hardware architecture, its software counterpart and the hardware architecture proposed in [10] for diﬀerent number of codewords N . The CPU time is the execution time of the processor for the CL over the entire training set. The software implementation is executed on the 2.8-GHz Pentium IV CPU with 1.5-Gbyte DDRII SDRAM. The architecture presented in [10] is also adopted as an accelerator for NIOS II softcore processor. The corresponding SOPC system is implemented in the same FPGA device. All implementations share the same set of training vectors with 65536 training vectors obtained from the 512 × 512 training image “Lena.” We can see from Table 4 that the CPU time of the proposed architecture is lower than the other implementations. Because of the pipeline implementation, the CPU time of the proposed architecture is almost independent of N . However, for the other implementations, the CPU time may grow linearly with N . Therefore, as N becomes large, the proposed architecture will have signiﬁcantly higher speed for the CL training.

4

Concluding Remarks

The proposed architecture has been found to be eﬀective for fast CL training. By the employment of a novel codeword swapping scheme for pipeline implementation, the CPU time of the architecture is almost independent of the number of neurons N . Therefore, the architecture can attain high speedup over other hardware or software implementations for large N . The ﬁnite precision hardware divider for neuron updating is able to reduce the area cost at the expense of a negligible degradation for CL training. The proposed architecture is therefore an eﬀective alternative for attaining high performance with low area cost and low computational time.

390

H.-Y. Li, C.-L. Hung, and W.-J. Hwang

References 1. Altera Corporation, Cyclone III Device Handbook (2008), http://www.altera.com/literature/lit-cyc3.jsp 2. Altera Corporation, NIOS II Processor Reference Handbook, ver. 9.1 (2009), http://www.altera.com/literature/lit-nio.jsp 3. Hauck, S., Dehon, A.: Reconﬁgurable Computing. Morgan Kaufmann, San Francisco (2008) 4. Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Pearson, London (2009) 5. Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the Theory of Neural Computation. Addison-Wesley, New York (1991) 6. Hwang, W.J., Lin, F.J., Zeng, Y.C.: Fast Design Algorithm for Competitive Learning. Electronics Letters 33, 1469–1470 (1997) 7. Hwang, W.J., Ye, B.Y., Lin, C.T.: A Novel Competitive Learning Algorithm for the Parametric Classiﬁcation with Gaussian Distributions. Pattern Recognition Letters 21, 375–380 (2000) 8. Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans. Neural Networks 12, 1299–1305 (2001) 9. Kohonen, T.: Self-Organizing Maps, 3rd extended edn. Springer, Heidelberg (2001) 10. Li, H.Y., Hwang, W.J., Yang, C.T.: High Speed k-Winner-Take-ALL Competitive Learning in Reconﬁgurable Hardware. In: Chien, B.-C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009. LNCS (LNAI), vol. 5579, pp. 594–603. Springer, Heidelberg (2009) 11. Park, H., Prasanna, V.K.: Modular VLSI architectures for Real-Time Full-SearchBased Vector Quantization. IEEE Trans. Circuits Syst. Video Technol. 3, 309–317 (1993) 12. Wang, C.L., Chen, L.M.: A New VLSI Architecture for Full-Search Vector Quantization. IEEE Trans. Circuits Syst. Video Technol. 6, 389–398 (1996) 13. Yeh, Y.J., Li, H.Y., Hwang, W.J., Fung, C.Y.: FPGA Implementation of kNN Classiﬁer Based on Wavelet Transform and Partial Distance Search. In: Ersbøll, B.K., Pedersen, K.S. (eds.) SCIA 2007. LNCS, vol. 4522, pp. 512–521. Springer, Heidelberg (2007)

Merging Data Records on EREW PRAM Hazem M. Bahig Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University, Cairo, Egypt [email protected]

Abstract. Given two sorted arrays A = (a1 , a2 , · · · , an ) and B = (b1 , b2 , · · · , bn ) of records such that (1) the n records are sorted according to one ﬁeld which is called the key, and (2) the values of the keys are serial numbers. Merging data records has many applications in computer science especially in database. We develop an algorithm that runs in O(log n) time on EREW PRAM to merge two sorted arrays of records using logn n processors even the keys of the data records are repeated. The algorithm is cost-optimal, deterministic, stable and uses linear number of space. Keywords: Parallel algorithms, merging records, integer merging, EREW PRAM.

1

Introduction

The design of complex algorithms relies heavily on a set of fundamental problems. One of these problems is the merging of two sorted arrays. The merging problem has many applications in computer science such as sorting [17], database design [26] and graphs [6]. For example, in database the data are represented as a record. Each record consists of many ﬁelds. The primary ﬁeld of the record is called the key. In general, the values of the keys are represent as a serial numbers. Merging such sorted records has an important applications in database [20,21,22,25], where the data records are sorted if the records sorted according to the ﬁeld key. The merging problem is solved in both sequential and parallel computation. A linear time sequential algorithm for merging problem is proposed [17]. Many research eﬀorts have studied the merging problem on diﬀerent parallel models, shared and nonshared [3,7,8,10,11,12,13,14,16,18,24,27]. Parallel Model: In this paper, we assume the Parallel Random Access Machine (PRAM). PRAM has been the model of choice for analyzing the parallel complexity of problems and describing parallel algorithms. A PRAM consists of p synchronous processors and a global shared memory accessible in unit time from each of the processors. The only mean of inter-processor communication is through the shared memory. Diﬀerent conventions exist regarding concurrent access to the memory [2]. Exclusive Read Exclusive Write (EREW) PRAM: for C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 391–400, 2010. c Springer-Verlag Berlin Heidelberg 2010

392

H.M. Bahig

each memory location, it may only be read from or written to by one processor in each cycle. Concurrent Read Exclusive Write (CREW) PRAM: for each memory location, it may be read from by several processors, or written to by only a single processor in each cycle. Concurrent Read Concurrent Write (CRCW) PRAM: for each memory location, it may be read from or written to by several processors at the same time. Previous works: Hagerup and Rub[14] showed that merging two sorted arrays, each of length n, required O( np ) time using p(≤ logn n ) processors on EREW PRAM. Also, Kruskal [18] showed that merging can be performed faster on CREW PRAM in O(log log n) time and O(n) work. If the elements of two sorted arrays are taken from the restricted domain of integer [1, m] then the problem is called integer merging, where m is a function in n [7]. The integer merging algorithm can be used in database since the values of the keys are represent as serial integer number or it can be mapped to integer range. Berkman and Vishkin [7] proposed two algorithms on CREW PRAM for integer merging. The ﬁrst algorithm uses log lognlog m CREW PRAM and runs in O(log log log m) time. The algorithm is optimal and uses O(nmk ) space, where m = nk . The second n algorithm uses α(n) CREW PRAM and runs in O(α(n)) time, where α(n) is the inverse of Ackermann’s function and n = n1 + n2 . The algorithm is optimal and uses a linear space for input elements that are drawn from the domain of integer [1, n]. Also, Hagerup and Kutyloushi [13] proposed merging algorithm for two sorted sequences of length n based on rank-merged on an EREW PRAM. The algorithm takes O(log log n + log min{n, m}) time, O(n) operations, and O(n) space. Two integer merging algorithms on EREW PRAM are proposed by Bahig [4,5]. The algorithms take O( np ) time when the elements are distinct. Contribution: In this paper, we present a logarithmic time parallel algorithm for merging two sorted arrays of records on EREW PRAM. The records are sorted according to the key and the values of the keys are taken from linear range in n. The running time of the algorithm is matching with the previous results but our algorithm is based on a diﬀerent technique. The idea of the technique is determining the correct position for each distinct element of both arrays, A and B, in the sorted array C without comparing the elements of A with B. We only compare the elements of the array with itself. Structure of paper: The structure of the paper is as follows: we introduce the preliminary deﬁnitions of the problem and related subproblems in Section 2. In Section 3, we design an optimal logarithmic algorithm for merging two sorted arrays of integers on EREW PRAM. In Section 4, we extend the algorithm to work when the elements of the two arrays are records and the values of the keys of the records are serial numbers. Finally, the conclusion of our work is in Section 5.

2

Preliminaries

In this section we give the deﬁnitions of our problem and related subroutines that are used in designing our algorithm.

Merging Data Records on EREW PRAM

393

Definition 1 [2]: Given two sorted arrays A = (a1 , a2 , · · · , an1 ) and B = (b1 , b2 , · · · , bn2 ). The merging of A and B is a sorted array C = (c1 , c2 , · · · , cn1 +n2 ) such that: (1) ci ∈ C belongs to A or B, ∀1 ≤ i ≤ n1 + n2 ; and (2) ai and bj appear exactly once in C, ∀1 ≤ i ≤ n1 and ∀1 ≤ j ≤ n2 . Definition 2 [2]: Given an array A = (a1 , a2 , . . . , an ) and an associated binary operator ⊕. The preﬁx sums is to compute the n preﬁx operations si = a1 ⊕ a2 ⊕ . . . ⊕ ai , ∀1 ≤ i ≤ n. Definition 3 [2]: Given an array A = (a1 , a2 , . . . , an ). Certain positions of A are referred to as leaders. Each leader holds a datum while all other positions of A are empty. The interval broadcasting is copying the datum in each leader into all positions of the array following the leader, up to, but not including, the next leader (if it exists). Definition 4 [2]: A parallel algorithm using time Tp (n) and p processors to solve a given problem Q is said to be optimal if Tp (n) × p = O(T ∗ (n)), where T ∗ (n) is the running time of the fastest sequential algorithm for the problem Q. Proposition 1 [2]: For p ≤ logn n , the problem of computing the preﬁx sums of an n elements in an array can be performed in O( np ) time using p EREW PRAM processors. Proposition 2 [2]: For p ≤ logn n , the problem of interval broadcasting can be performed in O( np ) time using p EREW PRAM processors.

3

Merging Integer Numbers

In this section, we present a cost-optimal algorithm for merging two sorted arrays of integral elements on EREW PRAM since the values of the keys are serial numbers. Then, in the next section, we extend this work to merging two sorted arrays of records such that the values of the keys of the records are serial numbers. Suppose that we have two sorted arrays A = (a1 , a2 , · · · , an1 ) and B = (b1 , b2 , · · · , bn2 ) of equal length n1 = n2 = n. The elements of both arrays are taken from a linear range in n. We also assume that a0 = b0 = 0. The main idea behind our work is determining the correct position for each distinct element of both arrays, A and B, in the sorted array C without comparing the elements of A with B. We only compare the elements of the array with itself. The correct position of the element k in C is determined by computing the number of elements (with their repetitions) that are less than k in both arrays, A and B. If the element k has r repetitions then we use the interval broadcasting technique to distribute it in the sorted array C. Our cost-optimal algorithm on EREW PRAM consists of six steps. Step 1: Determine the ﬁrst and the last indices for each element in the array A by using p processors. The ﬁrst and the last indices of the elements of A are represented by the two arrays AFirst and ALast respectively. The length of each

394

H.M. Bahig

array is linear in n. The initial value for each element in AFirst and ALast is zero and −1 respectively. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do if aj−1 = aj then AFirst[aj ] = j if i = 1 or j = 1 then ALast[aj−1 ] = j − 1 Also, the last processor will do:

ALast[an ] = n

Step 2: Apply Step 1 on B and construct the two arrays BFirst and BLast. Step 3: Compute the total number of repetitions for each integer x in A and B, where x ∈ [1, n]. The total number of repetitions is represented by the array R of linear length in n. The initial value of each element in the array R is zero. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do R[j] =ALast[j]−AFirst[j] +BLast[j]−BFirst[j] + 2 Step 4: Compute the preﬁx sum, S, for the array R using p processors. The p processors will do the following: for i = 1 to p do parallel S[(i − 1) np + 1] = R[(i − 1) np + 1] for j = (i − 1) np + 2 to i np do S[j] = S[j − 1] + R[j] for h = 1 to log p do for i = 2h−1 + 1 to p do parallel S[i np ] = S[i np ] + S[(i − 2h−1 ) np ] for i = 2 to p do parallel for j = 1 to np − 1 do S[(i − 1) np + j] = S[(i − 1) np + j] + S[(i − 1) np ] Step 5: Allocate the distinct elements of both arrays A and B in the sorted array C. The array C consists of 2 n elements and the initial value of each element is equal to zero. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do if AFirst[j] = 0 or BFirst[j] = 0 then C[S[j − 1] + 1] = j Step 6: Allocate the repetitions of each element in the sorted array C by applying the interval broadcasting algorithm on C. In this step, each processor pi will do the following:

Merging Data Records on EREW PRAM

395

for i = 1 to p do parallel for j = 2(i − 1) np + 1 to 2i np do if C[j] = 0 then C[j] = C[j − 1] for h = 1 to log p do for i = 2h−1 + 1 to p do parallel if C[2i np ] = 0 then C[2i np ] = C[2(i − 2h−1 ) np ] for i = 2 to p do parallel for j = 2(i − 1) np + 1 to 2i np − 1 do if C[j] = 0 then C[j] = C[j − 1] The correctness of the algorithm is dependent on step 5. The ﬁrst position of the integer j in the sorted array C is equal to the total number of integers that are less than j plus one. The total number of integers that are less than j is equal to S[j − 1]. Therefore, C[S[j − 1] + 1] = j. Hence the algorithm is correct. Also, the algorithm does not required any CR/CW operation. Therefore, the algorithm works under EREW PRAM. The running time of the algorithm is O(log n) time using logn n EREW processors since in each step of the algorithm either: 1. each processor scan sequentially a subarray of length at most log n. In each iteration, the processor executes constant number of operations. 2. the p processors apply the binary tree strategy on array of at most n elements. The algorithm is cost-optimal since the product of running time and number of processors is O(n). Example 1: Given two sorted arrays: A = (1, 2, 3, 4, 4, 4, 4, 4, 4, 9, 10, 12, 12, 12, 14, 16) B = (2, 3, 4, 4, 4, 6, 7, 10, 10, 10, 10, 11, 11, 13, 15, 16) Step 1: AF irst = (1, 2, 3, 4, 0, 0, 0, 0, 10, 11, 0, 12, 0, 15, 0, 16) ALast = (1, 2, 3, 9, 0, 0, 0, 0, 10, 11, 0, 14, 0, 15, 0, 16) Step 2: BF irst = (0, 1, 2, 3, 0, 6, 7, 0, 0, 8, 12, 0, 14, 0, 15, 16) BLast = (0, 1, 2, 5, 0, 6, 7, 0, 0, 11, 13, 0, 14, 0, 15, 16) Step 3: R = (1, 2, 2, 9, 0, 1, 1, 0, 1, 5, 2, 3, 1, 1, 1, 2) Step 4: S = (0, 1, 3, 5, 14, 14, 15, 16, 16, 17, 22, 24, 27, 28, 29, 30, 32)

396

H.M. Bahig

Step 5: C = (1, 2, 0, 3, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 6, 7, 9, 10, 0, 0, 0, 0, 11, 0, 12, 0, 0, 13, 14, 15, 16, 0) Step 6: C = (1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 7, 9, 10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 13, 14, 15, 16, 16)

4

Merging Data Records

In this section, we study the merging problem when each element in the two arrays is a record. Each record is of the form (key, d) where d is a datum and key is the primary ﬁeld of the record. The values of the keys are serial number. This means that the values of the keys can be represent as integers numbers in a linear range in n. Therefore, merging data records problem can be represented as merging two sorted arrays A = (a1 , a2 , · · · , an ) and B = (b1 , b2 , · · · , bn ) of records such that 1 ≤ ai · key, bj · key ≤ n, and 1 ≤ i, j ≤ n. If there exist two records having the same key then the algorithm, in previous section, fails to merge the two sorted arrays of records. The algorithm cannot distinguish between equal keys since the algorithm uses the interval broadcasting, Step 6, when the elements are repeated. The following example illustrate how this algorithm fails to merge two sorted arrays of records. Example 2: Given two sorted arrays of records, where each record represents as (key, data): A = ((1, d1 ), (2, d2 ), (3, d3 ), (4, d4 ), (4, d5 ), (4, d6 ), (4, d7 ), (4, d8 ), (4, d9 ), (9, d10 ), (10, d11 ), (12, d12 ), (12, d13 ), (12, d14 ), (14, d15 ), (16, d16 )) B = ((2, d1 ), (3, d2 ), (4, d3 ), (4, d4 ), (4, d5 ), (6, d6 ), (7, d7 ), (10, d8 ), (10, d9 ), (10, d10 ), (10, d11 ), (11, d12 ), (11, d13 ), (13, d14 ), (15, d15 ), (16, d16 )) Steps 1-4 are similar to example 1. Step 5: C = ((1, d1 ), (2, d2 ), (0, 0), (3, d3 ), (0, 0), (4, d4 ), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (6, d6), (7, d7 ), (9, d10 ), (10, d11 ), (0, 0), (0, 0), (0, 0), (0, 0), (11, d12 ), (0, 0), (12, d12 ), (0, 0), (0, 0), (13, d14), (14, d15 ), (15, d15 ), (16, d16 ), (0, 0)) Step 6: C = ((1, d1 ), (2, d2 ), (2, d2 ), (3, d3 ), (3, d3 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (4, d4 ), (6, d6 ), (7, d7 ), (9, d10 ), (10, d11 ), (10, d11 ), (10, d11 ), (10, d11 ), (10, d11 ), (11, d12 ), (11, d12 ), (12, d12 ), (12, d12 ), (12, d12 ), (13, d14 ), (14, d15 ), (15, d15 ), (16, d16 ), (16, d16 )) It is clear that the seventh element in C is not correct. It must be (4, d5 ). Also, for example, the last element in the array C must be (16, d16 ). So, we need to

Merging Data Records on EREW PRAM

397

modify the algorithm to merge two sorted arrays of records even the keys are equal. The main idea of the modiﬁcation is computing the position of the ﬁrst occurrence of each repeated keys. Then we compute the position of repeated keys by preﬁx sum method instead of interval broadcasting method. Our cost-optimal algorithm on EREW PRAM consists of the following steps. Steps 1-4: Similar to Steps 1-4 in Section 3 with replace aj by aj · key. Step 5: Compute the positions of the non-repeated records, that belong to A, in the sorted array C. Two or more records are repeated if they have the same keys. The position of a non-repeated record, (keyi , di ), in the array C is equal to the total number of records having keys less than keyi plus one, i.e. S[keyi − 1] + 1. The positions of repeated and non-repeated records in C are represented by the array APos of linear length in n. I.e. APos[j] represents the position of aj in C, 1 ≤ j ≤ n. The initial value of APos for repeated records is equal one. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do if aj−1 · key = aj · key then APos[j] = S[aj · key − 1] + 1 else APos[j] = 1 Step 6: Compute the positions of the repeated records, that belong to A, in the sorted array C by applying the preﬁx-sum strategy on APos. In this step, the p processors will do the following: for i = 1 to p do parallel for j = (i − 1) np + 2 to i np do if aj−1 · key = aj · key then APos[j] =APos[j]+APos[j − 1] for h = 1 to log p do for i = 2h−1 + 1 to p do parallel if ai np · key = a(i−2h−1 ) np · key then APos[i np ] =APos[i np ]+APos[(i − 2h−1 ) np ] for i = 2 to p do parallel for j = (i − 1) np + 1 to i np − 1 do if a(i−1) np · key = aj · key then APos[j] =APos[j]+APos[(i − 1) np ] Step 7: Compute the positions of the non-repeated records, that belong to B, in the sorted array C. The position of a non-repeated record, (keyi , di ), in the array C is equal to the total number of records, in both arrays, having keys less than keyi plus total number of records, in A, having keys equal to keyi plus one, i.e. S[keyj − 1]+ALast[keyj ]−AFirst[keyj ] + 2. The positions of repeated and non-repeated records in C are represented by the array BPos of linear length in

398

H.M. Bahig

n. I.e. BPos[j] represents the position of bj in C, 1 ≤ j ≤ n. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do if bj−1 · key = bj · key then BPos[j] = S[bj · key − 1]+ALast[bj · key]−AFirst[bj · key] + 2 else BPos[j] = 1 Step 8: Compute the positions of the repeated records, that belong to B, in the sorted array C as Step 6. Step 9: Allocate the elements of the two arrays, A and B, in the sorted array C using the two arrays APos and BPos. In this step, each processor pi will do the following: for j = (i − 1) np + 1 to i np do C[APos[j]] = aj C[BPos[j]] = bj It is clear that no step requires concurrent read or write. Also, the number of processors required by each step is not greater than logn n . Therefore, the algorithm merges the two sorted arrays of records using logn n EREW processors. The running time of the algorithm is O(log n) as in previous section. Example 3: For simplicity, we remove all components other than the keys from the records. Given the keys of the two sorted arrays A = (1, 2, 3, 4, 4, 4, 4, 4, 4, 9, 10, 12, 12, 12, 14, 16) B = (2, 3, 4, 4, 4, 6, 7, 10, 10, 10, 10, 11, 11, 13, 15, 16) Step 1: AF irst = (1, 2, 3, 4, 0, 0, 0, 0, 10, 11, 0, 12, 0, 15, 0, 16) ALast = (1, 2, 3, 9, 0, 0, 0, 0, 10, 11, 0, 14, 0, 15, 0, 16) Step 2: BF irst = (0, 1, 2, 3, 0, 6, 7, 0, 0, 8, 12, 0, 14, 0, 15, 16) BLast = (0, 1, 2, 5, 0, 6, 7, 0, 0, 11, 13, 0, 14, 0, 15, 16) Step 3: R = (1, 2, 2, 9, 0, 1, 1, 0, 1, 5, 2, 3, 1, 1, 1, 2) Step 4: S = (0, 1, 3, 5, 14, 14, 15, 16, 16, 17, 22, 24, 27, 28, 29, 30, 32) Step 5: AP os = (1, 2, 4, 6, 1, 1, 1, 1, 1, 17, 18, 25, 1, 1, 29, 31)

Merging Data Records on EREW PRAM

399

Step 6: AP os = (1, 2, 4, 6, 7, 8, 9, 10, 11, 17, 18, 25, 26, 27, 29, 31) Step 7: BP os = (3, 5, 12, 1, 1, 15, 16, 19, 1, 1, 1, 23, 1, 28, 30, 32) Step 8: BP os = (3, 5, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23, 24, 28, 30, 32) Step 9: C = (1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 7, 9, 10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 13, 14, 15, 16, 16)

5

Conclusion

We have proposed a logarithmic time algorithm on EREW PRAM to merge two sorted arrays of integers in a linear range of n. We extended the algorithm to work when the elements of the array are records. Each record is represented by a key. Both algorithms are not dependent on comparing the elements of A with the elements of B. Also, the algorithm is deterministic, optimal and stable.

References 1. Akl, S.: Parallel Sorting Algorithms. Academic Press Inc., London (1985) 2. Akl, S.: Parallel Computation: Models and Methods. Prentice Hall, Upper Saddle River (1997) 3. Akl, S., Santoro, N.: Optimal Parallel Merging and Sorting Without Memory Conﬂicts. IEEE Transactions on Computers 36, 1367–1369 (1987) 4. Bahig, H.: Parallel Merging with Restriction. The Journal of Supercomputing 43(1), 99–104 (2008) 5. Bahig, H., Bahig, H.: Merging on PRAM. International Journal of Computers and Applications 30(1), 51–55 (2008); Special Issue in High Performance Computing Architectures 6. Bang-Jensen, J., Huang, J., Ibarra, L.: Recognizing and Representing Proper Interval Graphs in Parallel Using Merging and Sorting. Discrete Applied Mathematics 155(4), 442–456 (2007) 7. Berkman, O., Vishkin, U.: On Parallel Integer Merging. Information and Computation 106, 266–285 (1993) 8. Borodin, A., Hopcroft, J.: Routing, Merging, and Sorting on Parallel Models of Computation. Journal of Computer and System Science 30(1), 130–145 (1985) 9. Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms. MIT Press, Cambridge (1990) 10. Deo, N., Jain, A., Medidi., M.: An Optimal Parallel Algorithm for Merging using Multiselection. Information Processing Letters 50(2), 81–88 (1994) 11. Deo, N., Sarak, D.: Parallel Algorithms for Merging and Sorting. Information Sciences 51, 121–131 (1990)

400

H.M. Bahig

12. Gerbessiotis, A., Siniolakis, C.: Merging on the BSP Model. Parallel Computing 27(6), 809–822 (2001) 13. Hagerup, T., Kutylowski, M.: Fast Integer Merging on the EREW PRAM. Algorithmca 17, 55–66 (1997) 14. Hagerup, T., Rub, C.: Optimal Merging and Sorting on the EREW PRAM. Information Processing Letter 33(4), 181–185 (1989) 15. Karp, R., Ramachandran, V.: Parallel Algorithms for Shared-Memory Machines. In: Handbook of Theoretical Computer Science, vol. A, pp. 870–941 (1990) 16. Katsinis, C.: Merging, Sorting and Matrix Operations on the SOME-Bus Multiprocessor Architecture. Future Generation Computer Systems 20(4), 643–661 (2004) 17. Knuth, D.: The Art of Computer Programming: Sorting and Searching. AddisonWesley, Reading (1973) 18. Kruskal, C.: Searching, Merging, and Sorting in Parallel Computation. IEEE Transaction on Computers 32(10), 942–946 (1983) 19. Kruskal, C., Rudolph, L., Snir, M.: A Complexity Theory of Eﬃcient Parallel Algorithms. In: Handbook of Theoretical Computer Science, pp. 95–132 (1990) 20. Liu, G., Chen, H.: Parallel Mering of Lists in Database Mangement System. Information Systems 13(4), 423–428 (1988) 21. Mach, W., Schikuta, E.: Parallel database sort and join operations revisited on grids. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 216–227. Springer, Heidelberg (2007) 22. Matias, Y., Sega, E., Vitter, J.S.: Eﬃcient Bundle Sorting. SIAM J. Computing 36(2), 394–410 (2006) 23. Merrett, T.: Relational Information Systems. Reston, Va. (1984) 24. Shiloach, Y., Vishkin, U.: Finding the Maximum, Merging, and Sorting in a Parallel Computational Models. Journal of Algorithms 2, 88–102 (1981) 25. Taniar, D., Rahayu, W.: Parallel Double Sort-Merge Algorithm for Object-Oriented Collection Join Queries. In: Proceedings of International Conference on High Performance Computing HPC ASIA’97. IEEE Computer Society Press, Seoul (1997) 26. Valduriez, P., Gardarin, G.: Join and Semijoin Algorithms for Multiprocessors Database Machines. ACM Transaction Database Systems 9, 133–161 (1984) 27. Varman, P., Iyer, B., Haderle, B., Dunn, S.: Parallel Merging: Alogrithm and Implementation Results. Parallel Computing 15, 165–177 (1990)

Performance Modeling of Multishift QR Algorithms for the Parallel Solution of Symmetric Tridiagonal Eigenvalue Problems Takafumi Miyata1 , Yusaku Yamamoto2 , and Shao-Liang Zhang1 1

Nagoya University, Nagoya, Aichi, 464-8603, Japan {miyata,zhang}@na.cse.nagoya-u.ac.jp 2 Kobe University, Kobe, Hyogo, 657-8501, Japan [email protected]

Abstract. Multishift QR algorithms are eﬃcient for solving the symmetric tridiagonal eigenvalue problem on a parallel computer. In this paper, we focus on three variants of the multishift QR algorithm, namely, the conventional multishift QR algorithm, the deferred shift QR algorithm and the fully pipelined multishift QR algorithm, and construct performance models for them. Our models are designed for shared-memory parallel machines, and given the basic performance characteristics of the target machine and the problem size, predict the execution time of these algorithms. Experimental results show that our models can predict the relative performance of these algorithms to the accuracy of 10% in many cases. Thus our models are useful for choosing the best algorithm to solve a given problem in a speciﬁed computational environment, as well as for ﬁnding the best value of the performance parameters.

1

Introduction

Symmetric eigenvalue problems play an important role in many areas of science and engineering, such as structural analysis, statistics and electronic structure calculation. The standard procedure for this problem consists of the following three steps: (a) transforming the input matrix to tridiagonal form, (b) solving the symmetric tridiagonal eigenvalue problem, and (c) recovering the eigenvectors of the original matrix by back-transformation [1][2]. There are several algorithms for the second step, including the tridiagonal QR algorithm, divide-and-conquer algorithm, the combination of bisection and inverse iteration [1][2]. Among them, the QR algorithm is widely used due to its stability. To deal with large tridiagonal eigenvalue problems, parallel versions of the tridiagonal QR algorithms have been developed [3][4][5][6]. The most promising ones among them are the multishift QR algorithms [4][5]. In contrast to the conventional single-shift tridiagonal QR algorithm, these algorithms use multiple shifts simultaneously. Thus, by allocating computations associated with each shift to one processor, one can parallelize the algorithms. There are several variants of the multishift QR algorithms, such as the conventional multishift QR C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 401–412, 2010. c Springer-Verlag Berlin Heidelberg 2010

402

T. Miyata, Y. Yamamoto, and S.-L. Zhang

algorithm [4][5], the deferred shift QR algorithm [7] and the fully pipelined multishift QR algorithm [8]. They diﬀer in the way and the timing of computing the shifts. Accordingly, their convergence properties and parallelism are diﬀerent. It is therefore desirable to be able to choose the best variant given the computational platform and the input matrix. In this paper, we present performance models for the above three variants of the multishift QR algorithm. The performance model for the conventional multishift QR algorithm was proposed by the authors in [8]. In the present paper, we extend this model to deal with the deferred shift QR algorithm and the fully pipelined multishift QR algorithm. Our models are designed for shared-memory parallel machines, and given the basic characteristics of the target machine and the problem size, predict the execution time for one iteration of the multishift QR algorithms. Using these models, one can compare the performance of the algorithms for a given platform and problem. One can also use these models to choose the best performance parameter for the algorithm. This paper is structured as follows: in Section 2, we brieﬂy explain the three variants of the multishift QR algorithm. Section 3 presents our performance models for these algorithms. Experimental results are given in section 4. Finally, section 5 gives some concluding remarks.

2 2.1

The Multishift QR Algorithms for the Symmetric Tridiagonal Eigenvalue Problem The Single-Shift QR Algorithm

As a preparation for constructing the performance models, we explain the three variants of the multishift QR algorithm brieﬂy. First, we show the single-shift QR algorithm for computing the eigenvalues of a tridiagonal matrix T as Algorithm 1. At the ith step of this algorithm, we compute the QR decomposition of the shifted tridiagonal matrix Ti − si I, where I is the identity matrix and si is the shift used at the ith step. We then multiply the Q and R factors in reverse order and shift the result again to obtain Ti+1 . It is easy to see that Ti+1 is orthogonally similar to Ti . Then the next shift si+1 is computed from Ti+1 [1][8]. By repeating this process, the subdiagonal elements of Ti becomes smaller and smaller. When one of the elements is small enough to be neglected, the eigenvalue problem can be decomposed into two subproblems. This is called deﬂation. In the actual algorithm, the QR decomposition of Ti −si I and the formation of Ti+1 are done using the operation called bulge chasing; we introduce a small 3×3 bulge at the top-left corner of Ti and chase it along the diagonal by repeating orthogonal transformations until it disappears from the bottom-right corner. Thus the main operation of the QR algorithm is the bulge chasing. Note that bulge chasing is a local operation that involves only three rows and columns at a given moment. Since bulge chasing is an inherently sequential operation, it is diﬃcult to parallelize the single-shift QR algorithm in its original form.

Performance Modeling of Multishift QR Algorithms for the Parallel Solution

[Algorithm 1: Single-shift QR] T1 = T Computation of Shift(2, T1 ) s1 ← the shift which is closer to the (n, n) element of T1 for i = 1, 2, . . . do Ti − si I → Qi Ri Ti+1 ← Ri Qi + si I = QTi Ti Qi Computation of Shift(2, Ti+1 ) si+1 ← the shift which is closer to the (n, n) element of Ti+1 (Update of the shift) end for

2.2

403

[Algorithm 2: M-QR] T1,1 = T Computation of Shift(m, T1,1 ) for j = 1, 2, . . . , m do s1,j ← the j th smallest shift end for for i = 1, 2, . . . do for j = 1, 2, . . . , m do Ti,j − si,j I → Qi,j Ri,j Ti,j+1 ← Ri,j Qi,j + si,j I end for Computation of Shift(m, Ti,m+1 ) for j = 1, 2, . . . , m do si+1,j ← the j th smallest shift (Update of m shifts) end for Ti+1,1 = Ti,m+1 end for

The Conventional Multishift QR Algorithm

The multishift QR (M-QR) algorithm [4][5] was proposed to parallelize the single-shift QR algorithm. It is shown as Algorithm 2. At the ith step of this algorithm, we use m shifts si,1 , . . . , si,m to perform m steps of the QR algorithm. Once they have been ﬁnished, the next m shifts are computed as the eigenvalues of the trailing m × m submatrix of the resulting matrix, Ti,m+1 . In this algorithm, the loops over i and j are still sequential. However, since bulge chasing is a local operation, once the ﬁrst bulge goes suﬃciently far from the top-left corner, we can start chasing the second bulge. Thus the loop over j can be carried out in a pipelined fashion. To avoid collision of the bulges, we partition the tridiagonal matrix into M (≥ m) row blocks and start chasing the second bulge when the ﬁrst bulge enters the second block. Hence, by allocating one bulge to a processor, we can parallelize the M-QR algorithm using m processors. The execution of the conventional M-QR algorithm with m = 2 (two processors) and M = 4 is illustrated in Fig. 1. In the left ﬁgure, the light gray area denotes bulge chasing within a row block, while the white area denotes the idle time. The dark gray area shows the time for computing the shifts. As can be seen from the ﬁgure, the second processor must stay idle until the ﬁrst bulge passes through the ﬁrst block. Also, the ﬁrst processor must stay idle until the second processor ﬁnishes bulge chasing and Ti,m+1 is formed. We can reduce this idle time by increasing M and making the size of each block as small as possible. On the other hand, inter-processor synchronization is needed when the bulge crosses the boundary of the blocks. Thus the cost of synchronization increases with M . The optimal value of M is determined from the trade-oﬀ between the

404

T. Miyata, Y. Yamamoto, and S.-L. Zhang

processor idle time and the synchronization cost and is a function of the machine characteristics and the problem size.

0

Processor1 Processor2 time

Processor2 Processor1

( Step ) i

( Step +1 ) i

0

Fig. 1. Computational procedure of the M-QR algorithm (m = 2, M = 4)

2.3

The Deferred Shift QR Algorithm

In the conventional M-QR algorithm, the processor idle time occurred because the shifts for the i+1 step can be calculated only after Ti,m+1 has been computed. On the other hand, if we modify the algorithm to compute these shifts from Ti−1,m+1 , the processors need not wait for the completion of Ti,m+1 . This is called the deferred shift QR (D-QR) algorithm [7]. It is shown as Algorithm 3. Since the D-QR algorithm eliminates the idle time due to waiting for other processors’ bulge chasing, we do not have to keep the block size as small as possible. Thus the optimal value of M is m, in which case the cost of inter-processor synchronization is the smallest. The drawback of the D-QR algorithm is that its order of convergence is lower than that of the conventional M-QR algorithm [7]. This is because the D-QR algorithm computes the shifts from an older matrix, that is, from Ti−1,m+1 instead of Ti,m+1 . However, when the number of processors is large, the advantage of reducing the idle time can more than compensate for the deterioration of convergence. In fact, in the computational environment used in [8], it is reported that the D-QR algorithm consistently outperforms the conventional M-QR algorithm when the number of processors is 32. The relative performance of these algorithms varies depending on the computational environments and the problem size. 2.4

The Fully Pipelined Multishift QR Algorithm

The fully pipelined multishift QR (FPM-QR) algorithm was developed in an eﬀort to improve the convergence of the D-QR algorithm while retaining its parallel eﬃciency [8]. The algorithm is shown as Algorithm 4. In contrast to the M-QR and D-QR algorithms, this algorithm computes the shifts at each iteration of the j loop. This makes it possible to use as new shift as possible without disturbing the pipeline. The price to pay is that the cost of shift computation increases by m times. However, when the matrix size is suﬃciently large, we can hide the time of shift computation by adjusting the block size judiciously [8]. Thus, this algorithm is the best choice if the improvement of convergence is considerable and the matrix size is suﬃciently large.

Performance Modeling of Multishift QR Algorithms for the Parallel Solution [Algorithm 3: D-QR] T1,1 = T Computation of Shift(m, T1,1 ) for j = 1, 2, . . . , m do s0,j = s1,j ← the j th smallest shift end for for i = 1, 2, . . . do for j = 1, 2, . . . , m do Ti,j − si−1,j I → Qi,j Ri,j Ti,j+1 ← Ri,j Qi,j + si−1,j I end for Computation of Shift(m, Ti,m+1 ) for j = 1, 2, . . . , m do si+1,j ← the j th smallest shift (Update of m shifts) end for Ti+1,1 = Ti,m+1 end for

3

405

[Algorithm 4: FPM-QR] T1,1 = T Computation of Shift(m, T1,1 ) for j = 1, 2, . . . , m do s1,j ← the j th smallest shift end for for i = 1, 2, . . . do for j = 1, 2, . . . , m do Ti,j − si,j I → Qi,j Ri,j Ti,j+1 ← Ri,j Qi,j + si,j I Computation of Shift(m, Ti,j+1 ) si+1,j ← the j th smallest shift (Update of only one shift) end for Ti+1,1 = Ti,m+1 end for

Performance Modeling

In this section, we present performance models for the three variants of the multishift QR algorithm. We ﬁrst review the performance model of the conventional M-QR algorithm given in the appendix of [8]. Based on this model, we construct performance models for the other two variants, namely, the D-QR algorithm and the FPM-QR algorithm. In subsection 3.2, we use these models to compare the parallel performance of the three variants theoretically. 3.1

Performance Modeling

We model the execution time of the three kinds of multishift QR algorithm as Tmodel = Tbulge + Tidle + Tshift + Tsync ,

(1)

where Tbulge is the total time of bulge-chasing, Tidle is the total time of processor idling, Tshift is the total time of shift computation, and Tsync is the total time of processor synchronization [8]. The input parameters of this model are as follows: – – – – – –

n: the order of the matrix, m: the number of shifts, iavg,m : the average iteration number to get one eigenvalue, tbulge : the time of moving a bulge down by one row, tshift,m : the time of computing m shifts, tsync : the time for one synchronization.

Among these, iavg,m is dependent on the input matrix, and tbulge , tshift,m , tsync are dependent on the computational environments. These quantities are determined experimentally. The average iteration number to get one eigenvalue,

406

T. Miyata, Y. Yamamoto, and S.-L. Zhang

iavg,m , is weighted by the length of bulge-chasing. Let Li,j denote the length of bulge-chasing in the i-th multishift QR step with the j-th shift. Also, let Nstep denote the number of multishift QR steps needed to compute all the eigenvalues. Nstep m Then the total length of bulge-chasing is i=1 j=1 Li,j . On the other hand, if the single-shift QR algorithm is used and each eigenvalue is computed in only one iteration, the total length of bulge-chasing is NL = (n−1)+(n−2)+. . .+1 = n(n − 1)/2. Using these quantities, iavg,m is deﬁned as iavg,m

Nstep m 1 = Li,j . NL i=1 j=1

(2)

To get all eigenvalues, the (single-shift) QR steps of about iavg,m (n − 1) times are needed. One multishift QR step corresponds to single-shift QR steps, so the total multishift QR steps, Nstep , is approximately Nstep =

iavg,m (n − 1) . m

(3)

Modeling the execution time of the M-QR algorithm. To construct the performance model for the M-QR algorithm, we focus on the processor m and derive the expressions for Tbulge , Tidle , Tshift and Tsync [8]. – The total time of bulge-chasing: As can be seen from Fig. 1, the time of bulge-chasing in the i-th multishift QR step is Li,m tbulge ( m j=1 Li,j tbulge )/m. Hence, Tbulge =

Nstep m 1 n Li,j tbulge = Nstep tbulge . m i=1 j=1 2

(4)

– The total time of processor idling: In one multishift QR step of the M-QR algorithm with division number M , the processor idle time is (m − 1)/M times of the bulge-chasing time. Thus Tidle =

m−1 Tbulge . M

(5)

By increasing M , Tidle decreases, but the cost of processor synchronization increases. Thus there is a trade-oﬀ between the processor idle time and the synchronization cost. The optimal division number Mopt is derived later. – The total time of shift computation: The M-QR algorithm updates the shifts at the end of each multishift QR step, so the total time of shift computation can be written as follows: Tshift = Nstep tshift,m .

(6)

Performance Modeling of Multishift QR Algorithms for the Parallel Solution

407

– The total time of processor synchronization: In each multishift QR step, synchronization is necessary whenever a bulge passes through the boundary of the matrix regions. So there are M + m − 1 synchronization points and two synchronization (one for bulge arrival and another for bulge departure) is needed at each synchronization point. Hence, Tsync = 2Nstep (M + m − 1) tsync .

(7)

The total execution time can be written as the sum of the above four terms: iM−QR (n − 1) n m−1 TM−QR = 1+ tbulge m 2 M + tshift,m + 2(M + m − 1) tsync , (8) where iM−QR is the weighted average iteration number of the M-QR algorithm. The optimal division number Mopt for the M-QR algorithm, which minimize the execution time, is obtained by ∂TB /∂M = 0. The result is

n(m − 1) tbulge Mopt = . (9) 4 tsync The execution time of the M-QR algorithm is given by Eq. (8) with M = Mopt . Modeling the execution time of the D-QR algorithm. We can use the same framework as explained above for the M-QR algorithm to construct performance models for the other two variants. In the case of the D-QR algorithm, Tbulge and Tshift are the same as those of the M-QR algorithm and are given by Eqs. (4) and (6), respectively. Tidle and Tsync can be modeled as follows: – The total time of processor idling: Unlike the M-QR algorithm, the processor idle time don’t exist in the D-QR algorithm except in the ﬁrst multishift QR step, when bulges are introduced step by step. Thus the processor idle time is Tidle = 0.

(10)

– The total time of processor synchronization: There are m processor synchronization points in one D-QR step and two synchronizations are needed at each point. Hence, Tsync = 2Nstep m tsync .

(11)

By summing these four terms, we can write the execution time of the D-QR algorithm as iD−QR (n − 1) n TD−QR = tbulge + tshift,m + 2mtsync (12) m 2 where iD−QR is the weighted average iteration number of the D-QR algorithm.

408

T. Miyata, Y. Yamamoto, and S.-L. Zhang

Modeling the execution time of the FPM-QR algorithm. In the FPMQR algorithm, the frequency of shift computation in one multishift QR step is m times as many as that of the other algorithms. However, as explained in subsection 2.4, when the matrix order is suﬃciently large, all but one of the shift computations can be overlapped with bulge-chasing. So, when modeling the execution time, we need to take into account only one of them. As a result, the execution time of the FPM-QR algorithm equals to that of the D-QR algorithm: iFPM−QR (n − 1) n TFPM−QR = tbulge + tshift,m + 2mtsync , (13) m 2 where iFPM−QR is the weighted average iteration number. 3.2

Analysis of the Performance Gain of the FPM-QR Algorithm

We deﬁne the performance gains of the FPM-QR algorithm as TM−QR TD−QR fM−QR = , fD−QR = , TFPM−QR TFPM−QR

(14)

where TM−QR , TD−QR , and TFPM−QR are the modeled execution times of the M-QR, D-QR, and FPM-QR algorithm, respectively. These quantities show how many times the FPM-QR algorithm is faster than the other two algorithms. Comparing the FPM-QR algorithm with the M-QR algorithm. From Eq. (8) with M = Mopt (see Eq. (9)) and Eq. (13), fM−QR is given as ⎛ ⎞ n m−1 t + 2(M − 1)t bulge opt sync 2 Mopt iM−QR ⎝ ⎠ . (15) fM−QR = 1+ n iFPM−QR 2 tbulge + tshift,m + 2mtsync We plot the performance gain fM−QR ×(iFPM−QR /iM−QR ) in Fig. 2. Here, tbulge , tshift,m , and tsync are observed values in the computational environment to be described in section 4. – Properties of the performance gain: As can be seen from Fig. 2, fM−QR increases with the number of processors. This is due to the diﬀerence in processor utilization. In the M-QR algorithm, there is processor idle time and it increases with the number of processors. In contrast, in the FPM-QR algorithm, all processors are always kept busy. As a result, the performance diﬀerence increases as more processors are used. On the other hand, the eﬀect of the idle time and the synchronization time on the total execution time decreases with the matrix size. As a result, the performance gain decreases as the matrix order increases. – Predicted performance: If the average iteration number of the FPM-QR algorithm is the same as that of the M-QR algorithm, the graph can be interpreted as the speedup obtained by the FPM-QR algorithm over the M-QR algorithm. In this case, the graph shows that the FPM-QR algorithm with m = 32 for n = 200000 attains about 1.5 times speedup over the M-QR algorithm.

f M-QR (i FPM-QR /iM-QR )

Performance Modeling of Multishift QR Algorithms for the Parallel Solution

Table 1. The values of machinedependent parameters for each m

2.0 n = 50,000 n = 200,000 1.5

1.0

10

20

30

Number of processors

Fig. 2. Performance (iFPM−QR /iM−QR )

409

m 4 8 16 32

tbulge 8.4E-08 8.4E-08 8.4E-08 8.4E-08

tshift,m 2.1E-06 7.9E-06 2.9E-05 1.1E-04

tsync 1.9E-06 3.7E-06 4.9E-06 8.4E-06

gain fM−QR ×

Comparing the FPM-QR algorithm with the D-QR algorithm. From Eq. (12) and (13), the performance gain fD−QR is derived as follows: fD−QR =

iD−QR . iFPM−QR

(16)

Eq. (16) shows that the speedup of the FPM-QR algorithm over the D-QR algorithm equals to the ratio of the weighted average iteration numbers. This means that the only factor that causes the performance diﬀerence between the FPM-QR and D-QR algorithm is the convergence properties of these algorithms and the computational environments have nothing to do with it. Since the FPMQR algorithm is expected to show better convergence than the D-QR algorithm due to the use of newer shifts, it is expected to be faster. These preliminary results suggested by the performance models will be veriﬁed by experiments on a shared-memory parallel machine in the next section.

4

Experimental Results

We evaluate the accuracy of our performance models explained in section 3. To this end, we compare the performance gains fM−QR and fD−QR predicted by our models with the actual performance gains fˆM−QR and fˆD−QR . Our computational environment is Fujitsu PrimePower HPC2500 (CPU: SPARC 64V 8Gﬂops × 32 processors, Memory: 512GB). The programs of the M-QR, D-QR and FPM-QR algorithms were written in C and OpenMP and compiled by the Fujitsu C Compiler with options -Kfast GP2=2 -KOMP. In this case, the values of machine-dependent parameters tbulge , tshift,m and tsync , which our models need to predict the execution time, are as shown in Table 1. As test matrices, we used the following three matrices deﬁned in [8]: – Type 1: A tridiagonal matrix whose diagonal elements are all equal to a, subdiagonal elements are all equal to b. The exact eigenvalues are λi = a + 2b cos(iπ/(n + 1)) (i = 1, . . . , n). We set a = 2 and b = −1. – Type 2: A random symmetric matrix whose elements are in (-0.5, 0.5). – Type 5: A tridiagonal matrix with eigenvalues λi = sinh(10i/n) (i = 1, . . . , n).

410

T. Miyata, Y. Yamamoto, and S.-L. Zhang

Table 2. Actual performance gain fˆM−QR , the predicted performance gain fM−QR , and relative error Rerr = fM−QR − fˆM−QR /fˆM−QR

m 4 8 16 32

Type 2 fˆM−QR fM−QR 1.03 1.00 1.11 1.12 1.18 1.26 1.21 1.49

m 4 8 16 32

Type 2 ˆ fM−QR fM−QR 0.97 0.94 0.99 1.00 1.06 1.10 1.17 1.29

Rerr 0.03 0.01 0.06 0.24

n = 50000 Type 5 fˆM−QR fM−QR 1.10 1.10 1.21 1.23 1.35 1.37 1.35 1.64

Type 1 Rerr fˆM−QR fM−QR 0.00 1.16 1.08 0.01 1.42 1.31 0.01 1.73 1.58 0.22 1.92 2.17

Rerr 0.06 0.08 0.09 0.13

Rerr 0.03 0.01 0.04 0.10

n = 200000 Type 5 ˆ fM−QR fM−QR 1.04 1.04 1.11 1.12 1.21 1.21 1.33 1.40

Type 1 ˆ fM−QR fM−QR 1.10 1.02 1.28 1.16 1.49 1.31 1.76 1.69

Rerr 0.08 0.09 0.12 0.04

Rerr 0.00 0.01 0.00 0.05

We show the results about fM−QR for the matrices of Type 2, 5, 1 in Table 2, and the results about fD−QR in Table 3. When we use many processors for small matrices of Type 2 and 5 (m = 32, n = 50000 in Table 2 and 3), the actual performance gains are lower than the predicted ones. This is because the hiding of shift computation does not work completely (see subsection 2.4). Another possible reason is that the speed of data transfer between the memory and the processors limits the performance when the number of processors is large. We need to further investigate this possibility. For the Type 1 matrix, the actual performance gains are higher than predicted. For other cases, the predicted performance gains agree with the actual ones within maximum error of 10%. Thus we have conﬁrmed the following tendencies predicted by the models: – The FPM-QR algorithm shows higher performance than the M-QR algorithm as more processors are used. – If the FPM-QR algorithm shows better convergence than the D-QR algorithm, then the the former shows higher performance than the latter (only the convergence property aﬀects the performance diﬀerence). In the above, we explained the experimental results focusing on the accuracy of our performance prediction models. For more experimental data on various test matrices, including the actual execution time and speedup over the single shift QR algorithm, the readers are referred to [8]. On the computational platform used here, the FPM-QR algorithm almost always outperforms the M-QR and the D-QR algorithms when the number of processors is large. However, the latter algorithms may be of choice when the number of processor is small. In addition, the M-QR algorithm has advantage over the other two algorithms when

Performance Modeling of Multishift QR Algorithms for the Parallel Solution

411

Table 3. Actual performance gain fˆD−QR , the predicted performance gain fD−QR , ˆ and relative error Rerr = fD−QR − fD−QR /fˆD−QR .

m 4 8 16 32

Type 2 fˆD−QR fD−QR 1.10 1.11 1.08 1.09 1.04 1.08 0.94 1.09

m 4 8 16 32

Type 2 fˆD−QR fD−QR 1.09 1.10 1.06 1.08 1.05 1.08 1.05 1.10

Rerr 0.00 0.01 0.03 0.16

n = 50000 Type 5 fˆD−QR fD−QR 1.15 1.15 1.20 1.21 1.23 1.24 1.10 1.28

Type 1 Rerr fˆD−QR fD−QR 0.00 1.24 1.08 0.00 1.38 1.20 0.01 1.49 1.32 0.16 1.44 1.44

Rerr 0.12 0.14 0.11 0.01

Rerr 0.00 0.02 0.03 0.05

n = 200000 Type 5 fˆD−QR fD−QR 1.13 1.13 1.18 1.18 1.22 1.23 1.24 1.27

Type 1 Rerr fˆD−QR fD−QR 0.00 1.27 1.11 0.00 1.36 1.17 0.00 1.52 1.32 0.02 1.66 1.38

Rerr 0.13 0.14 0.13 0.17

the overhead of inter-processor synchronization is negligible. Our performance models are useful in choosing the right algorithm in these situations.

5

Related Work

There has been a lot of research on performance modeling and prediction of parallel programs. To mention a few, Dackland et al. [9] propose a simple method for performance prediction of linear algebra algorithms. They exploit the natural hierarchical structure of linear algebra algorithms and predict the total execution time by modeling the execution times of the constituent BLAS routines and accumulating them. Cuenca et al. [10] used a similar to realize automatic tuning of linear algebra programs. Our work is along the line of these studies and tries to model the performance of the multishift QR algorithms, which are an important class of algorithms for the symmetric eigenvalue problem.

6

Conclusion

In this paper, we constructed performance models for three kinds of multishift QR algorithms. Our models are designed for shared-memory parallel machines and can predict the relative performance of these algorithms as a function of the machine characteristics and the problem size. The error of prediction is less than 10% in most cases. Hence our models are expected to be useful for choosing the best algorithm to solve a given problem on a given machine, as well as for optimizing the performance parameters.

412

T. Miyata, Y. Yamamoto, and S.-L. Zhang

Acknowledgements. We would like to express our sincere gratitude to the anonymous reviewers. Their comments helped us to improve the quality of this paper. We also would like to thank Professor Yoshimasa Nakamura and Professor Shao-Liang Zhang for continuous support and the members of Zhang laboratory for fruitful discussion. This work is partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientiﬁc Research on Priority Areas, ”i-explosion” (No. 21013014), Grant-in-Aid for Scientiﬁc Research (B) (No. 21300013), Grant-in-Aid for Scientiﬁc Research (C) (No. 21560065) and Grant-in-Aid for Scientiﬁc Research (A) (No. 20246027).

References 1. Golub, G.H., van Loan, C.F.: Matrix Computations. Johns Hopkins University Press (1996) 2. Demmel, J.W.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997) 3. Sameh, A.H., Kuck, D.J.: A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans. Comput. C-26, 147–153 (1977) 4. Bai, Z., Demmel, J.: On a block implementation of Hessenberg QR iteration. Int. J. of High Speed Computing 1, 97–112 (1989) 5. Kaufman, L.: A parallel QR algorithm for the symmetric tridiagonal eigenvalue problem. J. Parallel and Distributed Comput. 3, 429–434 (1994) 6. Bar-On, I., Codenotti, B.: A fast and stable parallel QR algorithm for symmetric tridiagonal matrices. Linear Algebra Appl. 220, 63–95 (1995) 7. Van de Geijn, R.A.: Deferred shifting schemes for parallel QR methods. SIAM J. Matrix Anal. Appl. 14, 180–194 (1993) 8. Miyata, T., Yamamoto, Y., Zhang, S.-L.: A fully pipelined multishift QR algorithm for parallel solution of symmetric tridiagonal eigenproblems. IPSJ Trans. Advanced Computing Systems 1, 14–27 (2008) 9. Dackland, K., K˚ agstr¨ om, B.: A hierarchical approach for performance analysis of ScaLAPACK-based routines using the distributed linear algebra machine. In: Madsen, K., Olesen, D., Wa´sniewski, J., Dongarra, J. (eds.) PARA 1996. LNCS, vol. 1184, pp. 187–195. Springer, Heidelberg (1996) 10. Cuenca, J., Gimenez, D., Gonzalez, J.: Architecture of an automatically tuned linear algebra library. Parallel Computing 30, 187–210 (2004)

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory Hierarchy System Tangpei Cheng, Qun Wang*, Xiaohui Ji, and Dandan Li School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, P.R. China [email protected]

Abstract. A parallel scheme for distributed memory hierarchy system is presented to solve the large-scale three-dimensional heat equation. Since managing interprocess communications and coordination is the main difficulty with the system, the local physics/global algebraic object paradigm is introduced. Domain decomposition method is used to partition the modeling area, as well as the intensive computational effort and large memory requirement. Efficient storage and assembly of sparse matrix and parallel iterative solution of linear system are considered and developed. The efficiency and scalability of the parallel program are demonstrated by completing two experiments on Linux cluster, in which different preconditioning methods are tested and analyzed. And the results demonstrate this method could achieve desirable parallel performance.

1 Introduction Heat equation is one of the most important mathematic equations, which is widely applied in engineering applications such as groundwater simulation and reservoir modeling. Even with the significant advances made in computational algorithms for solving partial differential equations in scientific computation, to solve large-scale heat equation remains a challenge. The problem could arise from intensive computational efforts and large memory capacity required for solving large sparse matrix systems of discrete equations. Meanwhile, parallel computing has been proved to be an efficient way to address the primary limitations of single-processor computers and could significantly advance the modeling capability of researchers. Thus, research on parallel computing for heat equation has received considerable attention in recent years [1–10]. We note that most of these works deal with improving the stability and convergence rate of algorithms. Nowadays, Linux clusters, one of the most well-known DMS (distributed memory hierarchy system) are increasingly adopted as a cost effective alternative to classical parallel supercomputers for solving large-scale problems. However, the strong correlation of data and large amounts of global communications among processes in DMS bring lots of programming complexities which may prevent the application from obtaining optimal parallel performance. This paper describes a parallel method based on * Corresponding author. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 413–421, 2010. © Springer-Verlag Berlin Heidelberg 2010

414

T. Cheng et al.

domain decomposition to solve large-scale three-dimensional heat equation. The primary goal of this research is to present a scalable parallel algorithm which could meet the intense demands on modeling capability and minimize the effort of parallelization as well as reduce communication overheads on clusters, especially those based on standard interconnection technologies. This is achieved by optimizing the following procedures: (1) Efficient mesh generation and domain decomposition method; (2) Distributed memory programming model provided in PETSc [11], which would automatically manage the data movement and coordination among participating processes through MPI; (3) efficient storage and parallel assembly of large sparse matrix; (4) optimal parallelized iterative solver and preconditioners included in PETSc and HYPRE [12] for handling large linear systems of equations etc.

2 Overview of Three-Dimensional Heat Equation In general, the three-dimensional heat equation can be written as ∂ 2 u ∂ 2 u ∂ 2 u ∂u + + = , ( x, y , z ) ∈ Ω, t > 0 ∂x 2 ∂y 2 ∂z 2 ∂t

(1)

on the domain Ω = (0, M ) × (0, N ) × (0, L) , with the initial condition u = u 0 , t = 0 and Dirichlet boundary conditions u = g , ( x, y, z ) ∈ ∂Ω, t ≥ 0 where u = u ( x, y , z , t ) u 0 = u 0 ( x, y , z ) , g = g ( x, y , z , t ) . This is a linear, second-order, parabolic partial differential equation. For simplicity, we discretized the left-hand side by using seven-point stencil finite differences method (unstructured and irregular mesh generation methods are also supported in our method) and set hx = M , h y = N , hz = L , (1) can be rewritten as KL IM JN u i +1, j ,k − 2u i , j ,k + ui −1, j ,k hx

2

+

u i , j +1,k − 2u i , j ,k + u i , j −1,k hy

2

+

u i , j ,k +1 − 2u i , j ,k + u i , j ,k −1 hz

2

=

∂u ∂t

(2)

where 0 ≤ i ≤ IM , 0 ≤ j ≤ JN , 0 ≤ k ≤ KL . Considering hx = hy = hz , (2) can be simplified to the follow equation − 6u i , j , k + u i +1, j , k + u i −1, j ,k + u i , j +1, k + u i , j −1, k + u i , j ,k +1 + u i , j , k −1 =

∂u 2 ⋅ hx ∂t

(3)

Applying AU i = −6u i , j ,k + u i , j +1,k + u i , j −1,k + u i , j ,k +1 + u i , j ,k −1 (4) to (3) yields equation AU

which could be written in detail

i

+U

i +1

+U

i −1

= h x2

∂u ∂t

(5)

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory

⎡ A I1 ⎤ ⎢I ⎥ ⎢ 1 A I1 ⎥ ⎢ ⎥ O O O ⎢ ⎥ O O O ⎢ ⎥ ⎢ I1 A I1 ⎥ ⎢ ⎥ I 1 A ⎦⎥ ⎣⎢

415

⎡ U1 ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ M ⎥ 2 ∂u ⎢ ⎥ = hx ∂t ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣⎢U IM −1 ⎦⎥

(6)

where I1 refers to an unit square matrix of order ( JN − 1)( KL − 1) and T U i = [ u i ,1，1 , u i ,1，2 , L , u i ,1, KL −1 , u i , JN −1, KL −1 ] . Applying BU

ij

= − 6 u i , j , k + u i , j , k + 1 + u i , j , k −1 (7) to (4) yields two equations AU i = BU ij + U i , j +1 + U i , j −1

(8)

and ⎡ B I2 ⎤ ⎢I ⎥ B I 2 ⎢ 2 ⎥ ⎢ ⎥ O O O A=⎢ ⎥ O O O ⎢ ⎥ ⎢ I2 B I2 ⎥ ⎢ ⎥ I 2 B ⎥⎦ ⎣⎢

where I 2 refers to an unit square matrix of order KL − 1 , U ij = [ u i , j，1 , L , u i , j , KL −1 ⎡− 6 1 ⎤ ⎢ 1 −6 1 ⎥ ⎢ ⎥ ⎢ ⎥ O O O B=⎢ ⎥ O O O ⎢ ⎥ ⎢ 1 −6 1 ⎥ ⎢ ⎥ 1 − 6⎥⎦ ( KL −1)×( KL −1) ⎣⎢

(9)

]

T

and

(10)

For time discretization, it is handled by the backward Euler method which is a fully implicit method. Thus we have ⎧ u in, +j 1,k − u in, j ,k u in++11, j ,k − 2u in, +j ,1k + u in−+11, j ,k u in, +j 1+1,k − 2u in, +j ,1k + u in, +j 1−1,k u in, +j ,1k +1 − 2u in, +j ,1k + u in, +j ,1k −1 = + + ⎪ 2 2 2 Δt hx hy hz ⎪ ⎪ (11) ⎨1 ≤ i ≤ IM - 1, 1 ≤ j ≤ JN - 1, 1 ≤ k ≤ KL - 1, n ≥ 0 ⎪u n+1 = g n+1 , i = 0 or i = IM or j = 0 or j = JN or k = 0 or k = KL, n ≥ 0 i , j ,k ⎪ i , j ,k ⎪ ⎩

where Δt represents the time step and u in, j ,k = u i , j ,k |t = nΔt , g in, j ,k = g i , j ,k |t = nΔt .

416

T. Cheng et al.

3 Methodology and Implementation 3.1 Parallel Partitioning of Data As stated in section 2, cell-centered finite difference method is applied over the mesh and the discretization is done for time with an implicit backward Euler difference scheme; then the next and both a critical step is to develop an efficient domain decomposition method. To obtain optimal performance, the program should balance the computation in each process as well as minimize the communication resulting from the placement of adjacent elements to different processes. In this paper, we used the distributed memory programming model provided in PETSc to create global objects (vectors, matrices, distributed arrays etc) which are automatically distributed among several processes by an MPI communicator, which represents a group of processes. Therefore, the effort of parallelization is minimized as PETSc library handles it in a highly efficient way which could hide the low-level coordination details among processes in DMS. Examples of such details include overlapping communications and computation, determining the details of various repeated communications and optimizing the resulting message-passing code etc. Since our discretization is predicted on structured mesh generation, the domain partitioning method is straightforward. It is natural to consider the block decomposition method based on array distribution which is already implemented in PETSc DA (distributed arrays). Even that, all kinds of graph partitioning algorithms provided in ParMetis [13] are also supported in our implementation for unstructured and irregular mesh. As shown in Fig. 1, each process is typically assigned a contiguous portion of the global objects named local nodes which are stored in local process memory. Besides, the calculations on local portion will require data from neighboring partitions, which are termed ghost nodes.

Fig. 1. (a) Domain partitioning for structured and (b) domain partitioning for unstructured mesh. Local nodes refer to the node located in the local area while ghosted nodes are those nodes located at the bordering portions owned by neighboring processes.

3.2 The Storage and Assembly of Sparse Matrix As stated in section 2, the heat equation is discretized to a linear equation systems of the form Ax = b , where A is sparse, symmetric and positive definite. For a rectangular grid structure and "natural ordering" of unknowns, matrix A has a 7-diagonal banded nonzero structure. More precisely, their non-zero elements are located along seven diagonals: the principal diagonal and the three upper and lower diagonals, respectively. In our implementation, we utilized the CSR (compressed sparse row format) to store the matrix in order to reduce the memory requirement. The nonzero elements are

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory

417

stored by rows, along with an array of corresponding column numbers and an array of pointers which point to the beginning of each row. To achieve better performance, we preallocate the memory for the sparse matrix instead of dynamic allocating. As computational domain becomes larger, assembling of sparse matrix becomes the key issue of achieving high performance. Since the matrix is automatically partitioned into blocks and distributed among processes by PETSc, we only need to insert the local elements for each process in parallel and any non-local elements will be sent to the appropriate process during the matrix assembly. Defining Istart and Iend as the lower bound and upper bound of global index respectively, Ii as the index of the current row and assigning values to matrix row by row, then as illustrated in (6)- (10), the parallel assembly of sparse matrix could be schematically outlined as follows, in which m = IM, n = JN , l = KL . for Ii := Istart to Iend do i := Ii /( n * l ); j =: ( Ii − i * n * l ); k := Ii − i * n * l − j * l ; if (i > 0) then Position ( Ii, Ii − n * l ) := 1; if (i < m − 1) then Position ( Ii, Ii + n * l ) := 1; if ( j > 0) then Position ( Ii, Ii − l ) := 1; if ( j < n − 1) then Position ( Ii, Ii + l ) := 1; if ( j > 0) then Position ( Ii, Ii − 1) := 1; if ( j < l − 1) then Position ( Ii, Ii + 1) := 1; Position ( Ii, Ii ) := −6; endfor ;

3.3 Iterative Solution of Linear System As illustrated in section 2, after the process of finite-difference discretization, the heat equation could be deduced to large-scale linear algebraic equations, which could be written in matrix form as Ax = b , in which the coefficient matrix A is sparse, symmetric and positive definite. In general, there are two kinds of methods for solving large linear equations which are direct solvers and iterative solvers. Direct solvers, although very efficient for solving some small size problems, have computational limitations when they are used for solving large-scale applications [14]. As direct solvers inevitably introduce filling elements when decomposing large sparse matrix, the demand on memory capacity has become the bottleneck to achieve high performance. Meanwhile, it is generally recognized that the combination of a Krylov subspace method and a preconditioner is at the center of most modern numerical codes for the iterative solution of linear systems. Therefore, the combination of Krylov subspace methods (such as GMRES and CG) preconditioners (like Block Jacobi, Overlapping Additive Schwarz Methods, Euclid, BoomerAMG) provided in PETSc and HYPRE are integrated in our program for the iterative solution of linear systems. The program source code was written in highlevel abstractions based on object-oriented technology, so that it can be easily understood and modified. This approach promotes code reuse and flexibility and, in many

418

T. Cheng et al.

cases, helps to decouple issues of parallelism from algorithm choices and allows the optimal choice for a particular application via a series of runs without recompiling.

4 Performance Evaluations The numerical experiments were carried out on Linux cluster. The cluster is composed of four nodes, each of which contains an Intel Xeon CPU and interconnected by an Ethernet 100M bit/s switch. The main frequency of CPU is 3.0GHz and memory space per node is 2GB. As CG (Conjugate Gradient method) is widely recognized as the standard method for the solution of symmetric positive definite systems, we adopted it as the default iterative method, preconditioned with different preconditioners. The convergence criterion used for iterative linear solver is based on the l2-norm of the residual. Convergence is detected at iteration k if rk < max(10 −10 * b ,10 −50 ) 2 2 where rk = b − Axk , rk and b are the residual and right-hand side, respectively.

Fig. 2. Wall clock time vs. the number of processors for different preconditioners

Experiment was first carried out with discretization of 100×100×100 spatial grids by finite difference method and 100 time steps by backward Euler method. The wall clock time versus the number of processors for different preconditioners is shown in Fig. 2. We found a significant decrease of computational time and the difference between the wall clock time in dependence of different preconditioners is moderate except for PILU(k). The plot also shows that the block Jacobi method, with one block per processor and each of which is handled by ILU(0), is of the least computational time. Besides, from our previous study, we also found that its memory requirement is always much less than other preconditioners [15]. Especially, we noted that there is no significant performance improvement for PILU(k) from 2 to 4 processors. As

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory

419

shown in Fig. 3, this should be attributed to the effort required from constructing the preconditioner increased significantly, which may counteract the effect of parallel processing. Another reliable reason could be that as the size of local domain becomes smaller, the effect of the preconditioner decreases which may lead to extra iterations and more computational time.

Fig. 3. Time ratio for constructing preconditioner vs. number of processors for the PLU(k) preconditioners

Then we adopted the Bjacobi+ILU(0) preconditioner in the second experiment and adjusted the spatial discretization from 100×100×16 to 100×100×512 to examine the scalability of our parallel program. From Fig. 4, we noted that the speedups for different processor numbers are largely on the rise as the problem scale expands, which demonstrates good scalability. However, we also observed the declines of speedup from 100×100×32 to 100×100×64 for 2 processors and from 100×100×256 to 100×100×512 for 4 processors, respectively. This behavior was extensively analyzed and we took the example of 2 processors to illustrate. As the problem scale for each processor increases, it would have three effects. First, the proportion of computational part (linear solve of the equation) would dramatically increase while the ratio of input/out and initialization parts is on the decline. Therefore, the speedup is largely on the rise. Second, the effect of the preconditioner would increase, so that the iterations number of linear solver decreases which could in turn decrease the computation overhead. When the size of local portion is comparatively small, this would be the main contributing factor for the sudden increases in speedup with the problem scale arising from 100×100×16 to 100×100×32. Third, as the size of a local domain for each processor increases, more effort of solving the linear equation and constructing the preconditioner is required, so that the convergence rate decreases and the computational time is prolonged. There is often a trade-off between the last two points and the latter one may become the leading factor as the problem scale expands.

420

T. Cheng et al.

Fig. 4. Speedup vs. problem scale for the combination of CG iterative method, Bjacobi preconditioner and ILU(0) sub-preconditioner

5 Conclusions Heat equations have been widely applied in reservoir simulation and groundwater modeling. In this paper, an efficient parallel method for large-scale three-dimensional heat equation on DMS has been developed. Since managing the data movement and coordination between processes is the bulk difficulty for parallel computing, distributed memory programming model has been introduced in our implementation which could hide the low-level coordination details among processes. Through the scheme of domain decomposition, the requirements of intensive computational efforts and large memory capacity are distributed among processes. To further improve parallel performance, we devote intensive effort to the storage and assembly of sparse matrix. The compressed matrix storage format is introduced and parallel assembly of large scale sparse matrix is implemented in our work. Experiment results are given extensive analyses which demonstrate this parallel scheme has achieved great improvement in computational efficiency and the combination of block Jacobi and ILU(0) is the best performing preconditioner.

Acknowledgement This work was partially supported by the Fundamental Research Funds for the Central Universities of China.

References 1. Dawson, C.N., Du, Q., Dupont, T.F.: A Finite Difference Domain Decomposition Algorithm for Numerical solution of the Heat Equation. Mathematics of Computation 195, 63–71 (1991)

A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory

421

2. Lions, J.L., Maday, Y., Turinici, G.: A “parareal” in time discretization of PDE’s. Comptes Rendus de l’Académie des Sciences - Series I – Mathematics 332, 661–668 (2001) 3. Contassot-Vivier, S., Couturier, R., Denis, C., Jézéquel, F.: Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method. In: Fourth International Workshop on Parallel Matrix Algorithms and Applications, PMAA’06, pp. 21–22 (2006) 4. Amestoy, P.R., Duff, I.S., Pralet, S., Vömel, C.: Adapting a parallel sparse direct solver to architectures with clusters of SMPs. Parallel Computing 29, 1645–1668 (2003) 5. Antoine, G., Kahou, A., Grigori, L., Sosonkina, M.: A partitioning algorithm for blockdiagonal matrices with overlap. Parallel Computing 34, 332–344 (2008) 6. Dağ, H.: An approximate inverse preconditioner and its implementation for conjugate gradient method. Parallel Computing 33, 83–91 (2007) 7. Couturier, R., Denis, C., Jézéquel, F.: GREMLINS: a large sparse linear solver for grid environment. Parallel Computing 34, 380–391 (2008) 8. Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Computing 33, 624–633 (2007) 9. Hernandez, V., Roman, J.E., Tomas, A.: Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement. Parallel Computing 33, 521–540 (2007) 10. Mo, Z.Y., Xu, X.W.: Relaxed RS0 or CLJP coarsening strategy for parallel AMG. Parallel Computing 33, 174–185 (2007) 11. Balay, S., Buschelman, K., Eijkhout, V., Gropp, W., Kaushik, D., Knepley, M., Curfman McInnes, L., Smith, B., Zhang, H.: PETSc Users Manual, http://www-unix.mcs. anl.gov/petsc/petscas/documentation/index.html#Manual 12. Hypre- the LLNL preconditioner library, http://www.llnl.gov/CASC/hypre 13. Karypis, G., Schloegel, K., Kumar, V.: ParMETIS 1.0: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Technical Report TR-97-060. Department of Computer Science, University of Minnesota (1997) 14. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. SIAM Press, Philadelphia (1995) 15. Cheng, T.P., Ji, X.H., Wang, Q.: An Efficient Parallel Method for Large-scale Groundwater Flow Equation based on PETSc. In: IEEE Youth Conference on Information, Computing and Telecommunications, pp. 190–193 (2009)

A New Metric for On-Line Scheduling and Placement in Reconfigurable Computing Systems Maisam Mansub Bassiri and Hadi Shahriar Shahhoseini Department of Electrical Engineering, Iran University of Science and Technology Tehran, Iran [email protected], [email protected]

Abstract. Reconfigurable computing systems use reconfigurable processing unit in conjunction with a processor to make us able to execute tasks in a true multitasking manner. This leads to highly dynamic allocation situations. To manage such systems at runtime, a reconfigurable operating system is needed. On-line scheduling and placement algorithms are the main parts of this operating system. In this paper, we present a technique for on-line integrated scheduling and placement which focuses on on-line, real-time and non-preemptive reconfigurable computing systems. The main characteristic of our method includes using a new metric for selecting the best feasible placements for arriving tasks. This new metric is based on temporal and spatial constraints. A large variety of experiments has been conducted on the proposed algorithm using synthetic and real tasks. Obtained results show the benefits of this algorithm.

1 Introduction Reconfigurable computing (RC) systems comprise one or several reconfigurable processing unit(s) (RPU) along with a CPU [1]. RPU which may be a modern FPGA can be reconfigured at run-time and support partial reconfigurability which makes us able to change the reconfiguration of part of FPGA dynamically without disturbing other parts. Hence they have high flexibility along with high performance. Run-time reconfigurability and partial reconfigurability of FPGAs allow us to execute tasks in a true multitasking manner. While using RC systems can increase the computing performance, it also leads to complex allocation situations for dynamic task sets. This clearly asks for well-defined system services that help to efficiently design applications. Such a set of system services forms a reconfigurable operating system (OS) [2,3]. From the designer’s point of view, an operating system is an abstraction layer in design process that hides the details of underlying hardware. The benefits of using this abstraction layer are increased design productivity, portability and resource utilization. Of course these benefits are paid for by area overhead and computation time overhead. This work deals with the resource management part of a reconfigurable OS. In particular, we focus on the problem of scheduling and placement of the tasks to the RPU in a RC system. We propose a new metric for on-line scheduling which is based on temporal and spatial constraints. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 422–433, 2010. © Springer-Verlag Berlin Heidelberg 2010

A New Metric for On-Line Scheduling and Placement

423

In the rest of this paper, section 2 discusses the modeling of the RC system, RPU, the tasks and the operating system. Section 3 reviews the related work. In section 4 we discuss on-line scheduling and present our algorithm. Section 5 presents the experimental evaluation of the algorithm and finally, section 6 concludes the paper.

2 Problem Modeling 2.1 RPU and Task Models A task is a function synthesized to digital circuit that can be programmed onto the reconfigurable device (RPU). Hardware form of a task has a size and a shape. The size gives the area requirement of the task in reconfigurable units. We assume all task shapes to be rectangle and the execution time of tasks are known in advance. Generally, each hardware task is defined as a 5-tuple Ti = (wi ,hi , eih, ai, di) where wi ,hi , eih , ai and di denote width, height, hardware (HW) execution time, arrival time and deadline of the task, respectively. The reconfiguration time of the task can be considered as a part of its HW execution time. Also, we define the following parameters for each task (Ti) : LSTi = d i − eih , S i = LSTi − ai , TI i = d i − ai ; Where LSTi , Si and TIi denote the latest start time, slack (mobility) and the task interval of the task Ti , respectively. Tasks can be real-time or non real-time. Deadline (di ) is defined only for real-time tasks and determines the latest finish time of the task. Also, the tasks can be dependent or independent. Dependent tasks may be represented as a directed task graph. In this work we focus on real-time and independent tasks. Also, the task execution can be preemptive or non-preemptive. Although preemption is a very useful technique that yields efficient and fast on-line scheduling algorithm, we believe that it is too expensive for current reconfigurable technology, because preemption and resuming task, requires heavy (large) context switches and needs additional external memory. Hence, in this work we focus on a non-preemptive task model i.e. once a task is loaded onto the device, it runs to completion. The mapping of tasks to a RPU strongly depends on the area model of RPU. Therefore it is necessary to describe our RPU area model. In general, there are four area models: flexible 2D area model [4,5,6], partitioned 2D model [7,8], flexible 1D area model [9] and partitioned 1D area model. In this paper, we address the problem of on-line scheduling of real-time tasks to the 1D area model. The tasks have known execution times and there is no dependency among them. Such assumptions are the same as used in related research works. There is large variety of applications which can exploit our proposed algorithm in embedded reconfigurable computing systems. Some of these applications include: reconfigurable co-processor implementation, image and video processing, cryptography, telecommunication and neural network implementation. 2.2 Scheduling and Placement Problem Definition As stated before, in this work we use 1D area model for device area. Therefore the location of each task Ti on the device surface is indexed with

xi which is the

424

M.M. Bassiri and H.S. Shahhoseini

horizontal position of bottom-left of rectangular task ( Ti ). We can define the placement and scheduling for a task as following: Definition 1. A placement for a task Ti and a set of currently placed tasks Tc, is a placement < xi > that satisfies following constraints in 1D area model: xi + wi ≤ W hi ≤ H ∀T j ∈ Tc : ( xi + wi ≤ x j ) ∨ ( x j + w j ≤ xi )

H and W are height and width of the device surface, respectively. In fact, the tasks must be placed in such a way that they don’t overlap with the device boundaries and other currently placed tasks. Definition 2. Scheduling assigns a starting time sti to each task Ti ∈ T such that:

∀T j ∈ T , T j ≠ Ti : ( xi + wi ≤ x j ) ∨ ( x j + w j ≤ xi ) ∨ ( sti + eih ≤ st j ) ∨ ( st j + e hj ≤ sti )

( sti + eih ) ≤ d i In an on-line scenario, new tasks arrive during the system’s runtime. For each arriving task Ti , scheduler has to find feasible start times. If scheduler and placer can not find a feasible placement and start time for the arrived task that satisfy the mentioned constraints, the arrived task will be rejected. In such real-time systems, it is assumed that when a task is rejected, it is referred to the CPU or a resource out of the system for execution. The goal of scheduling and placement in target RC systems is to minimize the task rejection ratio (TRR) in the system. Task rejection ratio is the ratio of number of rejected tasks to the total number of arrived tasks.

3 Related Work So far, several research works such as [10,11,12,13,14,15,16,17,18] which address on-line scheduling and placement in RC systems have been carried out. Some of these works such as [11] focuses on real-time and preemptive tasks. But most of them focus on real-time and non-preemptive tasks. Zhou et al. in [10] present a technique that uses time window to find feasible location for placing on-line real-time tasks. In fact, they have improved the stuffing algorithm for scheduling. Ahmadinia et al. in [12] present an integrated scheduling and placement algorithm in which the FPGA is divided into slots and the arriving tasks are placed inside one of the slots depending on their execution end time. Moreover the width of the slots is to be varied during runtime in order to improve the quality of placement. In [14], two techniques, horizon and stuffing, are presented by steiger, walder and platzner. Both proposed techniques use reservation list and free space list for scheduling. The stuffing method schedules tasks into arbitrary free rectangles that will exist in the future. But horizon scheduler

A New Metric for On-Line Scheduling and Placement

425

can only append the tasks to the horizons and it doesn’t guarantee to find a feasible placement even though there is enough free space on the RPU. Marconi et al. in [15] present intelligent stuffing method. They define alignment status of the free space segments and use it to make the correct decision on task placement position in order to maximize the free space area. In another work [18], roman et al. have proposed a technique in which FPGA area is divided into four partitions of different sizes. Each partition has an associated queue where the hardware manager places each arriving task depending on its size, shape and real-time parameters as deadline. The algorithm may change the queue selection policy, partition strategies and the sizes or the number of partitions at run-time in order to adapt itself to the parameters of arriving tasks. In [19], Cui et al. present a novel fragmentation metric based on MERs that take into account probability distribution of width and height of future task arrivals instead of just the average value. In addition, they take into account the time axis to minimize the time-averaged area fragmentation (TAAF) during the execution time of the task being placed. They use TAAF to select the best feasible placement among the several placement choices.

4 On-Line HW Scheduling Algorithm HW tasks which are applied to the scheduler should be scheduled and placed on the RPU. In general, the tasks in scheduler are divided into three categories at any given time: 1) Running tasks which are currently placed and are running on the RPU. 2) Scheduled tasks which have been scheduled on the RPU but their start times are at future time. These tasks are waiting in HW queue. 3) Arrived tasks which are applied to the HW scheduler and still have not been scheduled. There exist two different approaches for task placement: scanning and management of free/occupied spaces. In free/occupied space management techniques, the list of free/occupied spaces is maintained and is updated in each event. In these techniques, for placement a task, only the free spaces are searched. Some of space management techniques such as MER (maximal empty rectangle) management [5] are recognitioncomplete and some of them are not. We use free space management technique to schedule and place the arrived tasks on the RPU. Fig.1 shows an example of RPU with two running tasks. A new task (T3) arrives at time t3 and should be placed on the RPU before its deadline ( d 3 ). In order to place T3, we find maximum empty rectangles (MERs) [5,12] in time interval (t3 , d 3 ). In fact, we use the concept of MER for 1D area model. Although in 1D area model only one dimension of RPU (the width of RPU) is considered, we consider the time as the second dimension for defining MER. Fig.1a and fig.1b show the examples of MERs. In fig.1b, there are 4 running tasks and one scheduled task (T4). Each MER has been indicated by its diagonal line.

426

M.M. Bassiri and H.S. Shahhoseini

(a)

(b) Fig. 1. MERs in 1D area model

We denote each MER by MER i

MER i

MER

Mi

and characterize it using a 4-tuple

MER i

M i = (S , L ,Wi , TH ) , where S iMER , LMER , Wi MER and TH iMER denote the i start time of M i , horizontal location of bottom-left corner of M i , width of M i and time height (time duration) of M i , respectively. Also, we define the parameter setup delay for each MER at any given time as following that indicates the time delay required to utilize the MER. SDiMER = S iMER − t c

(4)

Where, t c is the current time. For example, in fig.1a, SD1MER , SD2MER and SD3MER at time t3 are 5, 3 and 0, respectively. Several fast and efficient algorithms such as [20,21] have been proposed to find MERs which can be used. We define feasible MER as following: Definition 3. Feasible MER is the MER, M j , which can accommodate the newly

arrived task Ti . For this purpose, following constraints should be satisfied at arrival time of Ti : Temporal Constraints: ( S i − SD MER ≥ 0) ∧ (TH MER − eih ≥ 0) j j Spatial Constraint:

(W jMER − wi ) ≥ 0

(5) (6)

Where Si is the slack of the arrived task as defined in section 2.1. Generally a placement algorithm contains two sub-functions: the status holder and the fitter. The status holder maintains the status of reconfigurable units. Whenever a task is placed or removed, status holder updates the list of free (occupied) reconfigurable units. The fitter selects a site among the several feasible sites for task placement. Several fitting strategies can be applied. Best-fit for example, chooses the smallest free space which is big enough to accommodate the task. First-fit strategy selects the first free space which can accommodate the task.

A New Metric for On-Line Scheduling and Placement

427

In our scheduling and placement algorithm, an important problem is choosing the best MER among the several feasible MERs. Different recent works use first-fit, bestfit [14] or fragmentation_aware [19,20] policy to choose the best MER among several feasible MERs. For illustrating the drawback of these techniques, let consider an example as shown in fig.2. In this example, there are two running tasks on the RPU. Fig.2a shows all MERs at time t3 . A new task T3 arrives at time t3 to be scheduled and placed on the RPU. Fig.2b shows all feasible placements for T3. According to first-fit, best-fit or fragmentationaware methods, M3 is selected for placement. As indicated in the figure, the next arriving task, T4 , arrives at time t4. Since T4 has higher urgency than T3 (see the table in the figure), then it can not be placed on the RPU and it is rejected. But if task T3 was placed on M1 or M2, task T4 could be placed on M3 and would not be rejected. In fact, the urgency of the future task has not been considered in placement of current task. Generally, such techniques (first-fit, best-fit and fragmentation-aware) suffer from two main drawbacks: 1) they don’t consider the requirements of future tasks in placement of currently arrived tasks. 2) they focus only on satisfaction of spatial constraints of placement and don’t try to optimize temporal constraints. In the other words, they use only spatial metrics for placement.

T3 T4

(a)

ai

di

wi

eih

2 3

11 9

4 3

4 5

(b)

Fig. 2. Drawback of fist-fit and best-fit policy in task placement

428

M.M. Bassiri and H.S. Shahhoseini

Aforementioned example shows that choosing different feasible placements for a task affects the probability of rejection of the later tasks. Also the example shows that choosing the best scheduling time for a newly arrived task strongly depends on the characteristics of the next arriving tasks (future tasks). To the best of our knowledge, so far, none of the proposed on-line scheduling algorithms has considered the influence of future tasks on the scheduling and placement of currently arrived task. In the rest of this paper, we propose an on-line scheduling algorithm in which the characteristics of the future tasks are predicted and used to make correct decision for current task placement. In the other words, the placement of currently arrived task is carried out in such a way that there is the most matching between future task requirements and future MERs (the MERs which are formed after placement of current task). Also, we take temporal constraints in addition to spatial constraints into account for placement of the tasks. In what follows, first we explain the method of prediction future task requirements and then describe the matching calculation and selecting the best MER for placement. 4.1 Future Task Prediction

We use poisson probability distribution for estimating the probability of arriving tasks in the future. Thus the probability of arriving k tasks in next time interval Δt is defined as following: PΔt (k ) =

λk × e − λ k!

(7)

Where, λ is the average number of arrived tasks in past time intervals ( Δt ). Regarding the above equation, we can calculate the next time interval in which the probability of arriving one task, PΔt (1) , is maximum. Following equations show that λ = 1 leads to maximum value for PΔt (1) . PΔt (1) = λe − λ

d ( PΔt (1)) = (1 − λ )e − λ = 0 ⇒ λ = 1 dλ n P λ =1 ⇒ × Δt = 1 ⇒ Δt = P n

(8)

(9)

(10)

Where n is the number of arrived tasks in past time window. We use a time window (tc-P, tc) with length P to determine the average number of arrived tasks in each time unit in the past. As stated before, tc denotes the current time. Regarding above equations, in the next Δt, the probability of arriving one task is maximum. We use S , e h , w and TI as slack, HW execution time, width and task interval of the next arriving task (future task). These parameters are determined as following:

A New Metric for On-Line Scheduling and Placement

429

S = average slack of the arrived tasks during past time window e h = average HW execution time of the arrived tasks during past time window w = average width of the arrived tasks during past time window TI = average task interval of the arrived tasks during past time window We use these parameters, to select the best feasible MER for placing the current task. 4.2 Selecting the Best MER for Task Placement

Placing the new task in each of feasible MERs leads to formation of new MERs in the next Δt, so called future MERs and denoted by MER ′ . Our proposed algorithm tries to maximize the matching between Future MERs ( MER ′ ) and future task requirements in the next Δt. The proposed algorithm includes the following steps: 1- Each feasible MER ( M j ) is selected and the currently arrived task is placed in M j , tentatively. Afterward the next steps are performed. 2- Current time ( ti ) is moved forward to ( ti + Δt) and all events (task terminations and task starts) are simulated. 3- All newly formed MERs are determined in time interval ( t i + Δt , t i + Δt + TI ). we denote them as MER ′ set. 4- The matching function is calculated for the new situation as following: MF jMER =

∑

Mk

(

)

(

)

(

)

⎛ S − SD MER′ TH kMER′ − e h WkMER′ − w ⎞⎟ k ⎜ + + MER ′ ⎜ ) Max (TH kMER′ , e h ) Max (WkMER′ , w ) ⎟⎠ ∈MER ′ Max ( S , SD k ⎝

(11)

In fact, matching function indicates the matching value between characteristics of future MERs ( MER ′ ) and requirements of future tasks. The higher matching value is the less probability of rejection of next arriving tasks and the less the probability of fragmentation in the future. 5- Steps 1 to 4 are repeated for the other feasible MERs and matching function is calculated for each MER. 6- The MER which has the highest matching value is selected for placing the current task. Large values of matching function indicate that the size of future feasible MERs are large compared to the size of future tasks. This decreases the probability of fragmentation in the future. In fact, matching function takes the temporal and spatial metrics into account for placement of the tasks.

430

M.M. Bassiri and H.S. Shahhoseini

5 Experimental Evaluation 5.1 Simulation Setup

A discrete time simulation framework has been constructed in C++ language to experimentally evaluate the performance of the proposed algorithm. The simulated device consists of 96*64=6144 reconfigurable units (RCUs), which corresponds to xilinx’s XCV1000 FPGA. We have conducted our experiments on three groups of input tasks as following: 1) Task group A and B: synthetic tasks 2) Task group C: real tasks In task groups A and B, we use task sets with randomly generated parameters. The probability distribution of task parameters are based on real tasks in real world applications (14). Fifty task sets have been generated based on the synthetic tasks in each task group and the simulation results have been averaged. In each task set, the number of tasks has been set to 50. The parameters of the tasks in different groups have been distributed as table 1. For each task in a task set, we generate a random arrival time. Hence two or more tasks in a task set may have the same arrival time (ai) which means that they arrive simultaneously. We use a workload parameter instead of the arrival time distribution in the following figures. We denote the set of tasks arriving in the time interval ΔT with T. ΔT is set to span from the arrival of the first task to the arrival of the last task plus its execution time. Every task Ti ∈ T occupies wi × hi reconfigurable units on the device area but as our area model is 1D, we use only wi in workload definition. In (12), W is the total width of the RPU.

∑w ×e i

Workload = WLT , ΔT =

h i

Ti ∈T

(12)

W × ΔT

Table 1. Probability distribution of the parameters of the generated tasks Task Group A B

HW execution time (eih)

Width (wi) [2,15]

[15,30]

[30,45]

25% 50% 25% Uniform distribution in interval [2 , 45]

[5,30]

[30,60]

[60,100]

25% 50% 25% Uniform distribution in interval [5, 100]

Slack (d i − ai − eih ) [1,30]

[30, 60]

[60,100]

25% 50% 25% Uniform distribution in interval [1 , 100]

In task group C, we used real tasks of JPEG encoding and MPEG encoding. These tasks are the most common tasks which are used in image and video processing. Twenty task sets have been generated using these tasks and the simulation results

A New Metric for On-Line Scheduling and Placement

FA

FR-C

FR-H

431

WBS

50

TRR (%)

40

30

20

10

0 0.5

1

1.5

2

2.5

Workload

Fig. 4. Comparison of different scheduling algorithms for input task group A

FA

FR-C

FR-H

WBS

50

TRR (%)

40

30

20

10

0 0.5

1

1.5

2

2.5

Workload

Fig. 5. Comparison of different scheduling algorithms for input task group B

have been averaged. We have used workload parameter instead of task arrival time in figures. The experiments have been carried out for different workloads. 5.2 Simulation Results

We have compared the results of simulations for four different algorithms: our proposed future-aware scheduling (FA) algorithm, window-based stuffing (WBS) algorithm [10] which is the improved version of stuffing algorithm [14], fragmentation-aware (FR-H) algorithm presented by handa [20] and fragmentation-aware (FR-C) algorithm presented by cui [19]. Figures 4 to 6 show the values of TRR (see section 2.2) against the workload for different scheduling algorithms and different task

432

M.M. Bassiri and H.S. Shahhoseini

FA

FR-C

FR-H

WBS

50

TRR (%)

40

30

20

10

0 0.5

1

1.5

2

2.5

Workload

Fig. 6. Comparison of different scheduling algorithms for input task group C

groups. As seen, our new metric for on-line scheduling incurs considerable decrease in task rejection ratio.

6 Conclusion In this work, we described the problem of on-line scheduling of the tasks in RC systems with real-time and non-preemptive task model and 1D area model. We reviewed the related work. Afterward, we presented our proposed algorithm which performs on-line look ahead scheduling based on maximal matching between requirements of future tasks and the characteristics of future MERs. Simulation results show considerable improvement in task rejection ratio for randomly generated task sets and real task sets in different workloads.

References 1. Saha, P., El-Ghazawi, T.: A Methodology for Automating Co-Scheduling for Reconfigurable Computing Systems. In: Proceedings of the 5th IEEE/ACM International Conference on Formal Methods and Models for Codesign, pp. 159–168 (2007) 2. Walder, H., Platzner, M.: Reconfigurable Hardware Operating Systems: From Concepts to Realizations. In: Proc. of 3rd International Conf. on Engineering of Reconfigurable Systems and Architectures, ERSA (2003) 3. Deng, Q., Wei, S., Xu, H., Han, Y., Yu, G.: A Reconfigurable RTOS with HW/SW Coscheduling for SOPC. In: Yang, L.T., Zhou, X.-s., Zhao, W., Wu, Z., Zhu, Y., Lin, M. (eds.) ICESS 2005. LNCS, vol. 3820. Springer, Heidelberg (2005) 4. Brebner, G.: A Virtual Hardware Operating System for the Xilinx XC6200. In: Glesner, M., Hartenstein, R.W. (eds.) FPL 1996. LNCS, vol. 1142, pp. 327–336. Springer, Heidelberg (1996) 5. Bazargan, K., Kastner, R., Sarrafzadeh, M.: Fast Template Placement for Reconfigurable Computing Systems. IEEE Design and Test of Computers 17, 68–83 (2000)

A New Metric for On-Line Scheduling and Placement

433

6. Diessel, O., ElGindy, H., Middendorf, M., Schmeck, H., Schmidt, B.: Dynamic scheduling of tasks on partially reconfigurable FPGAs. In: IEEE Proceedings on Computers and Digital Techniques, May 2000, vol. 147, pp. 181–188 (2000) 7. Merino, P., Jacome, M., Lopez, J.C.: A Methodology for Task Based Partitioning and Scheduling of Dynamically Reconfigurable Systems. In: Proc. IEEE Symopsium on FPGAs for Custom Computing Machines (FCCM), pp. 324–325 (1998) 8. Marescaux, T., Bartic, A., Dideriek, V., Vernalde, S., Lauwereins, R.: Interconnection Networks Enable Fine-Grain Dynamic Multi-tasking on FPGAs. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 795–805. Springer, Heidelberg (2002) 9. Xilinx, Inc. Virtex 2.5 V Field Programmable Gate Arrays (December 2002) 10. Zhou, X.-G., Wang, Y., Huang, X.-Z., Peng, C.-L.: On-line Scheduling of Real-time Tasks for Reconfigurable Computing System. In: Proc. of FPT Conf. (2006) 11. Danne, K., Platzner, M.: A Heuristic Approach to Schedule Periodic Real-Time Tasks on Reconfigurable Hardware. In: Proc. of International conference on Field Programmable Logic and Applications, August 2005, pp. 568–573 (2005) 12. Ahmadinia, A., Bobda, C., Teich, J.: A dynamic scheduling and placement algorithm for reconfigurable hardware. In: Müller-Schloer, C., Ungerer, T., Bauer, B. (eds.) ARCS 2004. LNCS, vol. 2981, pp. 125–139. Springer, Heidelberg (2004) 13. Walder, H., Platzner, M.: Non-preemptive Multitasking on FPGA: Task Placement and Footprint Transform. In: Proceedings of the 2nd International Conference on Engineering of Reconfigurable Systems and Architectures (ERSA), pp. 24–30. CSREA Press (June 2002) 14. Steiger, C., Walder, H., Platzner, M.: Heuristics for Online Scheduling Real-time Tasks to Partially Reconfigurable Devices. In: Y. K. Cheung, P., Constantinides, G.A. (eds.) FPL 2003. LNCS, vol. 2778, pp. 575–584. Springer, Heidelberg (2003) 15. Marconi, T., Lu, Y., Bertels, K., Gaydadjiev, G.N.: Online hardware task scheduling and placement algorithm on partially reconfigurable devices. In: Woods, R., Compton, K., Bouganis, C., Diniz, P.C. (eds.) ARC 2008. LNCS, vol. 4943, pp. 306–311. Springer, Heidelberg (2008) 16. Steiger, C., Walder, H., Platmer, M., Thiele, L.: Online Scheduling and Placement of Realtime Tasks to Partially Reconfigurable Devices. In: Proc. of RTSS03 Conf., pp. 224–235 (2003) 17. Qiu, W.-D., Zhou, B., Chen, Y., Peng, C.-L.: Fast on-line real-time scheduling algorithm for reconfigurable computing. In: Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, vol. 2, pp. 793–798 (2005) 18. Roman, S., Mecha, H., Mozos, D., Septien, J.: Constant complexity scheduling for hardware multitasking in two dimensional reconfigurable field-programmable gate arrays. Journal of IET Comput. Digit. Tech. 2(6), 401–412 (2008) 19. Cui, J., Gu, Z., Liu, W., Deng, Q.: An efficient algorithm for on-line soft real-time task placement on reconfigurable hardware devices. In: Proc. of the 10th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC), pp. 321–328 (2007) 20. Handa, M., Vemuri, R.: Area Fragmentation in Reconfigurable Operating Systems. In: Proc. of ERSA, pp. 77–83 (2004) 21. Cui, J., Deng, Q., He, X., Gu, Z.: An Efficient Algorithm for Online Management of 2D Area of Partially Reconfigurable FPGAs. In: Proc. Design, Automation and Test in Europe (DATE), pp. 129–134 (2007)

Test Data Compression Using Four-Coded and Sparse Storage for Testing Embedded Core* Zhang Ling1,2, Kuang Ji-shun1, and You zhi-qiang3 1

College of Computer and Communication Hunan University Changsha, China {forry1230,jshkuang}@hotmail.com 2 Department of computer science Huangshi Institute of Technology Huangshi, China 3 School of software, Hunan University, Changsha, 410082, China [email protected]

Abstract. this paper presents a new test-data compression technique that uses exactly four codewords and sparse storage for testing embedded cores. It provides significant reduction in test-data volume with no any complex algorithm. It aims at precomputed data of intellectual property cores in system-on-chips and does not require any structural information of cores. In addition, the decompression logic is very small and can be implemented fully independent of the precomputed test-data set. Experimental results for ISCAS’89 benchmarks illustrate the flexibility and efficiency of the proposed technique.

1 Introduction With the increasing scale of SoCs(System-on-a-Chip), testing such large circuits is a challenge due to some limiting factors, such as large volume of the test data, long test application time, high power consumption during test and limited ATE(automatic test equipment) bandwidth. Good compression techniques could test embedded cores in a SoC without exceeding limits of power, memory and ATE’s bandwidth. Also test application time should not be too long. Built-in self-test(BIST) reduces dependencies on expensive ATE and allows test on chip target faults that easy test. However, in practice, BIST may not always replace other test methods for the long time needed to detect random pattern resistant faults [1]. Test data compression offers a promising solution to the problem of large volume of test data, which used for deterministic test. For deterministic test, there are extensive works on the test compression which used to speed up the ATE-SoC interaction during test. These works are used to compress the precompued test-data set(TD) which often provided by core vendor, to a smaller test set TE(|TE|<|TD|) which is then stored in the ATE’s memory, and an on-chip decoder decompresses TE to TD to be applied to the system under test. * This research is supported by National Natural Science Foundation of China (NSFC) under grant No. 60773207 and 60673085. C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 434–443, 2010. © Springer-Verlag Berlin Heidelberg 2010

Test Data Compression Using Four-Coded and Sparse Storage

435

Some compression methods reduce test volume and time based on structural solutions which require design modification. Illinois scan architecture (ILS) [2] needs structural redesign, And the VirtualScan [3] uses same vector to broadcast to several scan chains, it has limited ability to deterministically assign values, and top-up patterns in bypass mode may be required to improve test coverage. EDT [4] is linear-decompression-based schemes which based on the fact that a test cube generated by ATPG for a circuit with a 1-to-n scan configuration usually contains many unspecified bits, it has to redesign the circuits, and has fault coverage loss. Most compression techniques compress TD without requiring any structural information about the embedded core, which were called algorithmic strategy [5]. These techniques are typically based on statistical codes, run-length codes, and their variants, such as alternating run-length coding [5], Golomb coding [6], FDR coding [7], EFDR coding [8] and run-length Huffman coding [9] are all based on variable run-length codes; and 9C [1], select Huffman coding [10] are based on fixed blocks. In addition, there are also some techniques based on LFSR encoding [11] and dictionary based compression [12]. A particularly attractive feature of algorithmic strategy for test compression is that it does not require any redesign of IP cores. In this paper, we propose a new compression solution that uses four-encoded and sparse storage for data blocks. It provides significant reduction in test-data volume. It aims at precomputed data of intellectual property cores in system-on-chips and does not require any structural information of cores, also does not perform any test generation or test set modification. In addition, the decompression logic is very small. The rest of this paper is organized as follows. Section 2 discusses the proposed compression technique which uses four-coded and sparse storage. Section 3 makes optimization for the proposed compression technique. The corresponding decode architecture is given in Section 4. Experimental results are shown in Section 5. The last section is devoted to a simple conclusion for the proposed test compression solution.

2 Proposed Test Data Compression Technique The new data compression technique which uses four-coded and sparse storage blocks is presented in this section. The proposed technique classifies test set blocks into four types: all-0 block, all-1 block, sparse block and other block. Four codewords are used to identify them, respectively. The technique encodes for sparse block using sparse storage. 2.1 Four-Coded Technique The proposed technique divides input test data into k-bit blocks. The blocks are classified into four types as shown in table 1. Test set of a circuit form as a long test sequence and assume all test patterns have the same length. Table 1 shows the proposed four-coded technique for the fixed block. As shown in the table, the first column is the four cases which test data blocks are classified into. All-0 block is the data block which all its bits match to zero, which was indicated by codeword of ‘0’ as shown in the first row in the table. All-1 block is the data block

436

L. Zhang, J.-s. Kuang, and z.-q. You

which all its bits match to one, which was indicated by codeword as ‘10’ as shown in the second row in the table. Sparse block is the data block which has very small number of bits match to one which was indicated by the codeword of ‘110’, and the detailed presentation on the sparse storage will be given later in the paper. Other block is the data block which does not belong to the last three cases which indicated by codeword of ‘111’, and these blocks are stored as it is. Table 1. Four coded technique

Case(i) All-0 block all-1 block sparse block Other block

symbols 0 C1 10 C2 110 C3 111 C4

（）（）（）（）

Decoder input 0 10 110+Sparse storage 111+the block

2.2 Sparse Storage Proposed Compression Technique Test sets have many don’t care bits, so except many all-0 blocks and all-1 blocks, there are also many blocks which have very few one in them which we called sparse blocks. We only need to store the location of one for sparse block, which is the motivation of the sparse storage. Assume the block length is k, we use ⎡log2 k⎤ bits to store each location of one in the

sparse block. The additional ⎡log2xmax⎤ is used to store the number of one in sparse block where xmax is the maximum number of one in sparse blocks. Thus, the total bits for storing the sparse block is x. ⎡log2 k⎤ + ⎡log2xmax⎤ . Both sparse block and other block have one and zero in them. For achieve higher compression ratio, we discriminate them as following: if x. ⎡log2 k⎤ + ⎡log2xmax⎤
3 Optimize for Four-Coded Technique Using Variable-Length Block Only one fixed-length block(k) was used for the entire test-data set in last section. This has its own advantages and disadvantages in terms of decoder size and compression

Test Data Compression Using Four-Coded and Sparse Storage

437

ratio. The main advantage is that the decoder size is very small because of using a fixed-length block for the entire test-data set. However, the technique could not achieve the highest possible compression ratio. The optimization on the proposed compression technique is presented in this section. For the technique only using four symbols, the decoder is very simple though using variable blocks.

00XX0X0X

1XXX1XX

00001000

0

10

110

01011101(32bits) 111

(a) data input and the corresponding symbols Sparse block: 00001000 Sparse storing: Symbol+number of 1+ location of each 1 110+0+100(only has a one, its location is 4) (b) sparse block ‘s storag Other block: 01011101 Store it as it is: symbol+block (c) other block ‘s storag 0 10 1100100 11101011101 (21 bits) (d) last decoder input Fig. 1. An simple example for the proposed compression technique

The basic idea behind optimization is finding the best k for each test pattern in a test data set for achieving best compression ratio. In this case, different k may be obtained for different patterns. Thus, for those patterns that contain small number of don’t cares a smaller k may be obtained while a larger k is expected for pattern with higher number of don’t cares. Figure 2 shows an example of finding k for optimization. Figure 2(a) shows n=3 patterns (T1, T2 and T3) each of which contains m=32 bits. These three test patterns form a test sequence containing 96 bits. The final compressed data size in case of using fixed block size of k=8 for all three test patterns is |TE|=38bits as indicated in the figure 2(b). In the next step, we make optimization on the technique using variable length blocks. As shown in figure 2(c), k1=4, k2=8, and k3=32 result in best compression for each test pattern T1, T2 and T3, respectively. The final compressed data size is 23 in this case. This example shows that having a fixed-length block for a test-data set does not guarantee achieving the best results. In other words, if we use different k for different patterns higher compression ratio will most likely be achieved.

438

L. Zhang, J.-s. Kuang, and z.-q. You

T1: 00X0 11XX 00XX 1X1X 00XX 1XX1 XXX1 X0XX T2: 0XXX 0000 11X11X1X 1XXX XXX1 1X1X X00X T3:0XXX XX0X 000X 0XXX XXXX XXX0 XX00 00XX (a) three patterns

T1 symbols: 110 110 110 110 codeword size: 6*4=24 T2 symbols: 0 10 110 0 T3 symbols: 0 0 0 0

codeword size: 1+2+6+1=10 codeworsize:1+1+1+1=4

Final compressed data size=38 (b) compression using a fixed K for entire test-data

K=4:

T1 symbols: 0 10 0 10 0 10 10 0 codeword size:12

K=8:

T2 symbols: 0 10 10 110

K=32: T3 symbols: 0

codeword size: 11

codeword size: 1

Final compressed data size=24 (c) compression using different k’s for different patterns Fig. 2. Compression using fixed and variable block length

A test-data set can be viewed as a long sequence of |TD|=n.m bits, where n is the number of test patterns and m is the length of each test pattern. However, for many applications such as scan we can regroup test data to sizes other than m. In compression, for having more flexibility, we can regroup test data into L-bit vectors (instead of the original m-bit patterns) as in [1]. In general, L can be smaller or larger than m. Figure 3 gives an example of regrouping test pattern. Based on the three test patterns shown in Fig. 2(a), the test sequence length is |TD|=96 bits. We create four new test patterns, T1’, T2’, T3’, and T4’, each with L=24 bits as shown in figure 3(a). If we again use a fixed block size of k=8, the final compression data size is same as figure 2(b). However, Figure. 3(b) shows that using different k for these patterns provides higher compression ratio by reducing the final compressed data size. As shown, k1=4, k2=8, k3=12, and k4=24 result in best compression for each of T1’,T2’,T3’ and T4’, respectively. This eventually provides the best compression for the entire test–data set. As in paper [1], for achieving the best compression we need to find two factors: (1) the length of new groups (also called patterns here after)(L) and (2) block length(Kj) for each L-bit pattern. To reduce the complexity of finding Kj’s and decoding structure for

Test Data Compression Using Four-Coded and Sparse Storage

439

any L’s only those Kj’s which are divisible by L’s(starting from K=4) are considered in our technique. For example, when L=32, only Kj=4, 8, 16, and 32 are tried for each test pattern. Mathematically speaking, Kj is even, 4≤Kj≤L and L MOD Kj =0. When a test-data set with |TD|-bits is divided into L-bit patterns, overall ( ⎡| TD | / L⎤ ) new L-bit pattern are created which is the same as required number of Kj’s, therefore, T D=

∑⎡

|TD |/ L ⎤

j=1

Lj .

Kj is fixed for each pattern but it is variable across the test-data set. So, the decoder needs to know Kj for each L-bit pattern. Since Kj is fixed for each L-bit pattern, Kj is send to the on-chip decoder just before sending the codewords of that L-bit pattern. Therefore, one Kj is sent to the on-chip decoder for each L-bit pattern and over all

⎡| TD | /L⎤ number of Kj’s will be sent. Assume that G denotes the total number of distinct divisors of L starting from 4. For example, if L=32, divisors are Kj=4, 8, 16, and 32. Thus, G=4 and only two bits are required to detect each Kj’s. Therefore, the total number of bits sent into decoder, |TE|, is equal to final compressed data size plus ⎡| TD | /L⎤ . ⎡log2G⎤ . The advantages of this technique is that: 1) it is test-data independent and thus can be reused even if the test-data set changes and 2) it achieves higher compression ratio compared to fixed pattern technique. The decoding architecture for each of these techniques will be discussed in the next section.

T1’: 00X0 11XX 00XX 1X1X 00XX 1XX1 T2’: XXX1 X0XX 0XXX 0000 11X1 1X1X T3’:1XXX XXX1 1X1X X00X 0XXX XX0X (a) regroup three patterns into four patterns

K=4:

T1 symbols: 0 10 0 10 0 10 codeword size:9

K=8:

T2 symbols: 110 0 10

codeword size: 6

K=12: T3 symbols: 10 0

codeword size: 3

K=24: T4 symbols: 0

codeword size: 1

(b) compression new patterns using variable block length size

Fig. 3. Example of optimization using regroup test patterns

440

L. Zhang, J.-s. Kuang, and z.-q. You

4 Decompression architecture The section presents an small and flexible decompression architecture for the proposed compression technique. After extracting the best Kj’s for each pattern, they can be sent into the on-chip decoder along with the codewords. Figure 4 shows the decoder architecture for the proposed data-independent compression solution. The decoder consists of a finite-state machine(FSM), counter0, counter1, counter2, Kj register, L register and x register as indicated in figure 4. ATE sends L’s into FSM to be sent into L register that will be used for the entire test-data set. To transfer codewords, corresponding to L-bit test pattern, to the decoder, first the related codeword’s Kj is sent. FSM detects Kj and sends it into Kj register. FSM takes data from Data-in to find out which codeword has been entered. If the input codeword is C1, C2, all Kj bits are generated only with zeros or ones accordingly. If FSM receives the codewords of C3, ⎡log2xmax⎤ which represents the number of 1 in the sparse block is sent to x register under control of FSM, and ⎡log2 k⎤ bits which represents each location of 1 in the sparse block is sent to counter 0. then the decoder generate 1 on the corresponding location and generate 0 on other location under control of counter 0 and counter 1. After one pattern is decoded under control of counter2, then sent next Kj to FSM, repeat the progress, until all the patterns are decoded.

sel

0

D-out L

1

Data_in

Data_in cntx donex

d done

Dec_en

counter0

Cnt1_en inc done1

FSM Ack

sc

Kj reg

counter1 counterx

Cnt2_en done22

Clk_ate

counter2

done21 Clk_soc

x

L

X reg

L

L L reg

Fig. 4. Decoder for data-independent technique

Test Data Compression Using Four-Coded and Sparse Storage

441

5 Experimental Results Specifically, in today’s large circuits, we expect to have only 1%-5% specified bits in test-data set[13]. Test-data set is assumed to be the only data provided and no structural information of the core is required in our experiments. We do not combine ATPG and compression like some other techniques. The compression ratio(CR) is computed by (|TD|-|TE|)/|TD|. Table 2. Predetermined test set statistics

circuit S5378 S9234 S13207 S15850 S38417 S38584

#cells 214 247 700 611 1664 1464

#vectors 111 159 236 126 99 136

#Xs 23754 39273 165200 76986 164736 199104

X(%) 72.62 73.01 93.15 83.33 68.08 82.28

The experiments were conducted on a series of ISCAS 89 circuits on a Pentium with a 3.0GHz processor and 512 MB of memory on the given test set of mintest [14]. The related test set statistics are listed in table 2, including the number of cells, the number of test vectors, test data volume, and all the achieved FC are 100%. The experimental results are given in Table 2 and Table 3. Table 3. Comparison of test-data volume(Te) for different L’s circuit

TD

S5378 S9234 S13207 S15850 S38417 S38584

23754 39273 165200 76986 164736 199104

15 11491 18817 34698 26845 65679 70307

20 11681 18938 37575 26613 66775 71459

40 11164 18174 29650 24466 63372 65616

60 11015 18023 27065 23796 62607 63810

CR% 80 11031 17974 25434 23308 62553 63012

100 11316 18274 24947 23264 65217 63514

200 11306 18080 23992 21976 65609 62786

400 11589 18055 23273 23156 66043 62721

500 11697 18185 23642 23465 68232 63955

Table 4. Comparison of compression ratio for the proposed technique with other test compression techniques

circuit

9C [1]

S5378 S9234 S13207 S15850 S38417 S38584 average

51.64 50.91 82.31 66.38 60.63 65.53 62.90

RL Huffman [9] 53.75 47.59 82.51 67.34 64.17 62.40 62.96

SL Huffman [10] 55.10 54.20 77.00 66.00 59.00 64.10 62.57

Block Code [15]

The proposed solution

54.98 51.19 84.89 69.49 59.39 66.86 64.47

53.63 54.23 85.91 71.45 62.03 68.50 65.96

442

L. Zhang, J.-s. Kuang, and z.-q. You

Table 3 shows the compression ratios of ISCAS’89 benchmarks for different L’s using the proposed compression technique. Kj‘s are sent to the on-chip decoder along with the codewords. The CRs of different L’s are very close because the proposed compression technique finds the best compression by tuning K for each test pattern. For very small and very large L’s such tuning is harder and CR values are overall lower. The maximum CR for each benchmark is written in boldface font. We compare our compression ratio with other several classical compression techniques in Table 4, from the second to the sixth row are the compression ratio results for the 9C [1], run length Huffman coding [9], select Huffman coding [10] and block merging [15]. As is evident from the table, the proposed compression technique provides an improved compression rate.

6 Conclusion The paper presents a new test-data compression technique that uses four codewords and sparse storage for testing embedded cores. It provides significant reduction in test-data volume with no any complex algorithm. It aims at precomputed data of intellectual property cores in system-on-chips and does not require any structural information of cores. It does not perform any test generation or test set modification In addition, the decompression logic is very small and can be implemented fully independent of the precomputed test-data set. Experimental results for ISCAS’89 benchmarks illustrate the flexibility and efficiency of the proposed technique.

References 1. Tehranipoor, M., Nourani, M., Chakrabarty, K.: Nine-Coded Compression Technique for Testing Embedded Cores in SoCs. IEEE transactions on very large scale integration(VLSI) systems 13, 719–731 (2005) 2. Hsu, F., Butler, K., Patel, J.: A case study on the implementation of the Illinois scan architecture. In: Proc. Int. Test Conf (ITC’01), pp. 538–547 (2001) 3. Wang, L.-T., Wang, Z., wen, X., et al.: VirtualScan: Test Compression Technology Using Combinatioal Logic and One-Pass ATPG. IEEE Design & Test of Computers, 122–129 (2008) 4. Rajski, J., Tyszer, J., Kassab, M., Mukherjee, N.: Embedded Deterministic test. IEEE transactions on computer-aided design of integrated circuits and systems 23(5), 776–792 (2004) 5. Chandra, A., Chandrabarty, A.: A Unified Approach to Reduce SOC Test Data Volum, Scan Power and Testing Time. IEEE Transactions on Computer-aided design of integrated circuits and system 22(3), 352–362 (2003) 6. Chandra, A., Chakrabarty, K.: System-on-a-chip Test-data Compression and Decompression Integrated Circuits and Systems. IEEE transaction on computer-Aided design of integrated circuits and systems 20, 355–368 (2001) 7. Chandra, A., Chakrabarty, K.: Frequency-diredted Run-length(FDR) Codes with Application to System-on-a-Chip Test Data Compression. In: Proceedings of 19th IEEE VLSI Test Symposium (VTS 2001), pp. 42–47 (2001)

Test Data Compression Using Four-Coded and Sparse Storage

443

8. EL-Maleh, A.H.: Test data compression for system-on-a-chip using extended Frequency-Directed Run-Length Code. IET Computers & Digital Techniques, 155–163 (2008) 9. Nourani, M., Tehranipour, M.: RL-Huffman encoding for test compression and power reduction in scan applicaiton. ACM trans. Des. AUTOM. Electron. Syst., 91–115 (2005) 10. Jas, A., Gosh-Dastidar, J., Ng, M., Touba, N.: An effcient test vector compression scheme using selective Huffman coding. IEEE Trans. Comput. Aided Des., 797–806 (2003) 11. Jutman, A., Alekejev, I., Raik, J., et al.: Reseeding using Compaction of Pre-Generated LFSR Sub-Sequences, pp. 1290–1295. IEEE, Los Alamitos (2008) 12. Knieser, M., Wolff, F., Papachristou, C., Wyer, D., McIntyre, D.: A technique for high ratio LZW compression. In: Proc. Design Automation Test in Europe, pp. 116–121 (2003) 13. Hiraide, T., Boateng, K., Konishi, H., Itaya, K., Emori, M., Yamanaka, H.: BIST-aided scan test—A new method for test cost reduction. In: Proc. VLSI Test Symp. (VTS’03), pp. 359–364 (2003) 14. Hamzaoglu, I., Patel, J.H.: Test set compaction algorithms for combinational circuits. In: Proc. Int. Conf. Computer-Aided Design, pp. 283–289 (1998) 15. EL-Maleh, A.H.: Efficient test compression technique based on block merging. IET compter& Digital Techniques, 327–335 (2007)

Extending a Multicore Multithread Simulator to Model Power-Aware Hard Real-Time Systems José Luis March, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and José Duato Department of Computer Engineering (DISCA) Universidad Politécnica de Valencia Valencia, Spain [email protected], {jsahuqui,husein,spetit,jduato}@disca.upv.es

Abstract. Increasing computational requirements are the cause of the use of multicore multithread processors in embedded real-time systems. Although these processors are more efficient they are also more complex and power hungry. Consequently energy consumption has become a major concern in these systems. In this context, new designs are being researched to deal with these limitations. On the other hand, simulators play an important role in research since they can reliably evaluate different research proposals. In this paper we propose some extensions for Multi2Sim, a multicore multithread processors simulator, in order to support hard real-time systems with dynamic voltage scaling capability. In addition, the main guidelines to model power-aware hard realtime systems are discussed, which will be useful to extend other processor simulators with similar purposes. Finally, different partitioning algorithms are also compared by using simulation experiments to show how they affect the system performance.

1 Introduction Processors evolution has been very fast, any rule similar to Moore’s Law would be inconceivable in many other industries. The positive consequences can be seen in the continuous efficiency improvement from embedded to high-performance supercomputers, passing through personal computers which have modified people’s lifestyle and daily routines. Taking advantage of the growth of the amount of operations that processors can perform, embedded systems are offering better functionalities, and complex and important tasks can be assigned to them. That is the case of hard real-time systems where tasks non-fulfilment is disallowed, for example aeroplane on-board systems where a failure could possibly mean the loss of a lot of human lives. Due to current architectures complexity and the high cost of implementing experimental microprocessor designs just to test them, simulators have become a very useful tool as their cost is lower and their results are reliable. This paper presents an extension of Multi2Sim [1], a processors simulation framework, to support hard real-time tasks and model a power-aware system. The extensions implemented in the simulator make periodic tasks repetition possible, with a C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 444–453, 2010. © Springer-Verlag Berlin Heidelberg 2010

Extending a Multicore Multithread Simulator

445

deadline based task priority system. Processor frequencies are also modelled, including the latency penalty when a frequency change occurs. As losing deadlines is not permitted in hard real-time systems, Dynamic Voltage Scaling (DVS) technique [2] is implemented to allow the system to increase frequency when required by the workload, using a strategy to ensure deadlines fulfilment. Also a task partitioner module is added to balance the workload according to a given partitioning algorithm. Among current simulators Multi2Sim has been chosen because it integrates the most important characteristics of principal existing simulators as well as interesting additional features: separate functional and timing simulation, cache coherence protocols, sharing strategies of pipeline stages, memory hierarchy configuration or support for multicore multithread. Besides, Multi2Sim is an open source project, downloadable from [3] and anyone can start to develop its own new extensions. The remaining of this paper is structured as follows. In Section 2 some related work is presented. Section 3 describes the main characteristics of the Multi2Sim simulation model. In Section 4 proposed extensions are explained in more detail. Section 5 describes some illustrative simulation examples, focused on power-aware partitioning algorithms. Finally, Section 6 presents some concluding remarks.

2 Related Work Currently there are many simulation frameworks for testing new processor designs. SimpleScalar [4], which is widely accepted, models a superscalar processor. It has been extended in several ways, including support for multithreading [5] [6] [7] and energy consumption [8]. SimpleScalar and Multi2sim are application-only simulators, that is, they directly simulate the behaviour of applications, in contrast with fullsystem simulators which boot an operating system where applications run. The former have the advantages that there is no simulation of extra load to have an influence on final results and their overall computational cost is lower. On the other hand, the better accuracy provided by the full-system simulators is not usually required. A full-system simulator example is Simics [9], which also has multiple modules developed to extend its functionality such as GEMS [10]. Among its features, GEMS provides a timing-first approach adopted in Multi2Sim and also kept unmodified in the proposed extended power-aware real-time version. The timing-first approach implies a timing module tracing the state of the pipeline during instructions execution, followed by a functional module which actually executes instructions and guarantees correct execution paths. Regarding power consumption in real-time applications, many research papers have been published, considering either periodic [11] or aperiodic tasks [12], multithread [13] or multicore processors [14]. In [13] the OS is also involved trying to improve performance on a Simultaneous Multithread processor, while in [14] DVS is used to reduce energy in a symmetric multiprocessor. Note that DVS is a commonly implemented technique for energy savings purposes, as in [15] where it is stated that WF (Worst Fit) is the best partitioning heuristic in terms of energy performance. In fact, choosing a proper partitioning algorithm is a critical decision in these systems. To sum up, this is an active research field and the proposed extensions can help researchers to carry out modifications for other simulators with similar purposes.

446

J.L. March et al.

3 Multi2Sim Simulation Features Multi2Sim is a cycle-by-cycle execution driven simulation framework for evaluating processors. As in real processors, three main parts can be distinguished: the core, the cache hierarchy and the interconnection network. A single real chip can include more than one core (multicore) and each one of those cores can be composed of various hardware threads (multithread). Multicoremultithread processors can be modelled on this simulator with three multithreading paradigms: Fine Grain Multithreading (FGMT), Coarse Grain Multithreading (CGMT) and Simultaneous Multithreading (SMT). Several configurations can be also chosen about the sharing strategies of pipeline stages and resources. Memory accesses imply some latency cycles that the processor will be stalled waiting for requisite data. If more hierarchy levels are needed to be accessed to obtain these data, this penalty will have a negative influence on processor performance. Therefore, it is important to model different memory hierarchies with diverse sharing strategies of cache levels among cores and threads, in order to evaluate its impact on global system efficiency. Multiple processor units demand a mechanism to guarantee coherence in data stored in certain cache levels. That could be a coherence protocol for sharing data among cores. So apart from transferring simple data through cache levels, the interconnection network should be able to support a load overhead caused by the coherence messages. In Multi2Sim various topologies can be selected to analyze their impact on system performance. 3.1 Simulation Models In Multi2Sim three different simulation techniques are used: functional simulation, detailed simulation and event-driven simulation (timing simulation is performed by the latter two). Functional Simulation is implemented as an autonomous module that provides an interface to the rest of the simulator. It does not consider any hardware structures, as cores or threads, it just deals with software contexts. Its main functions are to create and destroy contexts, perform program loading, enumerate existing contexts and consult their status, execute machine instructions and handle speculative execution. In Detailed Simulation the specific hardware microarchitecture is taken into account, with elements as pipeline structures (stage resources, instruction queues, reorder buffer, etc), branch predictor, cache memories or segmented functional units. Each cycle the detailed simulation module uses the functional simulation module interface to update the context status. The detailed simulation module analyzes the recently executed instructions accounting the operation latencies caused by hardware structures. With functional and detailed simulation built in independent modules, the implementation of machine instructions behaviour can be centralized in a single file (functional simulation), and function calls that activate hardware components (detailed simulation) return the latency required to be completed. In some situations that latency cannot be calculated when the function is called, it needs to be simulated cycle by cycle. That is the case of the interconnection network and cache memories. In that

Extending a Multicore Multithread Simulator

447

situation an Event-Driven Simulation module is required to obtain delays of message transfers caused by memory accesses.

4 Proposed Extensions In order to make Multi2Sim able to support real-time tasks and, at the same time, model a power-aware system, many extensions have been implemented in an additional module. This Power-Aware Real-Time (PART) module is in charge of: i) manage periodic tasks repetition with its alternate active and inactive periods; ii) create a deadline based task priority system; iii) support processor frequencies; iv) model some latency cycles for frequency changes; and v) provide a task partitioner to distribute workload among cores. These assignments can be seen as PART sub-modules, and a more detailed description of each one is presented below. 4.1 Tasks Repetition In the original Multi2Sim, tasks (benchmarks) just arrive, run, and finish. Real-time tasks are entering and leaving the system constantly, alternating a number of consecutive active and inactive periods, until the end of the simulation (periodic task repetition). When a task finishes, it can be scheduled for another execution period or leave the system (active-inactive transition). In the latter case, the task will remain out of the system for some consecutive periods, and then it will enter it again (inactiveactive transition). Having active and inactive periods allows to simulate system mode changes. Figure 1 shows an example of a task with 2 active and 3 inactive periods. In this sub-module, the chosen approach to model periodic task repetition is repeating the program loading process. When the last instruction of a task finishes its commit stage, that task is removed, and immediately it is reloaded again. This includes creating context, restoring data, arguments, environment variables, etc.

Executing Finished

Leaves

Remains out

Remains in

Comes in Active Period Not Active Period Execution Time

Fig. 1. Active and Inactive Periods

4.2 Priorities In order to model a more realistic system, it is also necessary to increase the number of tasks that the simulator is able to schedule because the original Multi2Sim only accepts as many tasks as total number of threads the modelled system has, that is, number of threads per core multiplied by number of cores. This change implies

448

J.L. March et al.

implementing a task queue for each core, so active tasks can wait for a chance to use the processor when a task that is running finishes. These are priority queues, based on EDF (Earliest Deadline First) algorithm, although other algorithms can be applied. Coarse grain is the assumed multithreading paradigm, that is, thread switches occur when a long latency event appears (main memory access, …). In this context tasks priorities, regarding EDF, are taken into account. In this way, at the beginning of the simulation the highest priority tasks are launched to execution. A task switch caused by a long latency event from the highest priority thread enables the highest priority active task among the remaining ones to use the processor. When the long latency event is resolved, preemption is applied to allow the highest priority thread that was stalled to continue execution. In short, the processor must be occupied by the task with the highest priority not stuck in a long latency event. Moreover, preemption must be also applied when a higher priority task becomes active and all the threads of the core are occupied. In that case, the lowest priority task among the executing ones will be replaced by the new one. This expulsion requires saving the execution state of the replaced task, so it can be reloaded from that point later when any mapped task finishes. The aim is to ensure that, the tasks with the closest deadlines will run regardless of when they arrived at the system. Unlike the other extensions proposed, priority thread switching requires to alter the original Mult2Sim pipeline because the instruction fetch stage needs to be modified so task deadlines are considered in the decision of whether to change the current thread or not, since it is in that stage where thread switching occurs. 4.3 Frequencies In Multi2Sim there is not a model of the processor clock frequency, but the number of latency cycles requested for a main memory access is a configurable parameter. So relative speed between CPU and memory is used to model different processor frequency levels. In this way, a faster processor will stall more processor cycles than a slower one for a given memory latency. Therefore, for each processor cycle, there will be potentially more memory events serviced in the slower processor than in the faster one. Once the system is able to run tasks with diverse frequencies, there is assumed global DVS technique in order to control system speed depending on the workload requirements. System frequency will only be recalculated when a workload change occurs, that is, when a task enters or leaves the system. Regarding hard real-time simulation, this sub-module must ensure that no deadline is lost. To carry out this assignment, task utilization will be used. Utilization u of a task is defined as: u= WCET/p .

(1)

Notice that in (1), while period (p) is constant Worst Case Execution Time (WCET) changes depending on system speed. A low frequency implies a large WCET, increasing task utilization. In this context, the sub-module where DVS is included works as follows. Starting at the lowest frequency, utilization for each task is obtained. Then, for each core, utilizations of all the tasks allocated in the same core are added. This process is repeated increasing frequency until that sum is lower than 1 for every core, and that frequency would we selected. Using this strategy does not only make sure all

Extending a Multicore Multithread Simulator

449

deadlines are fulfilled, but power consumption is also taken into account as chosen frequency is not higher than what is strictly necessary. 4.4 Latency of Frequency Changes Frequency changes are not instantaneous, some time is needed to overcome voltage difference between the old frequency and the new one. Besides, another consideration is the power overhead caused by these changes. In real processors frequency changes are gradual, but in this case it is modeled a simpler approach. Only the old and the new frequencies are considered, with a latency depending on the worst case. If the system requests for a frequency rise it will remain working at the same speed some penalty cycles before changing frequency actually. For a decrease request the process will be done in the opposite order, frequency change is done first, before accounting the penalty cycles. In both cases, the number of penalty cycles will be proportional to the voltage difference of the two frequencies. If the system is able to run tasks at more than two different frequencies, then it is possible that only one change implies passing through some intermediate frequency levels. In that case the way of modeling frequency changes will be repeating the aforementioned process for each two pair of intermediate frequency levels. For example, in a system with 3 frequency levels it could be requested a change from the maximum to the minimum one, and that will be modeled changing from the maximum speed to the intermediate one, waiting some penalty cycles, and then changing finally to the minimum frequency, followed by some more penalty cycles (Figure 2 shows this case and the inverse). All these changes cause more power consumption. To model this power overhead it is assumed that the system power consumption is the one corresponding to the highest frequency among the two implied in the change, independently of whether it is an increase or a decrease. When frequency is constant, both power and speed refer to the same frequency, but when a change occurs the worst case is modeled for each magnitude during the transition. Worst case regarding speed is the lowest frequency, while worst case regarding power is the highest frequency.

Fmax

× LatA

Fi Fmin

LatA LatB

LatB

×

Fig. 2. Decreasing and Increasing Frequency

4.5 Task Partitioner The last extension is a task partitioner module. This element is in charge of allocating all the tasks to the available cores as they arrive at the system. This distribution is a major concern since it will have a very important influence on the system efficiency,

450

J.L. March et al.

in terms of power consumption and tasks execution time. This is because many different partitioning algorithms can be applied, and the workload balance among the cores will depend on the selected algorithm. Thus, good algorithms will be those that can better balance workload so system can work at lower frequencies reducing power consumption. For instance, if the system has two cores, a hypothetical balanced workload could present 0.5 utilization for each core, while another algorithm could allocate tasks to cores causing a 0.2 utilization for one core and 0.8 for the other one. The former algorithm would be better than the latter, as that task distribution could allow the system to reduce power consumption working at a global lower speed (no core can be power off), while in the second algorithm the system could not be able to work at any lower frequency because that may imply risking deadlines fulfillment in one of the cores.

5 Simulation Examples Some simulation experiments are executed in order to show the applications of the new extensions. In this case, three known partitioning algorithms are tested: First Fit(FF), Best Fit (BF) and Worst Fit (WF). To carry out these experiments, a multicore processor is considered, with three hardware threads per core, in-order issue policy and two instruction per cycle stage bandwidth. Regarding memory hierarchy, data caches are disabled, so main memory is accessed instead (100-cycle latency). 5.1 Frequencies Configuration Regarding DVS, three different frequency levels configurations are taken into account. First, 5L configuration allows the system to work with the five available frequency levels (Table 1). The second one, referred to as 3L, considers three levels (the extreme cases of 5L and the intermediate). Finally, in the 2L case only the highest and the lowest frequencies are permitted. Another important issue is the number of latency cycles required each time a frequency change is requested. To obtain this value it has been taken into account the voltage levels of implied frequencies in that change, assuming a 1mv/1μs voltage transition rate [16]. Table 1 shows the penalty cycles for each frequency level switch, depending on DVS configuration. As explained in section 4.4, lower frequencies (worst case) among the ones involved in the change will spend these penalty cycles, independently of the kind of change (increase or decrease). For example, if a Table 1. Frequency Changes Latency Cycles

Freq (MHz) 500 400 300 200 100

Energy (pJ/cycle) 450 349.2 261.5 186.3 123.8

Tcycle (μs) 0.0020 0.0025 0.0033 0.0050 0.0100

Voltage (V) 0.84 0.73 0.62 0.51 0.3967

Latency Cycles 5L 3L 2L 44,000 33,000 66,000 22,000 11,000 22,000 44,000

Extending a Multicore Multithread Simulator

451

frequency rise is requested from 100 MHz to 500 MHz, first the system will work for 22,000 cycles at 100 MHz, and then when that penalty is finished, it will remain executing tasks at 300 MHz for 66,000 cycles, before reaching 500 Mhz. 5.2 Mix Analysis Performance of the system can vary significantly, especially in terms of power consumption, depending on the characteristics of the selected workload and the active and inactive execution planning. The simulator has executed a set of tasks from the WCET analysis project Benchmark [17]. Heterogeneous mixes have been designed, made up of tasks (25 different benchmarks are used) with different utilizations, period length, number of repetitions, and active and inactive periods alternation. Notice that without enough frequency changes algorithms could not act in a suitable way and they would not be tested correctly. However, the number of changes should be also limited due to their power overhead, i.e. too many changes may imply a consumption rise. 5.3 Heuristics Comparison These experiments are aimed at comparing how different partitioning algorithms (FF, BF and WF) have an influence on the system power consumption. This is because a good algorithm will distribute tasks among cores trying to balance workload as much as possible. To this end, the global utilization must: a) increase and decrease; and b) be high enough. In this way, the balance of the workload will have a higher impact on power consumption. 1.0 0.9

5L

3L

2L

Relative Consumption

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 FF

BF

Mix 1

WF

FF

BF

Mix 2

WF

FF

BF

Mix 3

WF

FF

BF

WF

Mix 4

Fig. 3. Relative Consumption

The system has been tested assuming a bi-core for mixes 1 and 2, and a quad-core for mixes 3 and 4. The results are obtained by multiplying the number of cycles simulated at each frequency by the energy required for working one cycle at that frequency, assuming a Pentium M processor [18]. Then, this value is compared to the

452

J.L. March et al.

energy the system uses executing the mix at the maximum speed. For instance, a 0.25 value means that the energy consumed has been 25% the dissipated working all the time at 500 MHz, that is 4 times less energy. Figure 3 plot these results. The main conclusion that can be drawn is that, in general, WF algorithm acts as a better partitioner than FF and BF in terms of power consumption, because using it the system can achieve greater energy savings. Besides, results obtained by FF and BF are not very different. It can be noticed that the number of frequencies that the system can choose is a key issue. That is because with more frequency levels the selected one can be closer to the optimal frequency required. However, if only few frequencies are available, the system could choose a much higher speed, resulting in a rise of the power consumption. Another interesting remark, especially with the current trend in the industry of increasing the number of cores, is that the system behaves in a similar way independently of the number of cores.

6 Conclusions This paper identifies the main issues that must be tackled to support a power-aware hard real-time system in a detailed processor simulator. The underlying ideas can be used to extend other processor simulators to implement systems with analogous characteristics. While main features of the original Multi2Sim simulator have been kept, some extensions, especially aimed at controlling real-time tasks flow and processor speed, have been introduced. For example, some of the principal extensions implemented are: tasks repetition, tasks priority system, frequency model, a frequency change strategy that uses global DVS to ensure deadlines fulfilment, and a task partitioner. Some illustrative experiments were also carried out to show the impact of a good partitioning algorithm choice on the system power consumption, which is a major concern nowadays.

Acknowledgments This work was supported by Spanish CICYT under Grant TIN2006-15516-C04-01, by Consolider-Ingenio 2010 under Grant CSD2006-00046, by Explora-Ingenio under Grant TIN 2008-05338-E, and by Generalitat Valenciana under Grant GV /2009/043.

References 1. Ubal, R., Sahuquillo, J., Petit, S., López, P.: Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. In: 19th International Symposium on Computer Architecture and High Performance Computing (2007) 2. Hung, C.M., Chen, J.J., Kuo, T.W.: Energy-Efficient Real-Time Task Scheduling for a DVS System with a non-DVS processing Element. In: 27th IEEE International Real-Time Systems Symposium, pp. 303–312 (2006) 3. The Multi2Sim Simulation Framework Website, http://www.multi2sim.org

Extending a Multicore Multithread Simulator

453

4. Burguer, D.C., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR- 1997-1342 (1997) 5. Madon, D., Sanchez, E., Monnier, S.: A Study of a Simultaneous Multithreaded Processor Implementation. In: European Conference on Parallel Processing (1999) 6. Sharkley, J.: M-Sim: A Flexible, Multithreaded Architectural Simulation Environment. Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton (2005) 7. Tullsen, D.M.: Simulation and Modelling of a Simultaneous Multithreading Processor. In: 22nd Annual Computer Measurement Group Conference (December 1996) 8. Zhang, Y., Parikh, D., Sankaranarayanan, K., Skadron, K., Stan, M.: HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects. Univ. of Virginia, Dept. of Computer Science, Technical Report CS-2003-05 9. Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. IEEE Computer 35(2) (2002) 10. Marty, M.R., Beckmann, B., Yen, L., Alameldeen, A.R., Xu, M., Moore, K.: GEMS: Multifacet’s General Execution-driven Multiprocessor Simulator. In: International Symposium on Computer Architecture (2006) 11. Zhu, Y., Mueller, F.: Feedback EDF Scheduling of Real-Time tasks exploiting dynamic voltage scaling. Real-Time Systems Journal (December 2005) 12. Sharma, V., Thomas, A., Abdelzaher, T.F., Skadron, K., Lu, Z.: Power-aware QoS Management in Web Servers. In: 24th IEEE Real-Time Systems Symposium, Cancun, Mexico, pp. 52–63 (2003) 13. Cazorla, F.J., Knijnenburg, P.M., Sakellariou, R., Fernández, E., Ramirez, A., Valero, M.: Predictable performance in SMT processors: Synergy between the OS and SMTs. IEEE Transactions on Computers 55(7) (2006) 14. Aydin, H., Yang, Q.: Energy-Aware Partitioning for multiprocessor Real-Time Systems. In: 17th International Parallel and Distributed Processing Symposium, Workshop on Parallel and Distributed Real-Time Systems (2003) 15. AlEnawy, T., Aydin, H.: Energy-Aware Task Allocation for Rate Monotonic Scheduling. In: 11th IEEE RTAS, Washington, DC, USA, pp. 213–223 (2005) 16. Wu, Q., Reddi, V.J., Wu, Y., Lee, J., Connors, D., Brooks, D., Martonosi, M., Clark, D.W.: A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. In: MICRO-38 (2005) 17. Mälardalen Real-Time Research Centre (MRTC). WCET analysis project (2009), http://www.mrtc.mdh.se/projects/wcet/benchmarks.html 18. Watanabe, R., Kondo, M., Imai, M., Nakamura, H., Nanya, T.: Task scheduling under performance constraints for reducing the energy consumption of the GALS multi-processor SoC. In: Design Automation and Test in Europe (2007)

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications Joan Aracil, Carlos Domínguez, Houcine Hassan, and Alfons Crespo Department of Computer Engineering (DISCA) Universidad Politécnica de Valencia Valencia, Spain [email protected]

Abstract. A real-time emotional architecture (RTEA) for building parallel robotic applications is presented. RTEA allows the application developer to focus in the design and implementation of the agent processes, because the architecture itself solves, in an autonomous way the decision about the attention to be paid to each of these processes. From the functional point of view, an RTEA selects and adapts its objectives depending on its physical (actuators) and its mental (processing) capabilities. This characteristic makes the architecture a useful solution in such applications that have to deal with several simultaneous tasks, that has real-time constraints, and where the objectives are defined in a flexible way. From the viewpoint of the design and development of applications, RTEA defines its different entities as independent modules. This modularity facilitates the programmer the development of each part of the project. To control the processing capacity of the agent and to guarantee the fulfilment of the temporal constraints of the processes, RTEA has been implemented in a real-time kernel (rt-linux). Mobile robot Experiments have been carried out to show how emotional system influence the mental organisation of the robot when it performs navigational tasks under different environmental conditions.

1 Introduction The integration of embedded and real-time systems with the physical world via sensors and actuators (adaptive computing systems) is creating a nascent infrastructure for a technical, economic and social revolution [15]. In most real-time applications, the tasks have different criticality, flexible timing constraints and variable execution time [6]. RTEA is an agent-based architecture for real-time applications where the attention of the mental processes (thoughts) is guided by an emotional system. The advantages of RTEA are twofold. Firstly, from the functional point of view, an RTEA agent selects and adapts its desires (objectives) depending on its physical (actuators) and its mental (thoughts) capabilities. That is, its workload is always maintained in real-time in a capacity range that allows the agent to recover from any situation without putting in danger the system integrity. This characteristic makes the architecture a useful solution in such applications that have to deal with simultaneous tasks, that has real-time requirements, and where the objectives are defined in a flexible way. From the viewpoint of the design and development of applications, RTEA defines the C.-H. Hsu et al. (Eds.): ICA3PP 2010, Part II, LNCS 6082, pp. 454–463, 2010. © Springer-Verlag Berlin Heidelberg 2010

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications

455

different entities (concepts, desires, thoughts, emotions) as independent modules. This modularity facilitates the programmer the development of each part of the project. When the programmer defines the specific behaviours of each situation, it should specify the interfaces that the emotional system and the attention system require to allow their interaction with each of the thoughts. After that, the relations among the different modules are solved automatically. A thought is built and maintained to satisfy a desire. When a desire is formulated and introduced into the agenda, the attention system starts a thought associated to that desire. The thought encapsulates a mental process based on the agent method that satisfies the desire. A thought is active while its associated desire is active. So the life cycle of a thought is related to the life cycle of its associated desire. A RTEA agent has the autonomy to relax its functional requirements if its capacity to accomplish the objectives could put in danger its strict security requirements. The attention system establishes the attention criteria to negotiate with the thoughts their dedication based on the motivation. The allocated dedication defines the cost and the period of the thought. The thought adjusts its functional goal based on the assigned processing dedication. The flexibility to adjust the functional objectives to the mental capacity makes RTEA useful in those applications where the agent has to perform several simultaneous real-time tasks. To control the mental capacity (processor) of the agent and to guarantee the fulfillment of the temporal constraints of the thoughts, the RTEA architecture has been implemented in a real-time kernel (rt-linux). Also, different tools for the specification of the applications and the monitorisation of the execution are developed. After the introduction, section 2 reviews the more representative work on real-time control applications and emotional robotic models. In section 3, the main entities of the RTEA architecture are detailed and the affective system and its influence on the behavioral organisation is described. The implementation of the architecture in rtlinux is presented in the section 4. Two experiments to show the effectiveness of the emotional regulator and how the mobile robot behaviour is affected by different attitudes under different environmental conditions are analysed in the section 5. Finally, the conclusions are summarised in the section 6.

2 Related Work In the literature there are different papers dealing with task management in hard realtime control applications. Beccari [4] proposed a rate adaptation scheduling framework for soft real-time tasks characterized by a range of admissible rates. When the rate of some task is required to change, the remaining soft real-time tasks can be adapted based on heuristic criteria. In [14, 10, 11] a feedback control scheduler based on the feedback control theory is used with a proper QoS actuator for adjusting the task QoS levels in order to minimize the deadline miss ratio. Ramanathan [13] develops an interesting technique for overload management in hard real-time control applications. A scheduling policy deterministically guaranteeing m out of any k periodic task activations, along with a methodology able to minimize the effects of missed control-law updates, is presented. In [8, 1] periodic computations are modelled as springs with given elastic coefficients and minimum lengths. The spring elastic coefficients are used to change the rates of the periodic tasks under overload conditions or when variations in task execution rates are requested.

456

J. Aracil et al.

Different emotional architectures can be found in the literature. Moshkina has proposed the affect model Tame in order to assist in creating better human-robot interaction [12]. The model captures the interaction between different time varying affect related phenomena, such as traits, attitudes, moods, and emotions. Traits and attitudes determine the robot disposition and are time invariant. A partial integration of the affect model into the Missionlab system has been undertaken. In the MIT Media Lab Cynthia Breazeal developed one of the first social robots: Kismet [7]. Kismet’s motivation system consists of drives and emotions. The affective space is defined by three dimensions: arousal, valence and stance. The emotion is computed as a combination of contributions from drives, behaviours, and perceptions. Arbib proposed a robot composed of a set of basic functions, each with a perceptual schema and access to various motor schemas [2]. The perceptual schema evaluates the current state and sets an urgency level parameter for activating the motor schemas. A motivational system is defined to adjust the relative weighting of the different functions, raising the urgency level for one system while lowering the motivation system for others, depending on the context. In general, the research conducted in the topic of emotional systems has been focused in the human-robot interaction [3, 7, 2]. The work presented in this paper focuses in the emotion as a mechanism of the organisation of parallel behaviors and its implementation in a real-time system. Aspects such as organisation of tasks to fulfill the constraints imposed by time, physical limits and energy resources are considered in this paper.

3 Organisation of the Components of RTEA The Figure 1 shows the five principal systems of RTEA: Belief, Affective, Behavioural, Attention and Relation.

Fig. 1. RTEA subsystems and flow

The belief system maintains a logical image of the environment. The processes in execution read and update this image permanently. The fundamental elements of this system are the concepts. They represent conscious abstractions of the data. The affective system is the motor of the mental organisation. It manages a set of emotions as the basic mechanism of altering the anima. The anima represents the degree of

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications

457

motivation of each of the active thoughts. An emotional state is activated by the assessment of the concepts, and the result is the adjustment of the motivation of the thoughts. The behavioural system defines the behaviour of the robot. The main entity of this system is the thought. A thought is built associated to a special concept called desire. The motive of the thought is to satisfy the desire. The attention system organises the execution of the processes. This system negotiates with the thoughts in order to get relevant information to guarantee their execution (security requirements) and to determine the degree of satisfaction of their desires (functional requirements). Finally, the relation system communicates the agent with its environment. The main entities of RTEA are: concepts, desires thougths and emotions. The concepts are the main entities of the belief system. They are the fundamental elements of the conscious knowledge that are generated, updated, applied or destroyed by the thoughts. The concepts are not simple data containers, in stead, they have an active role on the organisation of the agent conscience. A concept is characterised by the expression 1.

Ci = (vi δi I i , Ωi )

(1)

Where vi is the data value that depends on the application; it is generated by the thoughts related with the i concept. δi is the confidence of the concept. It is a measure of the validity of the value along the time. It has a real value in the range of 0 and 1 and determines the applicability of the concept in a time instant. I i is the importance of the concept. It represents the quality of the value when it is generated by a thought. It varies from 0 to 1. Ωi is the set of thoughts that have access to the concept. The desire is represented by a situation in the environment. A situation is a state in a given time. The desires are created by the thoughts in the process of problem solving. The description of a desire is performed in a flexible way. A desire is defined as the appraisal of a concept based on a satisfaction model. The desire satisfaction model assess the possible situations in a range of 0 (unsatisfaction) and 1 (total satisfaction). Since the appraisal of the satisfaction should be simple and fast to calculate, combinations of sigmoid, trapezoidal and lineal functions are used. The thoughts are the entities of the behaviour system. They are activated to satiate the agent desires. A Thought is characterised by the expression 2.

Thi = (Ci , mi , S i , Pi , ICi ,OCi )

(2)

Where Ci is the thought computation time. mi is the motivation value inherited from the desire. Si is the expectation of satisfaction and represents the degree of fulfilment of the desire. Pi is the execution period. ICi is the set of input concepts, and OCi is the set of output concepts related with the i thought. Some thoughts interact with the belief system by generating output concepts based on input concepts and other unconscious data. Other thoughts change the image of the environment based on the sensory information and send the actions to the motor system. The emotion is the regulator entity of the affective system. It represents the description of a situation and its activation affects the organisation of the conscience of

458

J. Aracil et al.

the agent in order to reach the desired relation between the agent and its environment. An emotion is defined in the expression 3:

ei = (ai , st i , ri )

(3)

ai is the emotion activation parameter. st i ∈ [0,1] and represents the emotion state, and ri is the response of the i emotion that could represent changes on thought motivations, changes on attitude, or creation and destruction of thoughts. The emotions are classified on intrinsic system emotions and application emotions. The former are defined automatically, and each of them is associated to a thought to manage its motivation; they are responsible of the organisation of the execution of the thougths. The latter are defined by the programmer and are used to solve specific problems such as the robot navigation and manipulation; their emotional response consists of creation and destruction of desires and thoughts.

4 RTEA Linux Implementation RTEA has been implemented in the real-time linux kernel [5]. The Figure 2 shows how the components of the architecture has been structured and the tools for developping applications. RTEA applications (agent.conf) are specified with the lex and yacc tools. A specific parser (agent.ini) transforms the application entities to real-time linux data structures and tasks. After the initialisation, the components of the emotional agent can be changed or updated with a linux process (agent.mod). Likewise, monitorisation tools (log_event and log_app) has been developped in the Linux space to know the state of the agent objects and tasks during the agent execution in the rtlinux space. The Linux/rt-linux communication is performed trough fifo channels.

LINUX

Agent.conf

Agent.init

Agent.mod

fifo_1

fifo_2

Init_struct

log.event

log.app

fifo_3

Agent Data

fifo_4

Attention System Th1 Th2

RT-LINUX

Th3

Thn

HARDWARE

Fig. 2. Architecture implementation

4.1 Task Model and Scheduling

Taking into account that the computational requirements of the emotional model entities (Thoughts, attention system, perception and actuator processes), are variable. That is, their computation and their frequency are variable. A task model composed of fixed and variable periodic tasks has been incorporated to the rt-linux kernel [9]. The rt-linux kernel is adapted to include the two task sets of the equation 4.

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications

459

∪Tv .

(4)

Τ =T

f

T f is the set of fixed periodic tasks and T v is the set of variable periodic tasks. The former represents processes that have fixed temporal requirements (i.e. servomotor) and the latter supports the execution of processes that are dependent on the environment and/or on the robot dynamics (i.e. perception tasks). A fixed periodic task, Ti f , is characterised by the temporal parameters of the equation 5.

Ti f = {Ci , Pi , Di , Pri , φi , mi }

(5)

Where, Ci is the worst case computation time, Pi is the period, Di is the deadline, Pri is the priority, φi is the offset and mi is the motivation of the i task. The characteristics of a variable periodic task, Tiv , are expressed with the temporal parameters of equation 6. Tiv = {Ci (k1 ), Pi (k 2 ), Di (k 2 ), Pri ,φi , mi }

(6)

In this case, the computational time, Ci is variable and depends on the parameter k1 that could represent the number of objects in the environment. Pi and Di attributes are also variables and depend on the parameter k 2 that could be the robot speed. The relationship that lies the speed and the periods is stated in the equation 7. Pi = αi ⋅

1 vc

(7)

Where Pi is the i task period, αi is a distance constant and vc is the current robot speed. The attention system embeds the rate-monotonic scheduling algorithm to guarantee the execution of fixed and variable periodic tasks and hence of the global load of the emotional robot. When the requirements of the robot vary, the variable periodic tasks introduce changes in the run-time workload. To guarantee the schedulability of the new load, the attention systems performs schedulability analysis for fixed prioritybased pre-emptive systems [9]. If the load is schedulable, it is prepared for execution. Otherwise, the dedication negotiation between the attention system and the robot thoughts is launched to decide how to assign the mental dedication to the thoughts.

5 Experimental Evaluation The experiments have been carried out to show how emotional appraisals influence the mental organisation of a mobile robot and hence its behaviour when it performs navigational tasks under different environmental conditions. Firstly, the effectiveness of the emotional regulator has been assessed and secondly, different attitudes have been incorporated and their influence on the degree of satisfaction of the desires has been analysed.

460

J. Aracil et al.

5.1 Problem Statement

The emotional mobile robot navigates in a non uniform environment and has to reach different desired situations as can be seen in Figure 3.

Fig. 3. Simulation Framework

Fig. 4. Real-time processing model

The navigational conditions include aspects such as the characteristics of the terrain, and the obstacle density. A model of the emotional robot can be appreciated in the Figure 4. It can be seen that the tracking force of the robot is controlled by the emotional regulator. 5.2 Emotional Regulator

In the first experiment the effectiveness of the emotional appraisal learning system of the emotional regulator has been assessed to show how the desires of the mobile robot are satisfied. The types of emotional appraisals that have been trained are the fear and valence. Four fear emotions related to speed, obstacles, battery energy and desire are considered. The valence or bravery emotion is defined as a relation of the mass, the speed, the distance to the objective and the distance to the obstacle. The move thought is the responsible of satisfying the situation desires. The evolution of the fear and valence emotion levels are shown in The Figure 5.a.

Fig. 5. a) Emotional appraisals. b) Move behaviour motivation.

The fear emotion has two contributions in the experiment. To show the effect of the two fear appraisals, their sum is plotted in the Figure 5.a. It can be observed that the valence emotion starts with the maximum value (1, at t=0 s), when the robot is far

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications

461

from the desires, and it is progressively decreased as the robot approaches them (i.e. t= 250 s, t= 1000 s). The speed fear emotion starts at its low value (0, at t= 0 s) since the robot is stopped and is being increased as the robot is accelerated (i.e. t ∈[0-300]). The other fear contribution is being added when the robot is approaching the obstacles (i.e. 1.5, at t = 300 s). The Figure 5.b shows the evolution of the move behaviour motivation. This motivation is calculated as a weighted sum of the emotional appraisals that contribute to this behaviour. Since the experiment is evaluating the mechanism to adjust the emotional sensibility, the motivation is obtained directly as the emotional state and the robot behaviour control is considered proportional to the motivation. Initially the move behaviour has the maximum motivation (1) to reach the desire. While approaching the goal, the motivation is progressively decreased (i.e. 0.5 at t= 125 s, 0 at t=300 s). 5.3 Attitude Influence in the Regulation

The second experiment consists of analyzing how the degree of satisfaction of the desires of the mobile robot are influenced by different attitudes embedded in the robot and under different navigational conditions. The robot starts from an initial (t = 0.0s , s = 0.0m) situation and has to reach the desired situation formulated in the Figure 8. That is, it should reach a position in the range of [15, 20] meters in the time range of [0, 10] seconds. It is supposed that the situation desire has an initial importance of 0.5. The move thought is launched to satisfy the situation desire. It will compete for the agent attention with a list of thoughts: (P1, 0.7), (P2, 0.6), (P3-move, 0.5 + Δmotivation(urgency)), (P4, 0.4). In each attention cycle the thoughts are prioritized by their motivation. The satisfaction of the move thought depends on the urgency that influences its motivation. Three different navigational conditions are considered: In the first case the difficulties are constant, in the second case they get worse and in the third case they improve. The dedication of P1, P2 and P4 are constant, while move has a dedication that depends on the urgency. The dedication model and the navigational conditions are related to determine the maximum speed. The urgency is defined as the temporal distance between the expected situation and the desired situation. The minimum satisfaction situation is t=10 s, s=15 m, and the maximum satisfaction situation is t=10 s, s=20 m. The robot has the following attitudes: Careless, Conservative, Insesitive and Persuasible. The progress of the robot towards the proposed desire under the considered navigational conditions faced with the predefined attitudes is summarised in the Figure 6. When the robot is insensitive to the urgency, the desire is unsatisfied because the dedication of the move thought is insufficient, since its motivation is lower than the rest of the thoughts and it has not been increased by the effect of the urgency. The desire satisfaction can be improved when the robot is not insensitive to the urgency. If the robot is careless the desire is satisfied only when the environmental conditions are favourable (constant and improved). If the robot is conservative the satisfaction desire is improved. However for the worsening conditions the desire continues unsatisfied. When the urgency is considered, the move thought can be anticipated to other thoughts. Hence the extra mental dedication could facilitate the desire satisfaction.

462

J. Aracil et al.

Fig. 6. Desire satisfaction

The final satisfaction depends on the unpredictable navigational conditions. The hard conditions of the environment in the worsening case makes difficult to reach the minimal satisfaction requirements levels. But even though the future conditions are not completely predictable, the conservative attitude has anticipated the possibility of the future worst conditions better than the other attitudes. In this experiment, the robot conservative attitude presents the best progress and desire satisfaction results.

6 Conclusions A real-time emotional architecture (RTEA) for building parallel robotic applications has been proposed and the different entities of the architecture have been detailed. RTEA allows the application developer to focus in the design and implementation of the agent behaviours, because the architecture itself solves (in an autonomous way) the decision about the attention to be paid to each of these behaviours. In this approach, since the real-time requirements are crucial, the agent has always to adjust its goals to its processor (mental) capacities. The aim is to maximize the average satisfaction of several desires keeping in mind the safety requirements instead of abandoning those desires that can not be accomplished completely. The implementation of the architecture in rt-linux has been tackled based on a suitable task model and scheduling algorithms. The architecture has several functional blocks and its global validation is difficult. So firstly each functional unit has been evaluated at a time. Two experiments analysing the degree of fulfillment of the desires of a mobile robot depending on different attitudes and environmental conditions has been carried out. The future work will focus on completing the definition of the robot characters and attitudes of the affective system and defining benchmarks to assess their influence in the behavioral regulation.

Acknowledgments This work was supported by the Universidad Politecnica de Valencia, Vice-Rectorat of Research, under grant UPV-VI/4349.

Real-Time Linux Framework for Designing Parallel Mobile Robotic Applications

463

References 1. Abeni, L., Buttazzo, G.: Hierarchical QoS Management for Time Sensitive Applications. In: Proceedings of the IEEE Real-Time Technology and Applications Symposium, Taipei, Taiwan (May 2001) 2. Arbib, M., Fellous, J.M.: Emotions: from Brain to Robots. TRENDS in Cognitive Sciences 8(12) (December 2004) 3. Arkin, R.: Moving Up the Food Chain: Motivation and Emotion in Behavior-Based Robots. In: Fellous, J.M., Arbib., M.A. (eds.) Who Needs Emotions? The Brain Meets the Robot. Oxford University Press, Oxford (2004) 4. Beccari, G., Caselli, S., Zanichelli, F.: A technique for adaptive scheduling of soft realtime tasks. Real-Time Systems 30(3), 187 (2005) 5. Barabanov, M., Yodaiken, V.: Introducing real-time Linux. Linux Journal 34 (February 1997) 6. Bautista, D., Sahuquillo, J., Hassan, H., Petit, S., Duato, J.: A Simple Power-Aware Scheduling for Multicore Systems when Running Real-Time Applications. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008 (2008) 7. Breazeal, C.: Emotion and Sociable Humanoid Robots. International Journal of Human Computer Studies 59, 119–155 (2003) 8. Buttazzo, G.C., Lipari, G., Abeni, L.: Elastic task model for adaptive rate control. In: Proc. IEEE Real-Time Systems Symposium, Madrid, Spain (1998) 9. Hassan, H., Simó, J., Crespo, A.: Flexible real-time mobile robotic architecture based on behavioural models. Journal of Engineering Applications of Artificial Intelligence 10(14), 685–702 (2002) 10. Lu, C., Stankovic, J.A., Abdelzaher, T.F., Tao, G., Son, S.H., Marley, M.: Performance specifications and metrics for adaptive real-time systems. In: Proc. IEEE Real-Time Systems Symposium, RTSS’00, Orlando, FL (2000) 11. Lu, C., Stankovic, J., Tao, G., Son, S.: Feedback Control Real-Time Scheduling: Framework, Modeling and Algorithms. Special Issue on Control-Theoretical Approaches to Real-Time Computing, Journal of Real-time Systems 23(1/3) (May 2002) 12. Moshkina, L., Arkin, R.C.: On TAMEing Robots. In: Proc. IEEE International Conference on Systems, Man and Cybernetics, Georgia Tech., Atlanta, GA, USA, October 5-8 (2003) 13. Ramanathan, P.: Overload management in real-time control applications using (m,k)-Firm Guarantee. IEEE Transactions on Parallel and Distributed Systems 10(6) (1999) 14. Stankovic, J.A., Lu, C., Son, S.H., Tao, G.: The case for feedback control real-time scheduling. In: Proc. Euromicro Conference on Real-Time Systems, ECRTS’99, York, UK (1999) 15. Stankovic, J.A., Lee, I., Mok, A., Rajkumar, R.: Opportunities and obligations for Physical Computing Systems. IEEE Computer 38(11), 23–31 (2005)

Author Index

Abbes, Heithem I-287 Alvarez-Bermejo, Jose Antonio An, Hong II-32 Aracil, Joan II-454 Ara´ ujo, Guido I-499 Arul, Joseph M. I-205 Awwad, Ahmad II-1

I-454

Bahig, Hazem M. II-391 Bai, Yuebin I-363, I-391 Bai, Yuein I-324 Baran, Jonathan II-79 Bassiri, Maisam Mansub II-422 Behki, Akshay Kumar I-476 Belabbas, Yagoubi II-112 Bellatreche, Ladjel I-124 Benkrid, Soumia I-124 Bhat, Srivatsa S. I-476 Bossard, Antoine I-511 Brock, Michael II-254 Butelle, Franck I-287 Buyya, Rajkumar I-13, I-266, I-351 Calheiros, Rodrigo N. I-13 Cao, Qi II-235 Cao, Qian II-308 C´erin, Christophe I-287 Chai, Ian I-266 Chˆ ateau, Fr´ed´eric II-281 Cha, Youngjoon II-136 Chedid, Fouad B. I-44 Chen, Bo I-79 Chen, Bo-Han II-90 Chen, Ching-Wei II-22 Chen, Gang I-193 Cheng, Tangpei II-413 Chen, Pan I-149 Chen, Quan-Jie II-338 Chen, Shih-Chang I-278 Chen, Tai-Lung I-278 Chen, Tianzhou I-136, II-11 Chen, Tsung-Yun I-205 Chen, Yan II-318 Chiang, Tzu-Chiang I-538 Chikkannan, Eswaran I-266

Cho, Hyeonjoong II-42 Cho, Hyuk I-32 Chung, Hua-Yuan I-205 Chung, Yongwha II-42 Church, Philip II-188 Chu, Tianshu I-174, I-404, I-441 Chu, Wanming I-54 Cong, Ming II-32 Crespo, Alfons II-454 Cuzzocrea, Alfredo I-124 Dai, Jian I-441 Dai, Kui I-149 Dom´ınguez, Carlos II-454 Duato, Jos´e II-444 Duncan, Ralph II-52 Emeliyanenko, Pavel I-427 Ercan, Tuncay II-198 Fang, Weiwei I-441 Fang, Yi-Chiun II-166 Fujita, Satoshi II-235 Gong, Chunye I-416 Gong, Zhenghu I-416 Goscinski, Andrzej II-188, II-225, II-254 Guo, Yupeng II-289 Haddad, Bassam II-1 Hai, Ying-Chi II-68 Han, Xiaoming I-113 Hassan, Houcine II-444, II-454 Hayes, Donald II-79 He, Haohu II-308 Hobbs, Michael II-225 Hsieh, Chih-Wei II-297 Hsu, Chih-Hsuan I-299 Hsu, Ching-Hsien I-278 Hsu, Yarsun II-166 Huang, Libo I-226 Huang, Po-Jung II-338 Huang, Tian-Liang II-338 Huang, Xiang II-308 Hu, Changjun II-308

466

Author Index

Hung, Chia-Lung II-381 Hu, Wen-Jen II-90, II-121 Hwang, Guan-Jie I-205 Hwang, Wen-Jyi II-381 Inoue, Hirotaka II-146 Iwasaki, Tatsuya II-264 Iwasawa, Nagateru II-264 Jain, Praveen I-476 Jemni, Mohamed II-328 Jiang, Guanjun I-136, II-11 Jiang, Haitao I-79 Jianzhong, Zhang II-244 Jingdong, Xu II-244 Ji-shun, Kuang II-434 Ji, Xiaohui II-413 Jou, Yue-Dar II-275 Ju, Lihan II-11 Jungck, Peder II-52 Kai, Pan II-244 Kaneko, Keiichi I-511, II-264 Kang, Mikyung I-528, I-549 Kang, Min-Jae I-549 Kayed, Ahmad II-1 Kestener, Pierre II-281 Hariprasad, K. I-476 Kim, Dongwan II-348 Kim, Hansoo I-520 Kim, HyungSeok I-520 Kim, Jae-Jin I-520 Kim, Jong-Myon I-487 Kim, Mihye II-348 Kim, Seong-Baeg I-549 Kim, Seong Baeg I-528 Kim, Seongjai II-136 Kojima, Kazumine II-100 Kouki, Samia II-328 Kuo, Chin-Fu II-68 Kuo, Sheng-Hsiu II-297 Kwon, Bongjune I-32 Ladhari, Talel II-328 Lai, Kuan-Chou II-338 Lee, Cheng-Ta II-131 Lee, Eunji II-42 Lee, Hyeongok II-348 Lee, Jonghyun I-520 Lee, Junghoon I-528, I-549

Lee, Liang-Teh II-22 Lee, Shin-Tsung II-22 Lee, Sungju II-42 Lee, You-Jen I-205 Lee, Young Choon I-381 Lef`evre, Christophe II-188 Li, Bo I-404 Li, Dandan II-413 Li, Hui-Ya II-381 Li, Keqiu I-559 Li, Kuan-Ching II-338 Lim, Sang Boem I-520 Lin, Chih-Hao II-121 Lin, Cho-Chin I-299 Lin, Chun Yuan II-178 Lin, Fu-Jiun I-205 Ling, Zhang II-434 Lin, Hua-Yi I-538 Lin, Kai I-559 Lin, Reui-Kuo II-297 Lin, Te-Wei I-91 Lin, Xiaola I-163 Lin, Yeong-Sung II-131 Li, Shigang II-308 Liu, Bing I-215 Liu, Dong II-318 Liu, Feng I-193 Liu, Jie I-416 Liu, Jingwei I-136 Liu, Xiaoguang I-102, I-215, I-236, II-289 Liu, Yi I-441 Li, Wang II-218 Li, Wing-Ning II-79 Li, Yamin I-54 Luan, Zhongzhi I-174, I-404 Luo, Cheng I-324, I-363 Ma, Jianliang I-136, II-11 March, Jos´e Luis II-444 Matsumae, Susumu I-186 Mehnert-Spahn, John I-254 Meng, Xiangshan I-559 Meriem, Meddeber II-112 Mihailescu, Marian I-337 Misbahuddin, Syed I-313 Miyata, Takafumi II-401 Mohan, Madhav I-476 Mourelle, Luiza de Macedo II-156 Muthuvelu, Nithiapidary I-266

Author Index Nedjah, Nadia II-156 Nguyen, Man I-463 Nic´ acio, Daniel I-499 Noor, Fazal I-313 Okamoto, Ken Ok, MinHwan

II-358 I-246

Pan, Sung Bum II-42 Park, Gyung-Leen I-549 Pathan, Al-Sakib Khan II-208 Peng, Shietung I-54, I-511 Petit, Salvador II-444 Phan, Hien I-463 Porter, Cameron II-79 Prabhu, Vishnumurthy I-476 Qian, Depei I-174, I-404, I-441 Qin, Jin I-416 Qin, Tingting I-215, II-235 Qu, Haiping I-113 Raghavendra, Prakash I-476 Ranjan, Rajiv I-13 Rao, Jinli I-149 Raposo, S´ergio de Souza II-156 Ren, Yongqing II-32 Roca-Piera, Javier I-454 Ross, Kenneth II-52 Ryoo, Rina I-520 Sahuquillo, Julio II-444 Saika, Yohei II-358 Sakib, Md. Sabbir Rahman II-208 Salehi, Mohsen Amini I-351 Santana Farias, Marcos II-156 Saquib, Nazmus II-208 Schoettner, Michael I-254 Schweiger, Tom II-79 Shahhoseini, Hadi Shahriar II-422 Sheng, Yonghong I-65 Sheu, Wen-Hann II-297 Shieh, Jong-Jiann II-368 Shih, Kuei-Chung II-178 Shi, Qingsong II-11 Soh, Ben I-463 Song, Xiaoyu I-193 Song, Yongzhi I-215 Sugimoto, Kouki II-358 Sui, Julei I-236

Sun, Chao-Ming Sun, Tao II-32

467

II-275

Taheri, Javid I-381 Tang, Chuan Yi II-178 Tang, Minghua I-163 Tang, Xingsheng II-11 Tan, Qingping I-193 Teo, Yong Meng I-337 Teyssier, Romain II-281 Thapngam, Theerasak I-1 Thejus, V.M. I-476 Tong, Jiancong I-236 Tseng, Chien-Kai II-166 Wang, Chen I-381 Wang, Chung-Ho I-91 Wang, Dongsheng I-65 Wang, Gang I-102, I-215, I-236, II-289 Wang, Hui I-174 Wang, Qun II-413 Wang, Rui I-174 Wang, Xiuwen I-113 Wang, Yaobin II-32 Wang, Yongjian I-404 Wang, Zhiying I-226 Watanabe, Tatsuro II-264 Wei, Qiying II-235 Wei, Su I-1 Wei, Xin I-391 Wong, Adam II-188 Woo, Jung-Hun I-520 Wu, Dan I-149 Xie, Jing I-416 Xiong, Naixue II-318 Xu, Cong I-363 Xu, Dan I-65 Xu, Lu I-113 Xu, Yun I-79 Yamamoto, Yusaku II-401 Yang, Chao-Tung II-90, II-121 Yang, Hailong I-404 Yang, Jiaoyun I-79 Yang, Laurence T. II-318 Yan-xiang, He II-218 Yıldız, Mehmet II-198 Yoshida, Makoto II-100 Yuan, Hui I-136 Yu, Kun-Ming II-178

468

Author Index

Yuntao, Yu II-244 Yu, Shui I-1 Yu, You-Fu II-338 Zhang, Zhang, Zhang, Zhang, Zhang, Zhang, Zhang,

Fan II-289 Huiyong I-363 Jiangang I-113 Liang I-324, I-363, I-391 Ning II-318 Shao-Liang II-401 Yuyuan II-318

Zhao, Xin II-289 Zhao, Zhenhai I-215 zhi-qiang, You II-434 Zhong, Ming-Yuan II-368 Zhou, Bing Bing I-381 Zhou, Jianfu I-102 Zhou, Jiayi II-178 Zhou, Wanlei I-1 Zhu, Danfeng I-174 Zomaya, Albert Y. I-381 Zou, Xuecheng I-149

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Proceedings, Part I ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Distributed and Parallel Computing: 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne,

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Algorithms and Architectures for Parallel Processing: 9th International Conference, ICA3PP 2009, Taipei, Taiwan, June 8-11, 2009, Proceedings (Lecture ... Computer Science and General Issues)

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Melbourne Proceedings Part I (Lecture Notes in Computer Science)

Computer Vision - Accv 2010 Workshops, Part I

Introduction to Parallel Processing: Algorithms and Architectures

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 7th International Conference, ICA3PP 2007, Hangzhou, China, June 11-14, 2007, Proceedings

Algorithms and Complexity: 7th International Conference, CIAC 2010, Rome, Italy, May 26-28, 2010, Proceedings (Lecture Notes in Computer Science)

Algorithms -- ESA 2010, Part II: 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Fun with Algorithms: 5th International Conference, FUN 2010, Ischia, Italy, June 2-4, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Computer Vision, Part II - ACCV 2010

Algorithms and Parallel VLSI Architectures

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 13th International Workshop, APPROX 2010, and 14th ... Computer Science and General Issues)

Mathematics of Program Construction: 10th International Conference, MPC 2010, Québec City, Canada, June 21-23, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Language and Automata Theory and Applications: 4th International Conference, LATA 2010, Trier, Germany, May 24-28, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Workshops, Part II ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Proceedings, Part I ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Distributed and Parallel Computing: 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne,

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Computer Vision - Accv 2010 Workshops, Part II

Euro-Par 2010: Parallel Processing Workshops

Algorithms and Architectures for Parallel Processing: 9th International Conference, ICA3PP 2009, Taipei, Taiwan, June 8-11, 2009, Proceedings (Lecture ... Computer Science and General Issues)

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Melbourne Proceedings Part I (Lecture Notes in Computer Science)

Computer Vision - Accv 2010 Workshops, Part I

Introduction to Parallel Processing: Algorithms and Architectures

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I ... Computer Science and General Issues)

Introduction to parallel processing: algorithms and architectures

Introduction to parallel processing: algorithms and architectures

Introduction to Parallel Processing - Algorithms and Architectures

Introduction To Parallel Processing Algorithms And Architectures

Introduction to Parallel Processing Algorithms and Architectures

Introduction To Parallel Processing - Algorithms And Architectures

Introduction to Parallel Processing - Algorithms and Architectures

Algorithms and Architectures for Parallel Processing: 7th International Conference, ICA3PP 2007, Hangzhou, China, June 11-14, 2007, Proceedings

Algorithms and Complexity: 7th International Conference, CIAC 2010, Rome, Italy, May 26-28, 2010, Proceedings (Lecture Notes in Computer Science)

Algorithms -- ESA 2010, Part II: 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Fun with Algorithms: 5th International Conference, FUN 2010, Ischia, Italy, June 2-4, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Computer Vision, Part II - ACCV 2010

Algorithms and Parallel VLSI Architectures

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 13th International Workshop, APPROX 2010, and 14th ... Computer Science and General Issues)

Mathematics of Program Construction: 10th International Conference, MPC 2010, Québec City, Canada, June 21-23, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Language and Automata Theory and Applications: 4th International Conference, LATA 2010, Trier, Germany, May 24-28, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Workshops, Part II ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Proceedings, Part I ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Distributed and Parallel Computing: 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne,

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Computer Vision - Accv 2010 Workshops, Part II

Euro-Par 2010: Parallel Processing Workshops

Algorithms and Architectures for Parallel Processing: 9th International Conference, ICA3PP 2009, Taipei, Taiwan, June 8-11, 2009, Proceedings (Lecture ... Computer Science and General Issues)

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Melbourne Proceedings Part I (Lecture Notes in Computer Science)

Computer Vision - Accv 2010 Workshops, Part I

Introduction to Parallel Processing: Algorithms and Architectures

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I ... Computer Science and General Issues)

Introduction to parallel processing: algorithms and architectures

Introduction to parallel processing: algorithms and architectures

Introduction to Parallel Processing - Algorithms and Architectures

Introduction To Parallel Processing Algorithms And Architectures

Introduction to Parallel Processing Algorithms and Architectures

Introduction To Parallel Processing - Algorithms And Architectures

Introduction to Parallel Processing - Algorithms and Architectures

Algorithms and Architectures for Parallel Processing: 7th International Conference, ICA3PP 2007, Hangzhou, China, June 11-14, 2007, Proceedings

Algorithms and Complexity: 7th International Conference, CIAC 2010, Rome, Italy, May 26-28, 2010, Proceedings (Lecture Notes in Computer Science)

Algorithms -- ESA 2010, Part II: 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Fun with Algorithms: 5th International Conference, FUN 2010, Ischia, Italy, June 2-4, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Computer Vision, Part II - ACCV 2010

Algorithms and Parallel VLSI Architectures

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques: 13th International Workshop, APPROX 2010, and 14th ... Computer Science and General Issues)

Mathematics of Program Construction: 10th International Conference, MPC 2010, Québec City, Canada, June 21-23, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Tools and Algorithms for the Construction and Analysis of Systems: 16th International Conference, TACAS 2010, Held as Part of the Joint European ... Computer Science and General Issues)

Language and Automata Theory and Applications: 4th International Conference, LATA 2010, Trier, Germany, May 24-28, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Recommend Documents